Pipelines (for hardware people)

Pipelining is traditionally a tedious and error prone process. Designers need to ensure that all signals are in sync by manually inserting pipeline registers and more importantly, ensure that the correct registers are used for the correct expression. The problem is made even worse when the depth of a pipeline needs to change for some reason. Then the developer has to ensure that all register references are updated accordingly throughout the design.

Spade natively includes a pipelining construct that ensures that pipelines without feedback are correct by construction and which makes it significantly easier to write and reason about pipelines with feedback.

A basic pipeline

Let's look at a basic example of a pipeline which copmutes multiplication or addition of two numbers depending on an Op signal:

enum Op {
    Add,
    Mul
}

pipeline(1) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
    let sum = x + y;
    let prod = x * y;
  reg;
    match op {
      Op::Add => sext(sum), // Sign extend to match mul
      Op::Mul => prod,
    }
}

The head of a pipeline looks similar to the entity and fn definitions that we saw before but includes a number in parenthesis. This number is the depth of the pipelines, i.e. the number of registers it contains which is the same its latency from input to output. While the compiler could in theory infer this number from the body, it always has to be specified since it is a very important part of the public "API" of the pipeline. Without reading the body of the pipeline, you know how many clock cycles you have to wait between input and output.

The first two lines of the body of the pipeline are somewhat uninteresting: they compute a sum and a product and store them in corresponding variables.

The next line reg; is another pipeline specific construct. It is used to add a new stage to the pipeline which is done by creating a new pipelining register for every variable above the reg; statement, and re-mapping any references to those variables to the pipelined version below the reg; statement.

The final match statement selects whether to use the "sum" or "product" value depending on the op variable. Crucially, because this is a pipeline, the compiler ensures that the three variables are delayed the same amount, so there will be no interleaving of op from the previous cycle with the sum and prod from the current cycle.

All this means that the resulting hardware looks like this:

Nested Pipelines

Spade of course also supports nested pipelines, let's extend the example above to showcase how that is done.

pipeline(1) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
    let result = x * y;
  reg;
    result
}

pipeline(1) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
    let sum = x + y;
    let prod = inst(1) mul(clk, x, y);
  reg;
    match op {
      Op::Add => sext(sum), // Sign extend to match mul
      Op::Mul => prod,
    }
}

Here, the multiplier from the previous example has been broken out into its own sub-pipeline with its own internal register. Since the compiler is aware of this, it will ensure that the signals are still in sync, in this case by not inserting an extra register for the prod signal.

Spade also requires you to specify the depth of pipelines when instantiating them. This is done in order to make sure that when you change the depth of a pipeline, you also make sure that that change does not affect the behaviour where that pipeline is instantiated.

Compiler guarantees

If you synthesize the previous example on a typical FPGA, you may realize that we are not using the multipliers in the DSP blocks as efficiently as we could - they have built in optional pipelining registers that allow us to raise the \(f_{max}\). This means we could get higher performance from our design by adding 2 more regs to our mul pipeline. Traditionally, this would require updating a bunch of code, but with Spade, all we have to do is make the change to mul:

pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
    let result = x * y;
  reg;
  reg;
  reg;
    result
}

The astute reader will notice that the latency of this pipeline is now wrong, oh no ๐Ÿ˜ฑ. Luckily, even if you didn't notice this problem, the compiler did:

error: Pipeline depth mismatch. Expected 1 got 3
   โ”Œโ”€ src/pipelines_hw.spade:40:1
   โ”‚
40 โ”‚ โ•ญ pipeline(1) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
   โ”‚            - Type 1 inferred here
41 โ”‚ โ”‚     let result = x * y;
42 โ”‚ โ”‚   reg;
43 โ”‚ โ”‚   reg;
44 โ”‚ โ”‚   reg;
45 โ”‚ โ”‚     result
46 โ”‚ โ”‚ }
   โ”‚ โ•ฐโ”€^ Found 3 stages in this pipeline
   โ”‚
   = note: Expected: 3
                Got: 1

Error: aborting due to previous error

Let's update the code accordingly, and while we're at it change the repeated reg; to reg*3; which is a shorthand for the same thing:

pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
    let result = x * y;
  reg * 3;
    result
}

Now mul looks correct, but if we look at the bigger picture we're not out of the weeds yet. Our compute pipeline as currently described is now this abomination which will have a very different output than before:

Luckily, the compiler once again has our back here. If we compile the new code

pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
    let result = x * y;
  reg * 3;
    result
}

pipeline(3) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
    let sum = x + y;
    let prod = inst(3) mul(clk, x, y);
  reg * 3;
    match op {
      Op::Add => sext(sum), // Sign extend to match mul
      Op::Mul => prod,
    }
}
error: Pipeline depth mismatch
   โ”Œโ”€ src/pipelines_hw.spade:61:21
   โ”‚
53 โ”‚ pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
   โ”‚          - swim_test_project::pipelines_hw::m3::mul has depth 3
   ยท
61 โ”‚     let prod = inst(1) mul(clk, x, y);
   โ”‚                     ^ Expected depth 3, got 1
   โ”‚
   = note: Expected: 3
                Got: 1

This means we have to update the inst(1) to inst(3) to match the definition of mul which gives us yet one more compiler error

error: Use of swim_test_project::pipelines_hw::m3::prod before it is ready
   โ”Œโ”€ src/pipelines_hw.spade:65:18
   โ”‚
65 โ”‚       Op::Mul => prod,
   โ”‚                  ^^^^ Is unavailable for another 2 stages
   โ”‚
   = note: Requesting swim_test_project::pipelines_hw::m3::prod from stage 1
   = note: But it will not be available until stage 3

This error is saying that there aren't enough pipeline registers between our definition of prod and its use, which is the error we were seeing graphically before. We'll update our compute pipeline accordingly which finally gives

pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
    let result = x * y;
  reg * 3;
    result
}

pipeline(3) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
    let sum = x + y;
    let prod = inst(3) mul(clk, x, y);
  reg * 3;
    match op {
      Op::Add => sext(sum), // Sign extend to match mul
      Op::Mul => prod,
    }
}

At this point, the compiler is happy, and we should be too because the hardware correctly uses the DSP blocks giving faster performance, and its output is still the same as before (though of course, the latency has changed).

Fearless Refactoring

At this point it is worth taking a step back and analyzing what happened. We started out with a pipeline that computed a correct value, but that was not implemented as efficiently as it could have been. To fix this, we made a minimal change to the mul pipeline to more efficiently use the DSP blocks. Then, by running the compiler and mindlessly addressing the things it complained about, we updated the rest of our code to reflect this change. Once the compiler stopped complaining, our code still has the correct output but runs faster!

If our code is used elsewhere in the project, or by someone else in another project, the compiler would start complaining there until all the issues are fixed.

This is something that happens in several places in Spade, the type system being another notable example. You make a small localized change, then the compiler tells you every place you need to change to reflect that change in order to get back to hardware that still works correctly. Essentially, you can refactor code without having to think about the consequences.

Feedback

The pipelines discussed so far are useful if you're building a compute pipeline where you have no dependence between values. However, this is not always the case. A notable example of this is processors which are often pipelined but where values certainly are not independent. In this case, the guaranteed correctness when adding or removing registers is no longer possible, but being able to reason about pipelines structurally as individual stages rather than a soup of control registers mixed with pipeline registers is still very helpful.

For cases like this, Spade has support for "stage references", where you can refer to values from previous or future stages using stage(...).

As an example, to write a pipeline that computes the sum of a window "around the current" value, we can write

pipeline(2) window(clk: clock, x: int<16>) -> int<18> {
    reg;
    reg;
        x + stage(-1).x + sext(stage(-2).x)
}

where we use relative stage references to refer to x from the stage above, and from 2 stages above. The corresponding hardware looks like this:

As you can see, negative references refer to stages "above" the current stage while positive references refer to stages "below". Since stages "above" have gone through fewer registers, they are values from the "future" while positive references are values "from the past".

You can also use labels ('label) to refer to stages, for example, if you wanted to refer to a variable without delay you can define the first stage as 'first and then refer to variables from that stage using stage(first).

pipeline(2) without_delay(clk: clock, x: int<16>) -> int<16> {
        'first
    reg;
    reg;
        stage(first).x
}

Dynamic pipelines

Spade has experimental support for stalling of pipelines as documented in the language reference section. However, make sure you follow the note at the top of that page to avoid unexpected bugs.