Pipelines (for hardware people)
Pipelining is traditionally a tedious and error prone process. Designers need to ensure that all signals are in sync by manually inserting pipeline registers and more importantly, ensure that the correct registers are used for the correct expression. The problem is made even worse when the depth of a pipeline needs to change for some reason. Then the developer has to ensure that all register references are updated accordingly throughout the design.
Spade natively includes a pipelining construct that ensures that pipelines without feedback are correct by construction and which makes it significantly easier to write and reason about pipelines with feedback.
A basic pipeline
Let's look at a basic example of a pipeline which copmutes multiplication or
addition of two numbers depending on an Op
signal:
enum Op {
Add,
Mul
}
pipeline(1) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
let sum = x + y;
let prod = x * y;
reg;
match op {
Op::Add => sext(sum), // Sign extend to match mul
Op::Mul => prod,
}
}
The head of a pipeline looks similar to the entity
and fn
definitions that we saw before but includes a number in parenthesis. This
number is the depth of the pipelines, i.e. the number of registers it
contains which is the same its latency from input to output.
While the compiler could in theory infer this number from the body, it always
has to be specified since it is a very important part of the public "API" of the
pipeline. Without reading the body of the pipeline, you know how many clock cycles
you have to wait between input and output.
The first two lines of the body of the pipeline are somewhat uninteresting: they compute a sum and a product and store them in corresponding variables.
The next line reg;
is another pipeline specific construct. It is used to add
a new stage to the pipeline which is done by creating a new pipelining register for
every variable above the reg;
statement, and re-mapping any references to
those variables to the pipelined version below the reg;
statement.
The final match
statement selects whether to use the "sum" or "product"
value depending on the op
variable. Crucially, because this is a pipeline,
the compiler ensures that the three variables are delayed the same amount, so
there will be no interleaving of op
from the previous cycle with the sum
and prod
from the current cycle.
All this means that the resulting hardware looks like this:
Nested Pipelines
Spade of course also supports nested pipelines, let's extend the example above to showcase how that is done.
pipeline(1) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
let result = x * y;
reg;
result
}
pipeline(1) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
let sum = x + y;
let prod = inst(1) mul(clk, x, y);
reg;
match op {
Op::Add => sext(sum), // Sign extend to match mul
Op::Mul => prod,
}
}
Here, the multiplier from the previous example has been broken out into its
own sub-pipeline with its own internal register. Since the compiler is aware of
this, it will ensure that the signals are still in sync, in this case by not
inserting an extra register for the prod
signal.
Spade also requires you to specify the depth of pipelines when instantiating them. This is done in order to make sure that when you change the depth of a pipeline, you also make sure that that change does not affect the behaviour where that pipeline is instantiated.
Compiler guarantees
If you synthesize the previous example on a typical FPGA, you may realize that
we are not using the multipliers in the DSP blocks as efficiently as we could -
they have built in optional pipelining registers that allow us to raise the
\(f_{max}\). This means we could get higher performance from our design by
adding 2 more regs to our mul
pipeline. Traditionally, this would require
updating a bunch of code, but with Spade, all we have to do is make the change
to mul
:
pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
let result = x * y;
reg;
reg;
reg;
result
}
The astute reader will notice that the latency of this pipeline is now wrong, oh no ๐ฑ. Luckily, even if you didn't notice this problem, the compiler did:
error: Pipeline depth mismatch. Expected 1 got 3
โโ src/pipelines_hw.spade:40:1
โ
40 โ โญ pipeline(1) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
โ - Type 1 inferred here
41 โ โ let result = x * y;
42 โ โ reg;
43 โ โ reg;
44 โ โ reg;
45 โ โ result
46 โ โ }
โ โฐโ^ Found 3 stages in this pipeline
โ
= note: Expected: 3
Got: 1
Error: aborting due to previous error
Let's update the code accordingly, and while we're at it change the repeated
reg;
to reg*3;
which is a shorthand for the same thing:
pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
let result = x * y;
reg * 3;
result
}
Now mul
looks correct, but if we look at the bigger picture we're not out of the weeds yet. Our compute
pipeline as currently described is now this abomination which will have a very different output than before:
Luckily, the compiler once again has our back here. If we compile the new code
pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
let result = x * y;
reg * 3;
result
}
pipeline(3) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
let sum = x + y;
let prod = inst(3) mul(clk, x, y);
reg * 3;
match op {
Op::Add => sext(sum), // Sign extend to match mul
Op::Mul => prod,
}
}
error: Pipeline depth mismatch
โโ src/pipelines_hw.spade:61:21
โ
53 โ pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
โ - swim_test_project::pipelines_hw::m3::mul has depth 3
ยท
61 โ let prod = inst(1) mul(clk, x, y);
โ ^ Expected depth 3, got 1
โ
= note: Expected: 3
Got: 1
This means we have to update the inst(1)
to inst(3)
to match the definition of mul
which
gives us yet one more compiler error
error: Use of swim_test_project::pipelines_hw::m3::prod before it is ready
โโ src/pipelines_hw.spade:65:18
โ
65 โ Op::Mul => prod,
โ ^^^^ Is unavailable for another 2 stages
โ
= note: Requesting swim_test_project::pipelines_hw::m3::prod from stage 1
= note: But it will not be available until stage 3
This error is saying that there aren't enough pipeline registers between our
definition of prod
and its use, which is the error we were seeing graphically
before. We'll update our compute
pipeline accordingly which finally gives
pipeline(3) mul(clk: clock, x: int<18>, y: int<18>) -> int<36> {
let result = x * y;
reg * 3;
result
}
pipeline(3) compute(clk: clock, op: Op, x: int<18>, y: int<18>) -> int<36> {
let sum = x + y;
let prod = inst(3) mul(clk, x, y);
reg * 3;
match op {
Op::Add => sext(sum), // Sign extend to match mul
Op::Mul => prod,
}
}
At this point, the compiler is happy, and we should be too because the hardware correctly uses the DSP blocks giving faster performance, and its output is still the same as before (though of course, the latency has changed).
Fearless Refactoring
At this point it is worth taking a step back and analyzing what happened. We
started out with a pipeline that computed a correct value, but that was not
implemented as efficiently as it could have been. To fix this, we made a minimal change
to the mul
pipeline to more efficiently use the DSP blocks.
Then, by running the compiler and mindlessly addressing the things it
complained about, we updated the rest of our code to reflect this change.
Once the compiler stopped complaining, our code still has the correct output
but runs faster!
If our code is used elsewhere in the project, or by someone else in another project, the compiler would start complaining there until all the issues are fixed.
This is something that happens in several places in Spade, the type system being another notable example. You make a small localized change, then the compiler tells you every place you need to change to reflect that change in order to get back to hardware that still works correctly. Essentially, you can refactor code without having to think about the consequences.
Feedback
The pipelines discussed so far are useful if you're building a compute pipeline where you have no dependence between values. However, this is not always the case. A notable example of this is processors which are often pipelined but where values certainly are not independent. In this case, the guaranteed correctness when adding or removing registers is no longer possible, but being able to reason about pipelines structurally as individual stages rather than a soup of control registers mixed with pipeline registers is still very helpful.
For cases like this, Spade has support for "stage references", where you can refer to
values from previous or future stages using stage(...)
.
As an example, to write a pipeline that computes the sum of a window "around the current" value, we can write
pipeline(2) window(clk: clock, x: int<16>) -> int<18> {
reg;
reg;
x + stage(-1).x + sext(stage(-2).x)
}
where we use relative stage references to refer to x
from the stage above,
and from 2 stages above. The corresponding hardware looks like this:
As you can see, negative references refer to stages "above" the current stage while positive references refer to stages "below". Since stages "above" have gone through fewer registers, they are values from the "future" while positive references are values "from the past".
You can also use labels ('label
) to refer to stages, for example, if you
wanted to refer to a variable without delay you can define the first stage as
'first
and then refer to variables from that stage using stage(first)
.
pipeline(2) without_delay(clk: clock, x: int<16>) -> int<16> {
'first
reg;
reg;
stage(first).x
}
Dynamic pipelines
Spade has experimental support for stalling of pipelines as documented in the language reference section. However, make sure you follow the note at the top of that page to avoid unexpected bugs.