Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memories and Register Files

So far, we have seen that almost all operations in Spade are combinational, and that all state is stored in reg statements which map to discrete registers in hardware. A register updates its value on every clock cycle. However, this is not well suited for memory-like structures which in principle consist of a large number of registers where only a small subset is updated every cycle. In the resulting hardware, memories are also very different from a large bank of memories, with FPGAs having dedicated block RAMs and ASICs using dedicated macros for large memories. For this reason, Spade has a dedicated mechanism for dealing with memories.

Note that the distinction between a memory and a register bank is slightly fuzzy when dealing with smaller memories. A “register file” for a processor is often easiest to implement with the memory primitives even if it ends up being implemented with registers in the end. The thing that you should use to determine whether to use a memory or lots of registers should be if only a small set of the registers are updated at a time.

Spade provides two ways of working with memories. A higher level primitive that is quite close to the dual port memories that are present in FPGAs (currently, 1 write port, 1 read port), and a lower level primitive that is more tricky to work with, but that is more general.

The Dual Port Memory Primitive

The dual memory primitive is defined as dp_bram in the standard library (you can see the definition here) like this:

/// A dual port block RAM that supports read and write ports
/// being in different domains. If writes and reads happen
/// to the same address in the same clock cycle, the behaviour is undefined.
pub entity dp_bram<#uint W, D, #uint C>(
write_clk: clock,
read_clk: clock
) -> (WritePort<W, D>, ReadPort<W, D>) {

It takes three generic parameters W, D and C which are the width of the address port, the type of the contained data, and the number of elements in the memory. Further, it takes two clocks one for the read port and one for the write port in case you want to use it to cross clock domains. Generally when starting out, you will only have a single clock in your design, so you can pass it to both. It then returns two ports, a write port, and a read port which is what you will use to interact with the memory.

To use this primitive, first, instantiate it like this, here it is instantiated to store 255 bool values:

let (w, r) = inst std::mem::dp_bram::<8, bool, 255>(clk, clk);

In this case, we specified all the generic parameters, but the address width and stored type can often be inferred.

The w and r ports can then be passed around to the unit that require memory access, or be used directly here. Writes are performed with the write function which takes an Option value which is Some(value) if something should be written this clock cycle, and None otherwise. As an example, we can write true to address 5 like this:

w.write(5, Some(true));

Reading is done similarly, with a read method on the read port, but here the address to read from is given and the value at that address returned. Because memories are clocked, the read method is a pipeline and needs the inst(1) to indicate that the value arrives with a 1 clock cycle latency.

let read_out = r.inst(1) read(clk, 5);

The Low Level Interface

If the primitive above is sufficient for your needs, you should prefer using it, but because it only supports one read and one write-port, it is sometimes not sufficient for general purpose usage. It itself is built using the std::clocked_memory primitive which is defined like this, also in the standard library:

/// Define a memory where values are written on the rising edge of the clock.
/// Reads can occur asynchronously with the `read_mem` function. If Clocked reads
/// should be used, the read result should be placed in a register
///
/// The write array defines all the write ports to the memory. It consists of a
/// `write enable`, `address` and `data` field. When WE is enabled, data is written
/// to address. Otherwise no change takes effect
/// NOTE: We when possible, we should make compute AddrWidth from NumElements
pub extern entity clocked_memory<#uint NumElements, #uint WritePorts, #uint AddrWidth, D>(
clk: clock,
writes: [(bool, uint<AddrWidth>, D); WritePorts],
) -> Memory<D, NumElements>;

The extern keyword means that it is not defined in Spade. In most cases extern means that you want to use some external Verilog module, but in this case it is used because the compiler will replace the instantiation with dedicated code.

It takes 4 type parameters: the number of elements in the memory, the number of write ports, the width of the address, and finally the type of the stored data.

It takes a single clock, which is the clock that will be used for all write ports, and an array of write ports. The elements in this array are tuples containing a bool, which is the write enable signal for that port, an address to write the value to, and a data value to write. It returns a Memory<D, NumElements> which is also a special type that cannot be passed around freely. Instead, you can only use the read_memory entity on it, which is defined as

/// Get the value out of a memory
pub extern entity read_memory<#uint AddrWidth, D, #uint NumElements> (
mem: Memory<D, NumElements>,
addr: uint<AddrWidth>
) -> D;

Note how this is not a pipeline, reads from this memory primitive are combinational by default, but can be made clocked by adding registers on the output.

As a more concrete example, let’s see how we can use this primitive to implement a register file for a processor. This example makes quite heavy use of the ports feature, so go back and read about that if you don’t feel comfortable with ports yet. Generally, a processor will have two read ports and one write port. We will use a similar interface to the dp_bram ports above, so we will define two types for these ports

pub struct port ReadPort<#uint W, D> {
addr: inv &uint<W>,
out: &D,
}

pub struct port WritePort<#uint W, D> {
addr: inv &uint<W>,
write: inv &Option<D>,
}

Next, we will define the entity where we instantiate the memory primitive

entity regfile(clk: clock)
-> (
ReadPort<5, [bool; 32]>,
ReadPort<5, [bool; 32]>,
WritePort<5, [bool; 32]>
)
{

In here, we instantiate the memory primitive, and transform the Option write value into a the low level (write_enable, address, data) representation that the memory expects

// Define some ports which we will pass along to the read and write
// ports we return
let read_addr0 = port;
let read_addr1 = port;
let write_addr = port;
let write_data = port;

// The compiler can infer the types but not the number of elements (32) in this
// case
let mem = inst std::mem::clocked_memory::<32, _, _, _>(
clk,
[(
// Use the `Option` value as a write_enable
(*write_data.0).is_some(),
// The address can be passed straight through
*write_addr.0,
// And take the inner value out of the option. Since
// we used the `.is_some()` method as a write enable,
// we will never observe the undef value
(*write_data.0).unwrap_or_undef()
)]
);

Reads are done outside the definition of the memory using the read_memory entity:

let r0 = inst std::mem::read_memory(mem, *read_addr0.0);
let r1 = inst std::mem::read_memory(mem, *read_addr1.0);

However, note that these reads are combinational, so without adding some registers after them, you may end up with a combinational memory that cannot be mapped to real hardware. Generally, it is best to either use a pipeline and add a pipeline register after the reads:

    let r0 = inst std::mem::read_memory(mem, *read_addr0.0);
let r1 = inst std::mem::read_memory(mem, *read_addr1.0);
reg;

or to wrap the outputs in registers directly:

reg(clk) r0 = inst std::mem::read_memory(mem, *read_addr0.0);
reg(clk) r1 = inst std::mem::read_memory(mem, *read_addr1.0);

Finally, we can wrap the port signals defined at the start of the unit in the read and write ports defined above:

(
ReadPort(read_addr0.1, &r0),
ReadPort(read_addr1.1, &r1),
WritePort(write_addr.1, write_data.1),
)

This interface works, but it does not communicate the latency we introduced to the memory reads. To make sure that users get this right, we can define methods for accessing the ports like this:

impl<#uint W, D> ReadPort<W, D> {
/// Read the value stored at `addr`
pub pipeline(1) read(self, read_clk: clock, addr: uint<W>) -> D {
let ReadPort(saddr, sout) = self;
set saddr = &addr;
reg;
*sout
}
}

impl<#uint W, D> WritePort<W, D> {
pub fn write(self, addr: uint<W>, write: Option<D>) {
let WritePort$(addr: saddr, write: swrite) = self;
set saddr = &addr;
set swrite = &write;
}
}

These ports can now be passed along to a processor that can make use of the register file

If you happened to look at the standard library while reading this, you will notice that these ports are exactly the same as those defined for memory ports in the standard library, and the implementation of dp_bram is very similar to our register file we just defined.

Reading the above code, you may be inclined to think that because we only instantiated one memory primitive, we will only end up with one physical memory. However, in FPGAs, our register file will most likely map to a block RAM, which typically only has two ports for reading or writing. Our regfile uses three ports, so the synthesis tool will not be able to map it to the block ram directly.

When we have a single write port and multiple read ports, the synthesis tool can pull off a little trick however: it can generate two physical memories, and map the write signals to both. That effectively gives two read ports, one from each memory, but it does mean that the memory usage is doubled. The trick works for more read ports as well, but of course, the cost goes up with the number of memories

For write ports, there is no such trick and synthesis tools will most likely simply fail to map this description to a memory, instead resorting to implementing it with general purpose logic. For a small memory, this may be fine, but for a large memory it will lead to excessive resource usage and very long synthesis times.

For these reasons, it is always worth checking synthesis reports when working with multi-port memories using the provided primitives.