Background

Recently, I have been working on some cache implementations that support cache coherency, such as the rocket-chip implementation and the sifive implementation, so I need to study some TileLink protocols. This article assumes that the reader has some knowledge of AXI, so much of the content will refer directly to AXI.

Signals

According to TileLink Spec 1.8.0, TileLink is divided into the following Three types.

  • TL-UL: Read/write only, no burst support, analogous to AXI-Lite
  • TL-UH: read/write support, atomic instruction, prefetch, burst support, analogous to AXI+ATOP (atomic operation introduced by AXI5)
  • TL-C: support cache coherency protocol based on TL-UH, analogous to AXI+ACE/CHI

TileLink Uncached (TL-UL and TL-UH) consists of two channels.

  • A channel: M->S sends request, analogous to AXI’s AR/AW/W
  • D channel: S->M sends response, analogous to AXI’s R/W

Therefore, TileLink can only send read or write requests per cycle, while AXI can send requests on both AR and AW channels.

Some examples of requests.

  • Read: M->S sends Get on channel A, S->M sends AccessAckData on channel D
  • Write: M->S sends PutFullData/PutPartialData on A channel, S->M sends AccessAck on D channel
  • Atomic operation: M->S sends ArithmeticData/LogicalData on A channel, S->M sends AccessAckData on D channel
  • Prefetch operation: M->S sends Intent on A channel, S->M sends AccessAck on D channel

AXI4ToTL

For the AXI4ToTL module, let’s analyze how to convert an AXI4 Master to a TileLink.

First, consider the difference between AXI4 and TileLink: one is that the read and write channels are merged, so an Arbiter is needed here; second, AW and W are separated in AXI4, so they need to be merged here as well. This module does not consider the Burst case, but rather the AXI4Fragmenter to do the splitting, i.e., add several AW beats and pair them with W.

Specifically for the code implementation, first the AR channel corresponds to to the A channel.

1
2
3
val r_out = Wire(out.a)
r_out.valid := in.ar.valid
r_out.bits :<= edgeOut.Get(r_id, r_addr, r_size)._2

Then AW+W channel also connects to the A channel. Since burst is not taken into account, the request is considered here when aw and w are valid at the same time.

1
2
3
4
5
val w_out = Wire(out.a)
in.aw.ready := w_out.ready && in.w.valid && in.w.bits.last
in.w.ready  := w_out.ready && in.aw.valid
w_out.valid := in.aw.valid && in.w.valid
w_out.bits :<= edgeOut.Put(w_id, w_addr, w_size, in.w.bits.data, in.w.bits.strb)._2

What is interesting is that the read and write ids are increased by a number of bits, the lowest bit being 0 for read, 1 for write, and the remaining bits being the request number, so that multiple requests with different ids are sent out.

Then, the read and write A channel connection to the Arbiter.

1
TLArbiter(TLArbiter.roundRobin)(out.a, (UInt(0), r_out), (in.aw.bits.len, w_out))

The rest of the process is to judge the D channel and transfer the data to the R channel if it is available, and to the B channel if it is not.

1
2
3
out.d.ready := Mux(d_hasData, ok_r.ready, ok_b.ready)
ok_r.valid := out.d.valid && d_hasData
ok_b.valid := out.d.valid && !d_hasData

Finally, the difference between TileLink and AXI4 in returning acknowledgement for write requests is handled: in TileLink, the acknowledgement can be returned at the first burst beat, while AXI4 needs to return the acknowledgement after the last burst beat.

TLToAXI4

Since TileLink can only read or write at the same time, it first makes a fictitious arw channel, which can be interpreted as merging the ar and aw channels of AXI4, and this design can be seen in the SpinalHDL code. Then it connects to the ar and aw channels respectively, depending on whether they are writes or not.

1
2
3
4
5
6
val queue_arw = Queue.irrevocable(out_arw, entries=depth, flow=combinational)
out.ar.bits := queue_arw.bits
out.aw.bits := queue_arw.bits
out.ar.valid := queue_arw.valid && !queue_arw.bits.wen
out.aw.valid := queue_arw.valid &&  queue_arw.bits.wen
queue_arw.ready := Mux(queue_arw.bits.wen, out.aw.ready, out.ar.ready)

here handles the valid signals for aw and w.

1
2
3
in.a.ready := !stall && Mux(a_isPut, (doneAW || out_arw.ready) && out_w.ready, out_arw.ready)
out_arw.valid := !stall && in.a.valid && Mux(a_isPut, !doneAW && out_w.ready, Bool(true))
out_w.valid := !stall && in.a.valid && a_isPut && (doneAW || out_arw.ready)

The reason for this is that in TileLink, each burst is a request on a channel, while in AXI4, only the first burst has aw requests and all bursts have w requests, so the doneAW signal is used here to make the distinction.

Next, to connect the results on the b and r channels to the d channel, based on the experience above, here is another arbitration.

1
2
3
4
val r_wins = (out.r.valid && b_delay =/= UInt(7)) || r_holds_d
out.r.ready := in.d.ready && r_wins
out.b.ready := in.d.ready && !r_wins
in.d.valid := Mux(r_wins, out.r.valid, out.b.valid)

Finally, it also deals with the order of requests and results.

The two modules mentioned above are both TileLink Uncached, so how does it support cache consistency? First, it introduces three channels: C, D and E, which support three operations.

  • Acquire: M->S sends Acquire on A channel, S->M sends Grant on D channel, and then M->S sends GrantAck on E channel; the function is to acquire a copy
  • Release: M->S sends Release on C channel, S->M sends ReleaseAck on D channel; the function is to delete its own copy
  • Probe: S->M sends Probe on B channel, M->S sends ProbeAck on C channel; the function is to ask M to delete its own copy

You can see that the three channels A C E are M->S and the two channels B D are S->M.

If a cache (Master A) wants to write a read-only block of data or read a miss cache line, with a broadcast cache coherency protocol, it needs to go through the following process.

  • Master A -> Slave: Acquire
  • Slave -> Master B: Probe
  • Master B -> Slave: ProbeAck
  • Slave -> Master A: Grant
  • Master A -> Slave: GrantAck

First Master A sends an Acquire request, then Slave broadcasts Probe to the other Masters, and when the other Masters return ProbeAck, they return Grant to Master A. Finally Master A sends GrantAck to Slave. Master A then gets a copy of the cache line and invalidates or makes Master B’s cache line read-only.

TileLink’s cache line has three states: None, Branch and Trunk(Tip). This basically corresponds to the MSI model: None -> Invalid, Branch -> Shared and Trunk -> Modified.

The Rocket Chip code also defines Dirty status for ClientStates, which roughly corresponds to the MESI model: None -> Invalid, Branch -> Shared, Trunk -> Exclusive, and Dirty -> Modified.

In addition, the standard says that TL-UH operations can be performed on the B and C channels. The intent of the standard is to allow the Slave to forward operations to the Master that has cached data. For example, if Master A sends a Put request on channel A, the Slave sends a Put request to Master B on channel B, Master B sends an AccessAck response on channel C, and the Slave forwards the response back to Master A on channel D.

This is like a network on a chip, where the Slave is responsible for routing requests between Masters.

Broadcast

The next step is to look at the Rocket Chip’s own broadcast-based implementation of the cache coherency protocol. The core implementation is TLBroadcast, and the core logic is that if a Master A sends an Acquire, then TLBroadcast needs to send a Probe to the other Masters, and then return the Grant to Master A when all the other Masters have responded to the ProbeAck.

First look at the Probe logic on the B channel. It records a todo bitmask indicating which Master needs to send a Probe, and here a Probe Filter is used to reduce the number of times a Probe is sent, since it only needs to be sent to the Master that has the cache line.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
val probe_todo = RegInit(0.U(max(1, caches.size).W))
val probe_line = Reg(UInt())
val probe_perms = Reg(UInt(2.W))
val probe_next = probe_todo & ~(leftOR(probe_todo) << 1)
val probe_busy = probe_todo.orR()
val probe_target = if (caches.size == 0) 0.U else Mux1H(probe_next, cache_targets)

// Probe whatever the FSM wants to do next
in.b.valid := probe_busy
if (caches.size != 0) {
    in.b.bits := edgeIn.Probe(probe_line << lineShift, probe_target, lineShift.U, probe_perms)._2
}
when (in.b.fire()) { probe_todo := probe_todo & ~probe_next }

Here probe_next is the bitmask corresponding to the Master being probed, and probe_target is the Id of the Master. The input to this Probe FSM is the Probe Filter, which will give which Cache has the current information about the cache line.

1
2
3
4
5
6
7
8
9
val leaveB = !filter.io.response.bits.needT && !filter.io.response.bits.gaveT
val others = filter.io.response.bits.cacheOH & ~filter.io.response.bits.allocOH
val todo = Mux(leaveB, 0.U, others)
filter.io.response.ready := !probe_busy
when (filter.io.response.fire()) {
    probe_todo  := todo
    probe_line  := filter.io.response.bits.address >> lineShift
    probe_perms := Mux(filter.io.response.bits.needT, TLPermissions.toN, TLPermissions.toB)
}

There are two cases: if Acquire needs to enter the Trunk state (e.g. a write operation), it means that the other Masters need to enter the None state, so toN is sent here; if Acquire does not need to enter the Trunk state (e.g. a read operation), then only the other Masters need to enter the Branch state, so toB is sent here.

ProbeAck and ProbeAckData on the C channel are processes at the same time as the B channel sends the Probe.

1
2
3
4
5
// Incoming C can be:
// ProbeAck     => decrement tracker, drop 
// ProbeAckData => decrement tracker, send out A as PutFull(DROP)
// ReleaseData  =>                    send out A as PutFull(TRANSFORM)
// Release      => send out D as ReleaseAck

Since invalidation based is used here, if a Master was in Dirty state before, it will send ProbeAckData and needs to write the data back, so it needs to write the data out with PutFull.

Reference Documentation