In recent days, we have analyzed parts of TileLink’s cache coherency protocol, see TileLink Bus Protocol Analysis, and we take this opportunity to study the ACE protocol.

The following mainly refers to the IHI0022E version, which is the ACE version corresponding to AXI4.

Review

Let’s start by reviewing what operations a cache coherency protocol needs to support. For a higher-level Cache, it needs a few things:

  1. when reading or writing a miss, it needs to request data from this cache line and update its status, such as read to Shared, write to Modified, etc. 2. when writing a cache line that is valid && !
  2. when writing a cache line that is valid && !dirty, it needs to update its status, e.g. from Shared to Modified. 3. it needs to evict a valid && !dirty cache line.
  3. If you need to evict a cache line that is valid && dirty, you need to write the dirty data back and downgrade your state, e.g. Modified -> Shared/Invalid. if you need to evict a cache line that is valid && !dirty, you can choose to notify or not notify the next level.
  4. when a snoop request is received, the current cache data needs to be returned and the status updated.
  5. A method is needed to notify the next level of Cache/Interconnect that the first and second steps are complete.

If you’ve seen my TileLink analysis before, then the operations above correspond to TileLink as follows

  1. when reading or writing a miss, you need to request the data of this cache line (send AcquireBlock, wait for GrantData) and update your state, e.g. read to Shared, write to Modified, etc.
  2. when writing a valid && !dirty cache line, you need to update your status (send AcquirePerm, wait for Grant), e.g. from Shared to Modified.
  3. If you need to evict a valid && dirty cache line, you need to write back the dirty data (send ReleaseData, wait for ReleaseAck) and downgrade your state, e.g. Modified -> Shared/Invalid. If you need to evict a valid && ! dirty cache line, you can choose to notify (send Release, wait for ReleaseAck) or not to notify the next level.
  4. When a snoop request is received (Probe received), the current cache data needs to be returned (send ProbeAck/ProbeAckData) and the state updated.
  5. A method (send GrantAck) is needed to notify the next level of Cache/Interconnect that steps 1 and 2 are complete.

With this in mind, it’s natural to look further down the ACE design.

Cache state model

First, let’s look at the cache state model of ACE, which I also analyzed in my previous analysis of cache coherence protocols, and it has these five, which are the different statements of MOESI.

  1. UniqueDirty: Modified
  2. SharedDirty: Owned
  3. UniqueClean: Exclusive
  4. SharedClean: Shared
  5. Invalid: Invalid

The definition in the documentation is as follows.

  • Valid, Invalid: When valid, the cache line is present in the cache. When invalid, the cache line is not present in the cache.
  • Unique, Shared: When unique, the cache line exists only in one cache. When shared, the cache line might exist in more than one cache, but this is not guaranteed.
  • Clean, Dirty: When clean, the cache does not have responsibility for updating main memory. When dirty, the cache line has been modified with respect to main memory, and this cache must ensure that main memory is eventually updated.

Roughly speaking, Unique means that only one cache has this cache line, Shared means that there may be multiple caches with this cache line; Clean means that it is not responsible for updating memory, and Dirty means that it is responsible for updating memory. Many of the following operations are based around these states.

The documentation also says that it supports different subsets of MOESI: MESI, ESI, MEI, MOESI, so maybe in a simplified system where some state can not exist, the implementation would be different.

Channel usage examples

So far I haven’t introduced ACE signals, but let’s try to see how we would add signals to accomplish this if we were the protocol designer.

First consider the first thing mentioned above: when reading or writing a miss, you need to request data from this cache line and update your status, such as Read to Shared, Write to Modified, and so on.

As we know, AXI has AR and R channels for reading data, so when you encounter a read or write miss, you can piggyback some information on the AR channel to let the next level Interconnect know whether you intend to read or write, and then the Interconnect will return the data on the R channel.

So, what exactly is the message to be piggybacked? For example, when I read a miss, I need to read the data and enter the Shared state, so I call it ReadShared; when I write a miss, I need to read the data (usually the write to the cache is only part of a cache line, so This operation can be encoded in a signal and passed to Interconnect.

Consider the second thing mentioned above: when writing a cache line that is valid && !dirty, it needs to upgrade its status, for example from Shared to Modified.

This operation requires Interconnect to clear the cache line from the other caches and upgrade itself to Unique, which we can name CleanUnique based on the operation+destination nomenclature above, i.e., clean all the other caches and make itself Unique.

Next, consider the third thing mentioned above: when you need to evict a valid && dirty cache line, you need to write back the dirty data and downgrade your state, e.g. Modified -> Shared/Invalid.

According to the previous Operation + Destination State nomenclature, it can be named WriteBackInvalid.

Finally, we get to the fourth thing: when we receive a snoop request, we need to return the current cache data and update the state.

Since the snoop is sent from the Interconnect to the Master, there is no way to do this in the existing AR R AW W B channel, otherwise it would break the existing logic. Then I have to add a pair of channels, for example, I specify an AC channel to send the snoop request and a C channel for the master to send the response. This is equivalent to the B channel (Probe request) and C channel (ProbeAck response) in TileLink. The actual ACE is a bit different from the one just designed, splitting the C channel into two: CR for returning all responses and CD for returning those that need data. This is like the relationship between AW and W, where one passes the address and one passes the data; similarly, CR passes the status and CD passes the data.

So, let’s consider what requests are sent on the AC channel. Let’s review the request types that have been used above: ReadShared, ReadUnique and CleanUnique for those that require snoop, and WriteBack for those that don’t. Then we directly send ReadShared, ReadUnique and CleanUnique through the AC channel. Then we directly send the ReadShared, ReadUnique and CleanUnique requests to the cache that needs snoop through the AC channel as is.

When the cache receives these requests in the AC channel, it can act accordingly. Since the same request can have different response methods under MOESI protocol, we won’t go into details here.

At this point we have basically deduced the workflow of the ACE protocol’s signals and big questions. Oh, and we forgot the fifth thing: we need a way to notify the next level of Cache/Interconnect that the first and second steps are complete. tileLink adds an extra E channel to do this, and ACE is even more brutal: it uses a pair of RACK and WACK signals directly to indicate that the last read and write have completed, respectively.

See the What’s the purpose for WACK and RACK for ACE and what’s the relationship with WVALID and RVALID? discussion for more information on WACK and RACK.

Summary

Here we will not continue to analyze, many other request types are to serve more scenarios, such as writing the entire Cache Line at once, you do not need to read the existing data; or a one-time read and leave it, or this is a gas pedal without cache, DMA, etc., there are some targeted optimizations or simplified processing, such as for the master without cache. For example, for the master without cache, it can be simplified to ACE-Lite, such as ARM’s CCI-400 supports two ACE masters and three ACE-Lite Masters, which can be used to connect to peripherals such as GPUs. If you simplify the ACE-Lite, you get the ACP (Accelerator Coherency Port).

References