Cache coherence protocol analysis

Write-invalidate and Write-update

There are two basic cache coherence ideas.

write-invalidate: when writing data, set this Cache Line in other Cache to Invalid
write-update: when writing data, write the new result to other Cache that has this Cache Line

Write-once protocol

The write-once protocol defines four states.

Invalid: indicates that the block is not legal
Valid: indicates that the block is legal and may be shared, and that the data has not been modified
Reserved: indicates that the block is legal, not shared, and that the data has not been changed
Dirty: means the block is legal, not shared, and the data has been modified, which is different from the memory.

As you can see, when one cache state is R or D, the other caches can only be I. When the cache state is V, there can be multiple caches in V state.

The characteristic of the write-once protocol is that the first write is written to memory (similar to write-through) and successive writes are written to cache only, similar to write-back.

When a Read hit occurs, the state is unchanged.

`1`	`Read hit: The information is supplied by the current cache. No state change.`

When a Read miss occurs, it looks at all caches and if there are other caches in the Valid/Reserved/Dirty state, it reads the data from the other caches and sets it to Valid and the other caches to Valid. If the other caches are in the Dirty state, it also writes the data to memory.

Read miss: The data is read from main memory. The read is snooped by other caches; if any of them have the line in the Dirty state, the read is interrupted long enough to write the data back to memory before it is allowed to continue. Any copies in the Dirty or Reserved states are set to the Valid state.

When a Write hit occurs, if it is in the Valid state, the memory is written to first, setting all other Caches to Invalid and entering the Reserved state, which means the first write is Write-through. This means that all subsequent writes are Write-back.

Write hit: If the information in the cache is in Dirty or Reserved state, the cache line is updated in place and its state is set to Dirty without updating memory. If the information is in Valid state, a write-through operation is executed updating the block and the memory and the block state is changed to Reserved. Other caches snoop the write and set their copies to Invalid.

When a Write miss occurs, the behavior is different from what Wikipedia tells us in class. According to Wikipedia, it is first handled as a Read miss and then as a Write hit, similar to the idea of a Write Allocate. If this is the case, then the data is first read from other caches or memory, then all other caches are set to Invalid, and the updated data is written to memory and enters the Reserved state. This is equivalent to Write miss, which is also implemented as a write-through.

Write miss: A partial cache line write is handled as a read miss (if necessary to fetch the unwritten portion of the cache line) followed by a write hit. This leaves all other caches in the Invalid state, and the current cache in the Reserved state.

The textbook says that a write-miss is treated as a write-back. If all the other caches are Invalid, read the data from memory and write it to the cache to enter the Dirty state. If the other caches are Valid/Reserved/Dirty, read the data from the other caches, put all the other caches into Invalid state, and then update your own data and enter Dirty state.

MSI Protocol

The MSI protocol is relatively simple, it defines three states.

Modified: indicates that the data has been modified and is not the same as in memory
Shared: the data is consistent with the memory, there can be one or more caches in Shared state at the same time
Invalid: not in the cache

When Read hit, the state is unchanged.

When Read miss, check the status of other caches, if they are all Invalid, read from memory and enter Shared state. If there is Shared, read from other caches. If there is Dirty, then write the data from other caches to memory and local cache, and then go to Shared state.

When a Write hit occurs, if it is now in Shared state, put the other Shared cache into Invalid state, then update the data and enter Modified state. If the state is Modified, the data is modified and the state remains unchanged.

When a Write miss occurs, if there is another cache in the Modified/Shared state, then read the data from the other cache and put the other cache into the Invalid state, then modify the local data and go to the Modified state. If all caches are in Invalid state, then read from memory and modify the cache data to go to Modified state.

MESI Protocol

The MESI protocol defines four states.

Modified: data is inconsistent with memory and only one cache has data
Exclusive: data is consistent with memory and only one cache has data
Shared: data is consistent with memory, and there can be multiple caches with data at the same time
Invalid: not in the cache

When Read hit, the status is unchanged.

When Read miss, first check the status of other cache, if there is data, read data from other cache, and all will enter Shared state. If the other caches are in Modified state, the data will be written to memory; if there is no data in the other caches, it will be read from memory, and then it will enter Exclusive state.

When a Write hit occurs, it enters the Modified state and the other caches enter the Invalid state.

When a Write miss occurs, the state of the other caches is checked, and if there is data, it is read from the other caches, otherwise it is read from memory. Then, all other caches enter the Invalid state, and the local cache updates its data and enters the Modified state.

It is worth mentioning that Shared state does not necessarily mean that only one cache has data: for example, if there are two caches in Shared state and one of them becomes Invalid due to a cache replacement, the other one will not be notified of the change to Exclusive. For example, when there is only one core accessing the data, only the first Read miss will send a bus request, after that it will be in Exclusive/Modified state and no bus request will be sent.

MOESI Protocol

MOESI defines five states.

Modified: the data has been modified and only one cache has this data
Owned: Multiple caches have the data at the same time, but only this cache can modify the data
Exclusive: the data is not modified and only one cache has the data
Shared: Multiple caches have the data at the same time, but the data cannot be modified
Invalid: not in the cache

In the state, M and E are exclusive and there can be only one in all caches. In addition, there can be multiple S’s at the same time, or multiple S’s plus one O, but not multiple O’s at the same time.

Its state transfer is similar to MESI, except that when the core writes to a cache in the Owned state, there are two ways: 1) notify the other Shared cache to update the data; 2) set the other Shared cache to Invalid, and then the local cache enters the Modified state. In the case of a Read miss, the data can be read from the Owned cache and enter the Shared state without writing to memory. The advantage of this over MESI is that it reduces the number of writes back to memory.

The AMD64 documentation uses the MOESI protocol, and the AMBA ACE protocol is actually the MOESI protocol, but with a few name changes to indicate compatibility with one of the MEI/MESI/MOESI protocols.

UniqueDirty: Modified
SharedDirty: Owned
UniqueClean: Exclusive
SharedClean: Shared
Invalid: Invalid

Note that SharedClean does not mean that its data is consistent with memory, for example, with the SharedDirty cache, it just means that when the cache is replaced, it does not need to be written back to memory.

Dragon Protocol

The Dragon protocol is an update based protocol, meaning that when a cache is written, the updated data is synchronized to the other cores that own this cache line. It defines four states.

Exclusive clean(E): exclusive, and data is consistent with memory
Shared clean(Sc): data exists in multiple caches at the same time, and it is not the last to write to the cache data
Shared modified(Sm): data exists in multiple caches at the same time, and it is the last one to write data to the cache, similar to the owner state of the MOESI protocol.
Modify(M): exclusive, and the data is not consistent with the memory

As you can see, both E and M are exclusive, and if there are multiple caches with the same cache line, there are several Sc and one Sm.

When a Read miss occurs, it checks on the bus if any cache already has the data of this cache line, if not, it reads from memory and moves to the Exclusive clean state; if it is already in another cache, it reads from the other cache, moves the other cache to the Shared clean/Shared modified state, and then the cache then the cache is moved to Shared clean state.

When a Write miss occurs, the state of the other cache is also checked, and if it is the first one accessed, it is read from memory, updated, and moved to the Modify state; if it is not the first one accessed, it goes to the Shared modified state, and the original Shared modified cache goes to the Shared clean state.

When a Write hit occurs, if the state is Shared modified, you need to notify the other caches to update the data; if the state is Shared clean, you need to notify the other caches to update the data and let the original Shared modified cache enter the Shared clean state at the same time; if the state is If the status is Shared clean, then the cache that was Shared modified will enter the Shared clean state.

Here, the Shared modified cache is responsible for writing the data to memory when swapping out.

ACE protocol

The ACE protocol adds three channels to the AXI:

AC: Coherent address channel, Input to master: ACADDR, ACSNOOP, ACPROT
CR: Coherent response channel, Output from master: CRRESP
CD: Coherent data channel, Output from master: CDDATA, CDLAST

In addition, the following signals have been added to the existing channels:

ARSNOOP[3:0]/ARBAR[1:0]/ARDOMAIN[1:0]
awsnoop[3:0]/awbar[1:0]/awdomain[1:0]/awunique
rresp[3:2]
RACK/WACK

ACE-lite only adds a new signal to an existing channel, not a new channel, so it cannot have a Cache internally, but it can access consistent cache contents.

When a read miss occurs, first the AXI master sends a read transaction to Interconnect, which sends an AC request to the cache where the cache line is stored, and if another master provides data, it returns the data to the requesting master; if no other If no other master provides data, the read request is made to memory and the result is returned to the master, and finally the master provides the RACK signal.

When a Write miss occurs, similarly, the AXI master sends a MakeUnique request to Interconnect, which sends a request to the cache where the cache line is stored, asking the other masters to change their status to Invalid; when all masters have invalidated has been invalidated, the result is returned to the original AXI master.

Directory-based cache consistency

The cache coherence protocol above often has this operation: send/receive messages to all caches that have this cache line. The simple way is to broadcast it directly, and then the receiver side determines for itself whether to process it or not. But this method causes too much broadcast traffic when there are a lot of cores, so you need to save the information about which caches will have this cache first, and then send it peer-to-peer for those caches. This will save some network traffic.

So, how to record this information? A simple solution (Full bit vector format) is to have a global table that records a bit vector of size N (N is the number of cores) for each cache line, with 1 indicating that the corresponding core has the cache line. However, this approach is too large: the number of cache lines is proportional to N, which is multiplied by N again, for a total capacity of O(N^2).

A slightly better approach (Coarse bit vector format) is to group the cores, say by NUMA nodes, where each cache line holds a bit vector of size M (M is the number of NUMAs), and as long as the cache line is present in that NUMA node, the corresponding bit is taken as 1. This is equivalent to sacrificing some traffic （This is equivalent to saving some directory storage space at the expense of some traffic (broadcast within the NUMA node).

In practice, however, a cache line is usually only kept in a small number of cores, so there is a lot of room for optimization here. For example, one can set an upper limit on the number of simultaneous cache lines (Limited pointer format) and save the subscripts of the cores instead of the bit vectors, so that the storage space is O(Nlog2N). However, this limits the number of simultaneous occurrences of cache lines, and if the limit is exceeded, the existing cache needs to be replaced, which may degrade performance in some scenarios.

There is another way, which is the chained directory format. The directory saves the core number of the last accessed core, and then the cache of each core holds the number of the next core that holds this cache line, or indicates that the chain is terminated. The storage space is also O(Nlog2N), but the delay in sending messages is longer because they have to be traversed serially instead of being sent simultaneously. Similarly, a number-balanced binary tree format can be used: each cache holds two pointers to the left and right subtrees, which are then traversed separately, again with the goal of speeding up the traversal and allowing messages to be sent to multiple cores simultaneously.

Table of Contents