rocksdb supports both PessimisticTransactionDB and OptimisticTransactionDB concurrency control modes, both of which seem to be external wrappers for DB objects, doing concurrency control outside of the storage, allowing applications to do transactional KV read and write capabilities per BEGIN, COMMIT, ROLLBACK APIs.
rocksdb originally has the ability to write WriteBatch atomically, and the transaction does things on the basis of WriteBatch, where writes within the transaction are temporarily stored in its own WriteBatch, and reads within the transaction will first read its own WriteBatch, and then read MemTable, L0, L1, and so on. During this period, other transactions cannot see the contents of WriteBatch, and when the transaction commits, the value of WriteBatch will fall into MemTable and WAL, so that other transactions can see it.
Since the LSM Tree already has atomic write capabilities, what rocksdb does is mainly at the level of concurrency control outside the LSM Tree, and supports both optimistic and pessimistic concurrency control.
- In optimistic concurrency control transactions, the reads and writes of the keys within the transaction are tracked, and conflict detection is performed at Commit() to check whether these keys have been modified by other transactions, and if so, the writeBatch is aborted, which is equivalent to nothing happening.
- In a pessimistic concurrency control transaction, a lock is placed on the keys written within the transaction, and when these keys are dropped in Commit(), the lock is released and there is no more conflict detection.
The transaction API is used, roughly, as follows.
Optimistic Concurrency Control: Snapshots and Conflict Detection
Let’s look at optimistic concurrency control. As mentioned earlier, optimistic concurrency control mainly lies in the conflict detection at the time of Commit(), but in addition, it also relies on the Snapshot mechanism to achieve transaction isolation. rocksdb’s Snapshot does not change compared to leveldb, that is, each time a WriteBatch is written, the sequence number is incremented, and the contents of the Snapshot The content of the Snapshot is the sequence number, and the query is filtered by the latest value of the Key less than or equal to the sequence number.
The entry point for tracking the key comes from the TrackKey() method called by
where tracked_locks_ is a unique_ptr of the LockTracker object.
According to the LockTracker class, the difference between optimistic and pessimistic concurrency control is that pessimistic concurrency control uses LockMgr to manage locking and unlocking, while optimistic concurrency control does not have LockMgr and only does tracking, not actually locking. The implementation of LockTracker is roughly a convenient wrapper around the collection, so let’s skip it here and look at how conflict detection is done in Commit based on the information recorded in it.
The main function entry for conflict detection is
OptimisticTransaction::CheckTransactionForConflicts(), which further calls
TransactionUtil::CheckKeysForConflicts() to perform conflict detection.
rocksdb does not implement SSI, only conflict tracking for write operations, the process will be simpler than badger’s SSI, only need to check one thing, that is, at commit time, each Key in the db is not updated after the sequence number of the start of the current transaction exists.
The dumbest way to do this is to read the db once for each Key, get its most recent sequence number, and compare it to the transaction’s sequence number. If the sequence number in the db is larger, then sorry, there is a transaction conflict.
rocksdb will definitely not do something like reading IO N times for each Key in a transaction. Here is an idea, the execution time of the transaction is not long, in the conflict check scenario, you do not need to find all the historical data of the Key to make a judgment, but only the most recent data. The most recent data is MemTable, so to do the recent conflict detection, it is enough to read the data of MemTable, and there is no need to execute IO.
However, there is one more constraint, that is, if the start time of the transaction is earlier than the oldest Key in the MemTable, it is impossible to determine, and then rocksdb’s processing is also more violent, so it just says that the transaction is expired, try again. The following is the relevant logic in
The idea of using MemTable for conflict detection is still very clever, but only for SI scenarios where only write operations need to be tracked. If SSI uses the same mechanism, all read operations will have to be MemTable writes, which is far less useful than tracking key sets like badger.
Pessimistic Concurrency Control: Lock Management
Pessimistic concurrency control is achieved by putting locks on key reads and writes, so that other transactions that try to get a lock go into wait. However, “locking” here is not really a mutex for each key, but has its own set of line lock semantics.
The main objects are LockMap, LockMapStripe, and LockInfo.
LockMap is the entry of all locks in a Column Family, and each LockMap is divided into 16 LockMapStripe (stripe), and LockMapStripe has a mapping between keys to LockInfo.
In simple terms, LockMap is a mapping table from Key to LockInfo, and the inner part is divided into a strip of LockMapStripe to improve its concurrency.
LockInfo has the semantics of read/write locks, exclusive means whether it is exclusive or not, if not, multiple read operations can be allowed to hold the lock at the same time, and the list of transactions holding the lock is maintained in txn_ids.
If a lock is acquired while waiting, it will wait on stripe_cv. It can be seen that the locks here are user-state read/write locks based on system primitives such as CondVar and mutex.
To look at the implementation of TryLock().
This part is relatively easy to read, it is to obtain LockMapStripe, generate LockInfo, and finally call AcquireWithTimeout to go to the process of acquiring the lock.
Leaving aside the deadlock detection part, the logic here is relatively simple:
- try to get the lock, AcquireLocked() is non-blocking, fail to get it, return failure
- if the lock acquisition is unsuccessful, then back off and wait for stripe_cv event notification
- loop to retry to get the lock
AcquireLocked() is a textbook implementation of read/write locking.
- If the lock corresponding to the Key is not occupied, if there is no accident, the lock information will be saved in the stripe and txn_ids will be configured as the current transaction ID.
- The accident here means that there is a configuration max_num_locks_ in rocksdb to limit the total number of locks, if the upper limit is exceeded, then locking fails and returns Status::Busy
- If the lock corresponding to the Key is already occupied
- If neither the occupied lock nor the lock to be acquired is mutually exclusive, multiple readers are allowed to share the lock, and the current transaction ID is appended to the txn_ids of lock_info in stripe to indicate successful read lock acquisition.
- If either the occupied lock or the lock to be acquired has a mutually exclusive flag, the lock should fail to be acquired and return Status::TimedOut if there is no accident, and also return txn_ids to indicate to the caller the list of transactions holding the lock to assist in deadlock detection, but there are two special cases.
- Recursive lock: if the lock is held by the current transaction, the lock’s mutex marker is overwritten and the lock is acquired successfully.
- Lock timeout: If the occupied lock happens to time out, the lock can be grabbed
The code of AcquireLocked() is as follows.
A final look at the unlocking section.
where UnLockKey() is a wrapper for removing the current transaction ID from the LockInfo in stripe or deleting the entire LockInfo.
Wake Lock Waiting Here is a very violent stripe_cv→NotifyAll(), in the form of a swarm to wake up all the players waiting to get a lock, but only one player can successfully get stripe_mutex.
Pessimistic concurrency control: deadlock detection
What deadlock detection does is to track lock dependencies between transactions, determine if there is a ring in it by BFS traversal, and prevent such ringed locking operations in advance. deadlock_detect is off by default in rocksdb, and when active deadlock detection is turned off, it can still be recovered from deadlocks by the lock timeout mechanism.
Deadlock detection occurs in the AcquireWithTimeout function.
A single call to AcquireLocked will return a list of wait_ids waiting for a lock when the lock fails. wait_ids This part of the information is used as the basis for tracking the lock dependency graph.
The fields related to deadlock detection are.
The rev_wait_txn_map_ seems to be used for pruning, tracking the number of waiters for each transaction ID, if the number of waiters is 0, then there must be no deadlock dependency, so there is no need to traverse the map later. On the other hand, if the number of waiters > 1, there is not necessarily a deadlock dependency, so we still need to traverse the graph to know.
The wait_txn_map_ field is the target of BFS traversal. In IncrementWaiters, wait_txn_map_ is traversed from the wait_ids of the current transaction, and if the ID of the current transaction is traversed, the ring is considered to exist.
The logic of the BFS part is as follows.
The BFS queue uses two arrays of maximum length deadlock_detect_depth_ queue_values and two subscripts head and tail, tail represents the tail of the queue, head represents the head, and the traversal ends when head catches up with tail. When a deadlock dependency is found, the path to the deadlock is recorded in the dlock_buffer_ based on the information in queue_parents to aid in the diagnosis.
- In OptimisticTransaction, rocksdb directly uses MemTable to get the latest Sequence number of the Key, which is used for conflict detection during Commit to determine whether the Key has been written by other transactions during the transaction.
- In PessimisticTransactionDB, rocksdb implements row lock semantics based on CondVar and mutex in LockManager, each Key corresponds to a row lock, and waits for CondVar notification if there is a lock conflict when obtaining a row lock.
- Deadlock conflict is a BFS of lock dependency graph to find whether there is a ring between lock dependencies, if there is, it is considered that there will be a deadlock, so it will prevent locking in advance. rocksdb does not enable deadlock detection by default, if there is a deadlock, it can still recover by lock timeout.