I recently wanted to learn something about the postgres ecosystem, and I didn’t quite understand its MVCC mechanism before, so I came back to try to understand it again. Here we ignore the concurrency control and cleanup part of MVCC and just look at the Snapshot part first.

Tuple

Postgres doesn’t have the MySQL kind of UNDO log, the multi-version data (Tuple) is stored directly in the tablespace with meta information to distinguish the versions.

  • xmin: indicates the xid (transaction ID) of the Tuple when it was inserted
  • xmax: the xid of the Tuple when it is deleted

For example, if there is a tuple that is inserted and committed.

1
2
| xmin | xmax | band  | fans  |
| 023  | 0    | tfboy | 9000w |

After deletion in a new transaction.

1
2
| xmin | xmax | band  | fans  |
| 023  | 024  | tfboy | 9000w |

You can see that xmax is set to the xid of the new transaction.

If the Tuple is updated in a new transaction, Postgres treats the update as a delete + insert.

1
2
3
| xmin | xmax | band  | fans   |
| 023  | 024  | tfboy | 9000w  |
| 024  | 0    | tfboy | 10000w |

A counterintuitive point here is that the Tuple of the tablespace does not change immediately when the transaction is COMMIT or ROLLBACK, and the commit status of the transaction depends on the record of the XACT structure.

XACT can be thought of as a synonym for clog (Commit Log), which consists of a set of 8kb pages with two bits for each transaction ID, indicating whether the transaction is Ingress, Committed, or Aborted. clog is continuously appended, rotating every 256kb, but it does not grow indefinitely. vacuum can clean up useless clog files.

So when querying table data, you often need to dichotomously look up a XACT (clog) to get the commit status of this row of data, and there is some overhead in querying XACT more than once, so postgres also has two hint bits in Tuple, which refer to committed or rollbacked, respectively. If a Tuple is committed/rollbacked during a read, a hint bit is set so that the next time you don’t need to access XACT again. It is more like a Read Repair process.

Why is it designed this way? This is the original quote from “MVCC in PostgreSQL-3. Row Versions”.

Why does not the transaction that performs the insert set these bits? When an insert is being performed, the transaction is yet unaware of whether it will be completed successfully. And at the commit time it’s already unclear which rows and in which pages were changed. There can be a lot of such pages, and it is impractical to keep track of them. Besides, some of the pages can be evicted to disk from the buffer cache; to read them again in order to change the bits would mean a considerable slowdown of the commit.

In this way, the XACT setup is a bit like the Commit Point in percolator, where one atomic step determines the commit status of N participants in a transaction.

Snapshot

In a store like Rocksdb, where there is no uncommitted data in the storage tier, Snapshot only requires a sequence number. However, Postgres requires a bit more information.

  • xmin: the earliest XID that was still active when the current transaction was started, all data created with less than xmin should be visible (except for rollback data)
  • xmax: XID + 1 for the most recent committed transaction, all data greater than xmax should not be visible
  • xip[]: list of currently active transaction XIDs, all data related to active transactions should be invisible

Refer to the diagram in How Postgres Makes Transactions Atomic to draw a similar one.

Snapshot

In this Snapshot, there are 100 and 102 transactions that satisfy 100 ≤ XID < 105, and the data they commit is visible; the XID = 99 and 104 transactions are not visible because they are Rollback, and the XID = 101 and XID = 103 transactions are still in progress and are also not visible. It seems that the visibility of a transaction depends on the state of the transaction, and the xmin and xmax ranges can play the role of pruning.

References