When it comes to distributed databases, whether they are centralized databases like Etcd/Zookeeper or decentralized databases like Ethereum blockchain, there are two keywords that cannot be avoided: “consistency” and “consensus”.

This article is the notes recorded by me when learning “consistency” and “consensus” and related theoretical knowledge, which can help us understand the difference between Etcd/Zookeeper/Consul/MySQL/PostgreSQL/DynamoDB/Cassandra/MongoDB/CockroachDB/ TiDB, etc., understand the advantages and limitations of each database, understand the meaning of database isolation level and how it should be set, and enable us to choose the applicable database in various application scenarios.

If you are interested in blockchain, this article can also help you understand how decentralized databases like blockchain differ from the popular distributed databases in the industry in terms of technology, what they have in common, and how they are implemented.

I. Consistency

“Consistency” itself is a rather vague definition, with many different meanings depending on the usage scenario. Since databases are still an emerging field, there are many different models of consistency, and some of these terms may describe overlapping relationships between consistency, relationships that may even bother professional database developers.

But at the root, when we talk about consistency, we are actually talking about transactional consistency and data consistency, and we describe each of these two types of consistency below.

1. Transactions Consistency

“Transactions Consistency” refers to the consistency of transactions in a database, which is the least important feature of ACID theory and is not the focus of this article. But it is not enough to write such a sentence here, so here is a closer look at transactions and ACID theory.

Transactions and ACID Theory

A transaction is an “All or Nothing” mechanism for running instructions.

ACID Theory defines a “sequence of database instructions” as a transaction if it has the following four characteristics.

  • Atomicity: A transaction is an indivisible unit of work in which all operations are either completed or not completed, and cannot be stalled in some intermediate state.
    • For example, if A transfers $100 to B, either the transfer fails or the transfer succeeds; it cannot be stuck in an intermediate state where A has been deducted $100 and B has not received $100.
    • Atomicity has been well addressed on standalone databases, but it becomes a new challenge on distributed databases. It is not easy to support atomicity in a distributed architecture, so many NoSQL products have chosen to bypass this problem and focus on niche scenarios that are not sensitive to atomicity. Consistency: also known as data “correctness” or integrity, means that changes to database state by a transaction must satisfy all predefined rules, including “constraints”, “cascades”, “triggers”, and any combination of these rules. For example
    • For example, if a user sets a constraint unique on a field, then all changes to that table by the transaction must ensure that this constraint holds, or it will fail.
    • It is the characteristic with the lowest presence
  • Isolation: Multiple transactions executing concurrently are completely isolated from each other, and they execute exactly as if they were executed serially in the order in which the transactions started.
    • The most complex features of a transaction
  • Durability: After a transaction is executed, the result is preserved. This is best understood.

ACID is a core feature of traditional standalone databases, such as MySQL/PostgreSQL.

The most complex feature in ACID - Isolation

A full implementation of ACID yields a database with very poor performance. Therefore, in relational databases, designers usually choose to sacrifice the relatively unimportant “isolation” to get better performance.

And once the isolation is not thorough enough, you may encounter some exceptions where transactions affect each other, which are classified as follows.

  • Dirty writes: i.e., transaction T1 and transaction T2 update the same data on top of the original data at the same time, resulting in a result that does not meet expectations.
    • Example: Two transactions try to debit $1000 from the account at the same time, but the initial state they read is $5000, so they both try to modify the account to $4000, and the result is $1000 less.
    • The simplest solution: For UPDATE table SET field = field - 1000 WHERE id = 1, you need to add a “row write lock” to the row being updated, so that other transactions that need to write this data wait.
  • Dirty reads: Transaction T1 reads data that was not committed by transaction T2. This data is not necessarily accurate and is called dirty data because if transaction T2 rolls back, T1 will get an incorrect data.
    • Case: Suppose Xiao Ming Xiao Hong has deposited 5000 Yuan in a bank account, and Xiao Ming Xiao Hong is using the same account to spend 1000 Yuan, in the middle of which Xiao Ming’s payment transaction reads that the account has been modified to 4000 Yuan by Xiao Hong’s transaction, so it modifies the balance to 3000 Yuan, and then the payment succeeds. However, after Xiao Ming’s payment transaction succeeds, Xiao Hong’s payment failure is rolled back and the balance is modified from 3000 to 5000. Xiao Ming then accomplishes the feat of $0 purchase.
    • The simplest solution: add a “row write lock” to the modified row when transaction T2 writes data, and then release the lock after T2 finishes, so that the read of transaction T1 will be blocked until the lock is released.
  • Non-repeatable reads: After transaction T1 reads the data, transaction T2 updates and commits the data immediately afterwards.
    • Case.
      • Xiao Ming buys a product on Jingdong, and when the transaction starts, it reads that there are 36 products left, so it continues to execute the logic of buying.
      • If you first read the balance by SELECT field INTO myvar FROM mytable WHERE uid = 1 in the transaction, and then update the balance by UPDATE on this basis, it is likely that the data will become A mess!
        • The correct way is to use UPDATE mytable SET field = field - 1000 WHERE id = 1, because each SQL command itself is atomic and this SQL will not be a problem.
    • The simplest solution: when transaction T1 reads the data, it also puts a “row” lock on it until it no longer needs to read the data, and then releases the lock.
  • Phantom reads: When transaction T1 reads data in bulk several times, transaction T2 performs insert/delete operations into it, resulting in T1 reading a remnant of the old data instead of the current real data state.
    • The simplest solution: Transaction T1 puts a range lock on the bulk read, and then releases the lock after Transaction T1 has finished reading. This solves both the “phantom read” and “non-repeatable read” problems.

According to the degree of isolation, the ANSI SQL-92 standard subdivides “isolation” into four levels (avoiding “dirty writes” is a mandatory requirement for databases, so it is not recorded in the following four levels)

  • Serializable Serializable: that is, complete isolation, forcing transactions to execute serially (through a locking mechanism) whenever there is a possibility of them affecting each other.
  • Repeatable read Repeatable read: avoids dirty reads and non-repeatable reads, but does not solve the problem of phantom reads.
  • Read committed: only avoids dirty reads
  • Read uncommitted: the lowest level, completely abandoning isolation

The default isolation level of MySQL is “Repeatable Read”, and the default isolation level of PostgreSQL and Oracle is “Read committed”.

Why is the default isolation level of MySQL/PostgreSQL/Oracle set in this way? How to choose the correct isolation level? Let’s do a simple analysis for a common high concurrency business scenario.

  • First of all, “dirty read” must be avoided, it will make the transaction read the wrong data! The lowest “read uncommitted” level is directly excluded.
  • The lowest level of “read uncommitted” is directly excluded.
  • As long as SQL is used correctly, the “non-repeatable read” problem usually has no impact on the correctness of the business logic, so it can be tolerated.
  • So “read-committed” is generally the best isolation level, which is why PostgreSQL/Oracle set it as the default isolation level.
  • So why is MySQL such a maverick, raising the default isolation level to “repeatable reads”? Why would a big Internet company like Ali change MySQL’s default isolation level to “Read Committed”?
    • According to the information I found on the Internet, this is the result of MySQL’s history. MySQL before 5.0 only supports statement binlog format, which has many problems under the “read committed” isolation level, the most obvious one is that it may lead to inconsistent data between master and slave databases.
    • In addition to setting the default isolation level, MySQL also prohibits the use of READ COMMITTED as the transaction isolation level when using binlogs in statement format, and attempts to modify the isolation level will result in the error Transaction level 'READ-COMMITTED' in InnoDB is not safe for binlog mode 'STATEMENT'
    • The reason why Internet companies change the isolation level to “READ COMMITTED” is also very understandable, of course, to improve performance, the lower the isolation level, the higher the concurrency performance.

The essence of “isolation” is actually concurrency control of transactions, different isolation levels represent the degree of isolation of concurrent transactions, the main means of implementation is “multiple version concurrency control MVCC” and “locking”. The locking mechanism has been briefly introduced earlier, and MVCC actually creates a snapshot for each transaction with a specific isolation level, so that reads and writes will not block each other and performance is improved.

The analysis of anomalies in ANSI SQL-92 is still too simple, and the newly released 1995 paper A Critique of ANSI SQL Isolation Levels enriches and refines SQL-92, defining six isolation levels and eight anomalies, and it is highly recommended to read through this paper. Of these, we are most concerned with the Snapshot Isolation (SI) level.

2. Data Consistency

“Data Consistency” means that every read operation to the database should either read the latest data written, or simply report an error.

Data Consistency" is often not a problem for standalone databases, because there is usually only one copy of the data stored on disk or in memory. However, in distributed systems, each copy of data is often stored on multiple nodes for data security, which raises the issue of consistency among data copies. Therefore, when we talk about “data consistency”, we usually mean “data consistency” of distributed systems.

The CAP Principle is a well-known theory in the field of distributed systems, which tells us that it is impossible to achieve all three properties in distributed systems, hence the name “CAP Impossibility Triangle”.

  • Data Consistency: Every read operation by a client, no matter which node of the system it accesses, either reads the same copy of the latest data written or fails
    • Emphasizes that the data is exactly correct
  • Availability: Any request from the client, regardless of which non-faulty node is accessed, will get the response data, but it is not guaranteed to be the same latest data
    • Emphasis on service availability, but no guarantee of correct data
  • Partition Tolerance: The system can operate normally even if there is an arbitrary number of lost messages or high latency between nodes
    • That is, network packet loss or latency can cause the system to be divided into multiple Partitions, and the system can tolerate this situation

To ensure partition fault tolerance P, consider that when a distributed system is fragmented into multiple partitions due to network problems, each partition has only two choices, A and C, and one of them must be sacrificed.

  • cancel the operation and deny the service, which reduces availability but ensures data consistency
  • Continue processing requests, which ensures availability, but data consistency is not guaranteed

If multiple partitions of a system are being served at the same time, resulting in inconsistent and conflicting data that cannot be merged, this is known as a “brain fracture” in a distributed system, and obviously no distributed system would want a “brain fracture” to occur.

Because a distributed system is different from a stand-alone system, it involves network communication and interaction between multiple nodes, but as long as there is network interaction, there will definitely be delays and data loss, and partitioning failures between nodes are very likely to occur. Therefore, for proper operation, P is a feature that must be guaranteed for distributed systemsIn case of partition failure, only A or C can be sacrificed for P.

Whether to engineer AP or CP depends on the situation:

  • Etcd/Zookeeper/Consul: They are typically used to store critical meta-information about the operation of the system, and each time they are read, they have to be able to read the latest data. Therefore they implement CP at the expense of A
  • DynamoDB/Cassandra/MongoDB: data consistency is not required and it is not a big problem to use the old cache for some time, but availability is required, so they should implement AP and sacrifice C

Data Consistency Model

A set of read and write policies on multi-copy data in distributed systems is called “Consistency Model (of data)”. There are so many consistency models that it is hard to distinguish them. To facilitate understanding, we first distinguish the concept of strong consistency and weak consistency from a state perspective, and then understand these many consistency models from an operational perspective.

1. State Perspective - Strong Consistency and Weak Consistency

We first consider the whole distributed system as a white box. From the state perspective, after any change operation, there are only three states of multiple data replicas of the distributed system as follows.

  • Under certain conditions, the inconsistent states across replicas are temporary and will also transition to a consistent state, which is referred to as “weakly consistent”.
    • This is usually done using asynchronous replication to synchronize the states of the replicas.
  • In contrast, if there is no such state as “inconsistent” across replicas of the system, and the data must be identical as long as the change operation succeeds, then it is called “Strongly Consistent”.
    • This requires that data updates between all replicas must be fully synchronized, and fully synchronized replication must be used.
  • Never Consistent: This is a bug in distributed systems and is also known as “brain cracking”.

The above describes the objective, actual state of the entire system, but for the vast majority of users distributed systems are more of a black box, so the more popular classification is based on a ‘black box’ approach, which classifies systems into two types based on their external state.

  • Strong Consistency: means that for any node/process of the system, after the write operation is completed, any subsequent access to any node by any user will read the new value. It is as if only one copy of the system exists.
    • The most commonly used algorithms are Raft/Paxos, whose write operations only require more than half of the nodes to be written successfully, so the internal state is actually inconsistent when the write completes, but the effect of reading and writing to it is no different from “fully synchronous replication”.
  • Weakly Consistent: means that for any node/process of the system, after the write operation completes, the value that any subsequent access may get is uncertain, but after some time, any subsequent access reads the new value.
    • Weak Consistency is very vaguely defined. If we refer to the fact that eventually all users can access the new value as “system convergence”, the time used for system convergence can be well-bounded or not. The access behavior before system convergence can have explicit specification or no specification. It all depends on the implementation of the specific system.
    • If the system can converge in finite time, then it is “finally consistent”, otherwise it can be considered as “inconsistent”.

For practical purposes, database experts have obtained many consistency models by imposing various restrictions on the effect of reading and writing before the system converges and various restrictions on the convergence time of the system.

2. Operational Perspective - Multiple Consistency Models

From the operational perspective of each client, there are four consistency models.

  • Read after Write Consistency: Also known as “Read after Write Consistency”, i.e., after you write version N of the data, the version you subsequently read must not be smaller than version N.
    • The problem it solves: A posts a shaky video, but it somehow disappears after refreshing the page (the old version), only to be refreshed a few minutes later.
    • One way to implement this: add a separate read rule for the writer, and all his reads are handled by the copy that has updated its write data.
  • Monotonic Read Consistency: guarantees the order of multiple read operations, i.e. once a client reads a certain version N of data, it will not subsequently read a version lower than N.
    • It solves the problem that A deletes a Jitterbug video, which can be refreshed multiple times, occasionally failing to refresh the video, and occasionally refreshing the deleted video (the old version), only to be completely deleted a few minutes later.
    • One way to implement this: Create a replica mapping for each user’s read, and subsequent reads are handled by a fixed replica to avoid randomly switching replicas and reading older values.
  • Monotonic Write Consistency: guarantees the order of multiple write operations, i.e., two write operations to the same data by the client must be executed in the order they were committed.
  • Read After Write Consistency Write after Read Consistency: Read after write consistency guarantees that after a client reads version N of data (which may have been written by another client), subsequent write operations to the same data must be executed on the copy with version number greater than or equal to N.

The above four consistency models only define rules from the perspective of each client, which is rather one-sided, so they are all “weak consistency models”.

Without considering the clients, and directly from the perspective of all database users’ operations, there are several consistency models as follows.

  • Linearizability: Linearizability makes use of the commit order of events, it guarantees that the order of data obtained by any read operation is the same as the commit order of read/write events.

    • Simply put it requires that the entire system behave as if only one copy exists and that all operations are executed as if those events were executed exactly serially in the order they were committed. This is effectively saying that all concurrent events are atomic and must be executed sequentially once they conflict with each other, hence why some call it “atomic consistency”.
    • Linear consistency, which is exactly equivalent to the “strong consistency” of the external state of the system
    • A linearly consistent system is fully deterministic
    • Implementation: requires a “global clock” that is consistent across all nodes so that all events can be globally ordered.
      • Most distributed databases like TiDB/Etcd have a global clock implemented through single point timing and synchronization via protocols like NTP.
      • Google Spanner, which has global deployment needs, is a global clock TrueTime using GPS + atomic clocks, and the global error can be controlled within 7ms.
    • Limitations: According to Einstein’s theory of relativity, “time is relative”, there is no absolute time, so linear consistency is only applicable within the scope of classical physics.
  • Sequentially Consistent: Sequential consistency was first used by Leslie Lamport to describe the behavior of multicore CPUs, and is less used in the distributed systems space. less used in the distributed systems domain.

    • The requirements for sequential consistency are twofold.
      • From the perspective of a single process (replica), the order of execution of all instructions is identical to the order of the code logic.
      • From the perspective of all processors (the entire distributed system), write operations need not be immediately visible to all users, but all replicas must receive these write operations in the same order.
    • Both sequential consistency and linear consistency are about finding a set of operation histories that satisfy “write followed by read,” the difference being that linear consistency requires a strict temporal order, while sequential consistency only requires that the logical order of the code be satisfied and that the order of events not defined by other code logic (such as the order between events on multiple replicas), it doesn’t matter what the exact order is, as long as all replicas see It doesn’t matter what the exact order is, as long as all copies see the same order of events.
    • Order consistency does not provide “certainty”, as the same two operations may still result in different event orders.
    • Implementation: Since strict global time order is not required, it does not require a global clock, but in practice some complex operations are still needed to satisfy global determinism.
  • Causal Consistency: While the linearly consistent global clock has its limitations, Causal Consistency proposes the concept of “logical clock” based on the “partial order relationship” of write events, and guarantees that the read order is consistent with the order of write events on the logical clock.

    • The “off-order relationship” relationship of write events means that at least some of the events (e.g., events within a node) are directly orderable using the local clock, and that when communication occurs between nodes, the events of the receiver must be later than the events of the caller. Based on this a “logical clock” can be implemented, but the disadvantage of a logical clock is that if a certain two events are not correlated, then the order given by the logical clock has no meaning.
    • Most argue that causal consistency is weaker than linear consistency, but has advantages in concurrency performance and is also sufficient to handle most anomalies, so causal consistency is also used in industry.
    • Both CockroachDB and YugabyteDB use Hybrid Logical Clocks in their designs, a scheme derived from Lamport’s logical clocks, which has also achieved good results
  • Consistent Prefix: During synchronization between replicas, there will be some replicas that do not receive data in the same order. “Consistent Prefix” means that the prefix of the data order read by all users is always the same.

    • “Prefix” means that the program needs to explicitly declare its “prefix” events when performing write operations, so that each event has a prefix that is arranged by other write events. For example, if there is currently a write event arrangement “A B C D”, then all data read by the user will have the same write event prefix, such as “A”, “A B”, “A B C”, “A B C D”, but no result such as “A C” or “C A” is possible.
    • It solves the consistency problem of partitioned distributed database: A B C reads and writes different copies because of the geographical difference, B asks a question in the comment section of Jitterbug, and then A makes an answer. However, if the question and answer data are in different slices, the order of the two data cannot be guaranteed when the copy is synchronized, and C may read the answer information first before refreshing B’s question, and the order of historical events is messed up.
    • Implementation: The program needs to actively add explicit dependencies between messages, and then control their reading order accordingly, which is more complicated to implement.
    • Problem: The order between events can only be guaranteed if they are explicitly defined with cause-and-effect relationships.

Among them Linear Consistency is Strong Consistency and all other models are Weak Consistency Models or Final Consistency Models. All these models are listed in descending order of strength as follows.

  • Linear consistency/strong consistency: the system behaves externally as if the whole system is perfectly consistent and there are no inconsistencies.
  • Sequential Consistency: only the order of events on each node is guaranteed to be consistent, with only very loose requirements on the order of events between nodes.
  • Causal Consistency: Again, only the order of events on each node is guaranteed to be consistent, but the requirements for the order of events between nodes are more lenient than sequential consistency.
  • Bounded Staleness: Ensures that the read data is no more than K versions away from the latest version.
  • Session Consistency: Monotonic reads, monotonic writes, and reads written by itself are guaranteed within a session, not between sessions.
  • Prefix Consistency: monotonic reads are guaranteed within each session, but not between sessions
  • Four consistency models from the client’s perspective: read after write, monotonic read, monotonic write, and read after write. All four models have a very one-sided perspective and are usually included in the aforementioned consistency models.

A more complete relational tree diagram: Consistency Models

II. BASE and Final Consistency of Distributed Systems

BASE theory.

  • Basically Available: When a distributed system has unpredictable failures, the availability of some functions is allowed to be lost to guarantee the availability of core functions
    • Four means of achieving basic availability: traffic clipping, delayed response, experience degradation, and overload protection
  • Soft state: In flexible transactions, allow the system to have an intermediate state, and this intermediate state will not affect the overall availability of the system. For example, if the database reads and writes are separated, there is a delay from the write library to the read library (master to slave), which is actually a flexible state.
  • Eventually consistent: It has been described in detail earlier, it means that for any node/process of the system, after the write operation is completed, the value that any subsequent access may get is uncertain, but after a limited period of time, any subsequent access will be able to read the new value.

ACID and BASE are essentially two extremes in the implementation of distributed systems.

  • ACID theory is, as it means “acid”, the consistency boundary of the CAP principle - the strongest consistency, the extreme of achieving CP at the expense of A.
  • BASE translates to “base”, which is the boundary of availability in the CAP principle - the highest availability, the weakest consistency, and the extreme of AP by sacrificing C.

According to CAP theory, if consistency is achieved in a distributed system, availability is bound to be affected. For example, if a node fails, the execution of the entire distributed transaction fails. In fact, most scenarios do not require that much consistency, and transient inconsistencies are acceptable. In addition, also based on availability and concurrency performance considerations, it is recommended that in developing and implementing distributed systems,if not necessary, try not to implement transactions, and consider using final consistency.

Means of implementation of ultimate consistency.

  • Read-time repair: detecting data inconsistencies when reading data and repairing them
  • Write-time repair: detects data inconsistencies and repairs them when writing data
  • Asynchronous repair: this is the most common way to detect the consistency of the replica data and repair it by timing the reconciliation

When implementing final consistency, it is also recommended to also implement custom write consistency levels (e.g. All, Quorum, One, Any), which are adjustable for many distributed databases.

But with the rise of distributed relational databases such as TiDB, the BASE theory in the distributed space is actually being overtaken by ACID, which is taking on a new lease of life.

III. Consensus Algorithm

Consensus algorithms, also known as consistency protocols, are a set of processes for reaching agreement among multiple nodes in a distributed system on a proposal Proposal (e.g., multiple transaction requests, who to execute first?). A set of processes to reach a consensus view.

The meaning of proposal is very broad in distributed systems, such as the order in which multiple events occur, the value corresponding to a certain key, who is the master node …… and so on. Any information that can be agreed upon can be considered as a proposal.

For distributed systems, each node is usually the same deterministic state machine model (also known as State-Machine Replication problem), and receiving the same sequence of instructions starting from the same initial state guarantees the same resultant state. Therefore, the most critical thing for multiple nodes in the system is the consensus on the order of multiple events, i.e., ordering.

Consensus algorithms are a means to reach data consistency and are a necessary non-sufficient condition for strong data consistency. For example, using the Raft algorithm directly, but allowing to read any node of the cluster, only yields the final consistency of the data, and other means are needed to ensure strong consistency.

The Byzantine General Problem and Byzantine Fault Tolerance

Byzantine error is an error model proposed by Lambert in 1982 in “The Byzantine General Problem”, describing the problem of whether consensus can be reached in a scenario where a few nodes are not only faulty, but also malicious, as described in the paper as follows.

Nine Byzantine generals lead an army to besiege a city together, because the city is very powerful, if they do not coordinate the strategy of the generals, part of the army attack and part of the army retreat will cause the siege to fail, so the generals must vote to agree on a strategy, either attack together or retreat together.

Since each general occupies a corner of the city, they can only communicate with each other through messengers. During the coordination process, each general will inform all the other generals of his vote to “attack” or “retreat” by messenger, so that each general will know the result of the vote based on his own vote and the votes sent by the other generals. The decision to attack or >retreat is made by each general based on his vote and the votes sent by the other generals.

The problem is complicated by the fact that there can be traitors among the generals who not only vote for the wrong decision, but also send their votes selectively. Suppose there is a traitor among 9 generals, and 4 of the 8 loyal generals vote “attack” and 4 vote “retreat”, then the traitor may deliberately vote “attack” for the 4 generals who voted “attack”. In this case, the traitor may deliberately vote “attack” for the 4 generals who voted “attack” and “retreat” for the other 4 generals who voted “retreat”. Thus, it appears to the 4 generals who voted “attack” that the vote was for 5 to attack, while the other 4 generals appear to have voted “retreat” for 5 to retreat. Thus, consistency is broken.

In another case, since the generals need to communicate with each other by messenger, even if all generals are loyal, the messengers sent may be intercepted by the enemy or even replaced by spies, which means that the message channel for communication between generals is not reliable. So when no message is received from the corresponding general, the generals will default to a vote, such as “attack”.

More generally, in the case of a known rebellion by N generals, can the remaining M loyal generals reach a consensus without the influence of a traitor? What are the preconditions and how should consensus be reached? This is the Byzantine general problem.

If a consensus algorithm can solve the Byzantine general problem under certain conditions, then we call this algorithm a “Byzantine Fault Tolerance (BFT)” algorithm. Conversely if a consensus algorithm cannot accept any node as evil, then it is called “Non-Byzantine Fault Tolerance Crash Fault Tolerance (CFT)” algorithm.

It can be found by simple exhaustive enumeration that two loyalties and one traitor cannot reach consensus. This conclusion combined with the converse method can prove that the Byzantine Fault Tolerance algorithm requires the proportion of traitors to be less than 1/3.

Commonly used consensus algorithms

For the case of “Non-Byzantine Fault Tolerance Crash Fault Tolerance (CFT)”, a number of classical algorithms already exist, including Paxos (1990), Raft (2014) and its variants. Such fault-tolerant algorithms tend to perform relatively well, process faster, and tolerate no more than half of the failed nodes.

For the case of “Byzantine Fault Tolerance Byzantine Fault Tolerance (BFT)”, there are currently algorithms such as PBFT (Practical Byzantine Fault Tolerance, 1999) for the deterministic family of algorithms, PoW (1999) for the probabilistic algorithms, and other algorithms to choose from. Deterministic algorithms are irreversible once consensus is reached, i.e., consensus is the final result; whereas the consensus result of probabilistic class algorithms is temporary, and with time or some kind of reinforcement, the consensus result is less and less likely to be overturned and eventually becomes the de facto result. Byzantine-type fault-tolerant algorithms tend to perform poorly, tolerating no more than 1/3 of the failed nodes.

In addition, recently proposed improved algorithms such as XFT (Cross Fault Tolerance, 2015) can provide CFT-like processing response speed and can provide BFT guarantees when most nodes are working properly. The Algorand algorithm (2017) is improved based on PBFT and solves the proposal selection problem by introducing verifiable random functions, which can theoretically achieve better performance (1000+ TPS) while tolerating Byzantine errors.

Note: In practice, for the client to get the consensus result needs to be verified by itself, typically, enough service nodes can be accessed to compare the results and ensure the accuracy of the obtained results.

Common consensus algorithms are listed as follows.

Byzantine fault tolerance consistency performance availability (what percentage of nodes can be tolerated to fail)
Two-phase Commit 2PC No Strong Consistency Low Low
TCC(try-confirm-cancel) No Final Consistency Low Low
Paxos No Strong Consistency Medium Medium
ZAB No Final Consistency Medium MMediumiddle
Raft No Strong Consistency Medium Medium
Gossip No Final Consistency High High
Quorum NWR No Strong Consistency Medium Medium
PBFT Yes N/A Low Medium
PoW Yes N/A Low Medium
PoS Yes N/A Low Medium
PoH Yes N/A Middle Medium

Note: Although consistency algorithms such as PoW/PoS/PoH are listed here for application in blockchain, they are very different from other Byzantine fault-tolerant algorithms such as PBFT, which will be given later.

Application Scenarios for Different Consensus Algorithms

In untrustworthy environments, where malicious behavior may exist, consensus algorithms that support Byzantine fault tolerance such as PoW/PoS are needed to enable the system to reach consensus despite the presence of some nodes acting in a malicious manner. This is the reason why blockchains use PoW/PoS algorithms instead of Paxos/Raft algorithms.

In scenarios such as enterprise intranets, which can be considered as trusted environments, there are basically no malicious nodes or node identity authentication can be performed by means of mTLS, and it is enough for the system to have fault tolerance in such scenarios.

Non-Byzantine error consensus algorithms Paxos and Raft

Due to space and effort, I will skip this part for now… I may write a new article dedicated to Paxos/Raft algorithms later.

PoW probabilistic consensus algorithm that tolerates Byzantine errors

PoW is Proof of Work, a metric set by a system to achieve a certain goal. It is simply understood as a proof that you have done a certain amount of work. The whole process of monitoring work is usually extremely inefficient, and the certification process is very simple and efficient by certifying the results of the work to prove that the appropriate amount of work was done, which is where PoW comes in.

In 1993, Cynthia Dwork and Moni Naor designed a system for anti-spam and avoiding misuse of resources, which is the prototype of the PoW algorithm. The core idea is as follows.

The main idea is to require a user to compute a moderately hard but not intractable function in order to gain access to the resource, thus preventing frivolous use.

In 1999, Markus Jakobsson and Ari Juels first distilled the concept of Proofs of Work from various protocols.

There must be two roles in a POW system, worker and verifier, and they need to have the following characteristics.

  • There must be a certain amount of work to be done by the worker, and this amount is given by the work verifier.
  • The validator can quickly check whether the workload is up to standard.
  • The worker cannot “create the work” himself, but must be issued by the validator.
  • The worker cannot find a way to get the work done quickly.

At this point, we should have enough understanding of PoW, which is to let workers consume a certain amount of resources as the cost of using the system. For a normal user this consumed resource is perfectly acceptable, but for a malicious attacker who wants to abuse the system’s resources or send massive amounts of spam, it needs to consume massive computing resources as a cost, which greatly increases the cost of the attack.

To recap, the core of the PoW algorithm is that it adds cost to message delivery and reduces the rate of message delivery.

Looking at the Bitcoin blockchain converted into a Byzantine general problem, it works along the lines of

  • Limit the number of proposals over a period of time, and only the nodes with the corresponding privileges (generals) can initiate proposals.
    • This is achieved through PoW workload proofs, where the Bitcoin blockchain requires nodes to perform massive hash computations as the cost of acquiring proposal privileges, and the algorithm difficulty is adjusted every two weeks to ensure that the average time to find the correct hash value for the entire system is about 10 minutes.
  • Relaxed from strong consistency to final consistency.
    • The result of a proposal does not need to be followed by all nodes immediately, but only the longest chain among all the chains in the whole network that the node can search for is selected for subsequent expansion.
  • Using asymmetric encryption algorithm to provide signature technical support for message delivery between nodes, each node (general) has its own secret key (public-private key) that uniquely identifies the node identity.
    • Using asymmetric encryption algorithm to deliver messages can guarantee the privacy of message delivery. Moreover, the message signature cannot be tampered with, which prevents the message from being forged by malicious nodes.

We have given a conclusion earlier: Byzantine fault-tolerant algorithm requires that the percentage of traitors must be less than 1/3.

But the difference between blockchain and Byzantine General problem is significant, for example.

  • Blockchain allows any node to join or leave the blockchain at any time, while the Byzantine General problem has a predetermined number of nodes and does not take into account the addition or removal of nodes.
  • The PoW algorithm of the Bitcoin blockchain can only guarantee that the average time for the entire system to find the correct Hash value is about 10 minutes, so it is entirely possible that nodes with better performance will take less time, nodes with worse performance will take longer, and even some nodes will be lucky enough to get the result in a few seconds. The earlier the node calculates the Hash value, the higher the probability that its proposal (block) will be the longest chain.
  • If a chain is not long at the beginning, but it expands fast enough, it can become the longest chain. And the Byzantine General problem does not allow any branching, only one result exists!
    • Just limited by the arithmetic power, the probability of a short chain catching up with the longest chain will become smaller and smaller as time goes by.

Anyway, because of such characteristics of the blockchain, it produces some results that are different from the Byzantine fault-tolerant algorithm.

  • The percentage of the number of nodes owned by the attacker is meaningless; the core is the arithmetic power, which corresponds to the proposal power in the blockchain.
    • Even if the attacker owns 99% of the nodes, but its overall arithmetic power is weak, the probability of its proposal (block) becoming the longest chain will be low.
  • Because of the principle that “the system always picks the longest chain for subsequent expansion”, only if an attacker has more than 50% of the computing power, it has an absolute advantage that its block will definitely become the longest chain after a certain period of time, and it will always maintain such an advantage to achieve the purpose of the attack.

As for the specific implementation of PoW algorithm and the principle and implementation of its alternative algorithms such as PoS/PoH and other emerging algorithms, we will introduce them in detail in the following blockchain series, so please look forward to them…