Raft Consensus Mechanism and The Applications
Raft Consensus Mechanism and The Applications
Raft Consensus Mechanism and The Applications
Abstract. Raft consensus algorithm is one of the commonly used consensus algorithms in
distributed systems. It is mainly used to manage the consistency of log replication. It has the
same function as Paxos, but compared to Paxos, Raft algorithm is easier to understand and
easier to apply to actual systems. The Raft algorithm is a consensus algorithm adopted by the
alliance chain. This article describes the details of Raft consensus algorithm and its application
in detail.
1. Raft Overview
Follower:
Passively accepts requests from Leader. The initial state of all nodes is in the Follower state.
Candidate:
Intermediate state transition from Follower to Leader.
Leader:
Responsible for interacting with the client and log replication (log replication is one-way, that
is, the direction that the leader sends to Followers). Only one leader exists in the entire system
at the same time, that is, two or more leaders will not appear at the same time.
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
2. Key concepts
2
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
correctly, the State Machine of each node will execute these commands in the same sequence, and
finally get a consistent state[6]. The result of the consensus is then returned to the client, as shown in
Fig. 3 below.
2.2. Term
In distributed systems, "time synchronization" is a big problem[7]. Because each machine may be
always inconsistent due to its geographical location, machine environment and other factors. But in
order to identify "outdated information", time information is essential.
The Raft consensus algorithm uses the concept of Term. The time is divided into Term (at the same
time each node will maintain the Current Term locally), it can be considered as a logical time[8], as
shown in Fig. 4 below.
Fig. 4. Term.
Each Term begins with a leader election and one or more Candidates run for Leader. If a Candidate
wins the election, he will be the leader for the rest of the term. In some cases, multiple Candidate votes
may be the same. At this time, Leader may not be selected (such as Fig, 3), then another Term will be
started, and the next election will begin immediately. The Raft algorithm guarantees that there must be
at least one leader in a given Term, and there can be no situation without a leader.
3
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
If a node finds that it has not received the heartbeat request sent by the Leader after the Election
Timeout time, the node will become Candidate and stay in this state until one of the following three
situations occurs:
⚫ The Candidate wins the election
⚫ Other Candidate wins election
⚫ After a period of time, no server wins the election (enters the next round of Term elections and
randomly sets the Election Timeout time)
The Candidate will then send Request Vote to other nodes. If more than half of the nodes agree, it
becomes a leader. If the election has timed out and no Leader has been elected, it will enter the next
term and re-elect.
After completing the Leader election, the Leader will periodically send Heartbeat to other nodes.
Tell other nodes that the leader is still running, and reset the Election Timeout of these nodes. The re-
election process is shown in Fig. 5.
4
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
5
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
3.3. Safety
The previous content discussed how the Raft algorithm conducts leadership elections and replicates
logs. However, so far this mechanism cannot guarantee that every state machine can execute the same
instructions in the same order. For example, when Leader submitted several log entries, a Follower
might go down. This Follower was then elected as a leader and overwritten the original log entries
with new ones. As a result, different state machines may execute different command sequences[11].
The Raft algorithm improves the Raft algorithm by adding a restriction during the Leader election
phase. This restriction guarantees that for a fixed term, any leader has all the log commands submitted
by the previous term. The Raft algorithm uses voting to prevent nodes that do not contain all log
commands from winning elections.
If a Candidate node wants to win an election, it needs to communicate with most nodes in the
distributed network. This means that each submitted log entry appears on at least one of the servers. If
Candidate's log is as new as the log on the majority server, then it must contain all the submitted log
entries. Request Vote RPC implements this limitation: RPC includes Candidate log information. If its
log is newer than Candidate's, then it will reject Candidate's voting request[12].
Method for judging the new and old log content of two nodes: Raft algorithm determines which log
content is newer by comparing the Index and Term of the last command in the log.
⚫ If the terms of the two logs are the same, the content of the log with the larger term is updated.
⚫ If the term number is the same, the content of the log is longer and the log is updated.
4. Case Study
How Raft maintains consistency when the network is partitioned. As shown below, we divide the
distributed network into two subnets. The two subnets are the subnet AB and the subnet CDE
respectively. At this time, the node B is a leader node.
However, there is no Leader node in subnet 1 due to network partitioning. At this time, the C, D,
and E nodes did not receive the heartbeat of the leader node, which caused the Election Timeout to
time out and entered the Candidate state. The entire distributed system began to conduct Leader
elections.
At this point we assume that node C wins the election and becomes the leader node of subnet 1.
Leader election process as shown in Fig. 8.
6
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
7
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
8
ICSP 2020 IOP Publishing
Journal of Physics: Conference Series 1544 (2020) 012079 doi:10.1088/1742-6596/1544/1/012079
Acknowledgments
Thanks to Professor Bo Mi for his patient guidance and the selfless help of the graduate team. The
theoretical analysis guidance provided by Professor Bo Mi and the good suggestions provided by the
graduate team make the writing and research process of the article highly efficient. Thanks for the
support from the School of Information Science and Engineering of Chongqing Jiaotong University.
Without their support, I would not be able to complete the full and efficient dissertation.
References
[1] BOLOSKY, W. J., BRADSHAW, D., HAAGENS, R. B., KUSTERS, N. P., AND LI, P.
Paxos replicated state machines as the basis of a high-performance data store. In Proc.
NSDI’11, USENIX Conference on Networked Systems Design and Implementation
(2011), USENIX, pp. 141-154.
[2] GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.T. The Google file system. In Proc.
SOSP’03, ACM Symposium on Operating Systems Principles (2003), ACM, pp. 29-43.
[3] O’NEIL, P., CHENG, E., GAWLICK, D., AND ONEIL, E. The log-structured merge-tree
(LSM-tree). Acta Informatica 33, 4 (1996), 351-385.
[4] SCHNEIDER, F. B. Implementing fault-tolerant services using the state machine approach: a
tutorial. ACM Computing Surveys 22, 4 (Dec. 1990), 299-319.
[5] ROSENBLUM, M., AND OUSTERHOUT, J. K. The design and implementation of a log-
structured file system. ACM Trans. Comput. Syst. 10 (February 1992), 26-52.
[6] OKI, B. M., AND LISKOV, B. H. Viewstamped replication: A new primary copy method to
support highly-available distributed systems. In Proc. PODC’88, ACM Symposium on
Principles of Distributed Computing (1988), ACM, pp. 8-17.
[7] HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. ZooKeeper: wait-free
coordination for internet-scale systems. In Proc ATC’10, USENIX Annual Technical
Conference (2010), USENIX, pp. 145-158.
[8] Hunt P D, Konar M, Junqueira F, et al. ZooKeeper: wait-free coordination for internet-scale
systems[C]. usenix annual technical conference, 2010: 11-11.
[9] LAMPORT, L. Time, clocks, and the ordering of events in a distributed system.
Commununications of the ACM 21, 7 (July 1978), 558-565.
[10] LAMPORT, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May
1998), 133-169.
[11] HERLIHY, M. P., AND WING, J. M. Linearizability: a correctness condition for concurrent
objects. ACM Transactions on Programming Languages and Systems 12 (July 1990), 463-
492.
[12] MORARU, I., ANDERSEN, D. G., AND KAMINSKY, M. There is more consensus in
egalitarian parliaments. In Proc. SOSP’13, ACM Symposium on Operating System
Principles (2013), ACM.
[13] VAN RENESSE, R. Paxos made moderately complex. Tech. rep., Cornell University, 2012.
[14] LAMPORT, L. Generalized consensus and Paxos. Tech. Rep. MSR-TR-2005-33, Microsoft
Research, 2005.