DS NOTES Unit 4 PDF

CS8603 DISTRIBUTED SYSTEMS UNIT 4
UNIT IV
RECOVERY & CONSENSUS
CHECKPOINTING AND ROLLBACK RECOVERY
INTRODUCTION
o Distributed systems are not fault-tolerant.

o To add reliability and high availability to distributed systems,many
techniques have been developed
o The techniques include :
→ Transactions,
→ Group communication, and
→ Rollback recovery.
o Rollback recovery treats a distributed system application as a collection
of processes that communicate over a network.
o Distributed system application achieves fault tolerance by periodically
saving the state of a process during the failure-free execution, enabling it
to restart from a saved state upon a failure to reduce the amount of lost
work. The saved state is called a checkpoint.
o The procedure of restarting from a previously checkpointed state is called
rollback recovery.
o Rollback recovery is complicated because messages induce inter-process
dependencies during failure-free operation.
o Upon failure of one or more processes in a system, inter-process
dependencies may force some of the processes that did not fail to roll
back. It is called rollback propagation.
o Rollback propagation occurs :
→ The sender of a message m rolls back to a state that precedes the
sending of m.
→ The receiver of m must also roll back to a state that precedes m’s
receipt;
otherwise,
→ The states of the two processes would be inconsistent.
o This phenomenon of cascaded rollback is called the domino effect.
1
o Independent or Uncoordinated check pointing

→ In a distributed system, if each participating process takes its
checkpoints independently, then the system is susceptible to the
domino effect. This approach is called independent or
uncoordinated checkpointing.
o Coordinated check pointing
→ In coordinated checkpointing processes coordinate their
checkpoints to form a system-wide consistent state.
o Communication-induced check pointing
→ Communication-induced checkpointing forces each process to take
checkpoints based on information piggybacked on the application
messages it receives from other processes.
o The two approaches used in rollback recovery are:
→ Checkpoint-based rollback recovery
→ Log-based rollback recovery
o Checkpoint-based rollback recovery relies only on checkpoints to
achieve fault-tolerance.
o Log-based rollback recovery combines checkpointing with logging of
nondeterministic events.
→ Log-based rollback recovery relies on the piecewise
deterministic (PWD) assumption, which postulates that all non-
deterministic events that a process executes can be identified and
that the information necessary to replay each event during recovery
can be logged in the event’s determinant.
BACKGROUND AND DEFINITIONS
System Model
o A distributed system consists of a fixed number of processes, P1, P2 …PN,
communicate only through messages.
o Processes cooperate to execute a distributed application and interact with
the outside world by receiving and sending input and output messages.
2
o Rollback-recovery protocols enhance the reliability of the inter-process

communication.
o Some protocols assume that the communication subsystem delivers
messages reliably, in first-in-first-out (FIFO) order, while other
protocols assume that the communication subsystem can lose,
duplicate, or reorder messages.
Local checkpoint
o All processes save their local states at certain instants of time. This saved
state is known as a local checkpoint.
o A local checkpoint is a snapshot of the state of the process at a given
instance.
o The event of recording the state of a process is called local
checkpointing.
o Assumption :
→ A process stores all local checkpoints on the stable storage.
→ A process is able to roll back to any of its existing local
checkpoints.
o Ci,k -- kth local checkpoint at process Pi.
o Ci,0 – A process Pi takes a checkpoint Ci,0 before it starts execution.
o A local checkpoint is shown in the process-line by the symbol “|”.
Consistent states
o A global state of a distributed system

→ a collection of the individual states of all participating processes
and the states of the communication channels.
3
o Consistent global state

→ a global state that may occur during a failure-free execution of
distribution of distributed computation
→ if a process’s state reflects a message receipt, then the state of the
corresponding sender must reflect the sending of the message.
o A global checkpoint
→ a set of local checkpoints, one from each process.
o A consistent global checkpoint
→ a global checkpoint such that no message is sent by a process after
taking its local point that is received by another process before
taking its checkpoint.
o Example:
In the consistent state,
o Message m1 to have been sent but not yet received. The state is
consistent because it represents a situation in which every message
that has been received, there is a corresponding message send event.
In the inconsistent state,
o Process P2 is shown to have received m2 but the state of process P1

does not reflect having sent it.
4
Interactions with the outside world

o A distributed application often interacts with the outside world to
receive input data or deliver the outcome of a computation.
o A special process that interacts with the rest of the system through
message passing called “outside world process” (OWP).
o The outside world should also to implement a consistent behavior of
the system despite failures.
o Approach : (Output Commit Problem)
→ Before sending output to the OWP, the system must ensure that
the state from which the output is sent will be recovered despite
any future failure. It is called the output commit problem.
o An interaction with the outside world to deliver the outcome of a
computation is shown on the process-line by the symbol “||”
o Different types of messages

o In-transit message
→ Messages that have been sent but not yet received.
→ The global state {C1,8 ,C2,9, C3,8, C4,8} shows that message m1 has
been sent but not yet received.
→ Message m2 is also an in-transit message.
o Lost messages
→ Messages whose ‘send’ is done but ‘receive’ is undone due to
rollback.
→ This type of messages occurs when the process rolls back to a
checkpoint prior to reception of the message while the sender does
not rollback beyond the send operation of the message.
→ Message m1 is a lost message.
o Delayed messages
→ Messages whose ‘receive’ is not recorded because the receiving
process was either down or the message arrived after rollback.
→ Messages m2 and m5 are delayed messages.
5
o Orphan messages
→ Messages with ‘receive’ recorded but message ‘send’ not recorded.
→ Do not arise, if processes roll back to a consistent global state.
→ Orphan messages do not arise if processes roll back to a consistent
global state.
o Duplicate messages
→ Arise due to message logging and replaying during process
recovery.
→ Message m4 was sent and received before the rollback.
→ Due to the rollback of process P4 to C4,8 and process P3 to C3,8,
both send and receipt of message m4 are undone.
→ When process P3 restarts from C3,8, it will resend message m4.
Therefore, P4 should not replay message m4 from its log. If P4
replays message m4, then message m4 is called a duplicate
message.
ISSUES IN FAILURE RECOVERY
Checkpoints : {Ci,0, C1,1}, {Cj,0 ,Cj,1, Cj,2}, and {Ck,0 ,Ck,1, Ck,2}
Messages :A-J
The restored global consistent state : { Ci,1 , Cj,1, Ck,1 }
6
o Suppose process Pi fails at the instance, process Pi’s state is

restored to a valid state by rolling it back to its most recent checkpoint
Ci,1.
o The rollback of process Pi to checkpoint Ci,1 created an orphan
message H.
o Orphan message I is created due to the rollback of process Pj to
checkpoint Cj,1
o Messages C, D, E, and F are problematic :
→ Message C: a delayed message
→ Message D: a lost message since the send event for D is recorded
in the restored state for Pj, but the receive event has been undone
at process Pi.
→ Lost messages can be handled by having processes keep a
message log of all the sent messages
→ Messages E, F: delayed orphan messages. After resuming execution
from their checkpoints, processes will generate both of these
messages.
CHECKPOINT-BASED RECOVERY
o In checkpoint-based recovery, the state of each process and the

communication channel is checkpointed frequently so that, upon a
failure, the system can be restored to a globally consistent set of
checkpoints.
7
o It does not rely on the PWD assumption, and so does not need to detect,
log, or replay non-deterministic events.
o Does not guarantee that prefailure execution can be deterministically
regenerated after a rollback.
o Checkpoint-based rollback-recovery techniques can be classified into
three types:
→ Uncoordinated checkpointing,
→ Coordinated checkpointing,
→ Communication-induced checkpointing
Uncoordinated checkpointing
o Each process has autonomy in deciding when to take checkpoints.

o Eliminates the synchronization overhead as there is no need for
coordination between processes.
o Advantages
→ The lower runtime overhead during normal execution.
o Disadvantages
→ Domino effect during a recovery.
→ Recovery from a failure is slow because processes need to iterate
to find a consistent set of checkpoints.
→ Each process maintains multiple checkpoints and periodically
invoke a garbage collection algorithm.
→ Not suitable for application with frequent output commits.
o The processes record the dependencies among their checkpoints
caused by message exchange during failure-free operation.
o Direct dependency tracking technique
→ Assume each process Pi starts its execution with an initial
checkpoint Ci,0
→ Ii,x : checkpoint interval,
Interval between Ci,x-1 and Ci,x
→ When Pj receives a message m during Ij,y, it records the
dependency from Ii,x to Ij,y, which is later saved onto stable
storage when Pj takes Cj,y
8
o Steps:
→ When a failure occurs, the recovering process initiates rollback by
broadcasting a dependency request message to collect all the
dependency information maintained by each process.
→ When a process receives this message, it stops its execution and
replies with the dependency information saved on the stable
storage.
→ The initiator then calculates the recovery line based on the global
dependency information.
→ Then the initiator broadcasts a rollback request message
containing the recovery line.
→ Upon receiving this message, a process whose current state
belongs to the recovery line simply resumes execution;
→ Otherwise, it rolls back to an earlier checkpoint.
Coordinated checkpointing
o In coordinated checkpointing, processes orchestrate their
checkpointing activities so that all local checkpoints form a consistent
global state.
o Simplifies recovery.
o Not susceptible to the domino effect.
o Disadvantages :
→ Large latency is involved in committing output,
9
→ Delays and overhead are involved every time a new global

checkpoint is taken.
o The techniques used for coordinated checkpointing are :
1. Blocking coordinated checkpointing.
2. Non-blocking checkpoint coordination.
3. min-process non-blocking checkpointing
1. Blocking coordinated checkpointing

▪ Block communications while the checkpointing protocol executes.
▪ After a process takes a local checkpoint, it remains blocked until the
entire checkpointing activity is complete.
▪ The coordinator takes a checkpoint and broadcasts a request
message to all processes, asking them to take a checkpoint.
▪ When a process receives this message, it stops its execution, flushes
all the communication channels, takes a tentative checkpoint, and
sends an acknowledgment message back to the coordinator.
▪ When the coordinator receives acknowledgments from all
processes, it broadcasts a commit message.
▪ After receiving the commit message, a process removes the old
permanent checkpoint and atomically makes the tentative checkpoint
permanent and then resumes its execution and exchange of messages
with other processes.
2. Non-blocking checkpoint coordination

▪ The processes need not stop their execution while taking checkpoints.
▪ A fundamental problem in coordinated checkpointing is to prevent a
process from receiving application messages that could make the
checkpoint inconsistent.
▪ Example (a) : checkpoint inconsistency
o Message m is sent by 𝑃0 after receiving a checkpoint request
from the checkpoint coordinator
o Assume m reaches 𝑃1 before the checkpoint request.
10
o This situation results in an inconsistent checkpoint since

checkpoint
o 𝐶1, shows the receipt of message m from 𝑃0, while checkpoint
𝐶0,𝑥 does not show m being sent from 𝑃0.
▪ Example (b) : a solution with FIFO channels
o If channels are FIFO, this problem can be avoided by preceding
the first post-checkpoint message on each channel by a
checkpoint request, forcing each process to take a checkpoint
before receiving the first post-checkpoint message.
3. Min-process non-blocking checkpointing
▪ A min-process, non-blocking checkpointing algorithm is one that

forces only a minimum number of processes to take a new checkpoint,
and at the same time it does not force any process to suspend its
computation.
▪ The algorithm consists of two phases:
▪ Phase 1 :
o The checkpoint initiator identifies all processes with which it
has communicated since the last checkpoint and sends them a
request.
11
o Upon receiving the request, each process in turn identifies all

processes it has communicated with since the last checkpoint
and sends them a request, and so on, until no more processes
can be identified.
▪ Phase 2 :
o All processes identified in the first phase take a checkpoint.
o The result is a consistent checkpoint that involves only the
participating processes.
▪ After a process takes a checkpoint, it cannot send any message until
the second phase terminates successfully. It is known as z-
dependency.
▪ Definition : z-dependency
o If a process Pp sends a message to process Pq during its ith
checkpoint interval and process Pq receives the message
during its jth checkpoint interval, then Pq Z-depends on
Pp during Pp’s ith checkpoint interval and Pq’s jth
checkpoint interval, denoted by Pp→i Pq.
j
Communication –induced checkpointing
o Avoids the domino effect.
o Eliminates the useless checkpoints.
o In communication-induced checkpointing, processes take two types of
checkpoints, namely,
→ Autonomous and
→ Forced checkpoints.
o The checkpoints that a process takes independently are called local
checkpoints.
o A process is forced to take checkpoints are called forced
checkpoints.
o Communication-induced checkpointing piggybacks protocol-related
information on each application message.
12
o The receiver of each application message uses the piggybacked

information to determine if it has to take a forced checkpoint to
advance the global recovery line.
o The forced checkpoint must be taken before the application may
process the contents of the message.
o Two types of communication-induced checkpointing :
1. Model-based checkpointing
2. Index-based checkpointing.
o In model-based checkpointing, the system maintains checkpoints
and communication structures that prevent the domino effect.
o In index-based checkpointing, the system uses an indexing scheme
for the local and forced checkpoints, such that the checkpoints of the
same index at all processes form a consistent state.
LOG-BASED ROLLBACK RECOVERY
o A log-based rollback recovery makes use of deterministic and

nondeterministic events in a computation.
Deterministic and non-deterministic events
o A non-deterministic event can be the receipt of a message from another

process or an event internal to the process.
o A message send event is not a non-deterministic event.
o Example :
→ The execution of process P0 is a sequence of four deterministic
intervals.
13
→ The first one starts with the creation of the process, while the
remaining three start with the receipt of messages m0, m3, and m7.
→ Send event of message m2 is uniquely determined by the initial
state of P0 and by the receipt of message m0, and is therefore not a
non-deterministic event.
o Log-based rollback recovery assumes that all non-deterministic
events can be identified and their corresponding determinants can be
logged into the stable storage.
o During failure-free operation, each process logs the determinants of all
non-deterministic events that it observes onto the stable storage.
No-orphans consistency condition
o Let e be a non-deterministic event that occurs at process p.

o Depend(e)
→ The set of processes that are affected by a non-deterministic event
e.
→ This set consists of p, and any process whose state depends on the
event e according to Lamport’s happened before relation
o Log(e)
→ The set of processes that have logged a copy of e’s determinant in
their volatile memory
o Stable(e)
→ a predicate that is true if e’s determinant is logged on the stable
storage.
o always-no-orphans condition
→ Suppose a set of processes  crashes. A process p in  becomes
an orphan when p itself does not fail and p’s state depends on the
execution of a nondeterministic event e whose determinant cannot be
recovered from the stable storage or from the volatile memory of a
surviving process.
→ – ∀(e) :￢Stable(e) ⇒ Depend(e) ⊆ Log(e)
14
PESSIMISTIC LOGGING
o Pessimistic protocols log to the stable storage the determinant of each
non-deterministic event before the event affects the computation.
o Pessimistic protocols implement the following property called as
synchronous logging.
o It is defined by :
∀e: ￢Stable(e) ⇒ |Depend(e)| = 0
o If an event has not been logged on the stable storage, then no process
can depend on it.
o Processes also take periodic checkpoints to minimize the amount of
work that has to be repeated during recovery.
o When a process fails, the process is restarted from the most recent
checkpoint and the logged determinants are used to recreate the
prefailure execution.
o Example :
→ During failure-free operation the logs of processes P0, P1, and P2
contain the determinants needed to replay messages m0, m4, m7,
m1, m3, m6, and m2, m5, respectively.
→ Suppose processes 𝑃1 and 𝑃2 fail as shown, restart from
checkpoints B and C, and roll forward using their determinant logs
to deliver again the same sequence of messages as in the pre-
failure execution
→ Once the recovery is complete, both processes will be consistent
with the state of 𝑃0 that includes the receipt of message 𝑚7 from 𝑃1.
15
OPTIMISTIC LOGGING
o Processes log determinants asynchronously to the stable storage
o Optimistically assume that logging will be complete before a failure
occurs
o Do not implement the always-no-orphans condition
o To perform rollbacks correctly, optimistic logging protocols track causal
dependencies during failure free execution
o Optimistic logging protocols require a non-trivial garbage collection
scheme
o Pessimistic protocols need only keep the most recent checkpoint of each
process, whereas optimistic protocols may need to keep multiple
checkpoints for each process.
Example:
o Suppose process P2 fails before the determinant for m5 is logged to the

stable storage. Process P1 then becomes an orphan process and must roll
back to undo the effects of receiving the orphan message m6. The
rollback of P1 further forces P0 to roll back to undo the effects of receiving
message m7.
CAUSAL LOGGING
o Combines the advantages of both pessimistic and optimistic logging at
the expense of a more complex recovery protocol.
o Like optimistic logging, it does not require synchronous access to the
stable storage except during output commit.
16
o Like pessimistic logging, it allows each process to commit output

independently and never creates orphans, thus isolating processes from
the effects of failures at other processes.
o Make sure that the always-no-orphans property holds.
o Each process maintains information about all the events that have
causally affected its state.
Example:
→ Messages m5 and m6 are likely to be lost on the failures of P1
and P2 at the indicated instants.
→ Process P0 at state X will have logged the determinants of the
nondeterministic events that causally precede its state
according to Lamport’s happened-before relation.
→ These events consist of the delivery of messages m0, m1, m2,
m3, and m4. The determinant of each of these non-
deterministic events is either logged on the stable storage or
is available in the volatile log of process P0. The determinant
of each of these events contains the order in which its
original receiver delivered the corresponding message.
→ The message sender, as in sender-based message logging,
logs the message content. Thus, process P0 will be able to
“guide” the recovery of P1 and P2 since it knows the order in
which P1 should replay messages m1 and m3 to reach the
state from which P1 sent message m4.
17
KOO-TOUEG COORDINATED CHECKPOINTING ALGORITHM
o A coordinated checkpointing and recovery technique takes a consistent

set of checkpointing and avoids domino effect and livelock problems
during the recovery.
o Includes two parts:
→ The checkpointing algorithm
→ The recovery algorithm
o Checkpointing Algorithm
→ Assumptions:
▪ FIFO channel,
▪ End-to-end protocols,
▪ Communication failures do not partition the network,
▪ Single process initiation,
▪ No process fails during the execution of the algorithm.
o Two kinds of checkpoints:
▪ Permanent checkpoint
▪ Tentative checkpoint
→ Permanent checkpoint: local checkpoint, part of a consistent
global checkpoint
→ Tentative checkpoint: temporary checkpoint, become permanent
checkpoint when the algorithm terminates successfully.
o The algorithm consist of two phases :
→ First Phase :
▪ An initiating process Pi takes a tentative checkpoint and
requests all other processes to take tentative checkpoints.
▪ Each process informs Pi whether it succeeded in taking a
tentative checkpoint.
▪ A process says “no” to a request if it fails to take a tentative
checkpoint.
▪ If Pi learns that all the processes have successfully taken
tentative checkpoints, Pi decides that all tentative
checkpoints should be made permanent; otherwise, Pi
18
decides that all the tentative checkpoints should be

discarded.
→ Second Phase
▪ Pi informs all the processes of the decision it reached at the
end of the first phase.
▪ A process, on receiving the message from Pi, will act
accordingly.
o Correctness:
▪ Either all or none of the processes take permanent
checkpoint.
▪ No process sends message after taking permanent
checkpoint
o Optimization:
▪ May be not all of the processes need to take checkpoints.
o Example :
o The set { x1, y1, z1} is a consistent set of checkpoints.

o If process X decides to initiate the checkpointing algorithm after receiving
message m. It takes a tentative checkpoint x2 and sends “take tentative
checkpoint" messages to processes Y and Z, causing Y and Z to take
checkpoints y2 and z2, respectively.
o { x1, y2, z2} forms a consistent set of checkpoints.
o { x1, y2, z1} also forms a consistent set of checkpoints.
19
o There is no need for process Z to take checkpoint z2 because Z has not

sent any message since its last checkpoint.
o Rollback Recovery Algorithm
o The rollback recovery algorithm restore the system state to a consistent

state after a failure with assumptions: single initiator, checkpoint and
rollback recovery algorithms are not invoked concurrently.
o The algorithm consist of two phases :
→ First Phase :
▪ The initiating process sends a message to all other processes
and asks for the preferences – restarting to the previous
checkpoints.
▪ All need to agree about either do or not.
→ Second Phase:
▪ The initiating process send the final decision to all processes,
all the processes act accordingly after receiving the final
decision.
o Correctness:
▪ Resume from a consistent state.
o Optimization:
▪ May not to recover all, since some of the processes did not
change anything.
JUANG-VENKATESAN ALGORITHM FOR ASYNCHRONOUS

CHECKPOINTING AND RECOVERY
o System Model and Assumptions

→ Communication channels are reliable,
→ Delivery messages in FIFO order,
→ Infinite buffers,
→ Message transmission delay is arbitrary but finite.
→ The processors directly connected to a processor via
communication channels are called its neighbors.
20
o Underlying computation/application is event-driven:

→ A processor P waits until a message m is received,
→ Process P is at state s,
→ receives message m,
→ processes the message,
→ changes its state from s to s’, and
→ sends zero or more messages to some of its neighbors.
o So the triplet (s, m, msgs_sent) represents the state of P
o The events at a processor are identified by unique monotonically
increasing numbers, ex0, ex1, ex2, ………..
o Two type of log storage are maintained:

→ Volatile log: short time to access but lost if processor crash. Move
to stable log periodically.
→ Stable log: longer time to access but remained if crashed.
o Asynchronous checkpointing
o After executing an event, the triplet is recorded without any
synchronization with other processes.
o Local checkpoint consist of set of records, first are stored in volatile log,
then moved to stable log.
21
o The recovery algorithm
o Notations:
→ 𝑅𝐶𝑉𝐷𝑖←𝑗(𝐶𝑘𝑃𝑡𝑖) : number of messages received by 𝑝𝑖 from 𝑝𝑗,

from
the beginning of computation to checkpoint
𝐶𝑘𝑃𝑡𝑖.
→ 𝑆𝐸𝑁𝑇𝑖→𝑗(𝐶𝑘𝑃𝑡𝑖): number of messages sent by 𝑝𝑖 to 𝑝𝑗, from the
beginning of computation to checkpoint 𝐶𝑘𝑃𝑡𝑖.
o Idea:
→ From the set of checkpoints, find a set of consistent checkpoints.
→ Doing that based on the number of messages sent and received.
o Algorithm :
22
o Description of the Algorithm:
→ When a processor restarts after a failure, it broadcasts a

ROLLBACK message that it has failed.
→ The recovery algorithm at a processor is initiated when it

restarts after a failure.
→ Because of the broadcast of ROLLBACK messages, the

recovery algorithm is initiated at all processors.
→ The rollback starts at the failed processor and slowly diffuses

into the entire system through ROLLBACK messages.
→ Note that the procedure has |N| iterations.
o During the kth iteration (k  1), a processor pi does the

following:
23
(i) Based on the state CkPti it was rolled back in the

(k − 1)th iteration, it computes SENTi→j (CkPti) for
each neighbor pj and sends this value in a
ROLLBACK message to that neighbor; and
(ii) pi waits for and processes ROLLBACK messages

that it receives from its neighbors in kth iteration
and determines a new recovery point CkPti for pi
based on information in these messages.
(iii) At the end of each iteration, at least one processor

will rollback to its final recovery point, unless the
current recovery points are already consistent.
o Example:
→ Suppose processor Y fails and restarts. If event ey2 is the latest

checkpointed event at Y, then Y will restart from the state
corresponding to ey2.
→ The recovery algorithm is also initiated at processors X and Z.
→ Initially, X, Y , and Z set CkPtX ← ex3, CkPtY ← ey2 and CkPtZ ←

ez2, respectively, and
o X, Y , and Z send the following messages during the first

iteration:
▪ Y sends ROLLBACK(Y , 2) to X and ROLLBACK(Y , 1)

to Z;
▪ X sends ROLLBACK(X, 2) to Y and ROLLBACK(X, 0)

to Z; and
▪ Z sends ROLLBACK(Z, 0) to X and ROLLBACK(Z, 1) to

Y.
▪ Since RCVDX←Y(CkPtX) = 3 > 2 (2 is the value received

in the ROLLBACK(Y , 2) message from Y ), X will set
CkPtX to ex2 satisfying RCVDX←Y (ex2) = 2 ≤ 2.
▪ Since RCVDZ←Y (CkPtZ) = 2 > 1, Z will set CkPtZ to ez1

satisfying RCVDZ←Y (ez1) = 1 ≤ 1.
24
▪ At Y, RCVDY←X(CkPtY) = 1 < 2 and RCVDY←Z(CkPtY) =

1 = SENTZ←Y (CkPtZ).
▪ Hence, Y need not roll back further.
→ In the second iteration,
o Y sends ROLLBACK(Y , 2) to X and ROLLBACK(Y , 1) to Z;

o Z sends ROLLBACK(Z, 1) to Y and ROLLBACK(Z, 0) to X;
o X sends ROLLBACK(X, 0) to Z and ROLLBACK(X, 1) to Y .
CONSENSUS AND AGREEMENT ALGORITHMS
Introduction
o Agreement among the processes in a distributed system is a

fundamental requirement for a wide range of applications.
o Many forms of coordination require the processes to exchange

information to negotiate with one another and eventually reach a
common understanding or agreement, before taking application-
specific actions.
o A classical example is that of the commit decision in database

systems, wherein the processes collectively decide whether to
commit or abort a transaction that they participate in.
Assumptions
o Failure models
25
→ Among the n processes in the system, at most f processes can be

faulty.
→ A faulty process can behave in any manner allowed by the failure
model assumed.
→ The various failure models – fail-stop, send omission and receive
omission, and Byzantine failures.
→ In the fail-stop model, a process may crash in the middle of the
step.
→ In the Byzantine failure model, it may send a message to only a
subset of the destination set before crashing.
→ The choice of the failure model determines the feasibility and
complexity of solving consensus.
o Synchronous/ Asynchronous
o communication
o Network connectivity
o Sender identification
o Channel reliability
o Authenticated vs. non-authenticated messages
o Agreement variable
Problem Specifications
o Byzantine Agreement (single source has an initial value)
Agreement: All non-faulty processes must agree on the same value.
Validity: If the source process is non-faulty, then the agreed upon

value by all the non-faulty processes must be the same as the
initial value of the source.
Termination: Each non-faulty process must eventually decide on a

value.
26
o Consensus Problem (all processes have an initial value)
Agreement: All non-faulty processes must agree on the same

(single) value.
Validity: If all the non-faulty processes have the same initial value,
then the agreed upon value by all the non-faulty processes must be
that same value.
Termination: Each non-faulty process must eventually decide on a
value.
o Interactive Consistency (all processes have an initial value)
Agreement: All non-faulty processes must agree on the same array

of values A[v1 . . . vn].
Validity: If process i is non-faulty and its initial value is vi, then all
non-faulty processes agree on vi as the ith element of the array A. If
process j is faulty, then the non-faulty processes can agree on any
value for A[j].
Termination: Each non-faulty process must eventually decide on

the array A.
OVERVIEW OF RESULTS
27
o In a failure-free system, consensus can be attained in a straight

forward manner.
AGREEMENT IN A FAILURE-FREE SYSTEM (SYNCHRONOUS OR

ASYNCHRONOUS)
o In a failure-free system, consensus can be reached by collecting

information from the different processes, arriving at a “decision,”
and distributing this decision in the system.
o A distributed mechanism would have each process broadcast its
values to others, and each process computes the same function
on the values received. The decision can be reached by using an
application specific function.
28
o Algorithms to collect the initial values and then distribute the

decision may be based on the token circulation on a logical
ring, or the three-phase tree-based broadcast–convergecast–
broadcast, or direct communication with all nodes.
AGREEMENT IN (MESSAGE-PASSING) SYNCHRONOUS SYSTEMS
WITH FAILURES
Consensus algorithm for crash failures (synchronous system)
o Up to f (< n) crash failures possible.

o In f + 1 rounds, at least one round has no failures.
o Now justify: agreement, validity, termination conditions are
satisfied.
o Complexity: O(f + 1)n2 messages
o f + 1 is lower bound on number of rounds
Algorithm
(global constants)
integer: f ; // maximum number of crash failures tolerated
(local variables)
integer: x ←− local value;
(1) Process Pi (1 ≤ i ≤ n) executes the Consensus algorithm for up to
f crash failures:
(1a) for round from 1 to f + 1 do
(1b) if the current value of x has not been broadcast then
(1c) broadcast(x);
(1d) yj ←− value (if any) received from process j in this round;
(1e) x ←− min(x, yj);
(1f) output x as the consensus value.
o The agreement condition is satisfied because in the f +1 rounds,

there must be at least one round in which no process failed.
o The validity condition is satisfied because processes do not send
fictitious values in this failure model. For all i, if the initial value is
identical, then the only value sent by any process is the value that
has been agreed upon as per the agreement condition.
o The termination condition is seen to be satisfied.
29
CONSENSUS ALGORITHMS FOR BYZANTINE FAILURES

(SYNCHRONOUS SYSTEM)
Upper bound on Byzantine processes
o In a system of n processes, the Byzantine agreement problem can

be solved in a synchronous system only if the number of Byzantine
processes f is such that f ≤ ( n−1/3 ).
30
31
BYZANTINE AGREEMENT TREE ALGORITHM: EXPONENTIAL

BZANTINE GENERALS (ITERATIVE FORMULATION), SYNC,

MSG-PASSING
o In the recursive version of the algorithm, each message has the
following parameters:
→ a consensus estimate value (v);
→ a set of destinations (Dests);
32
→ a list of nodes traversed by the message, from most recent to

least recent (List); and
→ The number of Byzantine processes that the algorithm still

needs to tolerate (faulty).
→ The list L = <Pi ,Pk1,………, Pkf+1−faulty > represents the sequence

of processes (subscripts) in the knowledge expression
Ki(Kk1(Kk2………Kkf+1−faulty (v0)…..))
→ The commander invokes the algorithm with parameter faulty

set to f, the maximum number of malicious processes to be
tolerated.
→ The algorithm uses f +1 synchronous rounds.
33
→ Each message (having this parameter faulty = k) received by a

process invokes several other instances of the algorithm with
parameter faulty = k − 1.
→ The terminating case of the recursion is when the parameter

faulty is 0.
TREE DATA STRUCTURE FOR AGREEMENT PROBLEM

(BYZANTINE GENERALS)
EXPONENTIAL ALGORITHM: AN EXAMPLE
Consider the tree at a lieutenant node P3, for n = 10 processes P0

through P9 and f = 3 processes. The commander is P0. Only one
branch of the tree is shown for simplicity.
P3’s perspectives are outlined next, with respect to the iterative

formulation of the algorithm.
34
PHASE-KING ALGORITHM FOR CONSENSUS: POLYNOMIAL

o The phase-king algorithm proposed by Berman and Garay solves

the consensus problem under the same model, requiring f +1
phases, and a polynomial number of messages but can tolerate only
f < [n/4], malicious processes.
o The algorithm is so called because it operates in f +1 phases, each

with two rounds, and a unique process plays an asymmetrical role
as a leader in each round.
o The phase-king algorithm assumes a binary decision variable.
o Each phase has a unique”phase king” derived, say, from PID.
o Each phase has two rounds:
→ in 1st round, each process sends its estimate to all other

processes.
→ in 2nd round, the”Phase king” process arrives at an estimate

based on the values it received in 1st round, and
→ Broadcasts its new estimate to all others.
35
o (f + 1) phases, (f + 1)[(n − 1)(n + 1)] messages, and can tolerate up

to f < [n/4] malicious processes.
36

DS NOTES Unit 4 PDF

Uploaded by

Copyright:

Available Formats

DS NOTES Unit 4 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS NOTES Unit 4 PDF

Uploaded by

Copyright:

Available Formats

CS8603 DISTRIBUTED SYSTEMS UNIT 4

o Distributed systems are not fault-tolerant.

o Independent or Uncoordinated check pointing

BACKGROUND AND DEFINITIONS

o Rollback-recovery protocols enhance the reliability of the inter-process

o A global state of a distributed system

o Consistent global state

In the consistent state,

In the inconsistent state,

o Process P2 is shown to have received m2 but the state of process P1

Interactions with the outside world

o Different types of messages

ISSUES IN FAILURE RECOVERY

The restored global consistent state : { Ci,1 , Cj,1, Ck,1 }

o Suppose process Pi fails at the instance, process Pi’s state is

o In checkpoint-based recovery, the state of each process and the

o Each process has autonomy in deciding when to take checkpoints.

→ Delays and overhead are involved every time a new global

1. Blocking coordinated checkpointing

2. Non-blocking checkpoint coordination

o This situation results in an inconsistent checkpoint since

3. Min-process non-blocking checkpointing

▪ A min-process, non-blocking checkpointing algorithm is one that

o Upon receiving the request, each process in turn identifies all

o The receiver of each application message uses the piggybacked

LOG-BASED ROLLBACK RECOVERY

o A log-based rollback recovery makes use of deterministic and

Deterministic and non-deterministic events

o A non-deterministic event can be the receipt of a message from another

No-orphans consistency condition

o Let e be a non-deterministic event that occurs at process p.

o Suppose process P2 fails before the determinant for m5 is logged to the

o Like pessimistic logging, it allows each process to commit output

KOO-TOUEG COORDINATED CHECKPOINTING ALGORITHM

o A coordinated checkpointing and recovery technique takes a consistent

decides that all the tentative checkpoints should be

o The set { x1, y1, z1} is a consistent set of checkpoints.

o There is no need for process Z to take checkpoint z2 because Z has not

o Rollback Recovery Algorithm

o The rollback recovery algorithm restore the system state to a consistent

JUANG-VENKATESAN ALGORITHM FOR ASYNCHRONOUS

o System Model and Assumptions

o Underlying computation/application is event-driven:

o Two type of log storage are maintained:

o The recovery algorithm

→ 𝑅𝐶𝑉𝐷𝑖←𝑗(𝐶𝑘𝑃𝑡𝑖) : number of messages received by 𝑝𝑖 from 𝑝𝑗,

o Description of the Algorithm:

→ When a processor restarts after a failure, it broadcasts a

→ The recovery algorithm at a processor is initiated when it

→ Because of the broadcast of ROLLBACK messages, the

→ The rollback starts at the failed processor and slowly diffuses

→ Note that the procedure has |N| iterations.

o During the kth iteration (k  1), a processor pi does the

(i) Based on the state CkPti it was rolled back in the

(ii) pi waits for and processes ROLLBACK messages

(iii) At the end of each iteration, at least one processor

→ Suppose processor Y fails and restarts. If event ey2 is the latest

→ The recovery algorithm is also initiated at processors X and Z.