February 2002 |
||||||||||||||||||||||||
| Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems | ||||||||||||||||||||||||
| Roberto Baldoni Universita di Roma, Italy Michel Raynal IRISA, France A distributed computation consists of a set of processes that cooperate to achieve a common goal. A main characteristic of these computations is that the processes do not already share a common global memory and that they communicate only by exchanging messages over a communication network. Moreover, message transfer delays are finite yet unpredictable. This computation model defines what is known as the asynchronous distributed system model, which includes systems that span large geographic areas and are subject to unpredictable loads. A key concept of asynchronous distributed systems is causality. More precisely, given two events in a distributed computation, a crucial problem is knowing whether they are causally related. Could the occurrence of one event be a consequence of the other? Processes produce message sendings, message receives, and internal events. Events that are not causally dependent are concurrent. Fidge 1 and Mattern 2 simultaneously and independently introduced vector clocks to let processes track causality (and concurrency) between the events they produce. A vector clock is an array of n integers (one entry per process), where the entry j counts the number of relevant events that process Pj produces. The timestamp of an event a process produced (or of the local state this event generated) is the current value of the corresponding process's vector clock. So, by associating vector timestamps with events or local states, we can safely decide whether two events or two local states are causally related (see the "A Historical View of Vector Clocks" sidebar). Here, we present basic vector clock properties, mechanisms, and application examples to help distributed systems engineers solve the casuality problems they face. A model of distributed executionA distributed program is made up of n sequential local programs that, when executed, can communicate and synchronize only by exchanging messages. A distributed computation describes a distributed program's execution. Executing a local program gives rise to a sequential process. Let P1, P2, ..., Pn be this finite set of processes. We assume that, at runtime, each ordered pair of communicating processes (Pi, Pj) is connected by a reliable channel cij through which Pi can send messages to Pj. Executing an internal, send, or receive statement produces an internal, send, or receive event. Let
be the xth event process Pi produces. The sequence
constitutes the history of Pi. Let H be the set of events that a distributed computation produces. This set is structured as a partial order by L. Lamport?s ?happened-before? relation,[1] denoted ?—>? and defined as
e —>f means that event e can affect event f. Consequently, ¬(e —> f) means e cannot affect f. The partial order
constitutes a formal model of the distributed computation with which it is associated. Figure 1 depicts a distributed computation, where black points denote events.
Two events e and f are concurrent (or causally independent) if
The causal past of event e is the (partially ordered) set of events f such that f —> e. Similarly, the causal future of event e is the (partially ordered) set of events f such that e —> f. For example, in Figure 1, we have
and the causal past of the event e22 corresponds to the set
Vector clocks: A causality tracking mechanismA vector clock system is a mechanism that associates timestamps with events (local states) such that comparing two events' timestamps indicates whether those events (local states) are causally related (and, if they are, which one comes first). In the time-stamping system, each process Pi has a vector of integers VCi[1..n] (initialized to [,0,...,0]) that is maintained as follows:
Note that VCi[i] counts the number of events that Pi has so far produced. Moreover, for VCi[j] represents the number of events Pj produced that belong to the current causal past of Pi. When a process Pi produces an event e, it can associate with that event a vector timestamp whose value equals the current value of VCi. Figure 1 shows vector timestamp values associated with events and local states. For example, e26.VC = (5,6,5). Let e.VC and f.VC be the vector timestamps associated with two distinct events e and f, respectively. The following property is the fundamental property associated with vector clocks:2,3
where e.VC < f.VC is an abbreviation for
Let Pi be the process that produced e. This additional information lets us simplify the previous relation to 2,3
(See the "An Efficient Implementation of Vector Clocks" sidebar.) In our discussion of basic vector clock properties, we investigate three problems—causal broadcast, detecting message stability, and detecting an event pattern. Causal broadcast This means that when a proccess delivers a message m to a process, all messages whose broadcasts causally precede the broadcast of m have already been delivered to that process. The ISIS system first proposed such a communication abstraction.4 Several researchers have proposed vector clock-based implementations of causal broadcast, based on the following idea:5,6 A receiving process Pi must delay delivering a message m until all the messages broadcast in the causal past of m are delivered to Pi. Consider Figure 2. When m´ arrives at P2, its delivery must be delayed because m´ arrived at P2 before m, and the sending of m causally precedes m´. To this end, each process Pi must manage a vector clock (VCi) tracking its current knowledge on the number of messages that each process has sent.
Figure 3 describes a simple broadcast protocol (similar to one presented elsewhere5). Broadcast events are a computation's relevant events, and VCi[j] represents Pi?s knowledge of the number of messages that Pj did broadcasts and delivered to Pi. Each message m piggybacks a vector timestamp m.VC, revealing how many messages each process has broadcast in the causal past of m?s broadcast. Then, when a process Pi receives a message m, it delays its delivery until all the messages that belong to its causal past are delivered. This is expressed by a simple condition involving the vector clock of the receiving process Pi and the vector timestamp (m.VC) of the received message m—namely,
Figure 3 describes the resulting causal broadcast protocol (vectors are initialized to [0,...,0]).
Figure 3. A simple causal broadcast protocol. Detecting message
stability In the context of reliable broadcasting, operations correspond to messages, and, to meet the problem requirements in the presence of sender process failures and network partitions, each process must buffer a copy of every message it sends or receives. If a process Pi fails, any process with a copy of a message m sent by Pi can forward m to any process Pj that detects it has not received m. This can induce a rapid growth of the buffer at each process with the risk of overflowing. Therefore, we need a policy that reduces buffer overflow occurrence. A simpler observation shows that buffering a message that has been delivered to all its intended destinations is not necessary. Such a message is called a stable message, and we can safely discard such messages from a process's local buffer. A message stability detection protocol manages the process buffers. Such a protocol can be lazy (stability information piggybacks on application messages), use gossiping (periodic broadcast of control messages propogates stability information), or hybrid (both piggybacking and gossiping propogate stability information). To concentrate on the buffer management actions, we consider the simple case where communication channels are first-in first-out, and we assume there is no failure. Moreover, causal delivery is not ensured (that is, each message is delivered on receipt). Broadcast events are the computation's relevant events. Each process Pi has a vector (MCi) of vector clocks. This vector of vectors is such that the vector MCi[k] keeps Pi aware of messages delivered to Pk. More precisely,
represents Pi?s knowledge of the number of messages that Pk delivered and Pl sent; MCi[i][i] represents the sequence number of the next message Pi sent. Hence, the minimum value over column j of MCi —that is,
—represents Pi?s knowledge of the sequence number of the last message Pj sent that is stable. To propagate stability information, each message m that Pi sends piggybacks the identity of its sender (m.sender) and a vector timestamp m.VC, indicating how many messages Pi has delivered from each other process Pl, (that is, m.VC corresponds to the vector MCi[i][*]). Two operations update the local buffer (bufferi): deposit(m) inserts a message m in the buffer and discard(m) removes m from the buffer. A process buffers a message immediately after it receives it and discards it as soon as it becomes stable (that is, when the process learns that all processes have delivered m). We can express the stability predicate for a message m using
where m.VC[m.sender] represents the sequence number of m. Figure 4 describes the resulting protocol.
Figure 4. A simple lazy stability tracking protocol. Figure 5 describes an example of running this stability tracking protocol. P3 discards m immediately after receiving m´, because
which corresponds to the sequence number of m. At the end of the example, P1?s and P3?s buffers contain m´and m´´, while Pj?s buffer contains only m´´. To extend this protocol to handle causal delivery, we just need to add a delivery condition, similar to the one in Figure 3 and in the second clause of the protocol in Figure 4.
Detecting an event pattern Consider a distributed execution that produces two types of internal events: some are tagged black and others are tagged white. All communication events are tagged white. (As an example, in a distributed debugging context, an internal event is tagged black if the associated local state satisfies a given local predicate; otherwise, it is tagged white.) Given two black events, s and t, the problem consists of deciding if there is another black event u, such that
Let black(e) be a predicate indicating whether event e is black. More formally, given two events s and t, the problem consists of deciding if the following predicate P(s,t) is true:
Figure 6 shows that vector clocks do not solve this problem. In these two executions, both events s have the same timestamp: s.VC = (0,0,2). Similarly, both events t have also the same timestamp—namely, t.VC = (3,4,2). However, the right execution satisfies the pattern, while the left one does not. (Note that s and t will have the same timestamp in both executions, even if vector clocks are incremented only on the occurrence of black events.) Figure 6. Recognizing a pattern.
Which clocks solve it? The predicate P(s,t) can be decomposed into two subpredicates P1(s,u,t) and P2(s,u,t):
with
P1 indicates that only the black events are relevant for the predicate detection. So, detecting P(s,t) requires only tracking black events. This means we can use vector clocks managed in the following way: A process Pi increments VCi[i] only when it produces a black event, and the other statements associated with vector clocks are left unchanged. (Actually, black events define the abstraction level at which the distributed computation must be observed to detect P. All the other events—namely, the white events—are not relevant for detecting P). Consider Figure 7, where only black events are indicated. We have P(s,t1) = false, while P(s,t2)= true. The underlying idea to solve the problem lies in associating two timestamps with each black event e:
Note that we can consider e.MC[j] as a pointer from e to the last event that precedes it on Pj. When considering Figure 7, we have
Figure 7. P(s,t2) is true; P(s,t1) is not.
Managing the
clocks
Figure 8. Detection protocol for P(s,t). As before, the notation VC: = max(VC1,VC2) (statement S3 in Figure 8) is an abbreviation for
Moreover, in statement S3, MCi[k] and m.MC[k] contain vector timestamps of two black events of Pk. It follows that one of them is greater than (or equal to) the other. The result of max(MCi[k],m.MC[k]) is this greatest timestamp. Let us finally note that MCi[i][i] = VCi[i] – 1 and
So, we can deduce the vector clock VCi from the diagonal of the matrix MCi. This can reduce the number and size of data structures that processes manage and messages carry. The pattern
detection predicate
Note that, because the protocol considers only black events, the predicate P1 is trivially satisfied by any triple of events. So, detecting P(s,t) amounts to only detecting
Given s and t with their timestamps (namely, s.VC and s.MC for s; t.VC and t.MC for t), we can state the predicate
in a more operational way using vector timestamps:
If such an event u does exist, some process Pk produced it, and it belongs to the causal past of t. Consequently, its vector timestamp is such that From this observation, the previous relation translates into
As
is the vector timestamp of a black event in the causal past of t, we have
Consequently, the pattern detection predicate simplifies and becomes
To summarize, when this condition is true, it means that a process Pk exists that has produced a black event u such that
So, when the system is equipped with the vector clock system we described, we can evaluate the predicate P(s,t) using a simple test—namely,
Moreover, when we know the identity of the process (say Pi) that produced s, we can simplify this test. Using the relation R (presented earlier), the test becomes
Bounded vector clocksA vector clock system's main drawback is its inability to face scalability problems. To fully capture the causality relation among the events that a distributed computation's processes produce, a vector clock system requires vectors of size n (n being the number of processes). To circumvent this problem, researchers have introduced two types of bounded vector clocks (whose size is bounded by a constant that is less than n): approximate 11 and k-dependency12vector clocks. Approximate vector clocks use a space-folding approach. We can use this approach when we are only interested in never missing causality between related events (so, we accept that we perceive two events as ordered when they are actually concurrent). k-dependency vectors involve a time-folding approach in which an event's bounded timestamp provides causal dependencies that, when recursively exploited, reconstruct the event's vector timestamp. Approximate vector clocks
(let e.TS be the timestamp associated with e). Such a timestamping never violates causality in the sense that, from e.TS < f.TS, we can safely conclude ¬(f —> e). If we optimistically conclude e —>f, then we can be wrong, because it is possible that e and f are not causally related. That is why concluding e —>f from e.TS < f.TS constitutes an approximation of the causality relation. F.J. Torres and M. Ahamad introduced approximate vector clocks. They provide a simple mechanism that associates approximate vector timestamps with events. Consider vector clocks whose size is bounded by a constant (with k < n). So, TSi[1..k] is Pi's approximate vector clock. Moreover, let fk be a deterministic function from {1,...,n} to {1,...,k}. Given a process identity i, this function associates with it the entry fk(i) of all vector clocks TS[1..k]—that is, TS[fk(i)]. Implementing such a time-stamping system is similar to the one described earlier. Each process Pi manages its vector clock TSi[1..k], initialized to (0,..,0) in the following way:
Combined with the function fk, these rules ensure that all processes Pi share the xth entry of any vector clock, such that fk(i) = x. Such an entry sharing makes the vector clocks approximate as far as causality tracking is concerned. These approximate vector clocks are characterized by[11]
More generally, if k = n and
then we get classic vector clocks that track full causality. In that case, the vector clock system's entry i is private to Pi in the sense that only Pi can entail its increase. If k = 1, then
and all processes share the unique entry of the (degenerated) vector. The resulting clock system is Lamport?s scalar clock system.[1] This scalar clock system is well known for its property
Many applications consider the timestamp of an event e that Pi produced as the pair (e.TS,i). This provides an easy way to totally order (without violating causal relations) the set of all the events a distributed computation produces. This is the famous total order relation Lamport defined[1]—namely, if Pi and Pj produce e and f, respectively, e is ordered before f if
Also, scalar clocks detect some concurrent events—more precisely,
If 1 < k < n, then all processes Pi such that fk(i) = x share the same entry x of the vector clock system. This sharing adds false causality detections that make this vector clock system approximate. Experimental results[11] show that with n = 100 and 2 < k < 5, the percentage of situations in which e —> f is concluded from e.TS < f.TS (while e and f are concurrent) is less than 10 percent. Dependency vectors The following behavior characterizes a k-dependency vector clock system. Each process Pi has a vector clock DVi[1..n], which is initialized to (0,...,0) and managed in the following way: 12
A k-dependency vector clock system provides each process with an n size vector, but each message carries only a subset of size k. This subset always includes the current value of DVi[i] (where Pi is the sender process). Choosing the other k – 1 values is left to the user. A good heuristics consists in choosing the last modified k – 1 entries of DVi.12 It is easy to see that k = n provides classical vector clocks. Let us consider two events e and f, timestamped e.DV and f.DV, respectively. Moreover, let's assume that Pi produced e. The k-dependency vector protocol ensures the following property:
Note that the implication is in one direction only. This means that it is possible that e—>f while e.DV[i] > f.DV[i]. But, unlike approximate vector clocks, k-dependency vectors can (using additional computation) reconstruct the causality relation (see the "Reconstructing Vector Timestamps from Dependency Timestamps" sidebar). Of course, according to the problem to be solved, we can use k-dependency vectors and approximate vectors simultaneously. The concept of causality among events is fundamental to designing and analyzing distributed programs. However, a vector clock system suffers from limitations other than scalability. For example, the system can't cope with hidden channels.13 This problem arises when a system's processes can communicate through one or more channels that are distinct from the ones application messages use. Hidden channels can causally relate events in distinct processes; the vector clock system doesn't reveal these relations. Shared memory, a database, and a shared file are examples of hidden channels. Moreover, vector clocks can be difficult to adapt to dynamic systems, such as systems of multithreaded processes. A vector clock system also suffers limitations when we consider the computation model at a higher abstraction level where computation atoms are intervals (sets of events) instead of events. The "Can Vector Clocks Always Track Precedence Relations?" sidebar briefly addresses this issue. References
Roberto Baldoni is an associate professor at the school of engineering of the University of Rome, La Sapienza. He has published more than seventy scientific papers in the fields of distributed computing, dependable middleware, and communication protocols. He received a degree in electronic engineering and a PhD in computer science from the University of Rome, La Sapienza. Contact him at baldoni@dis.uniroma1.it; www.dis.uniroma1.it/~baldoni. Michel Raynal is a professor of
computer science at the University of Rennes, France. At IRISA
(CNRS-INRIA-University joint computing laboratory located in Rennes), he leads
the ADP
|