February 2002  
Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems  
Roberto Baldoni Michel Raynal IRISA, France A distributed computation consists of a set of processes that cooperate to achieve a common goal. A main characteristic of these computations is that the processes do not already share a common global memory and that they communicate only by exchanging messages over a communication network. Moreover, message transfer delays are finite yet unpredictable. This computation model defines what is known as the asynchronous distributed system model, which includes systems that span large geographic areas and are subject to unpredictable loads. A key concept of asynchronous distributed systems is causality. More precisely, given two events in a distributed computation, a crucial problem is knowing whether they are causally related. Could the occurrence of one event be a consequence of the other? Processes produce message sendings, message receives, and internal events. Events that are not causally dependent are concurrent. Fidge 1 and Mattern 2 simultaneously and independently introduced vector clocks to let processes track causality (and concurrency) between the events they produce. A vector clock is an array of n integers (one entry per process), where the entry j counts the number of relevant events that process P_{j} produces. The timestamp of an event a process produced (or of the local state this event generated) is the current value of the corresponding process's vector clock. So, by associating vector timestamps with events or local states, we can safely decide whether two events or two local states are causally related (see the "A Historical View of Vector Clocks" sidebar). Here, we present basic vector clock properties, mechanisms, and application examples to help distributed systems engineers solve the casuality problems they face. A model of distributed executionA distributed program is made up of n sequential local programs that, when executed, can communicate and synchronize only by exchanging messages. A distributed computation describes a distributed program's execution. Executing a local program gives rise to a sequential process. Let P_{1}, P_{2}, ..., P_{n} be this finite set of processes. We assume that, at runtime, each ordered pair of communicating processes (P_{i}, P_{j}) is connected by a reliable channel c_{ij} through which P_{i} can send messages to P_{j}. Executing an internal, send, or receive statement produces an internal, send, or receive event. Let be the xth event process P_{i} produces. The sequence constitutes the history of P_{i}. Let H be the set of events that a distributed computation produces. This set is structured as a partial order by L. Lamport?s ?happenedbefore? relation,[1] denoted ?—>? and defined as e —>f means that event e can affect event f. Consequently, ¬(e —> f) means e cannot affect f. The partial order constitutes a formal model of the distributed computation with which it is associated. Figure 1 depicts a distributed computation, where black points denote events.
Two events e and f are concurrent (or causally independent) if The causal past of event e is the (partially ordered) set of events f such that f —> e. Similarly, the causal future of event e is the (partially ordered) set of events f such that e —> f. For example, in Figure 1, we have and the causal past of the event e_{2}^{2} corresponds to the set Vector clocks: A causality tracking mechanismA vector clock system is a mechanism that associates timestamps with events (local states) such that comparing two events' timestamps indicates whether those events (local states) are causally related (and, if they are, which one comes first). In the timestamping system, each process P_{i} has a vector of integers VC_{i}[1..n] (initialized to [,0,...,0]) that is maintained as follows:
Note that VC_{i}[i] counts the number of events that P_{i} has so far produced. Moreover, for
VC_{i}[j] represents the number of events P_{j} produced that belong to the current causal past of P_{i}. When a process P_{i} produces an event e, it can associate with that event a vector timestamp whose value equals the current value of VC_{i}. Figure 1 shows vector timestamp values associated with events and local states. For example, e_{2}^{6}.VC = (5,6,5). Let e.VC and f.VC be the vector timestamps associated with two distinct events e and f, respectively. The following property is the fundamental property associated with vector clocks:2,3 where e.VC < f.VC is an abbreviation for Let P_{i} be the process that produced e. This additional information lets us simplify the previous relation to 2,3 (See the "An Efficient Implementation of Vector Clocks" sidebar.) In our discussion of basic vector clock properties, we investigate three problems—causal broadcast, detecting message stability, and detecting an event pattern. Causal broadcast This means that when a proccess delivers a message m to a process, all messages whose broadcasts causally precede the broadcast of m have already been delivered to that process. The ISIS system first proposed such a communication abstraction.4 Several researchers have proposed vector clockbased implementations of causal broadcast, based on the following idea:5,6 A receiving process P_{i} must delay delivering a message m until all the messages broadcast in the causal past of m are delivered to P_{i}. Consider Figure 2. When m´ arrives at P_{2}, its delivery must be delayed because m´ arrived at P_{2} before m, and the sending of m causally precedes m´. To this end, each process P_{i} must manage a vector clock (VC_{i}) tracking its current knowledge on the number of messages that each process has sent.
Figure 3 describes a simple broadcast protocol (similar to one presented elsewhere5). Broadcast events are a computation's relevant events, and VC_{i}[j] represents P_{i}?s knowledge of the number of messages that P_{j} did broadcasts and delivered to P_{i}. Each message m piggybacks a vector timestamp m.VC, revealing how many messages each process has broadcast in the causal past of m?s broadcast. Then, when a process P_{i} receives a message m, it delays its delivery until all the messages that belong to its causal past are delivered. This is expressed by a simple condition involving the vector clock of the receiving process P_{i} and the vector timestamp (m.VC) of the received message m—namely, Figure 3 describes the resulting causal broadcast protocol (vectors are initialized to [0,...,0]). Figure 3. A simple causal broadcast protocol. Detecting message
stability In the context of reliable broadcasting, operations correspond to messages, and, to meet the problem requirements in the presence of sender process failures and network partitions, each process must buffer a copy of every message it sends or receives. If a process P_{i} fails, any process with a copy of a message m sent by P_{i} can forward m to any process P_{j} that detects it has not received m. This can induce a rapid growth of the buffer at each process with the risk of overflowing. Therefore, we need a policy that reduces buffer overflow occurrence. A simpler observation shows that buffering a message that has been delivered to all its intended destinations is not necessary. Such a message is called a stable message, and we can safely discard such messages from a process's local buffer. A message stability detection protocol manages the process buffers. Such a protocol can be lazy (stability information piggybacks on application messages), use gossiping (periodic broadcast of control messages propogates stability information), or hybrid (both piggybacking and gossiping propogate stability information). To concentrate on the buffer management actions, we consider the simple case where communication channels are firstin firstout, and we assume there is no failure. Moreover, causal delivery is not ensured (that is, each message is delivered on receipt). Broadcast events are the computation's relevant events. Each process P_{i} has a vector (MC_{i}) of vector clocks. This vector of vectors is such that the vector MC_{i}[k] keeps P_{i} aware of messages delivered to P_{k}. More precisely, represents P_{i}?s knowledge of the number of messages that P_{k} delivered and P_{l} sent; MC_{i}[i][i] represents the sequence number of the next message P_{i} sent. Hence, the minimum value over column j of MC_{i} —that is, —represents P_{i}?s knowledge of the sequence number of the last message P_{j} sent that is stable. To propagate stability information, each message m that P_{i} sends piggybacks the identity of its sender (m.sender) and a vector timestamp m.VC, indicating how many messages P_{i} has delivered from each other process P_{l}, (that is, m.VC corresponds to the vector MC_{i}[i][*]). Two operations update the local buffer (buffer_{i}): deposit(m) inserts a message m in the buffer and discard(m) removes m from the buffer. A process buffers a message immediately after it receives it and discards it as soon as it becomes stable (that is, when the process learns that all processes have delivered m). We can express the stability predicate for a message m using where m.VC[m.sender] represents the sequence number of m. Figure 4 describes the resulting protocol. Figure 4. A simple lazy stability tracking protocol. Figure 5 describes an example of running this stability tracking protocol. P_{3} discards m immediately after receiving m´, because which corresponds to the sequence number of m. At the end of the example, P_{1}?s and P_{3}?s buffers contain m´and m´´, while P_{j}?s buffer contains only m´´. To extend this protocol to handle causal delivery, we just need to add a delivery condition, similar to the one in Figure 3 and in the second clause of the protocol in Figure 4.
Detecting an event pattern Consider a distributed execution that produces two types of internal events: some are tagged black and others are tagged white. All communication events are tagged white. (As an example, in a distributed debugging context, an internal event is tagged black if the associated local state satisfies a given local predicate; otherwise, it is tagged white.) Given two black events, s and t, the problem consists of deciding if there is another black event u, such that Let black(e) be a predicate indicating whether event e is black. More formally, given two events s and t, the problem consists of deciding if the following predicate P(s,t) is true: Figure 6 shows that vector clocks do not solve this problem. In these two executions, both events s have the same timestamp: s.VC = (0,0,2). Similarly, both events t have also the same timestamp—namely, t.VC = (3,4,2). However, the right execution satisfies the pattern, while the left one does not. (Note that s and t will have the same timestamp in both executions, even if vector clocks are incremented only on the occurrence of black events.) Figure 6. Recognizing a pattern.
Which clocks solve it? The predicate P(s,t) can be decomposed into two subpredicates P_{1}(s,u,t) and P_{2}(s,u,t): with P_{1} indicates that only the black events are relevant for the predicate detection. So, detecting P(s,t) requires only tracking black events. This means we can use vector clocks managed in the following way: A process P_{i} increments VC_{i}[i] only when it produces a black event, and the other statements associated with vector clocks are left unchanged. (Actually, black events define the abstraction level at which the distributed computation must be observed to detect P. All the other events—namely, the white events—are not relevant for detecting P). Consider Figure 7, where only black events are indicated. We have P(s,t_{1}) = false, while P(s,t_{2})= true. The underlying idea to solve the problem lies in associating two timestamps with each black event e:
Note that we can consider e.MC[j] as a pointer from e to the last event that precedes it on P_{j}. When considering Figure 7, we have
Figure 7. P(s,t_{2}) is true; P(s,t_{1}) is not.
Managing the
clocks Figure 8. Detection protocol for P(s,t). As before, the notation VC: = max(VC1,VC2) (statement S3 in Figure 8) is an abbreviation for Moreover, in statement S3, MC_{i}[k] and m.MC[k] contain vector timestamps of two black events of P_{k}. It follows that one of them is greater than (or equal to) the other. The result of max(MC_{i}[k],m.MC[k]) is this greatest timestamp. Let us finally note that MC_{i}[i][i] = VC_{i}[i] – 1 and So, we can deduce the vector clock VC_{i} from the diagonal of the matrix MC_{i}. This can reduce the number and size of data structures that processes manage and messages carry. The pattern
detection predicate Note that, because the protocol considers only black events, the predicate P_{1} is trivially satisfied by any triple of events. So, detecting P(s,t) amounts to only detecting Given s and t with their timestamps (namely, s.VC and s.MC for s; t.VC and t.MC for t), we can state the predicate in a more operational way using vector timestamps: If such an event u does exist, some process P_{k} produced it, and it belongs to the causal past of t. Consequently, its vector timestamp is such that
From this observation, the previous relation translates into As is the vector timestamp of a black event in the causal past of t, we have Consequently, the pattern detection predicate simplifies and becomes To summarize, when this condition is true, it means that a process P_{k} exists that has produced a black event u such that
So, when the system is equipped with the vector clock system we described, we can evaluate the predicate P(s,t) using a simple test—namely, Moreover, when we know the identity of the process (say P_{i}) that produced s, we can simplify this test. Using the relation R (presented earlier), the test becomes Bounded vector clocksA vector clock system's main drawback is its inability to face scalability problems. To fully capture the causality relation among the events that a distributed computation's processes produce, a vector clock system requires vectors of size n (n being the number of processes). To circumvent this problem, researchers have introduced two types of bounded vector clocks (whose size is bounded by a constant that is less than n): approximate 11 and kdependency12vector clocks. Approximate vector clocks use a spacefolding approach. We can use this approach when we are only interested in never missing causality between related events (so, we accept that we perceive two events as ordered when they are actually concurrent). kdependency vectors involve a timefolding approach in which an event's bounded timestamp provides causal dependencies that, when recursively exploited, reconstruct the event's vector timestamp. Approximate vector clocks (let e.TS be the timestamp associated with e). Such a timestamping never violates causality in the sense that, from e.TS < f.TS, we can safely conclude ¬(f —> e). If we optimistically conclude e —>f, then we can be wrong, because it is possible that e and f are not causally related. That is why concluding e —>f from e.TS < f.TS constitutes an approximation of the causality relation. F.J. Torres and M. Ahamad introduced approximate vector clocks. They provide a simple mechanism that associates approximate vector timestamps with events. Consider vector clocks whose size is bounded by a constant (with k < n). So, TS_{i}[1..k] is P_{i}'s approximate vector clock. Moreover, let f_{k} be a deterministic function from {1,...,n} to {1,...,k}. Given a process identity i, this function associates with it the entry f_{k}(i) of all vector clocks TS[1..k]—that is, TS[f_{k}(i)]. Implementing such a timestamping system is similar to the one described earlier. Each process P_{i} manages its vector clock TS_{i}[1..k], initialized to (0,..,0) in the following way:
Combined with the function f_{k}, these rules ensure that all processes P_{i} share the xth entry of any vector clock, such that f_{k}(i) = x. Such an entry sharing makes the vector clocks approximate as far as causality tracking is concerned. These approximate vector clocks are characterized by[11] More generally, if k = n and then we get classic vector clocks that track full causality. In that case, the vector clock system's entry i is private to P_{i} in the sense that only P_{i} can entail its increase. If k = 1, then and all processes share the unique entry of the (degenerated) vector. The resulting clock system is Lamport?s scalar clock system.[1] This scalar clock system is well known for its property Many applications consider the timestamp of an event e that P_{i} produced as the pair (e.TS,i). This provides an easy way to totally order (without violating causal relations) the set of all the events a distributed computation produces. This is the famous total order relation Lamport defined[1]—namely, if P_{i} and P_{j} produce e and f, respectively, e is ordered before f if Also, scalar clocks detect some concurrent events—more precisely, If 1 < k < n, then all processes P_{i} such that f_{k}(i) = x share the same entry x of the vector clock system. This sharing adds false causality detections that make this vector clock system approximate. Experimental results[11] show that with n = 100 and 2 < k < 5, the percentage of situations in which e —> f is concluded from e.TS < f.TS (while e and f are concurrent) is less than 10 percent. Dependency vectors The following behavior characterizes a kdependency vector clock system. Each process P_{i} has a vector clock DV_{i}[1..n], which is initialized to (0,...,0) and managed in the following way: 12
A kdependency vector clock system provides each process with an n size vector, but each message carries only a subset of size k. This subset always includes the current value of DV_{i}[i] (where P_{i} is the sender process). Choosing the other k – 1 values is left to the user. A good heuristics consists in choosing the last modified k – 1 entries of DV_{i}.12 It is easy to see that k = n provides classical vector clocks. Let us consider two events e and f, timestamped e.DV and f.DV, respectively. Moreover, let's assume that P_{i} produced e. The kdependency vector protocol ensures the following property: Note that the implication is in one direction only. This means that it is possible that e—>f while e.DV[i] > f.DV[i]. But, unlike approximate vector clocks, kdependency vectors can (using additional computation) reconstruct the causality relation (see the "Reconstructing Vector Timestamps from Dependency Timestamps" sidebar). Of course, according to the problem to be solved, we can use kdependency vectors and approximate vectors simultaneously. The concept of causality among events is fundamental to designing and analyzing distributed programs. However, a vector clock system suffers from limitations other than scalability. For example, the system can't cope with hidden channels.13 This problem arises when a system's processes can communicate through one or more channels that are distinct from the ones application messages use. Hidden channels can causally relate events in distinct processes; the vector clock system doesn't reveal these relations. Shared memory, a database, and a shared file are examples of hidden channels. Moreover, vector clocks can be difficult to adapt to dynamic systems, such as systems of multithreaded processes. A vector clock system also suffers limitations when we consider the computation model at a higher abstraction level where computation atoms are intervals (sets of events) instead of events. The "Can Vector Clocks Always Track Precedence Relations?" sidebar briefly addresses this issue. References
Roberto Baldoni is an associate professor at the school of engineering of the University of Rome, La Sapienza. He has published more than seventy scientific papers in the fields of distributed computing, dependable middleware, and communication protocols. He received a degree in electronic engineering and a PhD in computer science from the University of Rome, La Sapienza. Contact him at baldoni@dis.uniroma1.it; www.dis.uniroma1.it/~baldoni. Michel Raynal is a professor of
computer science at the University of Rennes, France. At IRISA
(CNRSINRIAUniversity joint computing laboratory located in Rennes), he leads
the ADP

Universita di Roma, Italy