Certifying Safety when Implementing Consensus

03/08/2019 ∙ by Aurojit Panda, et al. ∙ 0

Ensuring the correctness of distributed system implementations remains a challenging and largely unaddressed problem. In this paper we present a protocol that can be used to certify the safety of consensus implementations. Our proposed protocol is efficient both in terms of the number of additional messages sent and their size, and is designed to operate correctly in the presence of n-1 nodes failing in an n node distributed system (assuming fail-stop failures). We also comment on how our construction might be generalized to certify other protocols and invariants.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Correctly implementing distributed systems remains challenging. As distributed systems have become a crucial part of the infrastructure, underlying many popular services, there has been tremendous interest and work in improving their reliability. Interactive theorem provers such as TLA+ [33], Coq [4], Isabelle/HOL [23], etc., and automated theorem provers such as Dafny [20], IVY [26], etc. are increasingly used to check the correctness of distributed protocols [22] before deployment. However, while these tools can ensure that the protocol meets safety and liveness conditions, they cannot provide similar guarantees for implementations.

In response to this gap between protocol and implementation correctness, several recent projects, e.g., IronFleet [14], Ivy [25], Diesel [30], etc., have proposed extracting correct implementations from verified protocol specifications. These implementations are correct as long as these tools correctly account for low-level system behavior (e.g., the behavior of Unix sockets), and can be incorrect otherwise [9]. While these approaches are promising they have several shortcomings: first, incorporating low level optimizations, e.g., ones designed to take advantage of network connectivity [1, 29] or I/O [12], is challenging and can require changes to the underlying extraction tool; second, any changes to the protocol, no matter how minor, requires extracting a new implementation since current tools do not support incremental extraction increasing time and effort required for validation and deployment; third, most deployed systems provide functionality beyond the basic protocol, e.g., consensus services such as Chubby [2], ZooKeeper [15], etc. offer not just a basic consensus protocol but also a key-value interface, support for leases [13], etc. and incorporating these additional mechanisms into protocol specifications is challenging and might even render these protocols inexpressible in some of these tools. As a result of these shortcomings, at present hand written code generally provides more features than automatically generated, correct by construction code, and thus cannot easily be replaced. As a result adopting these techniques remains challenging in practice, and points to a need for additional techniques to ensure correctness in distributed systems.

In this paper we take a different tack and propose a complementary approach which can be used to certify (i.e., check) safety properties of a running distributed system implementation. Our approach, which is implemented by distributed processes we refer to as certification agents, checks whether inputs provided by the implementation satisfy a programmer provided predicate, which encodes the safety condition being checked. In §3 we show that such a framework suffices to check agreement and validity for distributed consensus implementations.

To enable practical deployments, in our work we aim for certification protocols that are sound (i.e., they correctly identify all cases where a predicate does not hold), complete (i.e., they do not generate false positives) and efficient. In terms of efficiency this work focuses on certification protocols where during a round of certification (i.e., one round of checking that the predicate holds) each node sends no more than a constant number of messages, and the size of each message is at most bits (in an node distributed system). In practice, beyond these asymptotic bounds we aim for protocols which require each node sending no more than a few bytes for each certification round, thus limiting their overheads. We also aim for protocols which exhibit message locality, i.e., where nodes only communicate with a few neighboring nodes. The later is useful in practice when certifying systems deployed on networks with heterogeneous link speeds. We discuss some additional concern around deployment and use of such a framework in §5.

The approach we describe in this paper builds on recent results on how to distribute verifiers in interactive proof systems [16, 21, 17]. In the context of that work (and borrowing terminology from [16]

), we aim to find 1-round distributed Merlin-Arthur (1-dMA) protocols and then execute them in a non-interactive setting. In our setting the prover represent the implementation (which provides us with inputs for the predicate) and the predicate itself; while certification agents represent verifiers. There are well known techniques for translating interactive proofs into non-interactive proofs including the well known Fiat-Shamir heuristic 

[7], however a naive application of these techniques to our setting does not result in efficient algorithms, and we therefore use a different construction here. Our problem statement is also similar to the previous work on proof labeling schemes [17], however while that work focused on checking properties of the network topology, here we apply this approach to checking safety properties of distributed protocols. We provide a more detailed discussion of related work in §6.

Finally, while this paper focuses on presenting certification protocols for agreement and validity in consensus algorithms, we hope in the future to generalize this construction to support certification of safety properties for other systems. To do so we need to ask what are the limits on the kinds of predicates that can be checked using a distributed protocol. Recent results by Naor et al [21] present a construction for converting any centralized verifier which can be executed in time on a RAM machine into a distributed verifier, however this construction requires at three rounds of interaction between the prover and the verifier. Converting this to a non-interactive setting at least requires a random oracle (that can provide the same random number to all nodes) and potentially several rounds of messages, which might violate our requirements. At the same time, in developing our protocol for consensus algorithms we found that most boolean predicates and arithmetic predicates can be efficiently translated to such a framework, indicating that such a framework might suffice for safety properties in several other distributed systems. We leave the question of generality to future work, but provide a brief discussion on this question based on our experience thus far in §4.

2 System Model and Requirements

Figure 1: System model: we show a distributed system with nodes. Each node runs both a process implementing the distributed protocol and a certification agent, which is the focus of this work.

We begin by providing a system model, within which we describe certification protocols and then state requirements for these protocols. We consider distributed systems consisting of nodes (e.g., a 5 node system is shown in Figure 1). We assume that a single node can execute several processes which share fate, i.e., the failure of one process leads to the failure of all other processes on the same node. We also assume that nodes are connected by a partially synchronous network [6], i.e., we assume there are bounds on how long any computation can take and for messages to be transferred between nodes. We assume that certification agents can communicate with each other by sending and receiving messages.

In our discussion we assume that each node runs two processes: an implementation process running the protocol being certified, and a certification agent; thus allowing us to assume fate-sharing between the implementation and the certification agent. Each implementation process can communicate with its local certification agent, i.e., the certification agent running on the same node. We do not constrain the communication pattern used by implementation processes, nor do we assume visibility into any messages sent or received by these processes. As a result our setting is similar to that used in proof labeling schemes [17], with the certification agent in each node being provided an input from the local implementation (referred to as a label in the proof labeling work) and the certification protocol checking whether these inputs satisfy some user-supplied predicate. We discuss our relation to this existing work in greater detail later in §6.

The certificate agents as a whole implement a distributed protocol to check whether the inputs at all nodes satisfy some predicate. We assume that a safety condition is violated when the predicate does not hold, and thus require that at least one (out of ) certification agent signal an error whenever the predicate is violated. Note, that the requirement that at least one node signal an error is very weak, since we can neither assume that a particular node will always signal errors during predicate violations, nor that all nodes will signal violations. This means that in practice using a certification protocol to check correctness requires a process or administrator to monitor all nodes for error messages, however this allows the use of protocols that can operate in significantly weaker failure models than consensus algorithm (surviving node failures).

A certification round is invoked by the implementation process on a node when it sends a message to the local certification agent. In what follows, for ease of exposition, we assume that implementations at all nodes invoke the certification agent within a short time bound, however we note that this assumption can be relaxed by allowing certificate agents to query the local implementation process for input. We make our description more concrete in the next section.

We require correct certification protocols to provide both soundness and completeness. We define a certification protocol which evaluates a predicate to be sound if at least one node signals an error on any certification round where predicate does not hold. We similarly define the certification protocol to be complete if no node signals an error in any round where predicate holds. In this paper we will focus our attention on predicates for which sound and complete certification protocols exist. Our current results do not address the question of what class of predicates can be correctly checked using distributed certification protocols, and we hope to address this question in future work. We discuss this point in greater detail in §5.

Beyond correctness we also impose a few efficiency requirements on the certification protocols considered in this paper. These efficiency requirements, which minimize the number and size of messages used for certification, are designed to allow practical deployments of these protocols without requiring operators and developers to add significant network resources to existing systems. Specifically, we require that during each certification round, each node send no more than messages, and each message sent by the certification protocol be no more than bytes long. In practice, during failure-free operations, our certification protocols for agreement and validity require sending no more than one message per-node per-certification round, and the messages sent are of constant size. In addition to ensuring that nodes send a fixed number of small messages, we also aim to ensure that our certification protocols require each node to communicate with only a few neighboring nodes. The concept of what nodes are neighboring here can be decided on by operators, thus enabling the use of certification in deployments where some nodes might be in the same datacenter, while others might only be accessible over a wide area network and thus communication between different sets of nodes imposes different costs. A key focus of our future work is to design a framework for automatically generating efficient and correct certification protocols for a wide range of predicates, which we discuss in §4.

3 Certification of Consensus Protocols

In this section we make concrete the notion of distributed certification protocols by developing protocols that can be used to check the correctness of distributed consensus protocols [19]. We focus on consensus protocols since they have been widely studied: we have theoretical bounds on what a consensus protocol can achieve [8], a wide range of consensus protocols have been developed [18, 24, 6], targetting a range of different network models (asynchronous, synchronous and partially synchronous) and a range of failure models [3]. Furthermore, consensus protocols are widely deployed in practical systems [2, 15], forming a critical part of how distributed datastores and systems ensure consistency. Errors in the implementation of these systems can thus affect data consistency and the availability of web services, and thus mechanisms for checking and preserving the correctness of these systems are of practical importance.

3.1 Safety Properties Considered

We consider consensus protocols where each node receives 0 or more proposals and all nodes execute a protocol to agree (i.e., choose) a single proposal that they then output. We built certification proposals for two safety properties that are satisfied by all consensus protocols [19]:

Agreement: which requires that all nodes executing a consensus protocol output the same value.

Validity: which request that the agreed upon value was proposed. Equivalently, each node receives a set of proposals, and validity requires that a value output by the consensus protocol be contained union of the sets of proposals across nodes.

3.2 Certification Protocol

Here we provide a brief overview of our protocols for checking agreement and validity, which we present in greater detail below. Both protocols represent circuits which are evaluated over inputs distributed across nodes, the agreement circuit checks for cases where nearby nodes disagree on the chosen value, while the validity circuit checks to see whether or any nearby node has a proposal providing validity of the chosen value. To implement these protocols we therefore need a notion of nearby nodes, and we need to arrange nodes in a topology that allows each node to send a message to one or a few neighbors (so we can meet our efficiency goals) while still ensuring evaluation across all nodes. Our approach to doing so is to arrange the nodes in a spanning tree, and then embedding the circuit on this spanning tree. We allow users of our framework to specify how the spanning tree is create, and they can rely on protocols such as Perlman’s spanning tree protocol [28] or any other protocol including ones that consider the underlying physical network over which communication is performed. Since the spanning tree is computed by a user specified protocol, we also certify that the spanning tree computed is in fact correct, i.e., it is structurally a tree, and spans all nodes.

We begin by presenting certification protocols for the spanning tree, agreement and validity in the failure free case. We present our mechanisms for handling failures in §3.3, however we note that this mechanism requires little more than recomputing and re-certifying the spanning tree, for which we again call into the user provided algorithm. We now present these protocols in greater detail.

3.2.1 Certifying a Spanning Tree

As specified above, our framework is designed to allow users to use an arbitrary protocol for constructing a spanning tree. We require the output from this spanning tree construction to be supplied to each node as a six tuple comprising of:

  • A unique ID for the node. We assume that these ID’s are totally ordered – this allows for integer IDs, IDs based on network address, etc.,

  • A list of IDs for all active nodes in the system.

  • The ID of the tree’s root.

  • The ID of the node’s parent in the spanning tree, or null in case the node is the root.

  • A list of IDs of the node’s children, or null in case the node is a lead.

  • Finally, the node’s distance from the root.

As mentioned above, we need to certify the correctness of this input, for which we rely on techniques which have already described in prior work including the work of Korman et al [17] and Naor et al [21]. Certifying the correctness of this input requires checking structural properties of the supplied graph, i.e., ensuring that the graph is acyclic, has a single root and spans all nodes; and checking that node IDs are unique.

To check that the graph is acyclic and has a single root we have each node send its parent (if any) the ID of the root node and its distance from the root. The parent then ensures that the root node supplied to it as a part of its input is the same as the root node supplied to its children, and that the distance from any of its children to the root is greater than its distance to the root. Checking that a parents root agrees with its children ensures that the supplied graph cannot be a connected forest, i.e., it ensures that the supplied graph either has one root or is disconnected. This is because for the graph to have two roots which are connected, there must be sibling nodes which disagree on the ID of the root, in which case at least one of them must also disagree about the root ID with its parent. Checking distances ensures that the graph is acyclic.

Next, we need to check that the graph spans all nodes, this is done by having each node, starting with the leaves send their parent a message containing the number of nodes in the subgraph rooted at that node. For leaf nodes, this simply requires sending , while any non-leaf nodes add up the counts received from their children, and add one to this sum before sending it to their parent. Once the root has computed this value it checks whether the computed value equals , in which case the supplied graph does span all nodes, and reports an error otherwise.

The procedure thus far is sufficient for proving structural properties of the graph, however it cannot prove that nodes have unique IDs. To prove uniqueness we adopt a multi-set equality protocol described in Naor et al [21], however in adopting this protocol we needed to make minor modifications to allow its use in a setting where nodes cannot interact with a designated centralized prover. We describe this protocol and our changes next.

3.2.2 Certifying Unique IDs

1:procedure CertifyUniqueID( ) The outer function called to check uniqueness of IDs
2:       for  do Assuming nodes in the system
3:             CheckUniqueID(i)
4:       end for
5:end procedure
6:procedure CheckUniqueID() Run at a node to check if all nodes have unique IDs
7:        Receive messages from all children.
8:       CheckSetEquality(, , , ) is the node’s ID, is the successor ID.
9:end procedure
10:procedure CheckSetEquality(, , , ) Check equality between set and
11: is the value from at the current node.
12: is the value from at the current node.
15:       for  do
18:       end for
19:       if IsRoot() then Is the current node the root
20:             if  then Check if sets differ
21:                    SignalError( ) Signal an error
22:             end if
23:       else
24:             Send(parent, (p0, p1)) Send a tuple of products to parent
25:       end if
26:end procedure
Algorithm 1 Protocol for certifying that node IDs are unique

We show pseudocode for a protocol that can be used to certify that IDs are unique in Algorithm 1. The core of this protocol is the multi-set equality protocol proposed in Naor et al [21], and the only changes are to allow this protocol to be executed in the absence of a random oracle and without a centralized prover. We briefly explain the original protocol before explaining our technique for eliminating the use of a random oracle and some performance trade-offs this offers.

First, in order to use a set equality protocol we need to show that the checking uniqueness of IDs can be reduced to multi-set equality. In §3.2.1 we required that IDs have a total order, and that each node be provided with a list of all IDs. We define the successor of an ID to be the smallest valid ID that is greater than . If there is no valid ID larger than some ID , then we define its successor to be the smallest valid ID in the system. This puts node IDs in a ring. Next we consider two sets: the multi-set of IDs assigned to each node, and the multi-set of successors to each node’s ID. We note that each node is provided as input, and can use this and the list of IDs to compute the corresponding element . Finally observe that if and only if each node has a unique ID. As a result, we can see that checking uniqueness requires us to use a distributed multi-set equality protocol to check equality between and and running a local check to check that each ID in the list of IDs is unique. In Algorithm 1 we assume that all nodes have already checked that the provided list does not contain duplicate IDs.

The multi-set equality protocol itself relies on polynomial evaluation to check equality. First we note that given a set , where is an integer drawn from some finite field111To treat elements in this manner we treat fixed length elements as integers represented by their binary representation, or rely on cryptographic hash functions in cases where elements might be variable length. one can construct and evaluate the polynomial function for any point using an arithmetic circuit embedded in a spanning tree (which we are given as input, and whose structure we have already checked using the procedure described in the previous section). We show an evaluation mechanism for this polynomial on lines 1316 and 24 of Algorithm 1. Next, observe that given two degree polynomial and , the polynomial must either be at no more than points, or and must be equal (and thus the difference must be identically 0). This is because in the case where , is a degree polynomial, and hence has at most distinct roots. Given this we can now construct polynomials and from sets and as defined in the previous paragraph, and then use this polynomial to check set equality.

In Naor et al’s work this check is performed by picking points at random from a field of size , and evaluating

on these random points. One can bound the probability of any point picked at random being a root of

, and hence can evaluate this polynomial difference on a constant number of points to check multi-set equality. However, in our setting we neither assume a random oracle, nor a designated prover who can broadcast a set of random numbers. We resolve this in Algorithm 1 by evaluating the difference on the range which consists of points. While this suffices for correctness, it is inefficient requiring each node to send bits each time it is executed.

However, we note that in practice node IDs rarely change, e.g., they are unlikely to change during normal operation or due to node failures. As a result, this certification process runs infrequently, likely only in cases where new nodes are added, and if we can assume that IDs remain constant for at least certification rounds, then we find that our relatively simple protocol requires nodes to send out an amortized bits each time uniqueness has to be certified, and is thus efficient in practice. However, even when ID uniqueness has to be certified more frequently, we can make use of the root of the spanning tree to further increase efficiency, by having the root compute a set of random numbers and then having them propagate through the tree before executing the certification procedure described above. While this increases efficiency, it comes at the cost of increased protocol complexity.

3.2.3 Certifying Agreement

1:procedure CertifyAgreement(, ) is the output (decision) of the local consensus implementation
2: msgs is the set of messages received from the node’s children.
3:       for  do
4:             if  then Check if a child has decided on a different value
5:                    SignalError( ) Signal an error
6:             end if
7:       end for
8:       Send(parent, d) Send this node’s decision to its parent.
9:end procedure
Algorithm 2 Protocol for certifying agreement between nodes.

Having described how during initialization certification agents need to be organized in a spanning tree, and how to certify correctness for this spanning tree, we now turn to certifying the correctness of the consensus protocol itself. We start by describing our protocol for checking agreement, which we show in Algorithm 2. For ease of exposition, we show only the core of this and the validity certification algorithm, and we assume that the CertifyAgreement procedure is called similarly to the CheckSetEquality procedure in Listing 1 (line 8), i.e., a calling procedure first waits to receive messages from all children, collects these messages in a set and then calls CertifyAgreement with the input from the implementation and the set .

The certification protocol simply requires that each node check whether the decision reached by it differs from the decision reached by any of its children (lines 37). If so the node signals an error (line 5), and independent of whether an error is signalled it forwards its local decision to its parent (line 8). This protocol ensures completeness since any node signally an error must have observed a case where the value it has decided on differs from the value one of its children have decided on, i.e., an error is signalled only when the protocol observes disagreement between two nodes. The protocol provides soundness because equality is transitive, and our use of a spanning tree ensures that we perform a transitive equality check between any two nodes in the system.

In terms of efficiency, each node sends a single message at each round of this protocol. For any consensus protocol where the agreed value is of constant size, this message is also of constant size. However, one could use a consensus protocol to decide on values whose size depends on the number of nodes in the system (e.g., in case the consensus protocol is used to decide on a schedule of nodes). In this later case we use a cryptographic hash function to map this arbitrary length value to a constant hash string. Since standard assumptions about cryptographic hash functions imply that two distinct values do not hash to the same value with high probability, our protocol ensures soundness and completeness with high probability in this case. Finally, we note that while this protocol has low message complexity, it requires as many rounds of communication as the height of the spanning tress which can mean requiring rounds in the worse case. However, we note that a good choice of spanning trees (e.g., one with height ) can be used to reduce the number of rounds required, and flexibility in choosing the tree can be used in practice to optimize communication patterns – e.g., reducing the number of messages sent on links connecting multiple data centers.

3.2.4 Certifying Validity

1:procedure CertifyValidity(, ) is the output (decision) of the local consensus implementation.
2: is the set of proposals received by the node.
3: msgs is the set of messages received from the node’s children.
5:       for  do
7:       end for
8:       for  do
10:       end for
11:       if IsRoot( ) then The complete certificate is only visible to the root.
12:             if  then
13:                    SignalError( )
14:             end if
15:       else
16:             Send(c) Send the certificate for the subtree rooted at this node to its parent.
17:       end if
18:end procedure
Algorithm 3 Protocol for certifying validity.

We present our protocol for certifying validity in Algorithm 3. Certifying validity requires that we find at least one node where the value decided by the consensus algorithm was proposed (§3.1). Our procedure for certifying validity requires that one also check agreement using the protocol in §3.2.3, however these protocols can be run concurrently. For generality, we also assume that a single node may receive an arbitrary number of proposals.

Given these assumptions our algorithm starts by first checking if local information is sufficient to certify validity, i.e., did the current node receive a proposal with the same value as the one that was decided (lines 57). Next, we use logical disjunction (i.e., the or function) to combine the boolean value of this local certificate with certificates received from the node’s children (lines 810). Each node then forwards this combined certificate to its parent. Note that the combined certificate can be used to check whether any node in the subtree rooted at the current node can certify validity. As a result the root’s certificate signifies whether any node in the entire network can certify validity, and as a result the root signals an error (Line 11 in case its computed certificate is false.

Since no node except for the root signals errors in this protocol, completeness is preserved as long as the root’s computed certificate is correct. Soundness is similarly a consequence of the root’s computed certificate being correct. We observe that starting from the leaf of the tree a local certificate is true if and only if a node has a proposal that proves validity. Furthermore, for any non-leaf node the combined certificate is true if and only if node has seen a protocol proving validity or there is a path from some node to , where has seen such a proposal. As a result for the computed certificate to be true at the root, there must be a node that is contained in the spanning tree (thus ensuring there is a path from the node to the root) which has a proposal proving validity, and must be false otherwise. This ensures safety and completeness.

The efficiency of this protocol is similar to the agreement protocol in §3.2.3, except that all messages carry a single boolean value.

3.3 Handling Failures in Fail Stop Environments

Thus far in our exposition we have no considered cases where nodes might fail. We designed our current protocol to deal with fail-stop errors. We observe that a failed node can disrupt the structure of the spanning tree our protocols rely on. As a result when a node fails we need to recompute a new spanning tree, a task for which we can use existing protocols including the well-known spanning tree protocol [28]. Once a new spanning has been computed, we rely on the certification protocol in §3.2.1 to check structural properties of the spanning tree. We assume node IDs do not change as a result of failures, and thus do not execute CertifyUniqueID3.2.2).

Once a new spanning tree has been computed, we can now again use the protocols described in §3.2.3 and §3.2.4 to check agreement and validity. However, in this case failures might result in cases where our check for agreement is not sound (i.e., we do not detect a case where agreement is violated) or in cases where our check for validity is not complete (i.e., we return an error even when a valid proposal exists). This is because we do not have visibility into the state at failed nodes, so in cases where a node communicates an incorrect decision to an external entity and then fails, our agreement check will not in fact observe a violation and hence will not signal an error. Similar, if the failing node is the only node in the system to have seen a proposal that validates the selected value, then our validation protocol cannot observe this proof and hence will generate an error where non exists. To address this problem, we modify our definitions of soundness and completeness to only consider data that is visible to certification agents during a certification round. Specifically, we now define a certification protocol which evaluates a predicate to be sound if at least one node signals an error on any certification round where predicate does not hold for inputs from non-faulty nodes. We similarly define the certification protocol to be complete if no node signals an error in any round where predicate holds for inputs from non-faulty nodes.

Finally, we conjecture that this gap in soundness and completeness is fundamental, and no certification protocol can ensure soundness and correctness while considering the data at failed nodes, without assuming knowledge of the distributed system implementation.

3.4 Putting it All Together

1:procedure CertifyConsensus((, ))
2:        Receive messages from children
3:       CertifyAgreement(, )
4:        Receive messages from children
5:       CertifyValidity((, ))
6:end procedure
7:procedure DetectFailure( )
8:       RecomputeSpanningTree( )
9:       CertifySpanningTreeStructure( ) Check structural properties of the spanning tree.
10:end procedure
11:procedure Initialize( )
12:       RecomputeSpanningTree( )
13:       CertifySpanningTreeStructure( )
14:       CertifyUniqueID( )
15:end procedure
Algorithm 4 Overall protocol for certifying consensus.

Finally, we summarize the protocol executed by certification agents for consensus in Algorithm 4. Certification agents run the initialization function (line 11 when the system is started, whenever node IDs change, or whenever a new node is added to the system; they respond to failures by recomputing the spanning tree and checking spanning tree structure (DetectFailure, line 7); and respond to certification requests from the implementation by certifying agreement and validity (CertifyConsensus, line 1). We note that while for ease of exposition we separate checking for agreement and validity into separate phases in Algorithm 4, they are trivially combined into one phase where each node combines messages from both protocols into a single tuple.

4 Certificates for Other Protocols

In the previous section we presented a certification protocol for consensus. We now turn our attention to generalizing this certification framework so it can both be used to certify additional distributed protocol. What is desirable in generalizing this framework would be to find a procedure that takes as input programmers supplied predicates, and the generates an efficient certification protocol for checking these predicates. We can then rely on existing work (such as the code generation tools described in §1) to generate code for certification agents.

Thus, an important question in generalizing the certification framework in this way is to determine the types of predicates for which we can automatically generate efficient protocols. This is an open question, that we are currently working on, but the construction in §3 provides some early hints on what might be possible. In particular, the protocol for checking validity in §3.2.4 indicates that boolean formulas whose inputs are distributed across nodes of distributed system can be efficiently evaluated; the certificate for spanning tree structure in §3.2.1 show so can arithmetic circuits including ones which either count the number of elements or add up values at different nodes; and the protocol for checking agreement in §3.2.3 shows that we can also efficiently check equality for a fixed number of values.

More generally, we also see some hope in the recent results from Naor et al [21], which describes a compiler that can be used to convert any centralized verifier that can run in time on a RAM machine into a 3-round distributed protocol that can be executed in the interactive proof setting. It is unclear whether this result can be translated to our scenario in general, since the construction itself depends on the prover to execute the verifier algorithm while recording a trace, after which the verifiers use a multi-set equality protocol to check the correctness of this trace. In our case we do not assume a distinguished prover, and assuming otherwise would require the prover to have information about all nodes in the implementation. However, despite the challenges of adapting this approach to our setting, we continue to investigate how to apply these results to our setting. Finally, it is also unclear whether predicates for all safety properties can be expressed as algorithms, e.g., when certifying conflict-serializability [27] in a distributed database, one may need a predicate over the set of transactions, requiring the use of an algorithm with higher asymptotic complexity.

5 Discussion

Next, we briefly discuss a few additional questions and concerns about certification, before presenting a survey of related work and concluding.

Using Certification Protocols Thus far in the paper we have not addressed the question of how implementors and operators might use certification protocols. In particular, since on predicate violation any arbitrary node can output an error message, it seems unclear how one might use this message, e.g., using such a mechanism as a means to revert erroneous processing would require any error message to be broadcast to all nodes which might itself require execution of an atomic broadcast protocol. Our original intention with certification protocols was to provide a mechanism that could warn developers and operators of erroneous runs, allowing them to then use offline processes to analyze and fix the error. The protocols described thus far suffice for this purpose. In future work we hope to develop additional mechanisms that might be used by implementations to automatically react to certification errors.

Choice of failure models Our choice of the fail-stop failure model might appear strange in the context of certification, where we are trying to check whether an implementation is bug free, since it requires that we place some trust in the implementation so that inputs to the certification agent are correct. In general we chose the fail-stop model for simplicity, and also because we assume that implementation are not malicious, merely buggy. In this non-malicious setting we assume that implementations do not deliberately send incorrect inputs to the certification agent. Extending our framework for a more realistic byzantine setting requires us to consider clients, who consume the output from the original distributed system, and is left to future work.

6 Related Work

We have already discussed some of the related work in earlier section of the work, particularly to work on distributing verifiers in interactive proof systems [21, 16] which our protocols build on.

The work that is closest to the problem we address is the work on proof labeling scheme first initiated by Korman et al [17]. The main goal of this proof labeling work was to certify that network configuration met some boolean predicate, e.g., certifying that nodes were arranged in a spanning tree or in a clique, nodes had been assigned colors correctly to prove the graph was 3-colorable, etc. Much of the focus this work has been on determining bounds on the label size (which corresponds to the size of the input from the implementation in our model). While our work builds on some of the construction in the proof labeling literature, we differ in two significant ways: first, the class of predicates we consider are different – our focus is on allowing distributed systems to repeatedly certify their behavior and not on checking topological properties of the network; second, in our setting where certification is run repeatedly failure handling is an important concern to address. To the best of our knowledge, prior work on proof labeling schemes have not considered failures.

There are also close analogs to our certification method and to a proposal from Varghese and Lynch [32] for developing self-stabilizing algorithms [5]. In this proposal, nodes perform local checks of correctness and execute a protocol to fix any detected violations. Certification serves a similar purpose to the local checks used in this work, however, while our certification protocol could be used in a similar setting, this is not our primary goal.

There is also a large body of work focused on runtime monitoring and verification of distributed systems (see [11]

for a survey). Similar to our work, these efforts aim to detect and signal cases where the behavior of a distributed system deviates from some specified behavior. However, different from our work, these efforts do so by collecting consistent logs from across the system and analyzing these logs. Consistent log collection is enabled by including vector clocks or other ordering information in messages, this approach is also adopted by logging systems such as Dapper 

[31] and XTrace [10]. Compared to certification protocols, these approaches increase the size of each message, adding significant overheads.

7 Conclusion

There is significant interest in ensuring correctness for distributed systems, and as a result there has been significant recent work addressing this problem. Much of this work has focused on mechanisms for building distributed systems that are correct-by-construction. While the results in this area have been promising, this approach to correctness requires abandoning existing implementations and implicitly trusting the tools that generate these correct-by-construction implementations. In reality both these requirements are hard to meet – existing implementations usually provide richer functionality than new correct-by-construction implementations, and have often been extensively tested to meet correctness and performance requirements in deployments. In this work we suggest a different approach, where runtime certification is used to efficiently check the correctness of distributed system implementations. We have shown that such certification protocols can be used to check agreement and validity for consensus protocols, and we are now working on determining what set of safety properties are amenable to being checked using certification protocols.

8 Acknowledgements

We thank Mooly Sagiv, Scott Shenker, Eylon Yogev and Moni Naor for several early discussions on this work, and help with formulating these ideas. We also thank Changgeng Zhao and Mike Walfish for comments and suggestions on the current write-up. This work was funded in part by an early career faculty award from VMware Research.


  • [1] Jonathan Behrens, Ken Birman, Sagar Jha, Matthew Milano, Theo Gkountouvas, Weijia Song, and Robbert van Renesse. Derecho : Group Communication at the Speed of Light. Technical report, Cornell University, 2016.
  • [2] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In OSDI, 2006.
  • [3] Miguel Castro. Practical Byzantine Fault Tolerance. In OSDI, 1999.
  • [4] The Coq Development Team. The Coq Proof Assistant Reference Manual, version 8.7, October 2017. URL: http://coq.inria.fr.
  • [5] Edsger W. Dijkstra. Self-stabilizing Systems in Spite of Distributed Control. Commun. ACM, 17:643–644, 1974.
  • [6] Cynthia Dwork, Nancy A. Lynch, and Larry J. Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35:288–323, 1988.
  • [7] Amos Fiat and Adi Shamir. How to prove yourself: practical solutions to identification and signature problems. In CRYPTO 1987, 1986.
  • [8] Michael J. Fischer, Nancy A. Lynch, and Mike Paterson. Impossibility of Distributed Consensus with One Faulty Process. In PODS, 1983.
  • [9] Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In EuroSys, 2017.
  • [10] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-Trace: A Pervasive Network Tracing Framework. In NSDI, 2007.
  • [11] Adrian Francalanza, Jorge A. Pérez, and C. Sánchez. Runtime Verification for Decentralized and Distributed Systems. In Ezio Bartocci and Yliès Falcone, editors, Lectures on Runtime Verification. Springer, 2017.
  • [12] Eli Gafni and Leslie Lamport. Disk Paxos. Distributed Computing, 16:1–20, 2003.
  • [13] Cary G. Gray and David R. Cheriton. Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency. In SOSP, 1989.
  • [14] Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath T. V. Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. In SOSP, 2015.
  • [15] Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. ZooKeeper: wait-free coordination for internet-scale systems. In ATC, 2010.
  • [16] Gillat Kol, Rotem Oshman, and Raghuvansh R. Saxena. Interactive Distributed Proofs. In PODC, 2018.
  • [17] Amos Korman, Shay Kutten, and David Peleg. Proof labeling schemes. Distributed Computing, 22:215–233, 2010.
  • [18] Leslie Lamport. Fast Paxos. Distributed Computing, 19:79–103, 2006.
  • [19] Butler W. Lampson. How to Build a Highly Available System Using Consensus. In WDAG, 1996.
  • [20] K. Rustan M. Leino. Dafny: An Automatic Program Verifier for Functional Correctness. In LPAR, 2010.
  • [21] Moni Naor, Merav Parter, and Eylon Yogev. The Power of Distributed Verifiers in Interactive Proofs. Electronic Colloquium on Computational Complexity (ECCC), 25:213, 2018.
  • [22] Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. How Amazon Web Services uses formal methods. Commun. ACM, 58:66–73, 2015.
  • [23] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer, 2002.
  • [24] Diego Ongaro and John K. Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference, 2014.
  • [25] Oded Padon, Giuliano Losa, Shmuel Sagiv, and Sharon Shoham. Paxos made EPR: decidable reasoning about distributed protocols. PACMPL, 1:108:1–108:31, 2017.
  • [26] Oded Padon, Kenneth L. McMillan, Aurojit Panda, Shmuel Sagiv, and Sharon Shoham. Ivy: safety verification by interactive generalization. In PLDI, 2016.
  • [27] Christos H. Papadimitriou. The serializability of concurrent database updates. J. ACM, 26:631–653, 1979.
  • [28] Radia J. Perlman. An algorithm for distributed computation of a spanning tree in an extended LAN. In SIGCOMM, 1985.
  • [29] Dan R. K. Ports, Jialin Li, Vincent S Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. Designing Distributed Systems Using Approximate Synchrony in Data Center Networks. In NSDI, 2015.
  • [30] Ilya Sergey, James R. Wilcox, and Zachary Tatlock. Programming and proving with distributed protocols. PACMPL, 2:28:1–28:30, 2017.
  • [31] Benjamin H. Sigelman, Luiz Andé Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010. URL: https://research.google.com/archive/papers/dapper-2010-1.pdf.
  • [32] George Varghese and Nancy A. Lynch. Self-stabilization by local checking and correction. In FOCS 1991, 1991.
  • [33] Yuan Yu, Panagiotis Manolios, and Leslie Lamport. Model Checking TLA+ Specifications. In CHARME, 1999.