A Generalised Solution to Distributed Consensus

02/18/2019 ∙ by Heidi Howard, et al. ∙ 0

Distributed consensus, the ability to reach agreement in the face of failures and asynchrony, is a fundamental primitive for constructing reliable distributed systems from unreliable components. The Paxos algorithm is synonymous with distributed consensus, yet it performs poorly in practice and is famously difficult to understand. In this paper, we re-examine the foundations of distributed consensus. We derive an abstract solution to consensus, which utilises immutable state for intuitive reasoning about safety. We prove that our abstract solution generalises over Paxos as well as the Fast Paxos and Flexible Paxos algorithms. The surprising result of this analysis is a substantial weakening to the quorum requirements of these widely studied algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We depend upon distributed systems, yet the computers and networks that make up these systems are asynchronous and unreliable. The longstanding problem of distributed consensus formalises how to reliably reach agreement in such systems. When solved, we become able to construct strongly consistent distributed systems from unreliable components [13, 21, 4, 17]. Lamport’s Paxos algorithm [14] is widely deployed in production to solve distributed consensus [5, 6], and experience with it has led to extensive research to improve its performance and our understanding but, despite its popularity, both remain problematic.

Paxos performs poorly in practice because its use of majorities means that each decision requires a round trip to many participants, thus placing substantial load on each participant and the network connecting them. As a result, systems are typically limited in practice to just three or five participants. Furthermore, Paxos is usually implemented in the form of Multi-Paxos, which establishes one participant as the master, introducing a performance bottleneck and increasing latency as all decisions are forwarded via the master. Given these limitations, many production systems often opt to sacrifice strong consistency guarantees in favour of performance and high availability [7, 3, 18]. Whilst compromise is inevitable in practical distributed systems [10], Paxos offers just one point in the space of possible trade-offs. In response, this paper aims to improve performance by offering a generalised solution allowing engineers the flexibility to choose their own trade-offs according to the needs of their particular application and deployment environment.

Paxos is also notoriously difficult to understand, leading to much follow up work, explaining the algorithm in simpler terms [20, 15, 19, 23] and filling the gaps in the original description, necessary for constructing practical systems [6, 2]. In recent years, immutability has been increasingly widely utilised in distributed systems to tame complexity [11]. Examples such as append-only log stores [1, 8] and CRDTs [22] have inspired us to apply immutability to the problem of consensus.

This paper re-examines the problem of distributed consensus with the aim of improving performance and understanding. We proceed as follows. Once we have defined the problem of consensus (§2), we propose a generalised solution to consensus that uses only immutable state to enable more intuitive reasoning about correctness (§3). We subsequently prove that both Paxos and Fast Paxos [16] are instances of our generalised consensus algorithm and thus show that both algorithms are conservative in their approach, particularly in their rules for quorum intersection and quorum agreement (§4 & §5). Finally, we conclude by illustrating the power of our abstraction by outlining three different instances of our generalised consensus algorithm which provide alternative performance trade-offs compared to Paxos (§6).

2 Problem definition

The classic formulation of consensus considers how to decide upon a single value in a distributed system. This seemingly simple problem is made non-trivial by the weak assumptions made about the underlying system: we assume only that the algorithm is correctly executed (i.e., the non-Byzantine model). We do not assume that participants are either reliable or synchronous. Participants may operate at arbitrary speeds and messages may be arbitrarily delayed.

We consider systems comprised of two types of participant: servers, which store the value, and clients, which read/write the value. Clients take as input a value to be written and produce as output the value decided by the system. Messages may only be exchanged between clients and servers and we assume that the set of participants, servers and clients, is fixed and known to the clients.

An algorithm solves consensus if it satisfies the following three requirements:

  • Non-triviality. All output values must have been the input value of a client.

  • Agreement. All clients that output a value must output the same value.

  • Progress. All clients must eventually output a value if the system is reliable and synchronous for a sufficient period.

The progress requirement rules out algorithms that could reach deadlock. As termination cannot be guaranteed in an asynchronous system where failures may occur [9], algorithms need only guarantee termination assuming liveness.

If we have only one server, the solution is straightforward. The server has a single persistent write-once register, , to store the decided value. Clients send requests to the server with their input value. If is unwritten, the value received is written to and is returned to the client. If is already written, then the value in is read and returned to the client. The client then outputs the returned value. This algorithm achieves consensus but requires the server to be available for clients to terminate. To overcome this limitation requires deployment of more than one server, so we now consider how to generalise to multiple servers.

3 Generalised solution

Consider a set of servers, , where each has a infinite series of write-once, persistent registers, . Clients read and write registers on servers and, at any time, each register is in one of the three states:

  • unwritten, the starting state for all registers; or

  • contains a value, e.g., A, B, C; or

  • contains nil, a special value denoted as .

Register Quorums
(a)
Register Quorums
(b)
Register Quorums
(c)
Register Quorums
(d)
Figure 1: Sample configurations for systems of three or four servers.
S0 S1 S2
R0 A B
R1
R2 B A A
A decided by R2
S0 S1 S2
R0 A A A
R1 A A
A decided by R0 & R1
S0 S1 S2
R0 A A
R1 A C
R2 C B
No decisions yet
Figure 2: Sample state tables for a system using the configuration in Figure 0(a).

A quorum, , is a (non-empty) subset of servers, such that if all servers have a same (non-nil) value in the same register then is said to be decided. A register set, , is the set comprised of the register from each server. Each register set is configured with a set of quorums, , and four example configurations are given in Figure 1. The state of all registers can be represented in a table, known as a state table, where each column represents the state of one server and each row represents a register set. By combining a configuration with a state table, we can determine whether any decision(s) have been reached, as shown in Figure 2.

3.1 Correctness

Rule 1: Quorum agreement. A client may only output a (non-nil) value if it has read from a quorum of servers in the same register set.

Rule 2: New value. A client may only write a (non-nil) value provided that either is the client’s input value or that the client has read from a register.

Rule 3: Current decision. A client may only write a (non-nil) value to register on server provided that if is decided in register set by a quorum where then no value where can also be decided in register set .

Rule 4: Previous decisions. A client may only write a (non-nil) value to register provided no value where can be decided by the quorums in register sets to .

Figure 3: The four rules for correctness.

Figure 3 describes a generalised solution to consensus by giving four rules governing how clients interact with registers to ensure that the non-triviality and agreement requirements for consensus (§2) are satisfied.

Rule 1 (quorum agreement) ensures that clients only output values that have been decided. Rule 2 (new value) ensures that only client input values can be written to registers thus only client input values can be decided and output by clients. Rules 3 and 4 ensures that no two quorums can decide upon different values. Rule 3 (current decision) ensures that all decisions made by a register set will be for the same value whilst Rule 4 (previous decisions) ensures that all decisions made by different register sets are for the same value.

3.2 Implementing the correctness rules

Rules 1 and 2 are easy to implement, but Rules 3 and 4 require more careful treatment.

Register Client
C0
C1
C2
Figure 4: Sample round robin allocation of register sets to clients.

Rule 3 (current decision).

The simplest implementation of Rule 3 is to permit only configurations with one quorum per register set, as in Figure 0(b). We generalise this to intersecting quorums configurations by permitting multiple quorums per register set, provided that all quorums for a given register set intersect, as in Figure 0(d). The requirement for intersection ensures that if multiple quorums in a register set decide a value then they must decide the same value as they must share a common register.

Alternatively, we can support disjoint quorums if we require that all values written to a given register set are the same. This can be achieved by assigning register sets to clients and requiring that clients write only to their own register sets, with at most one value. In practice, this could be implemented by using an allocation such as that in Figure 4 and by requiring clients to keep a persistent record of which register sets they have written too. We refer to these as client restricted configurations.

Both techniques, intersecting quorums configurations and client restricted configurations, can be combined on a per-register-set basis.

Rule 4 (previous decisions).

Rule 4 requires clients to ensure that, before writing a (non-nil) value, previous register sets cannot decide a different value. This is trivially satisfied for register set , however, more work is required by clients to satisfy this rule for subsequent register sets.

Assume each client maintains their own local copy of the state table. Initially, each client’s state table is empty as they have not yet learned anything regarding the state of the servers. A client can populate its state tables by reading registers and storing the results in its copy of the state table. Since the registers are persistent and write-once, if a register contains a value (nil or otherwise) then any reads will always remain valid. Each client’s state tables will therefore always contain a subset of the values from the state table.

From its local state table, each client can track whether decisions have been reached or could be reached by previous quorums. We refer to this as the decision table. At any given time, each quorum is in one of four decision states:

  • Any: Any value could be decided by this quorum.

  • Maybe : If this quorum reaches a decision, then value will be decided.

  • Decided : The value has been decided by this quorum; a final state.

  • None: This quorum will not decide a value; a final state.

The rules for updating the decision table are as follows: Initially, the decision state of all quorums is Any. If there is a quorum where all registers contain the same value then its decision state is Decided . When a client reads nil from register on server then for all quorums where , the decision state Any/Maybe  becomes None. When a client reads a non-nil value from a client restricted register set then for all quorums over register sets to , the decision state Any becomes Maybe  and Maybe  where becomes None. When a client reads a non-nil value from a quorum intersecting register set on server then for all quorums where and for all quorums over register sets to , the state Any becomes Maybe  and Maybe  where becomes None.

These rules utilise the knowledge that if a client reads a (non-nil) value from the register on server , it learns that:

  • If is client restricted then all quorums in must decide if they reach a decision (Rule 3).

  • If any quorum of register sets to reaches a decision then value is decided (Rule 4).

A client may output value provided at least one quorum state is Decided  (Rule 1).

A client may write a non-nil value to register set provided:

  1. is ’s input value or has been read from a register (Rule 2), and

  2. is either:

    • quorum intersecting, or

    • client restricted and has been allocated to but not yet used (Rule 3), and

  3. the decision state of each quorum from register sets to is None, Maybe  or Decided  (Rule 4).

Figure 5: Client decision table rules

Figure 5 describes how clients can use decision tables to implement the four rules for correctness.

3.3 Examples

S0 S1 S2 S3
R0
Register Quorum Decision state
Any
(a) Initial state.
S0 S1 S2 S3
R0
R1 B
Register Quorum Decision state
Maybe B
Maybe B
(b) State after reading B from on .
S0 S1 S2 S3
R0 A
R1 B
Register Quorum Decision state
None
Maybe B
(c) State after reading A from on .
S0 S1 S2 S3
R0 A
R1 B B
Register Quorum Decision state
None
Decided B
(d) State after reading read B from on .
Figure 6: Sample client state tables (left) and decision tables (right).

This process is illustrated by Figures 6 and  7, which demonstrate how a client’s state is updated as they read registers. Figure 6 shows the state of a client in a system of 4 servers using the intersecting quorum configuration from Figure 0(b). Figure 5(a) shows the client’s initial state. The client’s state table is empty thus the status of all quorums in the decision table is Any. At this time, the client may only write non-nil values to due to condition (iii) in Figure 5. Next, Figure 5(b), the status of quorum over register set is updated to Maybe B since, depending on the state of register on , either this quorum will not reach a decision or it decides value B. Since the client that wrote B into on must have followed Rule 4, the quorum in must decide B if it reaches a decision thus its status is updated to Maybe B. The client can now write value B to or . Subsequently in Figure 5(c), the client could now safely write its input value to but there would be no use in doing so. Finally in Figure 5(d), the client learns that B is decided and thus can output B.

S0 S1 S2 S3
R0
Register Quorum Decision state
Any
Any
(a) Initial state.
S0 S1 S2 S3
R0
Register Quorum Decision state
None
Any
(b) State after reading nil from on .
S0 S1 S2 S3
R0
R1 B
Register Quorum Decision state
None
None
Maybe B
Maybe B
(c) State after reading nil from and B from on .
S0 S1 S2 S3
R0
R1 B B
Register Quorum Decision state
None
None
Maybe B
Decided B
(d) State after reading B from on .
Figure 7: Sample client state tables (left) and decision tables (right).

Figure 7 shows the state of a client in a system using the configuration in Figure 0(c) and the client allocation from Figure 4. Figure 6(a) shows the initial state of the client . At this time, the client can only write non-nil values to . Later in Figure 6(c), the client has updated the status of both quorums in to Maybe B after reading B from . This is because register set is client restricted to value B.

4 Generalisation of Paxos

Phase 1

  • [leftmargin=*]

  • A client chooses a register set that it has been assigned but not yet used and sends P1a() to all servers.

  • Upon receiving p1a(), each server checks if register is unwritten. If so, any unwritten registers up to (inclusive) are set to nil. The server replies with p1b(, ) where is a set of all written non-nil registers.

  • If receives p1b messages from a majority of servers then chooses the value from the greatest (non-nil) register. If no values were returned with P1b messages then chooses its input value. then proceeds to phase two. Otherwise, times out and restarts phase one.

Phase 2

  • [leftmargin=*]

  • sends P2a(, ) to all servers where is value chosen at the end of phase one.

  • Upon receiving P2a(, ), each server checks if register is unwritten. If so, any unwritten registers up to (inclusive) are set to to nil and register is set to . The server replies with P2b(, ).

  • If receives P2b messages from the majority of servers then learns that the value has been decided and can output . Otherwise, times out and restarts phase one.

Figure 8: The Paxos algorithm using write-once registers.

P1a(0)

P1b(0,{})

P1b(0,{})

P2a(0,A)

P2b(0,A)

P2b(0,A)

P1a(1)

P1b(1,{R0:A})

P1b(1,{R0:A})

P2a(1,A)

P2b(1,A)

P2b(1,A)
Figure 9: Sample message exchange for Paxos

The (unoptimised) Paxos algorithm is described in Figure 8 using only write-once registers. Figure 9 gives an example of the message exchange as two clients execute Paxos with three servers.

S0 S1 S2
R0
Register Quorum Decision state
Any
Any
Any
(a) Initial state, unchanged after receiving P1b(0,) from .
S0 S1 S2
R0 A A
Register Quorum Decision state
Decided A
Maybe A
Maybe A
(b) State after receiving P2b(0,A) from .
Figure 10: Sample client state tables (left) and decision tables (right) for client during the execution in Figure 9.
S0 S1 S2
R0 A
Register Quorum Decision state
Maybe A
Maybe A
Maybe A
(a) State after receiving P1b(1,{R0:A}) from .
S0 S1 S2
R0 A A
Register Quorum Decision state
Decided A
Maybe A
Maybe A
(b) State after receiving P1b(1,{R0:A}) from .
Figure 11: Sample client state tables (left) and decision tables (right) for client during the execution in Figure 9.

We observe that Paxos is a conservative instance of our generalised solution to consensus. The configuration used by Paxos is majorities for all register sets, such a configuration is given in Figure 0(d). Paxos also uses client restricted for all register sets and a suitable client assignment is given in Figure 4. The purpose of phase one is to implement Rule 4 and the purpose of phase two is to implement Rule 1. Earlier (§3), we proposed client state and decision tables as a mechanism for clients to implement the rules for correctness. Upon receiving P1b(,) where is the set of registers from a server, the client learns the contents of registers to . This is because registers are always written to in-order on each server and register must be unwritten. Therefore the client’s state table and thus its decision table can be updated accordingly. This is demonstrated in Figure 10 for client and Figure 11 for client .

4.1 Weakened quorum intersection requirements

The boolean function I tests whether two or more quorum sets are intersecting and is defined as .

Paxos utilises majorities as it requires all quorums, , to intersect, regardless of the register set or phase of the algorithm. That is, in terms of I, .

Instead, we differentiate between the quorums used for each register set and which phase of Paxos the quorum is used for. is the set of quorums for phase of register set . We observe that quorum intersection is required only between the phase one quorum for register set and the phase two quorums of register sets 0 to . This is the case because a client can always proceed to phase two after intersecting with all previous phase two quorums since the condition (iii) in Figure 5 will be satisfied. More formally,

. (*)

This result confirms the findings of Flexible Paxos [12]. This is illustrated in Figure 9(a) where the client was safe to proceed to phase two from startup since there is no intersection requirement for register set .

4.2 Progress without quorums

Each of the two phases of Paxos waits for agreement from a quorum of servers. However, we observe that it may be possible to proceed prior to reaching quorum agreement.

A client can safely terminate once it learns that a value has been decided (Rule 1). This need not be the result of completing both phases of the algorithm. This is illustrated in Figure 10(b) where the client learns that value A has been decided prior to starting phase two.

Similarly, if a server learns that a register contains a (non-nil) value then it also learns that if any quorums from register sets to reach a decision then must be chosen. By updating their decision table, we observe that it is no longer necessary for the client in phase one to intersect with the phase two quorums of registers up to (inclusive). This is illustrated in Figure 10(a) where the client could safely proceed to phase two after one P1b message as the client reads a non-nil value from predecessor register set.

5 Generalisation of Fast Paxos

Register Quorums
(a)
Register Quorums
(b)
Register Quorums
(c)
Register Quorums
(d)
Figure 12: Additional sample configurations.

Paxos requires client restricted configuration for all register sets. Fast Paxos [16] generalises Paxos by permitting intersecting quorum configurations for some register sets, known as fast sets, whilst still utilising client restricted configurations for remaining sets, known as classic sets. Quorums for classic sets must include of servers whereas quorums for fast sets must include of servers. Figure 11(a) is an example of such a configuration.

Fast Paxos modifies our original Paxos algorithm (Figure 8) as follows:

  • If a register set is fast then a client does not need to be assigned the register set nor do they need to ensure that they write to it with only one value. Any client can use the any fast register set.

  • If the register set is fast then completion of each phase requires responses from of servers instead of of servers.

  • When choosing a value at the end of phase one, multiple values may have been read from the same register set (if it was a fast set), in which case the client chooses the most common.

5.1 Weakened quorum intersection requirements

Fast Paxos uses quorums of of servers for fast sets and of servers for classic sets since it requires the following intersection between quorums for fast sets, and quorums for classic sets, : , and .111Generalisation to quorums requires us to rewrite the value selection rule to chose the value which may be decided.

As with Paxos, these intersection requirements are conservative. We differentiate between the quorums used for each register set and which phase of the algorithm the quorum is used for. is the set of quorums for phase of register set . In addition to Paxos’s weakened intersection requirement (Eq. (*)), we observe that two additional quorum intersections are required: between the quorums for each fast register set, and between the phase one quorum for register set and any pair of phase two quorums of fast register sets from 0 to . Denoting the set of fast register sets as , we express these requirements as follows:

and . (**)

5.2 Progress without quorums

Utilising decision tables, we observe that quorum agreement is sufficient but not necessary for a client to complete a phase of the algorithm. In particular, during the following three cases.

(i) As with Paxos, once a client learns that a quorum of registers contain a value then the client can terminate and return that value.

(ii) If a client learns that a register contains a (non-nil) value then it also learns that if any quorums from register sets 0 to reach a decision then must be chosen. If is a classic register set then it also learns that if any quorums from register sets reach a decision then must be chosen. The client therefore no longer needs to intersect with quorums to if is fast or quorums to if is classic.

S0 S1 S2 S3
R0
Register Quorum Decision state
None
None
None
None
Figure 13: Sample client state table (left) and decision table (right) for client during Fast Paxos.
S0 S1 S2 S3
R0 A B
Register Quorum Decision state
None
None
Maybe A
Maybe B
Figure 14: Sample client state table (left) and decision table (right) for client during Fast Paxos.

(iii) Furthermore, a client in phase one will only need to intersect with any previous two fast quorums if it is unable to determine which value to propose. Figures 1314 give an example of this with the configuration from Figure 11(a). According to Equation (**), the client needs to read three registers from register set 0 before it can safely write to register set 1. However, in Figure 13, the client can safely write to register set after reading just two registers. This is not the case in Figure 14 however.

Phase 1

  • [leftmargin=*]

  • A client chooses a register set that is either: quorum intersecting or is client restricted and has been assigned to but not yet used. sends P1a() to all servers.

  • Upon receiving p1a(), each server checks if register is unwritten. If so, any unwritten registers up to (inclusive) are set to to nil. The server replies with p1b(,) where is a set of all written registers.

  • Each time receives a P1a, it updates its state and decision tables accordingly. If the decision state of all quorums from register sets to is None or Maybe  then chooses (or if all states are None then its input value) and proceeds to phase two. If times out before completing phase one, it restarts phase one.

Phase 2

  • [leftmargin=*]

  • sends P2a(,) to all servers where is value chosen at the end of phase one.

  • Upon receiving P2a(,), each server checks if register is unwritten. If so, any unwritten registers up to (inclusive) are set to nil and register is set to . The server replies with P2b(,).

  • Each time receives a P2a, it updates its state and decision tables accordingly. If the decision state of a quorum is Decided  then outputs . If times out before completing phase two, it restarts phase one.

Figure 15: The Generalised Fast Paxos algorithm.

Figure 15 summaries how these generalisation can be combined into a revised Fast Paxos algorithm. Note that a client can complete a phase once the completion criteria (underlined) has been satisfied even if it has not executed every step.

6 Example consensus algorithms

In this section, we outline three uses of our generalisation of Paxos and Fast Paxos by utilising different configurations.

Co-located consensus. Consider a configuration which uses a quorum containing all servers for the first register sets and majority quorums afterwards, as shown in Figure 11(b). All registers sets are client restricted. Participants in a system may be deciding a value between themselves, and so a server and client are co-located on each participant. A client can therefore either achieve consensus in one round trip to all servers (if all are available) or two round trips to any majority (in case a server has failed).

Fixed-majority consensus. Consider a configuration with one majority quorum for register set and majority quorums for register sets onwards, as shown in Figure 11(c). Register set is quorum intersecting and register sets onwards are client restricted. A client can either achieve consensus in one round trip to a specific majority or two round trips to any majority.

Reconfigurable consensus. Consider a set of servers partitioned into a primary set and backup set. Consider a configuration which uses only primary servers for register set to and only backup servers from register set , as shown in Figure 11(d). A client can move the systems from primary servers to backup servers by executing Paxos for register set or greater. No subsequent client will need a reply from a primary server to make progress whilst the backup set is available.

7 Conclusion

Paxos has long been the de facto approach to reaching consensus, however, this “one size fits all” solution performs poorly in practice and is famously difficult to understand. In this paper, we have reframed the problem of distributed consensus in terms of write-once registers and thus proposed a generalised solution to distributed consensus. We have demonstrated that this solution not only unifies existing algorithms including Paxos and Fast Paxos but also demonstrates that such algorithms are conservative as their quorum intersection requirements and quorum agreement rules can be substantially weakened. We have illustrated the power of our generalised consensus algorithm by proposing three novel algorithms for consensus, demonstrating a few interesting points on the diverse array of algorithms made possible by our abstract.

Our aim is to make reasoning about correctness sufficiently intuitive that proofs are not necessary to make a convincing case for the safety; nonetheless, we include in Appendix A for completeness.

8 Acknowledgements

We would like to thank Jon Crowcroft, Stephen Dolan and Martin Kleppmann for their valuable feedback on this paper. This work was funded in part by EPSRC EP/N028260/2 and EP/M02315X/1.

References

  • [1] M. Balakrishnan, D. Malkhi, J. D. Davis, V. Prabhakaran, M. Wei, and T. Wobber. Corfu: A distributed shared log. ACM Trans. Comput. Syst., 31(4):10:1–10:24, Dec. 2013.
  • [2] W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pages 141–154, Berkeley, CA, USA, 2011. USENIX Association.
  • [3] N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. Tao: Facebook’s distributed data store for the social graph. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference, USENIX ATC’13, pages 49–60, Berkeley, CA, USA, 2013. USENIX Association.
  • [4] N. Budhiraja, K. Marzullo, F. B. Schneider, and S. Toueg. Distributed systems (2nd ed.). chapter The Primary-backup Approach, pages 199–216. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993.
  • [5] M. Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI ’06, pages 335–350, Berkeley, CA, USA, 2006. USENIX Association.
  • [6] T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An engineering perspective. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing, PODC ’07, pages 398–407, New York, NY, USA, 2007. ACM.
  • [7] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available key-value store. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP ’07, pages 205–220, New York, NY, USA, 2007. ACM.
  • [8] Facebook. LogDevice project homepage. https://logdevice.io/. [Online; accessed 5-Oct-2018].
  • [9] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374–382, 1985.
  • [10] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002.
  • [11] P. Helland. Immutability changes everything. Queue, 13(9):40:101–40:125, Nov. 2015.
  • [12] H. Howard, D. Malkhi, and A. Spiegelman. Flexible Paxos: Quorum Intersection Revisited. In P. Fatourou, E. Jiménez, and F. Pedone, editors, 20th International Conference on Principles of Distributed Systems (OPODIS 2016), volume 70 of Leibniz International Proceedings in Informatics (LIPIcs), pages 25:1–25:14, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [13] L. Lamport. The implementation of reliable distributed multiprocess systems. Computer Networks (1976), 2(2):95 – 114, 1978.
  • [14] L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, May 1998.
  • [15] L. Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column), 2001.
  • [16] L. Lamport. Fast paxos. Technical Report MSR-TR-2005-112, Microsoft Research, 2005.
  • [17] B. W. Lampson. How to build a highly available system using consensus. In Proceedings of the 10th International Workshop on Distributed Algorithms, WDAG ’96, pages 1–17, London, UK, UK, 1996. Springer-Verlag.
  • [18] H. Lu, K. Veeraraghavan, P. Ajoux, J. Hunt, Y. J. Song, W. Tobagus, S. Kumar, and W. Lloyd. Existential consistency: Measuring and understanding consistency at facebook. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 295–310, New York, NY, USA, 2015. ACM.
  • [19] D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In Proc. USENIX Annual Technical Conference, pages 305–320, 2014.
  • [20] R. D. Prisco, B. W. Lampson, and N. A. Lynch. Revisiting the paxos algorithm. In Proceedings of the 11th International Workshop on Distributed Algorithms, WDAG ’97, pages 111–125, London, UK, UK, 1997. Springer-Verlag.
  • [21] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv., 22(4):299–319, Dec. 1990.
  • [22] M. Shapiro, N. Preguiça, C. B. Moreno, and M. Zawirsky. Conflict-free replicated data types. page 386–400, July 2011.
  • [23] R. Van Renesse and D. Altinbuken. Paxos made moderately complex. ACM Comput. Surv., 47(3):42:1–42:36, Feb. 2015.

Appendix A Proofs of safety

In this appendix, we provide proofs for the safety properties (non-triviality, agreement) of our proposed algorithms for solving consensus.

a.1 Four correctness rules

Figure 3 proposed four rules which we claim are sufficient to satisfy the non-triviality and agreement requirements of distributed consensus (§2). We now consider each requirement in turn. We will use to denote that the value is in register on server .

Theorem A.1 (Satisfying non-triviality).

If a value is the output of a client then was the input of some client .

Proof.

Assume was the output of client .

According to Rule 1, therefore at least one register contains .

Consider the invariant that all (non-nil) registers contain client input values. Initially, all registers are unwritten thus this invariant holds. According to Rule 2, each client will only write either their input value or a value copied from another register, thus the invariant will be preserved. ∎

Theorem A.2 (Satisfying agreement).

If two clients, and , output values, and (respectively), then .

Proof.

Assume that value was the output of client . Assume that value was the output of client .

According to Rule 1, the following must be true:

Since register sets are totally ordered, it must be the case that either , or :

Case

:
Both decisions are in the same register set. It is either the case that both clients have read from the same quorum or they have read from different quorums.

Case

:
Each quorum can decide at most one value thus

Case

:
According to Rule 3, since has decided , each client who wrote a register in must have ensured that no other quorum in register set can reach a different decisions. Thus .

Case

:
According to Rule 4, a client will only write to register set after ensuring no quorum in register set will reach a different decision. Thus .

Case

:
This is the same as with and swapped. Thus .

a.2 Client decision table rules

We have shown that the four rules for correctness are sufficient to satisfy the non-triviality and agreement requirements of consensus. We will now show that the client decision table rules (Figure 5) implement the four rules for correctness (Figure 3) and thus satisfies the non-triviality and agreement requirements of consensus.

Theorem A.3 (Satisfying Rule 1).

If the value is the output of client then has read from a quorum in register set .

Proof.

Assume the value is the output of client . There must exist a register set and quorum in the decision table of with the status Decided  (Figure 5). A quorum can only reach decision state Decided  if . ∎

Theorem A.4 (Satisfying Rule 2).

If the value is written by a client then either is ’s input value or has been read from some register.

Proof.

Assume the value has been written by client . According to Figure 5, must be either the input value of or read from some register. ∎

Theorem A.5 (Satisfying Rule 3).

If the values and are decided in register set then .

Proof.

Assume the value is decided in register set by quorum , thus . Assume the value is decided in register set by quorum , thus .

The register set either uses intersecting quorums or client restricted configuration.

Case

is client restricted:
Each client is assigned a disjoint subset of register sets thus at most one client is assigned . A client will only write a (non-nil) value to if they have been assigned it and not yet written to it (Figure 5). The register set will therefore only contain one (non-nil) value thus .

Case

has intersecting quorums:
This requires that there exists a server such that and . We require that both and , thus .

Theorem A.6 (Satisfying Rule 4).

If the value is decided in register set and the (non-nil) value is written to register set where then

We will prove this by induction over the writes to register sets .

Theorem A.7 (Satisfying Rule 4 - Base case).

If the value is decided in register set then the first (non-nil) value to be written to a register set where is .

Proof.

Assume the value is decided in register set by quorum thus . Since registers are write once, the following always holds true: .

Assume the value is written to register set by client where . Assume that is the first value to be written thus cannot read any (non-nil) values from registers before writing .

We will show that .

Consider the decision table of client when it is writing to . Since , the decision state of must be either None, Maybe  or Decided  (Figure 5).

Case

Decided :
This decision state requires that . Since we know that then .

Case

Maybe :
The decision state Maybe can be reached in one of three ways:

Case

read from register of some server : Since we know that then .

Case

is client restricted and read from register of some server Since we know that then .

Case

read from a register : Since is the first value to be written to a register , this case cannot occur.

Case

None:
The decision state None can be reached in one of five ways:

Case

read nil from register of some server :
Since , this case cannot occur.

Case

read two different values from two servers, :
Since , this case cannot occur.

Case

read two different values from registers :
Since is the first value to be written to a register , this case cannot occur.

Case

read a value from register of some server and a different value from a register :
Since is the first value to be written to a register , this case cannot occur.

Case

is client restricted, read a value from a register in and a different value from a register :
Since is the first value to be written to a register , this case cannot occur.

Since the following proof overlaps significantly with the previous proof, we have underlined the parts which have been altered.

Theorem A.8 (Satisfying Rule 4 - Inductive case).

If the value is decided in register set and all (non-nil) values written to registers are then the next (non-nil) value to be written to a register set where is also .

Proof.

Assume the value is decided in register set by quorum thus . Since registers are write once, the following always holds true: .

Assume the value is written to register set by client where . Assume that all (non-nil) values written to registers are thus can only read from (non-nil) registers .

We will show that .

Consider the decision table of client when it is writing to . Since , the decision state of must be either None, Maybe  or Decided  (Figure 5).

Case

Decided :
This decision state requires that . Since we know that then .

Case

Maybe :
The decision state Maybe can be reached in one of three ways:

Case

read from register of some server :
Since we know that then .

Case

is client restricted and read from register of some server :
Since we know that then .

Case

read from a register :
Since is the only (non-nil) value to be written to registers then .

Case

None:
The decision state None can be reached in one of five ways:

Case

read nil from register of some server :
Since , this case cannot occur.

Case

read two different values from two servers, :
Since , this case cannot occur.

Case

read two different values from registers :
Since is the only (non-nil) value to be written to registers , this case cannot occur.

Case

read a value from register of some server and a different value from a register :
Since we know that and is the only (non-nil) value to be written to registers , this case cannot occur.

Case

is client restricted, read a value from a register in and a different value from a register :
Since we know that at some time , then if is client restricted then all non-nil registers in must contain . Since is the only (non-nil) value to be written to registers , this case cannot occur.

a.3 (Fast) Paxos

Figure 8 describes the Paxos algorithm using write-once registers. Section 5 describe how to generalise Figure 8 to Fast Paxos. In this section, we proof that Fast Paxos (and therefore Paxos) implements the four rules for correctness (Figure 3) and thus satisfies the non-triviality and agreement requirements of consensus.

Theorem A.9 (Satisfying Rule 1).

If the value is the output of client then has read from a quorum in register set .

Proof.

Assume the value is the output of client .

This must be the result of completing phase two of Fast Paxos for some register set . must have received the message P2b(,) from / of servers (depending on either is classic/fast). Prior to sending P2b(,), each server has written register to . is any subset of servers containing / of servers (depending on either is classic/fast). Thus has read a quorum in register set . ∎

Theorem A.10 (Satisfying Rule 2).

If the value is written by a client then either is ’s input value or has been read from some register.

Proof.

Assume a value is written by a client . This must be the result of completing phase one of Fast Paxos for some register set and choosing the value . The value must have been chosen in one of following ways:

Case

(non-nil) registers where returned with P1b messages:
In this case, is ’s input value.

Case

or more (non-nil) registers where returned with P1b messages:
In this case, is the most common value from the greatest register set thus has been read from some register.

Theorem A.11 (Satisfying Rule 3).

If the values and are decided in register set then .

Proof.

Assume the values and are decided in register set . It is therefore the case that there exists two quorums such that and The register set is either fast (quorum intersecting) or classic (client restricted):

Case

is fast:
There exists a server where and . We require that thus

Case

is classic:
At most one client is assigned register set . Each client only writes (non-nil) values to assigned register sets and each does so with only one value. Therefore .

Theorem A.12 (Satisfying Rule 4).

If the value is decided in register set and the (non-nil) value is written to register set where then

We will prove this by induction over the writes to register sets .

Theorem A.13 (Satisfying Rule 4 - Base case).

If the value is decided in register set then the first (non-nil) value to be written to a register set where is .

Proof.

Assume the value is decided in register set . If is fast (quorum intersecting), must have been written to register on or more of servers. Otherwise, if is classic (client restricted), must have been written to register on least of servers. The writing of to must be the result of receiving P2a(,).

Assume the (non-nil) value is written to register set by client . This must be the result of completing phase one of Fast Paxos for register set and choosing the value . The value could be chosen in one of two ways:

Case

is ’s input value: This requires that (non-nil) registers where returned to with the P1b messages for . At last one server must both write and send a P1b message to since both require at least of servers.

Case

sends P1b for register first:
Prior to sending P1b, the server must write nil to all unwritten registers to , including register since . Server will not be able to later write so this case cannot occur.

Case

must write first:
Since no registers where returned with P1b messages, this case cannot occur.

Case

is the most common value read from the greatest (non-nil) register set: This requires that or more (non-nil) registers where returned to with the P1b messages for . As we have already seen, at least one P1b message for register must include . The chosen value must have either been read from register set or from any register set .

Case

was read from register set :
The register set is either fast (quorum intersecting) or classic (client restricted):

Case

is classic:
All (non-nil) registers from returned with P1b messages will contain thus .

Case

is fast:
At least of servers will reply with . Therefore will be the most common value and it will be chosen by thus .

Case

was read from a register set :
Since the client is the first to write to a register then will not read any registers . Therefore this case cannot occur.

Since the following proof overlaps significantly with the previous proof, we have underlined the parts which have been altered.

Theorem A.14 (Satisfying Rule 4 - Inductive case).

If the value is decided in register set and all (non-nil) values written to registers are then the next (non-nil) value to be written to a register set where is also .

Proof.

Assume the value is decided in register set . If is fast (quorum intersecting), must have been written to register on or more of servers. Otherwise, if is classic (client restricted), must have been written to register on least of servers. The writing of to must be the result of receiving P2a(,).

Assume the (non-nil) value is written to register set by client . This must be the result of completing phase one of Fast Paxos for register set and choosing the value . The value could be chosen in one of two ways:

Case

is ’s input value: This requires that (non-nil) registers where returned to with the P1b messages for . At last one server must both write and send a P1b message to since both require at least of servers.

Case

sends P1b for register first:
Prior to sending P1b, the server must write nil to all unwritten registers to , including register since . Server will not be able to later write so this case cannot occur.

Case

must write first:
Since no registers where returned with P1b messages, this case cannot occur.

Case

is the most common value read from the greatest (non-nil) register set: This requires that or more (non-nil) registers where returned to with the P1b messages for . As we have already seen, at least one P1b message for register must include . The chosen value must have either been read from register set or from any register set .

Case

was read from register set :
The register set is either fast (quorum intersecting) or classic (client restricted):

Case

is classic:
All (non-nil) registers from returned with P1b messages will contain thus .

Case

is fast:
At least of servers will reply with . Therefore will be the most common value and it will be chosen by thus .

Case

was read from a register set :
Since all non-nil registers contain then will not read any other value from any registers thus .