Raft Consensus Algorithm: an Effective Substitute for Paxos in High Throughput P2P-based Systems

11/04/2019 ∙ by Mohammad Fazlali, et al. ∙ 0

One of the significant problem in peer-to-peer databases is collision problem. These databases do not rely on a central leader that is a reason to increase scalability and fault tolerance. Utilizing these systems in high throughput computing cause more flexibility in computing system and meanwhile solve the problems in most of the computing systems which are depend on a central nodes. There are limited researches in this scope and they seem are not suitable for using in a large scale. In this paper, we used Cassandra which is a distributed database based on peer-to-peer network as a high throughput computing system. Cassandra uses Paxos to elect central leader by default that causes collision problem. Among existent consensus algorithms Raft separates the key elements of consensus, such as leader election, so enforces a stronger degree of coherency to reduce the number of states that must be considered, such as collision.



There are no comments yet.


page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the CAP theorem, consistency, high availability, and partition tolerance are the three axes of any distributed database management system that at most two of them can be fulfilled. The systems with the primary goal of scalability usually provide high availability and partition tolerance. Therefore, they have to use various approaches to managing the replicas, and also be prepared for different partitioning scenarios. Horizontal partitioning in server clusters is one of the cost-effective ways for providing scalability. By horizontal partitioning, each cluster manages a part of data with its own database system. In such a system, manual processes are dedicated to maintenance and load balancing tasks; but recently, new architectures emerged for automatic horizontal partitioning and load balancing [8, 12, 15, 18] that are mostly based on hash keys. On the other hand, accessibility is the second challenging issue, which is almost achieved by replication.

The simplest solution for replication is a master-slave relationship in pairs of nodes. As a matter of fact, many database systems implement the horizontal partitioning by master-slave replication. But, this solution is accompanied with some limitations. For instance, if one of the nodes fails to continue the execution, the other one can still do the tasks, but in the case of existing, a sequence of failures this system will be in trouble. Another solution is tripartite replication that is more reliable and facilitates the management of problematic situations. Also, it provides the offline update of replica, while two other ones are online [8]. Managing the sequence of failures is more convoluted in this solution. So, many efforts aimed at simplifying and solving this problem [25]. Experiments disclosed that the best way for managing the replicas that are more than three in numbers is utilizing consensus algorithms [58]. Peer-to-peer (P2P) database systems are the most prominent example of this type of systems.

A P2P database is based on a network of heterogeneous interconnected nodes for sharing data via the files, in the role of both supplier and consumer. In this structure, the function of all the nodes is the same, and all the nodes are autonomous. Theoretically, there is no central coordinator in the system, but in practice, it is not materialized yet. In fact, peers construct an overlapping network and form a topology for connections that is able to put the nodes in touch after joining and keep the links for rejoins [54].

Cassandra is one of the outstanding products that is based on the hash table and has a P2P structure [12]

. It is an open source distributed database management system that is developed based on Amazon Dynamo and Big table data model. Some of the key features are distribution, decentralization, scalability, high accessibility, fault tolerance, tunable stability and column-oriented model. Cassandra is an appropriate tool for managing the vast amount of data on the structure of an intricate network. Some other features such as compatible hash/fragmentation, a rumor-based algorithm for controlling the membership, replication, anarchy algorithm, and failure detection made Cassandra be a distributed storing system, which is fault tolerant and symmetric total compatibility.

Ultimate compatibility that is provided by Cassandra causes some unexpected problems. For example, when different data are written in Cassandra, multiple copies in different timestamps are created. The major issue emerges when some of the nodes choose an identical item from the queue for running. So, one of them takes the job and omits it from the queue, while for the other nodes it might be thought that the job is still in the queue, and they try to run it. This situation is called workers’ collision [8]. Consensus algorithms, such as Paxos [33], are proposed to solve this problem. They let the applicant nodes to be noticed about preceding candidates that have the priority for popping the jobs [8].

Paxos is a group of protocols for performing consensus in a network of unreliable nodes. The main limitation of Paxos is complexity in processing and difficulty in understanding the details that entail inappropriate architecture of a system in practice. Proposed solutions were directed to emerge Raft algorithm [49], which was simpler than Paxos [15]. Raft utilized concepts like leader election, log, and security to make the consensus more comprehensible. Raft is equivalent to the Paxos in fault-tolerance and performance, but it separates the consensus into relatively independent sub-problems, and clearly addresses all essential pieces for real-world implementation of the system.

Raft starts the consensus with selecting a leader and giving the authority of managing the logs to it. Raft provides the fault-tolerance when the leader takes the logs from the servers and gives it to others to let them know about the secure time of receiving logs. Previous experiments indicated that Paxos has more latency in reading and writing.

In this paper, we will briefly show the similarities and differences between Paxos and Raft. Firstly, we will describe what consensus algorithms problem is. Secondly, we will describe other Consensus algorithms and protocols. We will describe how leaders are elected in Paxos and Raft algorithms Finally, the goal of our experiment is to find a solution for collision problem in distributed queue for Cassandra database system; because it has a positive correlation with the efficiency of high throughput capabilities and reducing the redundancy of P2P storage system. Due to it, some of the well-known consensus algorithms are surveyed involving: Paxos, Viewstamped Replication (VR), Zab, Chandra-Toueg, and Raft.

Studying these algorithms entails focusing on Paxos and proposing the usage of Raft solution in Cassandra, to manage the replicas in a more optimized manner and decrease the read/write request latency for achieving better performance and fairer load balancing.

In the rest of this paper, we first define the consensus problem, then in Section 3 an inspection of previous work is presented. This claim is underpinned in Section 4 by providing the results of simulation and different assessments. Finally, in Section 5 we get into the conclusions.

2 Consensus Problem

In this section, we take a look at the basic concepts of consensus problem.

Consensus deals with the consenting on a similar data by some processes; whilst some of these processes might be failed or unreliable. Hence, consensus algorithms must be able to handle the failures. The general approach is that a common data consents. A faulty process might change the results of consensus to a wrong value. Consensus algorithms must have the following features to tolerate the temporary failures:

  • Termination: Each valid process terminates on finite values.

  • Validity: If all the processes propose the real value of , all the valid processes choose .

  • Integration: Each valid process decides on at most one value.

  • Agreement: Each valid process must consent on a common value.

Besides, the system might encounter with two types of failures:

1) Crash failure: It occurs when a process stops abnormally.

2) Byzantine failure: A process with Byzantine failure might send wrong data to other processors, or hibernate for a long time.

The second type of failure is more destructive and harder to handle. So, it is expected that the consensus algorithms anticipate the occurrence of such failures and attempt to neutralize their effect.

3 Consensus Algorithms and Protocols

A brief review of the prominent consensus approaches is provided in this section to investigate the one which best fits real-world application on Cassandra database system. But first, some of the required backgrounds is presented.

A distributed system is comprised of a collection of independently interconnected computing devices that seems like a single coherent system to end user [45, 53]. This concept is generalized for distributed database systems, too. Distributed database systems are applications that provide the accessibility to data in a distributed environment. Designers of such these systems attempt to provide the fault tolerance by techniques like replication and data distribution; but, these techniques are accompanied by increasing the redundancy and difficulty in management and implementation [17].

As it was mentioned, the intended database system in this research is Cassandra, which is designed by Avinash Lakshman and Prashant Malik in 2008, and it was initially utilized by Facebook. Cassandra is an open source database system based on the Amazon Dynamo, for managing massive data in the Big table model.

In the following, we will review the literature and survey some of the well-known consensus algorithms.

3.1 Background

According to [26, 56], P2P systems were devised for sharing the files. Calculations in P2P systems led to exuding excellent research areas, like issues related to searching [57, 44, 46]. It attracted the attention of many studies to this scope. In [51], it was stated that a P2P system has the following features:

  • It is absolutely distributed, without central leading points.

  • Users are able to enter the tasks from their connected node and then disconnect from the network.

  • Global scheduling is needed.

Some of the extant studies such as [38, 14, 22] utilize the centralized solutions and cannot provide the complete distribution. Other works such as [13] and [10] use the super-peer model that is rather distributed and lets the worker nodes to be connected and scheduled by super-peers. In [11] a distributed scheduling is proposed based on the overlay network and tree structure, but in that system, the tasks must be supervised. Hence, the second requisite is violated. Centralized leadership as the most important drawback of [37] and [2] imposes limitations in scalability and accessibility to high throughput systems. For coping with it, consensus algorithms are proposed that let the processors transfer their local calculation results to each other to gain coordination [5, 39]. Establishing simple local rules can cause the consonant behavior of nodes [28, 42] to a same direction and convergence of their acts [21]. In one of the prominent research in this field, a special class of consensus algorithms was proposed based on the convergence of weak connections [59], and it initiated following studies like [47] for weakly connected graphs and [42] for asynchronous systems [4].

Different replicas in the system might see the multiple copies of events with different sequences. This problem is addressed as casualty in many studies like [32] by Lamport. In [19, 60, 52] some solutions were presented, but no one was perfect. Lamport in [31] claimed that it is possible to achieve to consent even with some faulty processes. But, asynchronous systems are out of this rule [20].

3.2 Paxos protocol

Paxos is known as the most popular consensus protocol, which was presented by Lamport’s paper [20] and developed by Lampson’s article [34]. Paxos is a consensus protocol for a network of unreliable processors. The consensus is the process of consenting on a particular value or item among a group of participants. This process is hardened when there are failures in processing or communication. Besides, Paxos guarantees the compatibility by applying a broad range of assessments on processors, latency, level of participation of nodes, messages, and failures. On the other hand, Paxos has an intricate structure that makes it inappropriate for practical system design.

There exist many real-world implementations of Paxos such as Chubby [6], Megastore lock service [3] and Spanner storage system [16] that are created by Google, Autopilot service [27], Bing, Azure [7] by Microsoft, Ceph storage system.

Two significant optimizations for modifying the performance of Paxos are batching [55] and pipelining [43]. Batching is used for message aggregation based on TCP’s Nagle algorithm, and pipelining is a general technique for optimization that runs the requests in parallel to improve the efficiency. An example of pipelining is HTTP [50]. With a brief look at literature, it is inferred that no effort is made to blend batching and pipelining techniques, so far. But, some works were conducted to optimize the Paxos based protocols. For example, LCR in [23]

proposes a pervasive atomic protocol based on the ring topology for optimizing clock vectors. Ring Paxos 

[40] combines different techniques for improving the efficiency of networks such as IP multicast, ring topology and a minimum number of acceptors. Both [23] and [40] just targeted the LAN environment; hence they utilize LAN specific techniques like IP multicast.

Although Paxos is widely used for consensus in recent studies, it has some shortcomings that are forked from its single-decree formulation structure. That is, the servers can only consent on one of the entries in the log. It causes hardship in understanding the consensus mechanism, because it consists of two phases that cannot be separated, and also Paxos uses the weak leader for optimization which is not symmetric, yet.

3.3 Paxos algorithm

Paxos lets a group of processes (called replicas) to consent on a similar value, in a faulty environment, which replicas might fail, and messages might lose in the network. Each replica is able to retrieve the previous state before the failure. If all the replicas perform without failure for a particular time period, Paxos algorithm can guarantee the similarity of consented value among the processes. Paxos contains three steps that would be repeated in the case of existing failures [24]:

1) Choosing a replica as the leader

2) The leader chooses a value and sends it to all other replicas in the form of an accept message. Other replicas can accept (by ack message) or reject (by reject message) the value.

3) If the majority of replicas consent with the leader, the consensus is done, and the leader sends the commit message to all the replicas.

In Paxos, several processes might decide to be the next leader. So the system dominant policies should restrict them to compete because it makes delay in carrying on the consensus. The flexibility that is caused by multileader imagination leads to selecting multiple values, but it cannot detect the results. In fact, Paxos guarantees the single value consensus via the attaching order number to the leaders and restricting the leaders’ options for selecting the values. Putting the leaders into the order lets the replicas distinguish the current leader from previous ones that entails ignoring the messages from former leaders and accepting the messages of the current leader.

A replica sends a proposed message to others whenever it decides to be the new leader. If other replicas accept the proposed leader, they promise to reject the messages from previous leaders [24]. More details about it can be found in [33]. Figure 1 shows an example of it.

Figure 1: Paxos algorithm [33].

Results of comparing the Chandra-Toueg and Paxos that are presented in [24] show that Paxos has a better performance in the case of existing failed leader, but in other situations, the latency -and consequently the performance- are almost the same for both of algorithms.

3.4 Chandra-Toueg

Chandra-Toueg [24] is presented for a network of unreliable processes and utilizes strong failure detector. Failure detector is a brief version of timeouts. This algorithm sends a signal to each process to detect the failures; so it could be regarded somehow similar to Paxos. Both of them are based on the strong failure detector, and the number of their failed processes is less than n/2, where n is the total number of processes.

The algorithm proceeds in rounds which are shown in Figure 2 and utilizes a rotating coordinator: in each round r, the process whose identity is given by (r mod n)+1 is selected as the coordinator. Each process keeps track of its current preferred decision value (initially equal to the input of the process) and the last round that it altered its decision value (the value’s timestamp). The actions in each round are:

1) All processes send (, , ) to the coordinator.

2) The coordinator waits to receive messages from at least half of the processes (including itself). It then chooses as its preference a value with the most recent timestamp among those sent.

3) The coordinator sends (, ) to all processes.

4) Each process waits to receive (, ) from the coordinator or for its failure detector to identify the coordinator as crashed. In the first case, it sets its own preference to the coordinator’s preference and responds with . In the second case, it sends to the coordinator.

5) The coordinator waits to receive or from a majority of processes. If it receives from a majority, it sends to all processes.

6) Any process that receives for the first time sends to all processes, then decides preference and terminates.

Figure 2: Phases in a round of Chandra-Toueg [24].

3.5 Zab

Zab is a crash-recovery atomic broadcast algorithm designed for the ZooKeeper coordination service. ZooKeeper implements a primary-backup scheme in which a primary process executes clients operations and uses Zab to propagate the corresponding incremental state changes to backup processes [30].

Zab at a high level is a leader based protocol similar to Paxos. Some of the aspects that distinguish Zab from Paxos is that Zab recovers histories rather than single instances of protocols; Zab has prefix ordering properties to allow primary/backup implementations to have multiple outstanding operations outstanding at a time; and Zab does special processing at leadership change to guarantee unique sequence numbers for values proposed by a leader. All servers start off looking for a leader. Once the instance of leader election at a given server indicates a leader has emerged it will move to phase 1. If the leader election instance indicates that the server is the leader, it will return to Phase 1 as a leader, or do as a follower.

Phase 1: In this phase, an elected leader makes sure that previous leaders cannot commit new proposals and decides on an initial history. (Note a leader is also considered a follower of itself.)

Phase 2: Sync with followers

Phase 3: The leader and followers can have multiple proposals in the process and at various stages in the pipeline. The leader will remain active as long as there is a quorum of followers acknowledging its proposals or pings within a timeout interval. The follower will continue to support a leader as long as it receives proposals or pings within a timeout interval. Any failures or timeouts will result in the server going back to leader election.

3.6 Viewstamped Replication

VR provides state machine replication in an asynchronous network like the Internet and handles nodes failures. VR code contains logic for maintaining a consensus among the majority of replicas for each operation. VR proxy sits between the real application clients and handles its request. It shares the configuration state such as current view number, current primary with replicas. The VR protocol can be broadly divided into three modes of operations:

1) Normal Operation: In this mode, the primary replica accepts an operation from the VR proxy and queues them for replicating onto at least f+1 replicas (including itself). Once the operation is logged on majority of replica, it is committed on the primary and is executed against the application service and the response is updated to client.

2) View Change: If the primary goes down, the replicas will have timeout alerts and will start the view change algorithm. Each replica will send StartViewChange to all replicas, and when a replicas f get such requests, it sends a DoViewChange message to the new primary. The new primary waits for f+1 DoViewChange message and picks the most updated logs from this call. It then sends StartView to all the replica and resume operation in normal mode again. Since log state is transferred in multiple messages, this phase has high throughput. Checkpointing the logs will be used to decrease the amount of exchanged bytes.

3) StateRecovery:When a replica is joined back after a failure, it catches up with the latest log through.

3.7 Raft

The Raft is designed to facilitate the understanding of consensus problem by dividing it to different sub –problems to get to better quality in consensus [48]. It would be helpful to take a look at the concept of distributed consensus, to have a clear perception of Raft.

In a system with multiple processes, each process might be at one of these states: follower state, candidate state or leader state. All the processes are initially at the follower state. If they do not receive a message from the leader, they can go to candidate state. Candid process requests for votes from other processes and the candid with the majority of votes becomes a leader. This process is called leader election. Then, all the changes of the system are supervised by the new leader. For the consensus, leader waits for the majority of processes to write their entries in the log and after the completion of the log, leader notifies the followers about the completion and consensus result [48].

Given the leader approach, Raft disjoints the consensus problem into three independent sub-problems as follows:

  • Leader election: a new leader must be chosen when an existing leader fails.

  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.

  • Safety: the key safety property for Raft is the State Machine Safety Property: if any server has applied a particular log entry to its state machine, then no other server may use a different command for the same log index.

As the last notion, it must be stated that Raft is similar to Paxos from a structural point of view. But, it is simpler and more understandable, and it can cope with the problem of read/write request latency. In the next section, we will show that Paxos can be replaced by Raft in Cassandra database system to achieve better performance. Experimental results can prove our claim.

3.8 Discussion

Among the existing consensus algorithms, Raft can be compared with Paxos, VR, and Zab; partially because all of them can handle the fail-stops. VR and Zab are leader based algorithms that are more similar to Raft. Both of them choose a leader at initial step, and the leader handles the replicated logs. Their difference is in the managing the leader and maintenance operation of logs. On the other hand, Raft has a lighter mechanism rather than VR and Zab. For example, VR and Zab use 10 types of messages, while Raft just utilizes 4 types (two RPC requests and their answers). But, Zab provides a stronger guarantee for concurrent running of requests by pipelining, and also observes the FIFO order of them [29]. Furthermore, there is no failure detector in Paxos, VR, and Zab. It could be inferred that failure detection process is considered absolutely disparate from consensus, while Raft considers the failure detection via the use of heartbeats. Another notion is about scheduling. The aforementioned algorithms allocate the number of turns to the servers in two ways: Zab and Raft do the consensus to be sure that there is a unique leader in each time. If system nodes consent on a server, then it can replicate the log entries. But, Paxos and VR divide the turns’ number domain so that the servers cannot compete for turns (like Round-Robin fashion). In practice, there is no difference in the application of these methods, because both of them do the consensus, accurately.

Table 1 shows some details of the different consensus algorithms. The field New leader shows that which servers are able to be selected as next leader; Management of configurations indicates the ability of algorithm in configuring the server for leadership during the selection process; and finally Vote collector shows the servers that can perform the consensus and collect the votes [26].

Algorithm New leader Management Vote collectors
of configurations
Paxos [20]
Paxos [32] All the servers Yes New leader
Chandra-Toung [24]
VR [36] Servers with No View manager
most recent logs
VRR*[35] Determined by No New leader
view number
Zab[29] Servers with No New leader
most recent logs
Raft [49] Servers with No New leader
most recent logs

* Viewstamped Replication Revisited utilizes a round turn approach for leader selection.

Table 1: Details of leadership in different algorithms

Paxos (non-optimized version), Zab and VRR (VR) need more process to ensure that the new leader has the complete list of entries; because they disregard the logs in choosing the new leader. In Paxos, a leader runs two single-decree phases for each log entry, and it is not aware of the completion of the log. It might cause a long latency period until choosing a new leader. Zab and VRR submit the whole logs to a new leader, and new leader wants the most recent one. This is a good idea, but difficult in practice. It is recommended to decrease the entries to gain more optimization.

As the last notion, there are various ways to reduce the number of servers, while it does not detect the fault tolerance. A common way is simplifying the replication so that it uses few number of servers in a cluster [1]. This approach is called “thrifty” [41]. It decreases the network load to half of normal situation because it replicates the entries for half of the servers in the cluster. Other methods are proposed in [36] and [9] using witness servers, that are out of the scope of current paper to be surveyed. Due to the aim of this article that is based on analysis of fault-tolerance detection which affect the results.

One of the most significant factor in consensus algorithms performance is leader election which is compared in Paxos and Raft in the following paragraphs.

Both Paxos and Raft assume that eventually there will be a leader that all stable servers trust and a single leader is responsible for a term. A new leader will propose a new term, which must be greater than the previous one, if the current leader is suspected to have failed.

In Paxos leader election procedure is unconditionally, the only considerable thing is that a server which is a candidate must be stable and available. These conditions are also the same about Raft, but there is one effective difference that a server which is a candidate as a leader, must be the most up to date server among others. In Paxos because any server deserves to become a leader, since a cluster which contained the leader received a request data for the past that the leader does not have any background about that, must learn it from the other servers. This causes more pressure in the cluster. Whereas, this problem is solved in Raft with its enhanced architecture by choosing the most up to date server as a leader.

It would be noticeable that the leader in both Raft and Paxos can affect the traffic goes through it.

4 Experimental results

To compare the Paxos and Raft, four evaluation parameters are considered as follows:

1) I/O latency: It refers to the speed of read/write operations. Latency is the time duration between the start point and end point of a transaction. For database systems, it is so important to achieve less latency because of a vast number of I/O operations that are needed for executing the transactions.

2) Read/Write requests: Load balancing must be considered as an important factor when clustering the nodes to avoid of latency. It should be stated that this issue is observed in Cassandra via replication and providing similar data for nodes. Based on the default settings, several queues are utilized for read/write operations. It causes load balancing on one side and evading of collisions on the other side. Also, it lets the system accept the clients’ requests even during the maintenance time.

3) OS load: In Cassandra, each node is potent enough to utilize the 100% of processor’s power. OS load refers to the imposed load to the hardware through the operating system.

4) OS: Net sent/received: It relates to the amount of data that is transferred via the network by the operating system. Network load is selected as a measure for assessment of consensus algorithms, because of the huge amount of read/write operation load that Cassandra keeps in the queues.

For experiments, we utilized a Cassandra server with following details:

- CPU: Intel Xeon X5660 @ 2.80 GHz with 6 Cores and 12 threads

- Memory: 32 GB, DDR3

We installed four virtual machines with Linux (Ubuntu 15.04) and Cassandra database in the aforementioned system. To monitor the virtual machines we utilized Opscenter (from DataStax.com). Also, Devcenter (from DataStax.com) was used for establishing the connection with the database and managing the queries. At the next step, a cluster was created that was comprised of all four virtual machines, which we may call nodes. The nodes were organized in a ring topology. To evaluate the system in high network load situation, stress tool was utilized. This is an application that auto generate a vast amount of data for Cassandra and other databases to analyze their behavior. In this case, is used Cassandra stress tool from DataStax.com.

Figure 3: Cluster health under the high load using Paxos (a) and Raft (b) consensus algorithms.

The specifications of the nodes:

- Number of Nodes: 8 in WMW

- Hard Disk: 10 G

- CPU: 2.8 GHZ Intel

- Memory: 2G

- OS: Ubuntu 15.04 64X

The test parameter:  /Cassandra-stress/target/appassembler# bin/stress –o insert –b 1000 –c 10 –n 9000000 –t 5

During the experience with Paxos, we increased the load continuously. In such a circumstance, one of the nodes stopped, while the others were interacting under a high load. Stopped node was not able to join again and stayed out.

In the experience of using Raft, we installed golang-raft-dev package. Also, Opscenter and Stress tool were utilized as well as the experience of Paxos.

Unlike Paxos, in Raft experience it was observed that the cluster health was improved, and none of the nodes stopped by increasing the load. Also, all of them had a normal load situation. This point can be considered as the superiority of Raft rather than Paxos.

Figure 4: Write request latency in Paxos (a) and Raft (b).
Figure 5: Read request latency in Paxos (a) and Raft (b).

Figure 3 shows the cluster health in Paxos and Raft. In Paxos experiment, by increasing the load, one of the nodes encounter with the problem. It is evident that three servers are operating under a massive load of the system, while one of the nodes is down and it cannot return to the process again. But, Raft shows more flexibility by increasing the load. All four servers are operating under more balanced load, accurately.

It can be seen from the figures that Raft can divide load pressure among the nodes in the cluster. So that, the nodes are more relaxed in compare with using Paxos.

As it is mentioned before, Raft performance to conserve mechanisms of cluster by choosing proper leader is far better than Paxos in omitting pressure on the other servers. As it can be seen from the Figure 10, this case can be easily sensed.

The vertical axis shows the write requests operator performance in a millisecond, and the horizontal axis demonstrates the period of time. Write request latency is provided in Figure 4 that denotes on unbalanced write request latency of system in Paxos, and better performance of Raft. By increasing the load, write request latency in Paxos reaches to about 4000 ms/op, while the maximum latency in Raft is about 20 ms/op. Also, the minimum of Raft is approximately half of the Paxos. These differences denote on acceptable optimization of write request latency in Raft rather than Paxos. These figures illustrate the writing operation in Raft is better than Paxos because it is vividly obvious that in Paxos write requests are likely buffered and wait to write, therefore aggregation of write requests will be increased. However, this situation in Raft is so fewer.

Figure 6: Operating system load in Paxos (a) and Raft (b).
Figure 7: Network received data by the operating system in Paxos (a) and Raft (b).
Figure 8: Operating system load in Paxos (a) and Raft (b).
Figure 9: Write requests in Paxos (a) and Raft (b).
Figure 10: Read requests in Paxos (a) and Raft (b).

The vertical axis shows the write requests operator performance in a millisecond, and the horizontal axis shows the period of time. Figure 5 shows the read request latency for one of the nodes in the cluster as a sample. It is clear that in Paxos, by increasing the write request latency, read request latency soars. It could be inferred that read and write request latency are interdependent. On the other hand, read request latency in Raft still follows a more balanced pattern. Also, the maximum value for latency in Paxos that is about 4.5 ms/op is rather greater than Raft, which is approximately 2.2 ms/op. Although read request latency is much smaller than write request latency, this read request latency can have an impressive effect on the overall speed of the system, especially in the case of larger clusters. These figures illustrate the read requests latency in Paxos is twice more than Raft.

Figure 6 indicates the average operating system load of the cluster. Maximum loads of Paxos and Raft are about 2 and 0.4, respectively, that shows the efficiency of Raft in comparison with Paxos. Besides, comparing the operating system load with read/write request latency indicates that both of them recess by increasing the system load. So, it could be inferred that there is a positive relationship between OS load and read/write request latency. But, in Raft there is not such a correlation. On the other hand, OS load in Paxos changes in a wider domain, while Raft follows a more regular pattern in its changes. It shows the appropriate load balancing of Raft rather than Paxos. These figures demonstrate that in the case of using Raft average of OS load in the cluster not only is incredibly less than Paxos, but it is also more monotonous.

In Figure 7, the vertical axis shows the rate of operating data received in the cluster base on KB/sec while the horizontal axis shows the period of time.

In Figure 8, the vertical axis shows the rate of operating data sent base on KB/sec in the cluster while the horizontal axis illustrates the period of time.

Figures 7 and 8 show the network sent and received load for one of the nodes in the cluster. In Paxos, received data has an inverse relationship with read/write request latency. The reason could be described easily as by increasing the latency, nodes send less data to the network. Consequently, both sends and received slake when the load is high. Posted data converges to zero when read/write request latency is reaching to the maximum value. But, sent and received data in Raft have steadier rate and more regular pattern that results from appropriate load balancing in this system. It is obvious from the figures 8 that the maximum network received in Raft in compare with Paxos is surprisingly far less and more monotonous and also as is evident from the figures 9 that the maximum transfer rate is in the cluster of Raft in compare with Paxos is more monotonous.

The vertical axis shows the number of write requests in a second in the cluster and the horizontal axis demonstrates the period of time. As in Figure 9, the average of write requests in Paxos has an ascending trend. But, Raft still has a regular pattern. A maximum number of write requests in Paxos reaches 14/sec, but it is 11/sec in Raft. This issue again denotes on an unbalanced load of Paxos, rather than Raft. It could be inferred that in Paxos, read and write requests has an inverse relationship with latency, while in Raft requests still follow a less constant pattern. These figures illustrate the write requests in case of using Raft is more organized, and its processing number for the write requests is up to 11, but the number of write requests in Paxos is heterogeneous in different numbers that can cause increasing pressure on the nodes.

The vertical axis shows the number of read requests in a second in the cluster and the horizontal axis indicates the period of time. According to Figure 10, the number of read requests in Paxos is fluctuating by increasing the system load, while the number of read requests in Raft always is a constant number. These figures show the read requests in case of using Raft is optimized and regarding this fact, read request latency is also inspired. It can be seen from figures of Paxos that the number of read requests somehow are increased.

As the summary of aforementioned issues, we can claim that Raft was always better than Paxos in our measured criteria. The main reason of superiority of the Raft was appropriate load balancing that entailed more flexibility, steadiness, less latency and more speed. All these advantages led to better performance of the whole system.

5 Conclusion and future work

The need for scalability and accessibility in distributed database system entailed emergence of consensus algorithms to manage the replicas that provide the scalability and availability for the system; because previous approaches like master-slave and tripartite replications could not solve the problem of the clusters with a large number of nodes. In this paper, we investigated some of the prominent consensus algorithms such as Chandra-Toueg, Paxos, and Raft. Then, by focusing on the effect of read/write request latency on overall performance, we attempted to find the best solution for Cassandra database system. Experiments indicated that Raft had the best performance of other algorithms. So, we tested Raft algorithm on Cassandra database and used Opscenter and Stress tool for monitoring and imposing high load to the system, respectively. Results showed the acceptable load balancing, and consequently improving the efficiency of the system by applying Raft.

Cassandra employs the queues for putting the requests into order. One of the ideas for future work is to inspect and change the strategies of managing the lines to achieve better performance. Also, by partitioning the clusters of Cassandra, improving the efficiency of Raft algorithm can be targeted.


  • [1] S. Amid and T. Mesri Gundoshmian (2016) Prediction of output energy based on different energy inputs on broiler production using application of adaptive neural-fuzzy inference system. Agriculture Science Developments 5 (2), pp. 14–21. Cited by: §3.8.
  • [2] D. P. Anderson (2004) Boinc: a system for public-resource computing and storage. In Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on, pp. 4–10. Cited by: §3.1.
  • [3] J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J. Leon, Y. Li, A. Lloyd, and V. Yushprakh (2011) Megastore: providing scalable, highly available storage for interactive services. Cited by: §3.2.
  • [4] A. Bartoli, C. Calabrese, M. Prica, E. A. Di Muro, and A. Montresor (2003) Adaptive message packing for group communication systems. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”, pp. 912–925. Cited by: §3.1.
  • [5] D. P. Bertsekas and J. N. Tsitsiklis (1989) Parallel and distributed computation: numerical methods. Vol. 23, Prentice hall Englewood Cliffs, NJ. Cited by: §3.1.
  • [6] M. Burrows (2006) The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation, pp. 335–350. Cited by: §3.2.
  • [7] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. (2011) Windows azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 143–157. Cited by: §3.2.
  • [8] D. G. Campbell, G. Kakivaya, and N. Ellis (2010) Extreme scale with full sql language support in microsoft sql azure. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 1021–1024. Cited by: §1, §1, §1.
  • [9] M. Cao, A. S. Morse, and B. Anderson (2005) Coordination of an asynchronous multi-agent system via averaging. IFAC Proceedings Volumes 38 (1), pp. 17–22. Cited by: §3.8.
  • [10] E. Caron, F. Desprez, and C. Tedeschi (2007) Enhancing computational grids with peer-to-peer technology for large scale service discovery. Journal of Grid Computing 5 (3), pp. 337–360. Cited by: §3.1.
  • [11] J. Celaya and U. Arronategui (2010) Distributed scheduler of workflows with deadlines in a p2p desktop grid. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 69–73. Cited by: §3.1.
  • [12] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber (2008) Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26 (2), pp. 4. Cited by: §1, §1.
  • [13] J. Charr, R. Couturier, and D. Laiymani (2011) JACEP2P-v2: a fully decentralized and fault tolerant environment for executing parallel iterative asynchronous applications on volatile distributed architectures. Future Generation Computer Systems 27 (5), pp. 606–613. Cited by: §3.1.
  • [14] G. Chmaj and K. Walkowiak (2013) A p2p computing system for overlay networks. Future Generation Computer Systems 29 (1), pp. 242–249. Cited by: §3.1.
  • [15] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni (2008) PNUTS: yahoo!’s hosted data serving platform. Proceedings of the VLDB Endowment 1 (2), pp. 1277–1288. Cited by: §1, §1.
  • [16] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. (2013) Spanner: google’s globally distributed database. ACM Transactions on Computer Systems (TOCS) 31 (3), pp. 8. Cited by: §3.2.
  • [17] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica (2001) Wide-area cooperative storage with cfs. ACM SIGOPS Operating Systems Review 35 (5), pp. 202–215. Cited by: §3.
  • [18] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels (2007) Dynamo: amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41 (6), pp. 205–220. Cited by: §1.
  • [19] C. J. Fidge (1987) Timestamps in message-passing systems that preserve the partial ordering. Australian National University. Department of Computer Science. Cited by: §3.1.
  • [20] M. J. Fischer, N. A. Lynch, and M. S. Paterson (1985) Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM) 32 (2), pp. 374–382. Cited by: §3.1, §3.2, Table 1.
  • [21] V. Gazi (2002) Stability analysis of swarms. Ph.D. Thesis, The Ohio State University. Cited by: §3.1.
  • [22] P. Goudarzi, H. T. Malazi, and M. Ahmadi (2016) Khorramshahr: a scalable peer to peer architecture for port warehouse management system. Journal of Network and Computer Applications 76, pp. 49–59. Cited by: §3.1.
  • [23] R. Guerraoui, R. R. Levy, B. Pochon, and V. Quéma (2010) Throughput optimal total order broadcast for cluster environments. ACM Transactions on Computer Systems (TOCS) 28 (2), pp. 5. Cited by: §3.2.
  • [24] N. Hayashibara, P. Urbán, A. Schiper, and T. Katayama (2002) Performance comparison between the paxos and chandra-toueg consensus algorithms. In Proc. Int” l Arab Conference on Information Technology (ACIT 2002), pp. 526–533. Cited by: Figure 2, §3.3, §3.3, §3.3, §3.4, Table 1.
  • [25] M. Herlihy, S. Rajsbaum, and M. Tuttle (2009) An axiomatic approach to computing the connectivity of synchronous and asynchronous systems. Electronic Notes in Theoretical Computer Science 230, pp. 79–102. Cited by: §1.
  • [26] D. Hughes, G. Coulson, and J. Walkerdine (2005) Free riding on gnutella revisited: the bell tolls?. IEEE distributed systems online 6 (6). Cited by: §3.1, §3.8.
  • [27] M. Isard (2007) Autopilot: automatic data center management. ACM SIGOPS Operating Systems Review 41 (2), pp. 60–67. Cited by: §3.2.
  • [28] A. Jadbabaie, J. Lin, and A. S. Morse (2003) Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Transactions on automatic control 48 (6), pp. 988–1001. Cited by: §3.1.
  • [29] F. P. Junqueira, B. C. Reed, and M. Serafini (2011) Zab: high-performance broadcast for primary-backup systems. In 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN), pp. 245–256. Cited by: §3.8, Table 1.
  • [30] F. P. Junqueira, B. C. Reed, and M. Serafini (2011) Zab: high-performance broadcast for primary-backup systems. In 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN), pp. 245–256. Cited by: §3.5.
  • [31] L. Lamport, R. Shostak, and M. Pease (1982) The byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS) 4 (3), pp. 382–401. Cited by: §3.1.
  • [32] L. Lamport (1978) Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (7), pp. 558–565. Cited by: §3.1, Table 1.
  • [33] L. Lamport (1998) The part-time parliament. ACM Transactions on Computer Systems (TOCS) 16 (2), pp. 133–169. Cited by: §1, Figure 1, §3.3.
  • [34] B. Lampson (1996) How to build a highly available system using consensus. Distributed Algorithms, pp. 1–17. Cited by: §3.2.
  • [35] B. Liskov and J. Cowling (2012) Viewstamped replication revisited. Cited by: Table 1.
  • [36] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M. Williams (1991) Replication in the harp file system. ACM. Cited by: §3.8, Table 1.
  • [37] M. J. Litzkow, M. Livny, and M. W. Mutka (1988) Condor-a hunter of idle workstations. In Distributed Computing Systems, 1988., 8th International Conference on, pp. 104–111. Cited by: §3.1.
  • [38] C. Lucchese, C. Mastroianni, S. Orlando, and D. Talia (2010) Mining@ home: toward a public-resource computing framework for distributed data mining. Concurrency and Computation: Practice and Experience 22 (5), pp. 658–682. Cited by: §3.1.
  • [39] N. A. Lynch (1996) Distributed algorithms. Morgan Kaufmann. Cited by: §3.1.
  • [40] P. J. Marandi, M. Primi, N. Schiper, and F. Pedone (2010) Ring paxos: a high-throughput atomic broadcast protocol. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), pp. 527–536. Cited by: §3.2.
  • [41] I. Moraru, D. G. Andersen, and M. Kaminsky (2013) There is more consensus in egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 358–372. Cited by: §3.8.
  • [42] L. Moreau (2003) Leaderless coordination via bidirectional and unidirectional time-dependent communication. In Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, Vol. 3, pp. 3070–3075. Cited by: §3.1.
  • [43] J. Nagle (1984) Congestion control in ip/tcp internetworks. Cited by: §3.2.
  • [44] A. Nezarat, M. Raja, and G. Dastghaibifard (2015) A new high performance gpu-based approach to prime numbers generation. World Applied Programming 5 (1), pp. 1–7. Cited by: §3.1.
  • [45] M. Nosrati, M. Nosrati, R. Karimi, and R. Karimi (2016) Energy efficient and latency optimized media resource allocation. International Journal of Web Information Systems 12 (1), pp. 2–17. Cited by: §3.
  • [46] B. Okelo, P. Mogotu, S. Omaoro, and C. Rwenyo (2016) Generalization in metric space. General Scientific Researches 4 (1), pp. 1–4. Cited by: §3.1.
  • [47] R. Olfati-Saber and R. M. Murray (2004) Consensus problems in networks of agents with switching topology and time-delays. IEEE Transactions on automatic control 49 (9), pp. 1520–1533. Cited by: §3.1.
  • [48] D. Ongaro and J. Ousterhout (2014) In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pp. 305–319. Cited by: §3.7, §3.7.
  • [49] D. Ongaro (2014) Consensus: bridging theory and practice. Ph.D. Thesis, STANFORD UNIVERSITY. Cited by: §1, Table 1.
  • [50] V. N. Padmanabhan and J. C. Mogul (1995) Improving http latency. Computer Networks and ISDN Systems 28 (1), pp. 25–35. Cited by: §3.2.
  • [51] C. Pérez-Miguel, J. Miguel-Alonso, and A. Mendiburu (2013) High throughput computing over peer-to-peer networks. Future Generation Computer Systems 29 (1), pp. 352–360. Cited by: §3.1.
  • [52] N. Preguiça, C. Baquero, P. S. Almeida, V. Fonte, and R. Gonçalves (2010) Dotted version vectors: logical clocks for optimistic replication. arXiv preprint arXiv:1011.5808. Cited by: §3.1.
  • [53] M. Sabaghi, M. Dashtbayazi, and S. Marjani (2016) Dynamic hysteresis band fixed frequency current control. World Applied Programming 6, pp. 1–4. Cited by: §3.
  • [54] P. J. Sadalage and M. Fowler (2012) NoSQL distilled: a brief guide to the emerging world of polyglot persistence. Pearson Education. Cited by: §1.
  • [55] N. Santos and A. Schiper (2013) Optimizing paxos with batching and pipelining. Theoretical Computer Science 496, pp. 170–183. Cited by: §3.2.
  • [56] D. Spitz and S. D. Hunter (2005) Contested codes: the social construction of napster. The information society 21 (3), pp. 169–180. Cited by: §3.1.
  • [57] I. Stoica, D. Adkins, S. Zhuang, S. Shenker, and S. Surana (2002) Internet indirection infrastructure. ACM SIGCOMM Computer Communication Review 32 (4), pp. 73–86. Cited by: §3.1.
  • [58] A. S. Tanenbaum and M. Van Steen (2007) Distributed systems. Prentice-Hall. Cited by: §1.
  • [59] T. Vicsek, A. Czirók, E. Ben-Jacob, I. Cohen, and O. Shochet (1995)

    Novel type of phase transition in a system of self-driven particles

    Physical review letters 75 (6), pp. 1226. Cited by: §3.1.
  • [60] S. Vinoski (2014) Rediscovering distributed systems.. IEEE Internet Computing 18 (2). Cited by: §3.1.