Index-Based Scheduling for Parallel State Machine Replication

11/26/2019 ∙ by Gang Wu1, et al. ∙ Northeastern University 0

State Machine Replication (SMR) is a fundamental approach to designing service with fault tolerance. However, its requirement for the deterministic execution of transactions often results in single-threaded replicas, which cannot fully exploit the multicore capabilities of today's processors. Therefore, parallel SMR has become a hot topic of recent research. The basic idea behind it is that independent transactions can be executed in parallel, while dependent transactions must be executed in their relative order to ensure consistency among replicas. The dependency detection of existing parallel SMR methods is mainly based on pairwise transaction comparison or batch comparison. These methods cannot simultaneously guarantee both effective detection and concurrent execution. Moreover, the scheduling process cannot execute concurrently, which introduces extra scheduling overhead as well. In order to further reduce scheduling overhead and ensure the parallel execution of transactions, we propose an efficient scheduler based on a specific index structure. The index is composed of a Bloom Filter and the associated transaction queues, which provides an efficient dependency detection and preserve necessary dependency information respectively. Based on the index structure, we further devise an elaborated concurrent scheduling process. The experimental results show that the proposed scheduler is more efficient, scalable and robust than the comparison methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Large-scale online service systems need to ensure high availability and high efficiency of the services. State Machine Replication (SMR) [1, 2] based on various consensus protocols, such as Paxos[3] and PBFT[4], is a common approach to designing fault-tolerate services. According to SMR model, even some of the replicas fail, the services will be kept available with the consistent replicas. SMR achieves strong consistency[5] by regulating every replica executing the same transactions in the same order: (i) every available replica receives all the same transactions eventually; (ii) all replicas must agree on the same order of the transactions received; and (iii) every replica starts from the same initial state and executes the agreed transactions deterministically (ie., transaction must guarantee ACID and the transaction’s changes to the state of the records are a function of only the initial state of the records and the transaction itself).

As we know, SMR is mainly designed to improve the system’s availability rather than its performance[6, 7, 8, 9, 10]. The requirement of the sequential execution of the total order (same order on all replicas) transactions makes it difficult for SMR to take full advantage of multi-core servers. It cannot directly execute transactions concurrently because the uncertainty of thread scheduling and lock competition would result in the undeterministic execution. However, the sequential execution is not a necessary requirement for consistency [2]. In short, dependent transactions(access the same records) must be handled in the same relative order on each replica to keep consistency, while independent transactions(access the different records) can be executed in parallel, which can fully utilize the processor’s multi-core processing ability. Thus, basing on transaction semantics, how to use the transaction independence to improve the performance of SMR has become a hot research direction [11, 12, 13, 14, 15, 16, 17].

For example, CBASE [11] is a classic parallel replication framework proposed to enhance the performance of PBFT algorithm. It sets up a scheduler for every replica which constructs a dependency graph by finding the dependencies pairwise among transactions in their total order. Based on the dependency graph, the scheduler dispatches transactions to idle threads in the thread pool for execution. Once a transaction is executed by one thread, the scheduler removes it from the graph and responds to clients. The scheduler of CBASE maximizes concurrency among executions while ensuring replica consistency.

However, recent research [15] has shown that, under the conditions of high workload, due to the high overhead of pairwise transaction comparisons, determining dependencies among transactions that have not yet been executed is a performance bottleneck. To overcome this problem, batchCBASE[15] determine the dependencies by batch comparison rather than a single transaction comparison once a time, which greatly reduces the times of comparison dramatically. However, it increases the possibility of inter-batch dependencies, and as transactions in each batch are executed sequentially, it loses some of the parallelism for those transactions within a batch. In this way, batchCBASE provides a possible trade-off between the parallel execution and dependency detection. Moreover, in order to promise replica consistence and operation safety, the scheduling process of CBASE and batchCBASE are in single-threaded mode, which means the scheduler and worker threads cannot access the dependency graph at the same time, it introduce more overhead to the system.

In summary, parallel SMR schedulers now face four challenges: 1) faster detection of transaction dependencies; 2) not sacrificing any parallelism of the execution; 3) concurrent scheduling process; and 4) ensuring correctness. In this paper, we propose an efficient scheduler based on a specific index structure to address the above challenges. It consists of a special Bloom filter and corresponding transaction queues for each filter element, with the Bloom filter, the dependencies among transactions can be detected within a constant time. Transaction queues can maintain the total order relations of the transactions and also simplify the representation of the transaction dependency graph. Moreover, the proposed scheduler supports record-granularity locks with the help of the above mentioned index structure, thereby supporting the concurrent scheduling process(specifically the insert, remove, and get operations) of transactions. In summary, the proposed method can efficiently solve the performance loss problem caused by the heavy scheduling overhead from the dependency graph based comparisons, and it can guarantee the execution parallelism under various workloads with different dependency rates. To show the proposed model s advantages in throughput, scalability and robustness in comparison with CBASE and batchCBASE, experiments are conducted and analyzed on a database prototype. Furthermore, the consistency among replicas and other scheduling safety propositions are proved formally.

The remainder of this paper is organized as follows. The system model is described in section II. The parallel SMR model of CBASE and batchCBASE are introduced in Section III. In Section IV, the proposed index-based scheduling approach is described in detail. The experimental results are shown in Section V. Finally, we introduce some related work in Section VI and conclude in Section VII.

Ii System Model

We assume a general distributed service system model, which is composed of an unbounded client sets and a bounded server set . All servers in are replicas of each other and work together to provide highly available services to the clients where the Paxos protocol is used to ensure consistency. The message transmission among distributed replicas is in asynchronous mode, which allows arbitrary message loss and delay. We assume that replicas follow the fail-stop model and never encounter a Byzantine error, which means the state of each replica is either correct or crash, and hence the system with replicas can tolerate replicas crashing simultaneously.

The system ensures that if a request message is sent without failing, all the unfaulty replicas will receive it, and eventually will be decided in the consensus instance , which is called that the replica accepts . The Paxos protocol can promise that at least half of the replicas will accept , and no replica will accept or , where and . Intuitively, all messages exist in most replicas or in none of them. If the messages exist, the order of messages on each replica is exactly the same, ie., total order. Every replica must handle messages in the total order.

In our system, the request messages are about transaction requests. According to the Paxos protocol, each transaction has two states in a replica committed and applied (see Fig. 1). The committed state represents that the transaction has been consistent with most of replicas but is not executed, and the state applied represents that it has been executed in this replica.

(a) Standard SMR
(b) Parallel SMR
Fig. 1: Standard versus parallel state machine replication

Iii Parallel SMR

For the standard SMR, each replica executes the transactions sequentially following their total order (see Figure 1LABEL:sub@fig-standard). Therefore, no matter how many thread resources are available, transactions can only be executed as if they were in a single-threaded replica. With the development of high-speed networks and efficient consensus protocols (eg., [18][19]), the CPU processing efficiency has becomes the next major performance bottleneck of SMR. It is manifested by the fact that the speed of is much slower than that of . Although the concurrent execution of transactions causes uncertainty, the consistency will not be broken if only independent transactions are executed concurrently (see Figure 1LABEL:sub@fig-parallel). As we know, two transactions are independent if they operate on different records or if they only read the same records. For example, if a record is modified by one transaction, and operated by the other transaction, then these two transactions are said to be dependent or conflicted.

There have been some attempts (e.g. [11, 14, 15]) so far to boost SMR with parallel execution by exploiting such transaction dependencies. In this section, CBASE[11] and its improved version batchCBASE[15] are discussed, and conclusion of the motivation for our methods is presented in the end. More details about other related work can be found in Section VI.

To parallelize the execution of transactions, CBASE sets up a scheduler for each replica. The main part of the CBASE algorithm is shown in Algorithm 1. The core of the scheduler is a dependency graph, which takes transactions as vertexes and the dependencies among transactions as directed edges. It keeps the partial order relationship (line 3) between transactions. While accepting a transaction, the scheduler inserts it into the dependency graph (lines 6-8). Based on the dependency graph, the scheduler dispatches free transactions to those idle threads (lines 18-19) in the thread pool for execution. Once a transaction has been executed by a thread, the corresponding vertex and edges should be removed from the graph (line 20). Thus other transactions without predecessor dependencies can be executed next.

(a)
(b)
(c)
Fig. 2: CBASE and batchCBASE dependency graphs. (a) the dependency graph of CBASE. (b) and (c) show how batchCBASE works, i.e., becomes batch , so they has to be executed sequentially though they are parallelizable; and have to be executed sequentially just because are conflict with ; Finally, all of them have to be executed sequentially as (c) where solid lines represents batch dependencies, and dotted lines represents execution trace.

Figure 2LABEL:sub@1fig2 shows how CBASE maintains the partial order of transactions based on the dependency graph. These transactions are agreed at each replica in a total order sequence . Among them, , and are dependent subsequences. For transactions in each such dependent subsequence, their position on the dependent graph path is determined by their relative order in the total order. For example, represents that depends on , and depends on , because arrives first, then , and finally . New transactions need to be compared to all transactions in the graph to determine the dependencies.

Intuitively, the overhead of building a dependent graph is related to the number of nodes in the graph. Specifically, the time complexity is . Experiments in batchCBASE [15] confirm that detecting conflicts (dependencies) between transactions is time consuming in heavy workloads. Therefore, batchCBASE is designed to reduce the number of comparisons by packing transactions into batches, as the example shown in Figure 2LABEL:sub@2fig2. Compared with CBASE, the detecting overhead of batchCBASE is reduced by a factor of the size of the batch. Bitmaps technology is introduced to detect conflicts between batches. It allocates a bitmap of 1,000Kbit for each batch. If the intersection of two bitmaps is not empty, then it can be determined that the two corresponding batches have dependencies. Therefore, the time complexity of batchCBASE dependency detection is , where is a constant representing the time complexity of bit comparison using bitmap, is the number of transactions, and

is the size of the batch. However, such batch-based method has a higher the conflict probability between two batches. In theory, the conflict probability between two random transactions is

, while the conflict probability between two batches is . Thus, when the batch-based method is applied, the conflict probability has an exponential increase with respect to the batch size .

Since transactions within each batch of batchCBASE is executed sequentially, the parallelism between transactions is reduced. In addition, if any two conflicting transactions from each batches conflict with each, the two batches have to be executed sequentially as well because the two batches of transactions are considered to be conflicted in this case. As shown in Figure 2LABEL:sub@2fig2, when the batch size is 2, it will degenerate into a sequential execution as Figure 2LABEL:sub@3fig2.

Moreover, since the scheduler internal operations of CBASE and batchCBASE, i.e. , , and , are mutually exclusive, the call to any of these operations will lock the whole dependency graph (line 10,12) until it is finished. From this perspective, the scheduler runs in a single-threaded mode, which introduces extra scheduling overhead.

1:data structures and variables
2: {or Batch for batchCBASE}
3: {dependency graph, or }
4: …
5:The scheduler executes as follows:
6:while accept(do {or Batch for batchCBASE}
7:     dgInsertTrans()
8:procedure dgInsertTrans()
9:     {occupy the whole DG}
10:     …
11:     {release lock of DG}
12:procedure T: dgGetTrans()
13:     … {omitted for simplicity}
14:procedure dgRemoveTrans()
15:     … {similar to }
16:Each worker thread executes as follows:
17:while  do
18:     execute transaction {batchCBASE executes }
19:     dgRemoveTrans()
Algorithm 1 CBASE (bacthCBASE) scheduler

To sum up, (i) CBASE has a greater overhead of detecting dependency; in addition (ii) batchCBASE increases the conflict probability which makes it highly likely to degenerate into sequential execution, (iii) as well as the running mode of their scheduler operations is single-threaded. In our opinion, the main reason behind this is that the granularity of the scheduling object is not appropriate. As stated earlier, transactions that access different records must be independent and need not be detected for the dependency. As for CBASE, the scheduling granularity is the transaction which is so fine-grained that each dependency detection must be performed over all the other transactions. As for batchCBASE, though the batch granularity is coarse enough to reduce comparisons, it still does not fully consider the dependency between transactions when selecting transactions to form a batch. Therefore, it would be a better solution to organize transactions into a specific index structure according to the records to be accessed beforehand. In this way, both efficient dependency detection and good parallelism can be achieved.

Iv Index-based Scheduler Model

We propose a deterministic and efficient parallel SMR scheduler for handling dependencies among transactions and scheduling them to execute concurrently on all worker threads available. The proposed method dedicate to improving the performance of scheduler by designing a specific index structure and devising an elaborated concurrent scheduling scheme accordingly.

Iv-a Overall idea

The basic idea of the scheduler is as follows:

  • The main part of the index structure is a simplified Bloom filter constructed from a single HashMap. Each key of the HashMap represents one record accessed by the transactions. Hence, without actually constructing and traversing the dependency graph, it can determine the dependency between transactions when they fall into the same Bloom filter bit by one hash.

  • The value corresponding to each key of the HashMap is a FIFO queue containing all the transactions accessing the record of the key. Hence, any different transactions at the heads of all transaction queues of the HashMap can be executed concurrently.

  • Based on the above index structure, it is easy to make the scheduler concurrently perform scheduling operations (i.e., insert, remove, get) with record-granularity lock, which can guarantee safety and correctness as well.

Transactions and Records: Transaction is composed of one or couples of commands and records. We denote the total order of transaction as where and represents the total order between two transactions. Let the transaction ’s record set and the transaction set accessing the common record as .

Bloom Filter: The Bloom filter is constructed from a single HashMap. Although a Bloom filter is usually composed of more than one hash functions, the only one hash used here is the one of the HashMap. The reason is that our Bloom filter is used not only for testing the existence of dependencies but also for indexing transaction queues according to the record accessed. This is achieved by letting record be the key to be hashed and all the transactions in be the corresponding value mapped. Thus, for a transaction , the time complexity of finding all dependent transactions related to record is O(1).

Transaction Queue: In order to provide efficient dependency detection and concurrent execution, all transactions in is organized in a FIFO (First In First Out) queue as the value part of our Bloom filter corresponding to the key . The transaction queue of record is actually a relative order . Thus, for a record , the time complexity of inserting a transaction at the end of or removing a transaction from the head of the queue is O(1). Note that a transaction may exist in different transaction queue because it usually operate on multiple records.

Simplified Dependency Graph: All transaction queues together can form a simplified dependency graph which is consistent in order but much simpler in structure compared with the original complete dependency graph.Since the dependency relation and relative order between transactions are all transitive, it is not necessary to explicitly establish a complete total order through pairwise comparison within the transactions. Therefore, the proposed index structure can effectively reduce the overhead of detection and scheduling. Figure 3 exemplifies the basic idea of dependency graph simplifying.

(a)
(b)
Fig. 3: Example of index structure and simplified dependency graph. (a) The index structure with four transactions having the same records x in the same FIFO transaction queue. (b) The transaction queue in (a) only keeps necessary total order between every two adjacent transactions, eg. the edge to , to and others in the origin dependency graph are omitted naturally.

Free Transaction: In our scheduler, after a transaction is inserted into scheduler, it is said to be free, iff . If a transaction is free, it can be scheduled to be executed. If a transaction is still in transaction queue, it means that the transaction is under execution or not yet executed, in one word, unfinished.

Fine-grained lock: During the scheduling process, the granularity of operation locking is a record in our scheduler. It means that when the scheduler operates transaction on the above index structure, the scheduler will only lock those transaction queues corresponding to the records of . Obviously, operations at the same location of the HashMap (i.e. transaction queues) are mutually exclusive, while operations at different locations are concurrent. Thus the maximum concurrency among these opeartions can be guaranteed.

Iv-B Detailed algorithm

Algorithm 2 shows how our scheduler works in detail. The dependency graph is not explicitly defined because transaction queues can effectively replace it. When the system starts, procedure initializes a HashMap (line 7), and then initializes worker threads for waiting to execute transactions (lines 8-10). The length of HashMap does can be less than the number of records. In this case, there will be a certain probability that the hash function maps two different records to the same position. Fortunately, such false positives do not violate the consistency because those transactions that incorrectly fall into the same transaction queue will be safely executed sequentially. Although such a false positive transaction conflict may occur, it can guarantee that false negatives will never occur.

Once the scheduler accepts tranasctions, it will insert them into the index according to their total order (lines 12-14). As stated earlier, a transaction can be scheduled to be executed, if it does not depend on any other transactions, i.e., being free. There are two situations. (i) For a newly accepted transaction, if there is no dependency detected, it can be executed directly after being inserted into the transaction queue after dependency detection; (ii) For a transaction in the transaction queue that has not been executed yet, it must be dependent and cannot be executed until its dependent transactions are all executed and removed. Therefore unlike CBASE and batchCBASE, our scheduler does not require a separate operation, but combines it with the insert operation and the remove operation to be and respectively. They are detailed as follows:

1:data structures and variables
2: {transaction}
3:{number of worker threads}
4:{transaction queue}
5:{HashMap}
6:procedure Initialization()
7:     initialize
8:     
9:     for  do{initialize every worker thread}
10:         create and start a worker thread      
11:The scheduler executes as follows:
12:while accept(do{accept from T}
13:     {used for executed exactly once}
14:     dgInsertAndGet(){scheduler inserts }
15:function bool: free()
16:     for  do
17:         {Bloom Filter used as index}
18:         if  then
19:              return               return
20:procedure dgInsertAndGet()
21:     for  do
22:         
23:         Lock()
24:         insert()
25:         if  then
26:                        
27:         Unlock()      
28:     if  then{no dependency afert insert}
29:         notify worker threads to execute      
30:procedure dgRemoveAndGet()
31:     for  do
32:         {Bloom Filter used as index}
33:         Lock()
34:          remove()
35:         Unlock()
36:         {candidate next to be executed}
37:         if  then
38:              notify working threads to execute               
39:Each worker thread executes as follows:
40:while  do
41:     execute transaction
42:     dgRemoveAndGet()
Algorithm 2 Index-based scheduler

dgInsertAndGet: The transaction’s execution order of and are subject to the order of and in . Therefore, they can not run concurrently. A call to consists of two operations, i.e., the operation of inserting into those transaction queues that correspond to each record (lines 22-24), and the operation of determining whether can be executed (lines 25-29) now. According to previous description of Free Transaction, if appears at the head of all corresponding transaction queues after insertion, it must be free and can be executed immediately because is the only transaction in those queues. More intuitively speaking, it has no dependent incoming edges in the dependency graph. Otherwise, can not be executed directly. Thus it will be scheduled to worker threads in . The lock granularity of the operation is a record (line 23 and 27), i.e. only one transaction queue corresponding to each record in is locked at a time. It does not lock all transaction records at the same time, ensuring maximum concurrency with operation .

dgRemoveAndGet: Just like , removing a finished transaction from the index also needs to operate on multiple transaction queues. With the help of HashMap in our index, those transaction queues that correspond to each record can be easily obtained (line 32). In our scheduler, transactions to be executed or finished transactions to be removed are kept at the head of corresponding transaction queues, which makes the remove operation more efficient. According to previous description of Free Transaction, if a transaction is free, it must appear at the head of those transaction queues where record . Thus transactions at the head of each is checked for free after removing finished transaction . Then free transactions can be dispatched to available worker threads for executing next. The execution checking is safe and does not need to acquire locks of other transaction queues. Both and achieve the goal of not having to lock all transaction queues. The operations of index-based scheduler have the maximum concurrency when it is measured by the number and granularity of the lock.

Iv-C Correctness

The key to the design of the scheduler is to ensure the security of scheduling operations and the consistency of the state of the transaction execution results between replicas. Here, we will highlight the security of deadlock-free and hungry-free and the validity of replica consistency of our index-based scheduler from data structure, lock granularity and scheduling strategy. See the appendix for details.

  • Operation safety: In the case of fine-grained locks, scheduler is deadlock-free and hungry-free. i) Deadlock-free. First of all, for any two transactions and , the index-based scheduler will never produce scheduling results where their operations dependent on each other. Both and are required to be executed sequentially in FIFO order. Thus deadlocks never occur; ii) Hungry-free. During operation, all free transactions are scheduled to be executed directly by the scheduler. Non-free transactions met in this operation can only turn free during the following operations in which those transactions they depend on are executed and removed. Hence, the transaction scheduling process is always driven by insertion and deletion operation. As long as there are unexecuted transactions in the transaction queue, they will eventually be executed. Thus hunger never occurs.

  • Replica consistency: The transaction queue can ensure order between transactions. Although only order between any two adjacent transactions are maintained, the transaction queues are still consistent with the complete dependency graph w.r.t order. Suppose there is competition for locks, and and operate on the same records in the same transaction queue . No matter which operation executes first, it will not affect the order between each transactions. In addition, it is important for a transaction to execute only once in order to keep replica consistency. In order to determine whether a transaction can be executed in both insert and deletion operations, a flag is defined in the transaction , it can guarantee executed exactly once even when (line 39) appears in both and . From the system perspective, Paxos protocol guarantees the unique total order between replicas on which the same index-based schedulers run as described in Algorithm 2, even though the execution speed of each replica may be different, the related records of replicas will reach the same states whenever a transaction become .

V Experiments

This section will introduce the system prototype, experiment configuration, experiment purpose, experiment method and conclusion of our experiment.

V-a System prototype

To evaluate the performance of our index-based scheduler, called fastCBASE, we implemented an in-memory database in C/S service model. This system provides three transaction operations: PUT, GET and DELETE. Algorithm CBASE, batchCBASE, and our fastCBASE are all implemented on it with different schedulers. Clients sequentially send transaction commands, and the replicas first agree on a total order of all transactions received, and then the corresponding operations are performed by the specific scheduler. The implementation of Algorithm 1 follows [15]. We published the source code of Algorithm 1 and our Algorithm 2 online[20].

V-B Environment

Our experimental environment consists of a cluster of four HP nodes. Three of them work as servers, playing the role of proposer and acceptor in Paxos protocol, and each has 2 E5-2620 CPU, 2.10GHz, hyper-threading, a total of 24 threads, and 256G memory. The client is deployed in the other HP node which has a four-way E7-4820 CPU, 2.0GHz, 8 cores per channel, hyper-threading, a total of 64 threads. The operating systems are all Ubuntu 18.04.2 LTS. The clients in the client node send large number of transactions to make the servers fully loaded. All applications are implemented by Go language version go1.12.1. The communication within the cluster goes through ER3200G2, a gigabit network switch.

V-C Goals and methods

Since the index-based scheduler is proposed to ensure the maximum concurrency among transactions as well as a lower scheduler load, the main experimental purposes include:

  • the speed-up achieved compared to state-of-art

  • the scalability with a growing number of worker threads

  • the impacts of scheduling overhead

  • the false positive introduced by Bloom filter

  • the impacts of conflicts on performance of scheduler

For the first point, in order to compare the performance of our scheduler with other schedulers, we evaluate each scheduler’s performance under conflict-free workloads, and compare the performance under the same number of worker threads with CBASE and batchCBASE.

For the second point, we evaluate the performance improvement of our scheduler with an increasing number of threads under the conflict-free workloads, and compare it with CBASE and batchBASE.

For the third point, we can analyze with the above experimental results.

For the fourth point, since batchCBASE uses two bitmaps bitwise comparison methods in conflict detection and our scheduler uses Bloom Filter, all of them will introduce false positive conflict. We compare the false positive rate introduced by these two scheduler models under different bitmap(HashMap) sizes.

For the fifth points, we compare the performance changes of our scheduler and batchCBASE under different conflict rate workloads.

V-D Speed-up analysis

Fig. 4: Threads scalability for contention-free workloads

In order to observe the most obvious speed-up ability of each scheduler, we analyze the throughput of each scheduler in the case of workloads without conflicting transactions, which is indicated by an average throughput each replica per second. Figure 4 shows the system throughput of CBASE, batchCBASE and our fastCBASE without conflict. The performance of different batch sizes are tested since the batch size of batchCBASE has a significant impact on it.

It can be seen that the traditional CBASE has a very low performance, because the scheduler has a large overhead in the dependency detection, which severely limits the throughput of the whole system. The speed-up of CBASE is poor. As the number of worker threads increases, its performance does not increase significantly. With 16 threads, it only achieves a throughput of about 1000Trans/S. Even though many worker threads are avaliable, the scheduler cannot fully utilize them.

To improve the performance of the system, batchCBASE sacrifices scheduling freedom (transaction sequential execution within batch) for a lower number of comparisons. It can be seen that the performance of batchCBASE is more significantly improved than that of CBASE, which effectively solves the problem of heavy scheduling workload in CBASE.

As can be seen from Figure 4, the performance of batchCBASE increases linearly with the increase of the number of threads in 1, 2, 4, but it does not increase with batch size, which indicates that neither one of the schedulers is a performance bottleneck for the system, throughput is limited by the number of available worker threads.

Our fastCBASE and batchCBASE have a similar performance at 1, 2 threads, fastCBASE has a slightly lower performance than batchCBASE when the number of threads is 4. That is because compared to the batch method, in order to ensure the concurrency between each two transactions, the operating system needs to allocate transactions to worker threads in the granularity of single transaction, whereas batchCBASE executes in the granularity of single batch, there is no need for frequent scheduling within a single batch. Under the conflict free workload, when the number of available threads of the system increases, the operating system requires a higher thread synchronization overhead compared to a less number of threads. However, as the number of threads increases, the performance gain of our method is much higher than that of batchCBASE.

While reducing the overhead of comparison, batchCBASE also potentially reduces the scheduling load. For example, for 10,000 transactions, when the batch size is 1000, there are only 10 batches in the scheduler, and the scheduling load is relatively low (so for batchCBASE, the scheduling load of procedures is not the performance bottleneck of the scheduler). The larger the batch of batchCBASE is, the lower the scheduler workload and the better the performance will be. The speedup of its performance gradually stabilizes gradually as the batch increases, as shown in Figure 4. By the start of 8 threads, the performance no longer increases linearly with the number of threads in every batch. This is because although the number of comparisons is reduced by batch, the load scheduling is still relatively high compared to our scheduler.

Because of our elaborate concurrent scheduling process based on the special index structure, although our scheduler needs to manage each transaction, it is still even more efficient than batchCBASE. Figure 4 shows that the throughput of our scheduler in 8 and 16 threads is much higher than batchCBASE. And unlike batchCBASE, the performance of our method improves much near linearly with the increase of the number of threads, so it has strong scalability.

V-E Conflict rate analysis

Dependency detection based on the index structure of fastCBASE becomes simple and efficient. If the records corresponding to the new transaction conflict after hashing with HashMap, it means that there has been the transaction containing the record in the index. However, if the number of records is greater than the length of the HashMap, then the Bloom Filter may have false positive, i.e. different records will be mapped to the same location in the HashMap. However, our Bloom Filter is also used as index, so it can only be set up with one hash function. Each batch in batchCBAse corresponds to a bitmap. When comparing two batches, the bitmap is compared by bitwise comparison to determine whether there is conflict, so the false positive may also be generated. We compare the conflict rate generated by our fastCBASE scheduler with Bloom filter and compare the rate generated by batchCBASE with the bitmap of each batch.

Before evaluating the conflict rate, recall mathematical formula of conflict rate proposed in Section III. When the batch is not executed, the probability of conflict between batches in batchCBASE is an exponential times with respect to batch size . Although the conflict rate can be accurately represented, the result is not intuitive, hence simulation experiments for different schedulers are performed.

In the simulation, unfinished transactions in the scheduler are stored in the transaction queue. For our scheduler, the new transaction generated by the simulation is detected by a Bloom Filter(the transaction queue and the Bloom Filter used in simulation differ from the previous). The number at each location of the Bloom Filter represents the number of its conflict transactions. If the corresponding Bloom Filter’s location of the newly generated transaction is not zero, the conflict is considered. When the conflict detection is completed, the oldest transaction in the transaction queue is removed, the corresponding location of the Bloom filter is reduced by 1, and the new transaction is added to the transaction queue. If the new transaction conflicts with all transactions, the conflict rate is 100%. If it does not conflict with any transactions, the conflict rate is 0. Therefore, the conflict rate can be defined as: the conflict proportion of the new transaction and the unfinished transactions in the queue at a given period of time or at a specific length of the queue. In our simulation, we use a fixed length of transaction queue to calculate conflict rate. For batchCBASE, only the conflict detecion is different. If at least one common bitmap position is set as 1 in both bitmaps of the two batches, then a conflict is computed.

In our simulation experiment, a transaction contains only one record. We randomly generate records. Thus the probability of generating the same record twice in the simulation is almost zero (), which means conflict rate generated is mainly caused by false positive. In our scheduler, the impact on the conflict rate mainly comes from the size of HashMap. The conflict rate of batchCBASE is also affected by the size of bitmap and batch. We conducted times simulation, the length of the transaction queue is set to 10,000, ie., there are average 10,000 unfinished transactions in the scheduler. The corresponding batchCBASE has a graph size of 50 nodes when the batch size is 200, and a graph size of 25 nodes when the batch size is 400. And we set up the HashMap and bitmap size to be 100K and 1M respectively. The experimental results are show in Table I.

HashMap
size
fastCBASE conlict
rate
batchCBASE conlict
rate,batch=200
batchCBASE conlict
rate,batch=400
102400 0.000984% 32.558% 79.332%
1024000 0.0000975% 3.844% 14.796%
TABLE I: Conflict rate

It can be seen from Table I that under the same configuration, the conflict rate of batchCBASE is nearly 10,000 times of the rate of fastCBASE, which meets the expectations of the mentioned mathematical formula. As the size of HashMap or bitmap increases, the conflict rate of fastCBASE and batchCBASE will decrease, but batchCBase will amplify the conflict rate due to batch, which will also increase the false positive rate. Therefore, in reality, even if the conflict rate is very low, batchCBASE will still be greatly affected, while the false positive rate brought by our scheduler would hardly affect the performance. Next experiments will confirm these two aspects.

V-F Speed-up analysis for conflict-prone workloads

Fig. 5: Throughput under light conflict rate workload

Fig. 6: Throughput under different conflict rate workloads

Now we analyze the performance of our scheduler and batchCBASE as the conflict rate increases. According to the analysis in the previous section, the configuration that leads to the lowest false positive rate is adopted, that is, the lengths of HashMap of fastCBASE and bitmap of batchCBASE are 1M, and the batch size of batchCBASE is 200. According to the analysis of the conflict rate in the previous section, although the actual conflict rate is extremely low, batch method will still amplify its effect. In addition to the sequential execution of the internal transactions of the batch, the decrease of its performance is significantly magnified. As Figure 5 verifies, a slight increase in the conflict rate of batchCBASE causes a drastic drop in the performance. In contrast, the fastCBASE scheduler is not change significantly affected the corresponding conflict rate.

Figure 6 shows that the throughput of our scheduler decreases with the increase of conflict rate. When the conflict rate is 10%, only the throughput of 16 threads decreases. This is because with the increase of the conflict rate, the parallelism of transaction execution decreases, consequently the utilization of multi-threading is reduced. For the same reason, when the conflict rate is 20%, the throughput of 8 threads begins to decrease. And as the conflict rate continuous to increase, the performance gain caused by the increase of the threads’ number is significantly reduced. When the conflict rate is more than 50%, i.e. more than half of the transactions cannot be executed in parallel, the redundant threads cannot be utilized, and the performance on different threads is approximately equal.

Based on the results of Figure 5 and Figure 6, it can be known that our scheduler can allow a maximum parallelism among transactions. When the conflict rate reaches 20%, there is still similar performance to the batchCBASE under conflict-free workload. With the conflict rate increasing, our scheduler is more robust.

Vi Related Work

CBASE[11] and batchCBASE[15] propose to set up a deterministic scheduler on the replica, and actually they are late scheduling model[17], about which more details have been described above. Different from CBASE, an early schedule system model is proposed by [13]. By setting up a client proxy, all client transactions are grouped according to transaction’s semantic. Independent transactions can be allocated to different groups, and dependent transactions must be allocated to the same group. Each group of transactions is sent to all servers by atomic broadcast. Serve-side proxy maps groups to specific threads. A transaction may conflict with transactions in multiple groups. Therefore, synchronization among groups is required to ensure that the transaction is executed only once. To optimize the process of thread scheduling in[13], a multi-objective programming model[14] is proposed to maximize parallelism and minimize execution time. The constraint is to ensure the relative order among the transactions. In order to achieve the optimal scheduling results, high time complexity is required either. Therefore, the existence (or absence) of an optimization model that combines early scheduling and concurrency is still an open question.

Eve[12] implements deterministic parallelism through a scheduler called mixer, which groups requests into batches, and replicas execute batch transactions in parallel in a speculative manner. After the batch execution, the validity of the replica status is checked during the validation phase. If too many replicas are inconsistent, the replica will roll back to the previous validated state and re-execute the command sequentially. Eve is therefore a process of execute-validation. Unlike Eve, Storyboard[21] enhances SMR through a prediction mechanism that predicts the order of locks across replicas based on application-specific knowledge. When the prediction is correct, the transactions can be executed in parallel. If the prediction does not match the execution path of the transactions, the replica must establish a deterministic execution sequence with other replicas through consensus protocol. In this case, Storyboard stops the current execution and repredicts the execution path. All replicas will re-execute the transactions based on the new path.

Rex[22], an execute-agree-follow model, in which a primary machine is free to execute transactions concurrently at first, and uncertain decisions are recorded in a partial order trace, and then other secondary machines will receive the trace. Finally, the secondary machine executes the same trace concurrently, which keeps consistency with the primary machine. Rex detects the relationship of the transactions by transaction-to-lock competition, encapsulates this detection mechanism into c++ synchronization primitives, so any applications developed with this synchronization primitive can generate trace. Traces can be generated only after the transactions on the primary server being executed, so replication on the secondary is a passive replication process, requiring a higher reconfiguration cost when the primary downtime occurs.

Vii Conclusion

In order to promise a high performance, the parallel state machine replication requires an elaborated design to execute independent transactions in parallel and dependent transactions following their relative order. To achieve this goal, efficient and correct dependency detection and scheduling strategies are needed. The existing models cannot make a good balance in these aspects, their advantages lead to their weakness, so their scheduler is inclined to become the performance bottleneck of the system. In this paper, an efficient scheduler based on a specific index structure is designed to detect dependency, express partial order relations and to schedule transactions, which can ensure the maximum parallelism of the execution between transactions to fully exploit the advantages of multi-core processors, and also can keep consistency among replicas.

References

  • [1] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Communications of the ACM, vol. 21, no. 7, pp. 558–565, 1978.
  • [2] F. B. Schneider, “Implementing fault-tolerant services using the state machine approach: A tutorial,” ACM Computing Surveys (CSUR), vol. 22, no. 4, pp. 299–319, 1990.
  • [3] L. Lamport, “The part-time parliament,” ACM Transactions on Computer Systems (TOCS), vol. 16, no. 2, pp. 133–169, 1998.
  • [4] M. Castro and B. Liskov, “Practical byzantine fault tolerance and proactive recovery,” ACM Transactions on Computer Systems (TOCS), vol. 20, no. 4, pp. 398–461, 2002.
  • [5] M. P. Herlihy and J. M. Wing, “Linearizability: A correctness condition for concurrent objects,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 12, no. 3, pp. 463–492, 1990.
  • [6] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “Zookeeper: Wait-free coordination for internet-scale systems.” in USENIX annual technical conference, vol. 8, no. 9.   Boston, MA, USA, 2010.
  • [7] D. Ongaro and J. Ousterhout, “In search of an understandable consensus algorithm,” in 2014 USENIX Annual Technical Conference (USENIXATC 14), 2014, pp. 305–319.
  • [8] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, “Large-scale cluster management at google with borg,” in Proceedings of the Tenth European Conference on Computer Systems.   ACM, 2015, p. 18.
  • [9] K. Shvachko, H. Kuang, S. Radia, R. Chansler et al., “The hadoop distributed file system.” in MSST, vol. 10, 2010, pp. 1–10.
  • [10] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum et al., “The case for ramclouds: scalable high-performance storage entirely in dram,” ACM SIGOPS Operating Systems Review, vol. 43, no. 4, pp. 92–105, 2010.
  • [11] R. Kotla and M. Dahlin, “High throughput byzantine fault tolerance,” in International Conference on Dependable Systems and Networks, 2004.   IEEE, 2004, pp. 575–584.
  • [12] M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin, “All about eve: execute-verify replication for multi-core servers,” in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), 2012, pp. 237–250.
  • [13] P. J. Marandi, C. E. Bezerra, and F. Pedone, “Rethinking state-machine replication for parallelism,” in 2014 IEEE 34th International Conference on Distributed Computing Systems.   IEEE, 2014, pp. 368–377.
  • [14] E. Alchieri, F. Dotti, and F. Pedone, “Early scheduling in parallel state machine replication,” arXiv preprint arXiv:1805.05152, 2018.
  • [15] O. M. Mendizabal, R. S. T. D. Moura, F. L. Dotti, and F. Pedone, “Efficient and deterministic scheduling for parallel state machine replication,” in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2017, pp. 748–757.
  • [16] E. Alchieri, F. Dotti, O. M. Mendizabal, and F. Pedone, “Reconfiguring parallel state machine replication,” in 2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS).   IEEE, 2017, pp. 104–113.
  • [17] E. Alchieri, F. Dotti, P. Marandi, O. Mendizabal, and F. Pedone, “Boosting state machine replication with concurrent execution,” in 2018 Eighth Latin-American Symposium on Dependable Computing (LADC), Oct 2018, pp. 77–86.
  • [18] L. Lamport, “Fast paxos,” Distributed Computing, vol. 19, no. 2, pp. 79–103, 2006.
  • [19] P. J. Marandi, M. Primi, N. Schiper, and F. Pedone, “Ring paxos: A high-throughput atomic broadcast protocol,” in 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).   IEEE, 2010, pp. 527–536.
  • [20] (2019) The fastCBASE website. [Online]. Available: https://github.com/kisisjrlly/PSMR.git
  • [21] R. Kapitza, M. Schunter, C. Cachin, K. Stengel, and T. Distler, “Storyboard: Optimistic deterministic multithreading.” in HotDep, 2010.
  • [22] Z. Guo, C. Hong, M. Yang, D. Zhou, L. Zhou, and L. Zhuang, “Rex: Replication at the speed of multi-core,” in Proceedings of the Ninth European Conference on Computer Systems.   ACM, 2014, p. 11.

Appendix

Definition 1 (Total order).

A transaction sequence is a pair(,) where is a set of transactions and is an irreflexive and antisymmetric total order (this total order represents Paxos functionality)

Definition 2 (Conflict, dependency relation).

Tow transactions and conflict if . Given a conflict relation among transactions, the dependency relation is the transitive closure of , so it is an partial order.

Definition 3 (Dependency Graph).

Given a new transaction , the dependency graph , where and every two adjacent transactions and in the same have the relation of , so by construction is equivalent to the edges in .

Replica consistency 1. Transaction is executed exactly once. Suppose the of is and the of is , there exits some records equal to and in corresponding transaction queues (when the remove operation of is over, now .prior is the head of (line 39)). At this time, and detected whether to be executed in these two operations are actually the same transaction.

We define =(), where 1⃝ represents the insert operation (lines 22-24, Algorithm 2, the same as folloing), 2⃝ represents the detection operation (lines 25-29) after insert, and “” means all operations within it are protected by lock(line 23,27 and line 33,35); Similarly, =(), where 3⃝ represents the remove operation (lines 32-35), and 4⃝ represents the detection (lines 36-38) after remove. The detection operation 2⃝ of happens after performing operation 1⃝ on the last record of . If detection operation 4⃝ in success on the transaction queue corresponding to record , the former detection operation in must be failed because the remove operation for on later transaction queue like has not been executed yet. In this most extreme case, all possible order of processing are (), ( ), (), ( ). According to Algorithm 2, it can promise only 2⃝ will schedule to be executed. In other cases, the safety can be guarenteed following the fact given below.

Replica consistency 2. The dependency graph (DG) is a directed acyclic graph (DAG). According to Definition 3, can keep the relative order between dependent transactions, i.e. . Since is a partial order, is a .

Operation safety 1. No deadlock. According to Replica consistency 1, is acyclic because there is no transactions in dependent on each other. Since both and operate the records in in FIFO order, there always exits transaction at the head of it’s corresponding as long as is not empty, which means that it does not depend on others and is free to be executed.

Operation safety 2. No starvation. inserts the transactions at the end of queues, and detection of checks the transactions at the head. Based on Replica consistency 1, if , the detection of will be executed by operation 4⃝. As for , in operation 2⃝, the flag will be set false, so the detection of whether can be executed (line 37). As each transaction has an order in and no deadlock, will be executed eventually.

Replica consistency 3. Since is the same for all replicas, and the execution is subject to of the dependent transactions, there is no deadlock and no starvation. All transactions will be executed exactly once, thus all the replicas will have the same identical states after every transaction in is finished.