DxHash: A Scalable Consistent Hash Based on the Pseudo-Random Sequence

07/16/2021
by   Chaos Dong, et al.
0

Consistent hasing has played a fundamental role as a data router and a load balancer in various fields, such as distributed database, cloud infrastructure, and peer-to-peer network. However, the existing consistent hashing schemes can't meet the requirements simultaneously, including full consistency, scalability, small memory footprint, low update time and low query complexity. Thus, We propose DxHash, a scalable consistent hashing algorithm based on the pseudo-random sequence. For the scenario of distributed storage, there are two optimizations based on DXHash are proposed. First, the Weighted DxHash can adjust the workloads on arbitrary nodes. Second, the Asymmetric Replica Strategy (ARS) is combining the replica strategy in distributed storage with the scaleup process to improve the availability of the system and reduce the remapping rate. The evaluation indicates that compared with the state-of-art works, DxHash achieves significant improvements on the 5 requirements. Even with 50 second. What's more, the two optimizations both achieve their own results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

page 11

12/23/2018

AnchorHash: A Scalable Consistent Hash

Consistent hashing (CH) is a central building block in many networking a...
07/14/2021

Consistent RDMA-Friendly Hashing on Remote Persistent Memory

Coalescing RDMA and Persistent Memory (PM) delivers high end-to-end perf...
03/16/2020

Dash: Scalable Hashing on Persistent Memory

Byte-addressable persistent memory (PM) brings hash tables the potential...
05/08/2018

Round-Hashing for Data Storage: Distributed Servers and External-Memory Tables

This paper proposes round-hashing, which is suitable for data storage on...
07/27/2020

Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

Image hash codes are produced by binarizing the embeddings of convolutio...
01/23/2020

Fast, Compact and Highly Scalable Visual Place Recognition through Sequence-based Matching of Overloaded Representations

Visual place recognition algorithms trade off three key characteristics:...
12/13/2021

Academic Storage Cluster

Decentralized storage is still rarely used in an academic and educationa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When the concept of consistent hashing (CH) was first proposed [5], it quickly became popular and played a essential role as a data router and a load balancer in various fields, such as disrtibuted database [10], [7], [1], cloud infrastructure [9], [13] and pear-to-pear network [11], [4], etc. In these scenarios, CH evenly maps the keys of the load to the backends (or nodes) with consistency. Different with the normal hash function, CH meet two extra requirements. First is minimal disruption or Monotonicity, meaning minimal remapping as nodes are arbitrarily removed or added. Second is balance

, meaning an equal probability of a key to map to each working node. If a hasing algorithm satisfies both of these requirements, it is consistent. The consistency is indispensable to balance load and protect from large-scale data migration caused by the topological changes of cluster.


As CH is used widely, in addition to the requirement of consistency, more pressing demands is eager to meet. Up to now, there are five properties used to evaluate whether a CH algorithm is ideal.
Minimal disruption, as known as Monotonicity. As metioned, the minimal disruption refers that when the arbitrarily nodes are removed or added, the altered keys are either remapped from B to other nodes or remapped from other nodes to B. When the number of altered nodes is of the total number of nodes ( is the cluster size), the number of remapped keys must be of the total number of keys.
Uniform balance. A key has an equal probability to map to each working node in the cluster. As a result, the load distribution among nodes is uniform.
Fast lookup. The lookup performance in CH is vital since the core mission of CH is serving for data routings. The ideal CH should complete a query at O(1) complexity.
Low memory footprint. The memory footprint of CH has gotten people’s attention recently [9]. To provide the consistency or good query performance, some existing CH algorithms fetch the memory resource far more than the minimum demand. As the number of cluster nodes increases, the large memory footprint in turn degrades the performance. The ideal CH should take up of memory, where is the cluster size.
low update complexity. A good update performance is necessary for CH. In cloud or distributed storage, nodes are inevitably altered for various reasons, such as node failure and cluster scaling. In network, the nodes join or exit the cluster more frequently. A ideal CH should provide O(1) complexity to update the mapping when a node is altered.
Many methods proposed previously are struggled to overcome to trade-off among the above 5 requirements. The original CH, named Ring in this paper, achieves the minimal disruption, but cannot support both the uniform balance and low memory footprint [5]. The maglev hash [2], proposed by google in 2014, has a similar situation, although the lookup and update performance precedes to ring. The SACH, proposed in 2021, is a CH algorithm similar to the Maglev. Although SACH makes a serious of optimizations, the memory footprint and the update performance is still not ideal. The Jump consistent hash [6] and AnchorHash [8] have a relatively ideal performance of the five properties. But the scalibility of these two CHes have limitations repectively. In this paper, we proposed DxHash, a scalable consistent hash based on the pseudo-random sequence. By rehashing according to the different items in a pseudo-random sequence, Dxhash provides a nearly ideal performance which satisfies the mentioned five preperties. In the evaluation, when the cluster scale exceeds 1 million nodes and 50% of the nodes are failed, DxHash still can handle 16.5 million queries per second. Compared to the state-of-art work, DxHash has a better lookup and update performance and better scalibility, with less memory footprint. What’s more, we combine the scenarios of distributed storage with DxHash to propose two optimizations, which provide more flexible load sharing and higher availability. First, Weighted DxHash can adjust the load on arbitrary nodes for full utilization of the hardware resources. Second, Asymmetric Replica Strategy (ARS) is used to improve the availability of the system and reduce the remapping rate when scaleup.
The rest of the paper is orginized as follows. Section 2 provides the background and motivation. In this section, some classical or state-of-art CH algorithms are compared. Section 3 introduces DxHash, which is divided into tree parts, including the algorithm, the implementation, and the complexity proof. Section 4 shows the optimizations based on DxHash, including Weighted DxHash and ARS. In section 5, we evaluate the performance of DxHash in comparison with existing CH algorithms. At last, section 6 concludes the paper.

Ring[5] Maglev[2] JCH[6] SACH[9] AnchorHash[8] DxHash
Consistency Disrupt +++ + +++ +++ +++ +++
Balance + +++ +++ ++ +++ +++
Scalability Query 1
Update 2 3 4 5 6
Memory
Statelessness
  • Constant in Ring denotes the number of the virtual nodes mapped to a physical nodes. Ring introduces the virtual nodes to balance the workloads while denotes the number of physical nodes.

  • in Maglev denotes the size of the lookup table. To maintian the balance when nodes enter or exit, the value of is recommended to be much larger than n, for example .

  • The updates in JCH are limited because the updates only allowed at the tail node.

  • The updates in SACH are limited because the total number of nodes cannot exceed the initial maximum size. Similar to Maglev, in SACH denotes the size of the lookup table which is much larger than n. There are two update complexity corresponding to the two update schemes in SACH.

  • The updates in and AnchorHash are similar to SACH. The number of nodes can’t be over the initial size.

  • The updates limitation in DxHash is less restrictive. When the number of nodes reaches upper bound, DxHash double the scale and generates 1/2 of the data migration. In DxHash, denotes the number of all nodes while denotes the number of working nodes.

Table 1: The Comparison of DxHash and Common CH Algorithms

2 Related Work & Motivation

The Karger Hash Ring is the original CH scheme proposed in 1997 [5]. The Ring maps both the nodes and the keys into a cyclic hash space. In the clockwise direction, the value in the ring is increased from 0 to . Each node is responsible for the keys in the segment which ends with it. The insertion or removal of a node only affect the keys in the segment belonging to the node. The mapping of other keys stays unchanged. So the Ring acheives minimal distruption. However, the balance is hard to promise since the segment length is variable depending on the nodes’ distribution in the ring. The solution is to introduce virtual nodes, which raises more memory footprint [12]. What’s more, the update and lookup complexity is , which seems too higher. Some work try to redistribute the data if the load is unbalance [3]. However, it brings extra data migration and break minimal distruption.
MaglevHash, proposed by Google in 2016 [2], is a high-efficiency CH. It maintains a large memory tables in which a keys are mapped to a table entriy by hashing and the contents of the table are the node IDs. Thus, a query can be completed in O(1) complexity. However, for the balance, the size of table is always much larger then the number of nodes, which introduces significant extra memory consumption. Besides, the minimal disruption and low update complexity cannot be promised in MaglevHash, too.
Jump Consistent Hash (JCH) is an interesting but not practical CH algorithm which utilize the Pseudo-Random Sequence (PRS) [6]. JCH calculates the pseudo-random sequence according to the key and compare with a given probablitily to determine which node the key is in. For example, there are a cluster with 2 nodes that are about to add a new one. JCH calculates the PRS of a key, fetch the second item for standardization and compare the value with 1/3. If the value is less, the key need to migrate to the new node. Otherwise the data stays at the original node. JCH satisfies all commonds of CH but the nodes cannot be removed or added arbitrarily. JCH only allow the tail node to alter, otherwise the minimal disruption cannot be promised.
In recent, some new work about CH has also been proposed. SACH [9] and AnchorHash [8]

are 2 new CH algorithms. SACH used the double hashing which is similar to the Maglev. There are two different update algorithms in SACH, fast but unbalanced update and slow but balanced update. SACH updates the allocation with the CH-like fast-update algorithm against backend failures/recoveries, and updates the allocation with the slow-update algorithm for scaling up/down the cluster. Although the performance of SACH is better than Maglev in some workloads, the memory footprint and the update complexity are still unsatisfactory. What’s more, in SACH, the data skew increases as the failure rate increases.


AnchorHash is a near-ideal CH algorithm. In AnchorHash, the complexity of lookup is . Here, is total number of nodes. These nodes are divided into failed nodes and working nodes while is the number of the latter. We can find even at a 90% failure rate, the keys can be routed to working nodes in 5 times of hashing in average. The memory footprint, the consistency and the update complexity is all met the requirements. However, there are two fatal issues in AnchorHash. First, the upper bound of the cluster is fixed and cannot be changed after initialization. When the number of nodes reaches the upper bound, the new node can’t join the cluster any more. Second, the AnchorHash is strictly stateful and cannot support the concurrent updates.
In the CH algorithms mentioned, the AnchorHash is the state-of-art work but still has the defects. Since the exsiting CH algorithms has their own problems, we proposed DxHash, a stateless, scalable and consistent hash which meets the five requirements almost-perfectly. As Table 1 shows, DxHash uses pseudo-random sequence to route a key to a working node whose complexity is . Since the memory access and the memory footprint of DxHash is all less, the performance of DxHash is usually better. For example, as our experiments in Section 5.5 show, the lookup rate of DxHash is higher than AnchorHash even with a 90% failure ratio. Besides the lookup, other performance of DxHash is all better than AnchorHash. What’s more, DxHash has more flexible scalability and supports the concurrent updates.

3 DxHash

3.1 DxHash Algorithm

DxHash uses the Pseudo-Random Sequence to map the keys to the nodes. First of all, we introduce what is the Pseudo-Random Generator (PRG) and the Pseudo-Random Sequence (PRS).
pseudo-random generator. Random numbers are widely used in production and calculation. However, computers cannot generate truly random numbers but the pseudo-random numbers. The Pseaudo-Random Generator (PRG) is a random function whose feature is for a fixed seed, the result is always fixed. And for different seeds, the results are evenly distributed over the range. We use to denote an ideal PRG in the paper.
pseudo-random sequence. A PRS is generated iteratively by a seed and a PRG. The th item in the PRS is generated in the th iteration. We use to denode . Thus, when the seed and the PRG is given, the PRS is . The feature of PRS is (1) for a fixed seed, the corresponding PRS is well-determined and (2) the different items in a PRS are distributed evenly over the range.

3.1.1 Lookup

We now explain how the lookup operation of a key takes effect in DxHash. We start with a cluster with 8 nodes. In the cluster, 4 nodes are working and 4 are failed. Traditional hash calculate the mapping in the range of the 4 working nodes. However, when a node enters or exits, the key in the cluster is all possible to remap, which brokes the property of minimal disruption.
To address the issue, DxHash uses PRS iteratively to find a working node in the range of all the nodes. In this paper, we use to denote the set of all the nodes, to denote the set of working nodes, and to denote the set of failed nodes. Obviously, . For a given key, DxHash choose node as the mapped node where

and i is the minimum number that satisfies . For the sake of description, we assume the range of pseudo-random generater is between 0 and to avoid the modulus operation.

Figure 1: Example of the Queries of Two Keys in a Specific Cluster

For example, the given cluster is shown as the lower portion of Figure 1. Node 0, 1, 3, 5 are working and node 2, 4, 6, 7 are failed which can be abstracted into an array, called Cluster State Array (CSA) in the paper. DxHash is processing the queries of two keys, denoted as K1 and K2. Each PRS is shown in the upper part of Figure 1. The query of K1 is simple that is immediately mapped to a working node . But the query of K2 is more complex. The first 3 items of the PRS of K2 are , , , respectively, which corresponds to 3 failed nodes. So the K2 is mapped to the th items of the PRS, that is node 3. It’s obviously that if there is no failed nodes in the cluster, all queries get the results by one search, meaning that the complexity of query is . However, as the number of failed nodes increases, so does the length of search. In section 3.3, we will prove the average length of searching a key only depends on the fraction of the fail nodes, not their absolute number.

3.1.2 Update

Figure 2: Example of the Updates of Two Nodes in a Specific Cluster

The updates of nodes can be divided into join and exit. When a node join the cluster, DxHash randomly allocate a failed node ID to the new one. Then DxHash removes the node ID from the failed set and adds it into the working set . When a node exits the cluster, the corresponding node ID is removed from the working set and inserted into the failed set . After the working set is changed, the lookup result for related keys is changed accordingly.
Consider Figure 1, if we add node 2 into the cluster, we get the state illustrated in Figure 2, which update is recorded in the CSA. Since the node 2 join the cluster, some data are remapped to this node, for example, K2 in node 3 is remapped to the node 2. Then we remove node 1. Since the node 1 exits the cluster, some data are remapped from this node to others, for example K1 is remapped from node 1 to node 2. We intuitively find that altered nodes only affect the keys remapped to the node while others are unacted, which is the minimal disruption. So we prove the DxHash has minimal disruption and is balanced in the following.
Theorem 1 (Minimal Disruption): DxHash guarantees minimal disruption.
Proof: Given an arbitrary key , the PRS it generates is fixed, which is represented as . We assume the th item is the node to serve . From the definition of DxHash, we can know the nodes in the set is all failed, where

() Removal. Assume a newly removed node is . If , it means is in the node r and is about to remap to another node, which don’t break the minimal disruption. If , it means is not in the node . Since the the states of nodes in stay failed which won’t be effect by a removed node, the K is still mapped to node .
() Addition. Assume a newly added node is . If , it means Note that the possible value of m is not unique, so we use to denote the minimun value of all m. Thus, . Expect for node , the state of the other nodes stay failed in . So the K is remapped to because the th item becomes a working node while the first nodes are failed.iF , it means the recovery of node has no effect to the states of the nodes corresponding the first items in PRS. So the mapping of stay unchanged.
We can find no matter removal or addition to the cluster, the remapped keys are all from or to the altered node. Other keys stay their original mapping. So the minimal disruption in DxHash is proved.
Theorem 2 (Balance): DxHash guarantees the uniform balance.
Proof: Given a specific state of the cluster, the mapping mappings generated during each pseudo-random iteration is always balanced. Induction is used to prove the correctness of the conclusion.
Basis. We now prove at the first iteration, the mapping of keys to the working nodes are evenly. Since the Pseudo-Random Generator is ideal, the distribution of keys is uniform among all the nodes at the first iteration. So the workload identified in the first iteration in the working nodes are uniform, too. Meanwhile, the keys mapped to the failed nodes wait for the next iterations to remapped.
Induction Hypothesis. Assume the balance holds after every iteration .
Induction Step. Now, let’s prove the balance after the iteration . The keys mapped to the failed nodes last iteration need to remap at this iteration. As the Pseudo-Random Generator is ideal, the distribution of these keys is uniform, too. So we can get that as the Pseudo-Random Generator is ideal, DxHash always guarantees the uniform balance.
So here we introduce the algorithm algorithm and prove that the DxHash is consistent. When the and , the nodes can be arbitrarily removed from or added into the cluster. To ensure the availability of the algorithm, let’s talk about the boundary case. When the , it means there is no surviving work nodes in the cluster. Every iteration for a key to find a working node is always failed. So to keep the algorithm from going into an infinite loop, an upper bound of the number of iterations is introduced. DxHash terminate the pseudo-random sequence calculation after times and traverse the CSA instead. If there is still no work node, DxHash returns an error code and stop working.
The situation of all working nodes are failed is rare. But, the other boundary case may occur frequently, which is , meaning all nodes are working. In this situation, some CH algorithms don’t allow a new node to join, such as SACH [9] and AnchorHash [8]. In Dxhash, the node still can be added but leading to a large number of remapping since the is changed. To reduce the amount of remapping, DxHash double the cluster scale and then add the new nodes to the failed set . In this way, half the remapping is done less than the previous scheme which is similar to the traditional hashing. For example, as Figure 3 shows, in the original cluster whose size is 8, there are 4 failed nodes. After inserting 4 nodes, the cluster is full. A new addition of a node may causes the entire data to remap. So DxHash doubles the scale to 16, pushes node 9-15 into the failed set and leave node 8 as the new working node.

Figure 3: Example of the Scaleup of a Specific Cluster

3.2 DxHash Implementation

We have proved the consistency of DxHash, but the theoretical performance depends on the implementation. So in the following, we introduce the implementation of DxHash.
In our implementation, we use an array as the set to record the state of all nodes, called Cluster State Array (CSA). The index of CSA is the node ID, and the content of the array is a flag bit which represent current node is working or not. In this way, each node uses at least 1 bit of memory, which is far less than almost all existing CH algorithms. Besides CSA, DxHash maintains a queue as the failed set . Each item in is a 32-bit intager, represented a failed node ID. Note that in , sequentiality is not required. The reason we use a queue as the implementation of is that a queue can add and remove nodes at O(1) complexity. To that end, a stack can also be used.

Input:
1 = ;
2 for  do
3       if  is failed then
4             .push();
5            
6      
Algorithm 1 Init

Initiation. As shown in Alg. 1, the function Init of DxHash receives a CSA as an input. The CSA records the state of all nodes which are working or failed. And the failed nodes are pushed into .

Output: The working node ID
1 .pop();
2 [] = working;
3 return ;
Algorithm 2 AddNode

AddNode. As shown in Alg. 2, the function AddNode receives no input but returns a node ID. It pops a failed node in and set the current item in to working, which means a failed node goes back to work again. The return ID is the recovered node ID.

Input: The failed node ID
1 .push();
2 [] = failed;
Algorithm 3 RemoveNode

RemoveNode. As shown in Alg. 3, the function RemoveNode receives a node ID. It pushes the node ID into and set the set the current item in to failed, which means such node is failed.

Input: The key of data
Output: A working node ID
1 = ;
2 = mod ;
3 while [] is failed do
4       = ;
5       = mod ;
6      
7return ;
Algorithm 4 GetNode

GetNode. As shown in Alg. 4, the funchion GetNode receives a key and returns the mapped node. DxHash uses a pseudo-random generator to caclulate a PRS for the key, and choose the first item in the PRS which corresponds to a working node. Such node ID is returned.
Besides the core algorithms, the Scaleup and Scaledown of the cluster are important in DxHash. As mentioned, when the nodes in are all working and a new node are about to join, the size of cluster is doubled. In the implementation, the are doubled and the unused nodes are pushed into . After that, the addition of the node is handled by the function AddNode. When the number of the working nodes are far less than , for example less then 1/4 of , the Scaledown can be triggered. First, the nodes whose IDs are greater than is removed by the function RemoveNode. Then, the is halved and the nodes are deleted in whose IDs are greater than . At last, add the same number of nodes as removed earlier into DxHash. We can find the number of data remapping brought by Scaledown is larger then that brought by Scaleup, since the former requires extra removing and adding operations. So the Scaledown is off by default which are suggest to trigger manually.

3.3 DxHash Complexity Proof

Following let us disscuss the complexity of DxHash.

3.3.1 Space complexity

First, the space complexity of DxHash is easy to calculate. The data structure used in DxHash is a queue and an array. The size of array is in direct proportion to the total number of nodes, denoted as . The size of queue is in direct proportion to the number of failed nodes, denoted as , which is less than . So the space complexity of DxHash is , where a is the total number of nodes.

3.3.2 Time complexity

Now, we turn to providing the time comlexity of all operations in DxHash, including Addition, Removal, Query, Scaleup, and Scaledown.
Addition. In the addition process, both the access of the queue and the CSA can be completed in O(1) time. So the complexity of Addition is O(1).
Removal. The same as Addition, the complexity of Removal is O(1).
Query. The time complexity of query plays a crucial role in the performance of DxHash. DxHash calculates a pseudo-random number from a key iteratively until the number corresponds to a working node. Denote as and as , the probility of hitting a working node in each iteration is and the probility of hitting a failed node is . We denote as

. From here, we can find DxHash subject to Bernoulli distribution. For a key

, the number of pseudo-random number calculations when query is denoted by . The probility of a query ends just in the th iteration is , which means in the first iterations DxHash gets a failed node and in the th iteration DxHash finds a woking node. If ending in the th iteration, the number of pseudo-random number calculations is . So the the expectation of can be calculated as Formula 1.

(1)

In the similar way, the variance of

can be calculated as Formula 2.

(2)

Let’s replace as . We can get:
Theorem 3 (Query Complexity): Fix and , let and . For a key k, the number of hash operations when query is denoted by . The expectation of is while the variance of is .
Scaleup. When scaling the cluster up, DxHash doubled and push the unused nodes into . Those two operataions involves keys. So the time complexity of Scaleup is .
Scaledown. Similar to the process of scaleup, the time complexity of Scaledown is , too.

4 DxHash in Distributed Storage

In this section, some optimizations of DxHash are introduced in the scenario of distributed storage. It worth noting that in the hypothetical distributed storage, the metadata of cluster nodes is centrally managed. All nodes share a view of the topology of the cluster which is updated according to the failure and recovery of nodes in real time. In fact, a lot existing distributed storage systems are managed in this way, for example Memcached, Swift and Ceph. Therefore, the spread and load propertes [5], which are two important properties of CH, don’t need additional discussion.

4.1 Weighted DxHash

In the real scenario, a cluster is usually made up of physical nodes with differetn performance. The nodes with more memory, more advanced storage devices or a better processor can hold more workloads while the nodes with poor hardware can only hold less. To the physical nodes with better hardware, it’s easy to increase the load by introducing the virtual nodes. Deploying multiple virtual nodes on a physical node, the load increases naturally. For the physical nodes with poor hardware, Weighted DxHash is introduced to reduce the workloads. In the algorithm of the native DxHash, which node a key is mapped to is decided by the PRS generated by the key. Weighted DxHash introduces a weight for each node and a hash function denoted as . The range for both of the function and the weight value is from 0 to 1. In Weighted DxHash, the conditions to map a key to a node are more restrictive. Not only the th item in PRS is the working node, but also the hashing value of the th item is smaller than the weight of the node. A more accurate description is as follows.
The mapping in Weighted DxHash. For a given key , denote the PRS of as . Denote the cluster as . Denote the weight of node as . The weight of the failed node is 0. Then, the node ID which belongs is the first item in the PRS which satisfies:

For example, as shown in Figure 4, there is a cluster with 8 nodes. Now the PSR and the weights of nodes are given. The node with a weight of 0 is failed so node 4, 6, 7 are failed while the others are working. If in native DxHash, the node 2 is choosen as the mapped node of . However, in weighted DxHash, node 2 can’t meet requirement because the hash value is larger than the weight of node 2. So the final mapped node is node 4.

Figure 4: Example of a Query in Weighted DxHash

Time complexity. It’s obvious that the probility of hitting a working node in each iteration is , where is the size of the cluster and is the weight of th node. And in each iteration, there are two calculations. One is for the pseudo-random number and one is for the hash value. The calculation process is similar to that of the native DxHash that the expectation of the number of calcucations when query is:

(3)

It worth noting that the expectation of the key number in a node is

(4)

, where is the key number of the th nodes and is the total number of keys in the cluster.
Space comlexity. In Weighted DxHash, the item in CSA is not the flag bit but the weight of current node. So the CSA is changed from a 1-bit array to a 4-byte array while other implementations unchanged. Although the memory footprint scales up, the total memory usage of Weighted DxHash is still smaller than almost all CH algorithms. For example, the AnchorHash, whose memory footprint is far less than Ring and Maglev hash, still needs 16 bytes of memory per node [8].
To adjust the load on different nodes, we introduce Weighted DxHash. For the node with poor hardware or bad network condition, the weight can be set to a smaller value to receive less load. For the node with better hardware, multiple virtual nodes can be set up on the physical node to take full advantage of hardware resources.

(a) Lookup Rate
(b) Memory Footprint
Figure 5: The Lookup Rate and the Memory Foorprint of Karger Ring, MaglevHash, AnchorHash and DxHash

4.2 Asymmetric Replica Strategy

Although DxHash support scale up when the number of nodes reaches upper bound, the number of remapping is still too large, which is half of the total data at least. To reduce the amount of remapping, we introduce an Asymmetric Replica Strategy. In distributed storages, the replica strategy is indispensable for the system reliability. To avoid data loss, a three-replica policy is used usually.
In Asymmetric Replica Strategy, data is orginized as a key-value pair (KV pair). Each KV pair is save as three copies, which is named as the first, the second and the third replica in the paper. The key of the copies needs to be suffixed by the cluster size. The key of the first replica is suffixed by the current cluster size. The key of the second replica is suffixed by the double size of current cluster and the key of the third replica is suffixed by the quadruple size of the current cluster. And the mapping process is different, too. The mapping of the second replica is managed as the cluster is doubled while the mapping of the third mapping is managed as a cluster is quadrupled.
If the set of all nodes in cluster is , the three copies of is , and , the mapped nodes of the copies are the first items in the corresponding PRS respectively which satisfies:

In the formula, the node ID calculated in the second or third replica is possible to exceed the cluster size. Nodes not in the cluster are uniformly treated as failed nodes. The three copies are Asymmetric because the mappings are computed differently and the query performance are different, too. It’s obvious that the performance to access the first replica is better. However, when scaling cluster up, the second replica which is calculated as the double size of the previous cluster takes effect directly. In this situation, the original second replica and third replica is promoted as the first and second replica respectively, and the original first replica is demoted to the third replica which is calculated as the 8 times of the original cluster size. The system was temporarily downgraded from three copies to two copies, but high availability is guaranteed. What’s more, the amount of remapping is decreased greatly. First, the second replica and the third replica is stay unmoved, which is of the population. Second, of the first replica is stay unmoved, too. So there are at most ot the total data to remap.
Although the Asymmetric Replica Strategy does reduce the amount of remapping, the query performance to the second, third replica is affected. Read requests can be processed in the first replica, but the performance of strongly consistent writes is necessarily worse than the native replica strategy. Therefore, the usage of the Asymmetric replica strategy should be decided on a case-by-case basis.

Figure 6: The remapping ratio of different hashing methods. The ordinate is the remapping ratio during each increment of the working node number.

5 Evaluation

In this section we test and compare the performance of different CH algorithms, for example Karger Ring, MaglevHash, AnchorHash and DxHash.

5.1 Environment

As shown in Table 2, all experiments are performed on the same commercial mechine with the processer of Inter Xeon CPU E5-2620 at 2.00GHz, 32GB memory. The system is CentOS 7.8. The kernel version is 3.10.0-1127 and the GCC version is 7.3.1. All algorithms are implemented in C++ and complied by G++.

Processor Intel Xeon CPU E5-2620 0 @ 2.00GHz
Memory 32GB
Operating System CentOS Linux release 7.8.2003 (Core)
Kernel Version 3.10.0-1127.13.1.el7.x86_64
GCC Version 7.3.1 20180303
Table 2: Environment Configuration

5.2 Overview

We test the lookup rate and the memory footprint of 4 CH algorithms, including Karger Ring, MaglevHash, AnchorHash and DxHash. All algorithms are implemented in C++. For Karger Ring and MaglevHash, the balance is depend on the memory footprint. For example, in MaglevHash, reaching 1% hash space imbalance require times of memory than the minimal requirement.
In the experiments, the nodes are all working. Figure 5(a) shows the lookup rate of the 4 CH algorithms in different cluster size of 1K, 10K, 100K and 1M. The abscissa is the cluster size while the ordinate is the key lookup rate, and MKPS represented Million Keys Per Second. From the figure, the lookup performance of Ring is consistently worse than the others because the lookup complexity of Ring is while the complexity of others are all . When the cluster size is small, the three CH algorithms all have a better performance because queries in these algorithms only need a similar Hasing operation. As the cluster size gets larger, the performance of Maglev is poor since the lookup table is too large to get results quickly. AnchorHash and DxHash have similar performance which is less affected by the larger cluster size. We can find that DxHash has a better performance. This is because the memory footprint of DxHash is smaller so that the metadata in CPU cache has a better hit rate. It worth noting that since the DRAM is one byte addressing storage, for the best performance, we use a byte not a bit of memory to record the state of a DxHash node in this experiment to avoid extra bit operations.
Now let’s calculate the memory footprint of different algorithms. In Karger Ring, a Red Black Tree (RBTree) is used to organize the node states. Since a node in RBTree includes the 3 pointers (to father, left child and right child), the color and a KV pair, each node takes up 28 bytes of memory space. In our implementation, each physical node matches 100 virutal nodes for balance. In Maglev, one entry in the lookup table requires 4 bytes of memory and the number of slots is about 100 times of the number of the physical nodes to keep balance. In AnchorHash, a node uses up 16 bytes of memory space and in Dxhash, a node only needs 1 bit of memory. We calculate the memory footprint of the 4 CHes when the cluster size is 1K, 10K, 100K and 1M. The final memory footprint is shows in Figure 5(b). The abscissa is the cluster size and the ordinate is the memory usage whose unit is byte. Maglev and Ring takes up quite a bit of memory space. Ring and Maglev requires over-provisioned space to keep balance. We can find for nodes, the memory they require is 2.8 GB and 400 MB, respectively. The memory footprint of AnchorHash is acceptable that 16 MB of memory is required. But for DxHash, it takes up surprisingly little space which is at least 125KB, that can be fully stored in L2 cache. Although a byte of memory is provided for a node for the performance, the memory footprint is still only 1MB. It obviously that the DxHash has the minimal memory usage which is far less than any other CH algorithms.

5.3 Balance

In this experiment we evaluate load balancing of DxHash and compare it against other CH algorithms. We set the number of working nodes from 100 to 1000. In Karger Ring, one physical node matches 100 virtual nodes. The lookup table size of MaglevHash is 99991, a prime which is about 100 times the total number of nodes. For AnchorHash and DxHash, the total number of nodes is set to 1024. That means if the number of working nodes is 100, the failure rate of the cluster is 0.9. We randomly generated 10 million key and calculate the mapped nodeID. The result is shown in Figure 7

. We measure the balancing efficiency in terms of the standard deviation of the key number in all working nodes. For the convenience of comparisons between different cluster sizes, We divide the standard deviation by the mean to get the Coefficient of Variation (CV).


Figure 7: Load balancing efficiency of different hashing methods. The ordinate is the Coefficient of Variation calculated by .

From Figure 7, we can find that the balancing efficiency of Karger Ring is so poor. The CV is up to 0.1 although the virtual nodes per physical node is 100. The performance is poor but is consistent with tests for other work [6]

. The other algorithms have a much better balancing efficiency. The CV of those is changed from 0.003 to 0.01 as the working nodes increases from 100 to 1000. This situation is meet expectation. According to the law of large numbers, when the number of keys is much greater than the number of nodes, the frequency converges on the probability, meaning that the key number in a node is equal to the mean and the variance is 0. The closer the key number is to the number of nodes, the greater the variance is. What’s more, the balance efficiency of Maglevhash depends on the ratio of the lookup table size to the number of nodes, too.


In general, Maglev, AnchorHash and DxHash all achieve high balance efficiency. Only Karger Ring has a poor load balancing.

5.4 Minimal disruption

In this experiment, we evaluate the disruption of 4 CH algorithms brought by the addition or removal of nodes. We increases the number of working nodes from 100 to 1000 with the step of 100. Every time adding nodes, we calculate the number of remapping keys, and divide the number by the total number of keys to get the remapping ratio. Then we compair the remapping ratio to the ideal which is calculated by , where is always 100 and is the number of nodes before addition. Figure 6 shows the result. As expect, of the 4 CH algorithms, Maglev can’t hold the minimal disruption when nodes are added or removed. But the extra disruption from Maglev is not as much as the traditional Hashing. And the other 3 CH algorithms basically guarantee the minimal disruption.

5.5 Fault Tolerance

In this section, we evaluate fault tolerance of the CH algorithms. In the former experiment, we find that the Karger King is not a ideal CH, because its poor performance, low balance efficiency and large memory footprint. Maglev is a more advanced CH, but still has a large memory footprint. And when nodes is added or removed, the time to rebuild the lookup table is too long. So we just compare the fault tolerance of AnchorHash and DxHash.
As metioned, the Average Search Length (ASL) for a key to look up the mapping node depends on the failure ratio. We test the ASL in different failure ratio. In the experiment, the total number of nodes is 1000. we randomly delete 100 nodes in a step until there are only 100 working nodes, meaning the failure ratio increases from 0 to 0.9. Figure 8 presents the trends in ASL as the incresement of the failure ratio. The experimental results agree with our theoretical calculation. The lookup complexity of AnchorHash is [8] and the lookup complexity of DxHash is , where is the total number of nodes and is the number of working nodes. It seems that AnchorHash has a better fault tolerance than DxHash, but it’s an extreme failure condition that the failure ratio is up to 70% or more. From Figure 8, even when 70% of nodes are failed, the ASL of a key lookup is only 3.4 in DxHash, which is close to the performance of AnchorHash.

Figure 8: The Average Search Length to look up keys. The abscissa represent the failure ratio of 1000 nodes.
(a) Lookup Rate in 1K nodes
(b) Lookup Rate in 1M nodes
Figure 9: The Lookup Rate of AnchorHash and DxHash under different failure ratios and different cluster size. (a) is tested in a cluster with 1K nodes and (b) is tested in a cluster with 1M nodes.

Besides the ASL, we test the lookup performance of the two algorithms under different failure ratios. We repeat the test twice. One is in a cluster with 1K nodes and the other is in a cluster with 1M nodes. From Figure 9(a), comparing the lookup rate with different failure, it’s astonishing that DxHash always has a better performance. First reason is the memory footprint of DxHash is smaller. The entire metadata can be cached in CPU to speed up access. Second reason is that the algorithm itself. Although DxHash has a longer Avg. Search Length, for each miss, AnchorHash access memory 4 times while DxHash has only one. Unless ASL of DxHash is 4 times longer that of AnchorHash (corresponding to 97.5% failure rate), the performance of DxHash is always better. In Figure 9(b), when the cluster size is up to , the gap between the two is even more stark. The memory usage of DH is smaller and the time of access memory is less. So the lookup performance of DH is obviously better than AH. Even when the failure ratio is 0.9, the lookup rate of DH is still 6.44 Mkeys/s.

5.6 Weighted DxHash

In this experiment, we test the ASL and the data distribution in Weighted DxHash. The test cluster is composed of 1024 working nodes. We divide these nodes into two halves. The weight of one part is set to (called 1-nodes) and the weight of the other part is changed from to with the step of (called n-nodes). After 10 milion lookups, we count the average access number per node of the two parts. What’s more, the Average Search Length in the tests is recorded, too. Figure 10 presents our test results. The avg. access of 1-nodes and n-nodes are tested in different configurations. The ASL and the avg. access are all fits our calculation as the Formula 3 and 4, whose error is within 0.1%.

Figure 10: The Average Search Length and Average Access per Node in different configurations. The cluster size is 1024. The weight of 512 nodes is 1 and the weight of others is , where is from to with the step of . We record the avg. access of the nodes with weight of 1 and the nodes with the weight of , respectively.

5.7 Asymmetric Replica Strategy

In this experiment, we test the remapping ratio of 10 million keys when scaleup. We set the cluster size as 1024, 2048, 4096, 8192 and 16384, respectively whose nodes are all working and add a new node into the cluster. The process of scaleup is triggered. The cluster size is doubled and the unused nodes are set as failed nodes. We count the ratio of key remapping in Dxhash with/without Asymmetric Replica Strategy (ARS) and compare these two in Figure 11. As expect, the remapping ratio is 0.5 in the original DxHash. If the ARS is used, the remapping ratio is only 0.2916 which is close to .

Figure 11: The rampping ratio in Dxhash with/without Asymmetric Replica Strategy. In the tests, we add a node into the full of clusters whose size is 1024, 2048, 4096,, 8192 and 16384, respectively and calaulate the remapping ratio.

5.8 Discussion

The experiment results show that compared with the state-of-art work, the disruption, the balance, the space complexity, the update complexity and the lookup comlexity of DxHash is all equal or better. Although the lookup complexity of DxHash is higher than AnchorHash, the performance of Dxhash is more than enough for the real scenarios where the failure ratio is less than 0.9 usually. Compared with AnchorHash [8], DxHash solves the problem of scaleup. What’s more, unlike AnchorHash, since DxHash is stateless, it is easy to extend to multi-threading, high concurrency scenarios. And since the metadata of DxHash is tiny (125KB/1 million nodes), it’s perfect for distributed clusters because the full updates also have a tiny network overhead.
The biggest drawback of Dxhash is the remapping ratio of scaleup. Although the Asymmetric Replica Strategy helps to reduce the remapping number from 1/2 to 7/24 of all the data, it’s still a large number if the data scale is huge. Luckily, the scaleup is infrequent because the upper bound is doubled after each scaleup. It worth noting if there are a large number of nodes join the cluster at the time of scaleup, the remapping ratio stays 7/24 (or 1/2 without ARS). If the cluster size is also doubled when scaleup, the remapping ratio is still the ideal. So Dxhash also applies to the elastic storage.

6 Conclusions

This paper presents Dxhash, a fast, scalable and flexible consistent hashing algorithm. We provide the algorithm and the implementation of DxHash and proves the complexity. And based Dxhash, we propose two optimizations, which are Weighted DxHash and Asymmetric Replica Strategy

. We then compare the evaluation of DxHash to other existing CH algorithms. It shows DxHash can maintain millions of nodes while providing a high key lookup rate, a very low memory footprint, and a small update time upon the addition or removal of nodes. And the two extra optimizations achieve the respective effect. Finally, the code of DxHash and all the tests are open source.

References

  • [1] Chung, C., Koo, J., Im, J., Arvind, and Lee, S. Lightstore: Software-defined network-attached key-value drives. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2019), ASPLOS ’19, Association for Computing Machinery, p. 939–953.
  • [2] Eisenbud, D. E., Yi, C., Contavalli, C., Smith, C., Kononov, R., Mann-Hielscher, E., Cilingiroglu, A., Cheyney, B., Shang, W., and Hosein, J. D. Maglev: A fast and reliable software network load balancer. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) (Santa Clara, CA, Mar. 2016), USENIX Association, pp. 523–535.
  • [3] Fu, X., Peng, C., and Han, W. A consistent hashing based data redistribution algorithm. In

    Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques

    (Cham, 2015), X. He, X. Gao, Y. Zhang, Z.-H. Zhou, Z.-Y. Liu, B. Fu, F. Hu, and Z. Zhang, Eds., Springer International Publishing, pp. 559–566.
  • [4] Goel, P., Rishabh, K., and Varma, V. An alternate load distribution scheme in dhts. In 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom) (2017), pp. 218–222.
  • [5] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In

    Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing

    (New York, NY, USA, 1997), STOC ’97, Association for Computing Machinery, p. 654–663.
  • [6] Lamping, J., and Veach, E. A fast, minimal memory, consistent hash algorithm, 2014.
  • [7] Liu, Z., Bai, Z., Liu, Z., Li, X., Kim, C., Braverman, V., Jin, X., and Stoica, I. Distcache: Provable load balancing for large-scale storage systems with distributed caching. In 17th USENIX Conference on File and Storage Technologies (FAST 19) (Boston, MA, Feb. 2019), USENIX Association, pp. 143–157.
  • [8] Mendelson, G., Vargaftik, S., Barabash, K., Lorenz, D. H., Keslassy, I., and Orda, A. Anchorhash: A scalable consistent hash. IEEE/ACM Transactions on Networking 29, 2 (2021), 517–528.
  • [9] Nakatani, Y. Structured allocation-based consistent hashing with improved balancing for cloud infrastructure. IEEE Transactions on Parallel and Distributed Systems 32, 9 (2021), 2248–2261.
  • [10] Shu, J., Chen, Y., Wang, Q., Zhu, B., Li, J., and Lu, Y. Th-dpms: Design and implementation of an rdma-enabled distributed persistent memory storage system. ACM Trans. Storage 16, 4 (Oct. 2020).
  • [11] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M., Dabek, F., and Balakrishnan, H. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking 11, 1 (2003), 17–32.
  • [12] Wang, X., and Loguinov, D.

    Load-balancing performance of consistent hashing: Asymptotic analysis of random node join.

    IEEE/ACM Transactions on Networking 15, 4 (2007), 892–905.
  • [13] Wu, C., Sreekanti, V., and Hellerstein, J. M. Autoscaling tiered cloud storage in anna. Proc. VLDB Endow. 12, 6 (Feb. 2019), 624–638.