1 Introduction
When the concept of consistent hashing (CH) was first proposed [5], it quickly became popular and played a essential role as a data router and a load balancer in various fields, such as disrtibuted database [10], [7], [1], cloud infrastructure [9], [13] and peartopear network [11], [4], etc. In these scenarios, CH evenly maps the keys of the load to the backends (or nodes) with consistency. Different with the normal hash function, CH meet two extra requirements. First is minimal disruption or Monotonicity, meaning minimal remapping as nodes are arbitrarily removed or added. Second is balance
, meaning an equal probability of a key to map to each working node. If a hasing algorithm satisfies both of these requirements, it is consistent. The consistency is indispensable to balance load and protect from largescale data migration caused by the topological changes of cluster.
As CH is used widely, in addition to the requirement of consistency, more pressing demands is eager to meet. Up to now, there are five properties used to evaluate whether a CH algorithm is ideal.
Minimal disruption, as known as Monotonicity. As metioned, the minimal disruption refers that when the arbitrarily nodes are removed or added, the altered keys are either remapped from B to other nodes or remapped from other nodes to B. When the number of altered nodes is of the total number of nodes ( is the cluster size), the number of remapped keys must be of the total number of keys.
Uniform balance. A key has an equal probability to map to each working node in the cluster. As a result, the load distribution among nodes is uniform.
Fast lookup. The lookup performance in CH is vital since the core mission of CH is serving for data routings. The ideal CH should complete a query at O(1) complexity.
Low memory footprint. The memory footprint of CH has gotten people’s attention recently [9]. To provide the consistency or good query performance, some existing CH algorithms fetch the memory resource far more than the minimum demand. As the number of cluster nodes increases, the large memory footprint in turn degrades the performance. The ideal CH should take up of memory, where is the cluster size.
low update complexity. A good update performance is necessary for CH. In cloud or distributed storage, nodes are inevitably altered for various reasons, such as node failure and cluster scaling. In network, the nodes join or exit the cluster more frequently. A ideal CH should provide O(1) complexity to update the mapping when a node is altered.
Many methods proposed previously are struggled to overcome to tradeoff among the above 5 requirements. The original CH, named Ring in this paper, achieves the minimal disruption, but cannot support both the uniform balance and low memory footprint [5]. The maglev hash [2], proposed by google in 2014, has a similar situation, although the lookup and update performance precedes to ring. The SACH, proposed in 2021, is a CH algorithm similar to the Maglev. Although SACH makes a serious of optimizations, the memory footprint and the update performance is still not ideal. The Jump consistent hash [6] and AnchorHash [8] have a relatively ideal performance of the five properties. But the scalibility of these two CHes have limitations repectively. In this paper, we proposed DxHash, a scalable consistent hash based on the pseudorandom sequence. By rehashing according to the different items in a pseudorandom sequence, Dxhash provides a nearly ideal performance which satisfies the mentioned five preperties. In the evaluation, when the cluster scale exceeds 1 million nodes and 50% of the nodes are failed, DxHash still can handle 16.5 million queries per second. Compared to the stateofart work, DxHash has a better lookup and update performance and better scalibility, with less memory footprint. What’s more, we combine the scenarios of distributed storage with DxHash to propose two optimizations, which provide more flexible load sharing and higher availability. First, Weighted DxHash can adjust the load on arbitrary nodes for full utilization of the hardware resources. Second, Asymmetric Replica Strategy (ARS) is used to improve the availability of the system and reduce the remapping rate when scaleup.
The rest of the paper is orginized as follows. Section 2 provides the background and motivation. In this section, some classical or stateofart CH algorithms are compared. Section 3 introduces DxHash, which is divided into tree parts, including the algorithm, the implementation, and the complexity proof. Section 4 shows the optimizations based on DxHash, including Weighted DxHash and ARS. In section 5, we evaluate the performance of DxHash in comparison with existing CH algorithms. At last, section 6 concludes the paper.
Ring[5]  Maglev[2]  JCH[6]  SACH[9]  AnchorHash[8]  DxHash  
Consistency  Disrupt  +++  +  +++  +++  +++  +++ 
Balance  +  +++  +++  ++  +++  +++  
Scalability  Query  ^{1}  
Update  ^{2}  ^{3}  ^{4}  ^{5}  ^{6}  
Memory  
Statelessness  ✓  ✓  ✓  ✓ 

Constant in Ring denotes the number of the virtual nodes mapped to a physical nodes. Ring introduces the virtual nodes to balance the workloads while denotes the number of physical nodes.

in Maglev denotes the size of the lookup table. To maintian the balance when nodes enter or exit, the value of is recommended to be much larger than n, for example .

The updates in JCH are limited because the updates only allowed at the tail node.

The updates in SACH are limited because the total number of nodes cannot exceed the initial maximum size. Similar to Maglev, in SACH denotes the size of the lookup table which is much larger than n. There are two update complexity corresponding to the two update schemes in SACH.

The updates in and AnchorHash are similar to SACH. The number of nodes can’t be over the initial size.

The updates limitation in DxHash is less restrictive. When the number of nodes reaches upper bound, DxHash double the scale and generates 1/2 of the data migration. In DxHash, denotes the number of all nodes while denotes the number of working nodes.
2 Related Work & Motivation
The Karger Hash Ring is the original CH scheme proposed in 1997 [5]. The Ring maps both the nodes and the keys into a cyclic hash space. In the clockwise direction, the value in the ring is increased from 0 to . Each node is responsible for the keys in the segment which ends with it. The insertion or removal of a node only affect the keys in the segment belonging to the node. The mapping of other keys stays unchanged. So the Ring acheives minimal distruption. However, the balance is hard to promise since the segment length is variable depending on the nodes’ distribution in the ring. The solution is to introduce virtual nodes, which raises more memory footprint [12]. What’s more, the update and lookup complexity is , which seems too higher. Some work try to redistribute the data if the load is unbalance [3]. However, it brings extra data migration and break minimal distruption.
MaglevHash, proposed by Google in 2016 [2], is a highefficiency CH. It maintains a large memory tables in which a keys are mapped to a table entriy by hashing and the contents of the table are the node IDs. Thus, a query can be completed in O(1) complexity. However, for the balance, the size of table is always much larger then the number of nodes, which introduces significant extra memory consumption. Besides, the minimal disruption and low update complexity cannot be promised in MaglevHash, too.
Jump Consistent Hash (JCH) is an interesting but not practical CH algorithm which utilize the PseudoRandom Sequence (PRS) [6]. JCH calculates the pseudorandom sequence according to the key and compare with a given probablitily to determine which node the key is in. For example, there are a cluster with 2 nodes that are about to add a new one. JCH calculates the PRS of a key, fetch the second item for standardization and compare the value with 1/3. If the value is less, the key need to migrate to the new node. Otherwise the data stays at the original node. JCH satisfies all commonds of CH but the nodes cannot be removed or added arbitrarily. JCH only allow the tail node to alter, otherwise the minimal disruption cannot be promised.
In recent, some new work about CH has also been proposed. SACH [9] and AnchorHash [8]
are 2 new CH algorithms. SACH used the double hashing which is similar to the Maglev. There are two different update algorithms in SACH, fast but unbalanced update and slow but balanced update. SACH updates the allocation with the CHlike fastupdate algorithm against backend failures/recoveries, and updates the allocation with the slowupdate algorithm for scaling up/down the cluster. Although the performance of SACH is better than Maglev in some workloads, the memory footprint and the update complexity are still unsatisfactory. What’s more, in SACH, the data skew increases as the failure rate increases.
AnchorHash is a nearideal CH algorithm. In AnchorHash, the complexity of lookup is . Here, is total number of nodes. These nodes are divided into failed nodes and working nodes while is the number of the latter. We can find even at a 90% failure rate, the keys can be routed to working nodes in 5 times of hashing in average. The memory footprint, the consistency and the update complexity is all met the requirements. However, there are two fatal issues in AnchorHash. First, the upper bound of the cluster is fixed and cannot be changed after initialization. When the number of nodes reaches the upper bound, the new node can’t join the cluster any more. Second, the AnchorHash is strictly stateful and cannot support the concurrent updates.
In the CH algorithms mentioned, the AnchorHash is the stateofart work but still has the defects. Since the exsiting CH algorithms has their own problems, we proposed DxHash, a stateless, scalable and consistent hash which meets the five requirements almostperfectly. As Table 1 shows, DxHash uses pseudorandom sequence to route a key to a working node whose complexity is . Since the memory access and the memory footprint of DxHash is all less, the performance of DxHash is usually better. For example, as our experiments in Section 5.5 show, the lookup rate of DxHash is higher than AnchorHash even with a 90% failure ratio. Besides the lookup, other performance of DxHash is all better than AnchorHash. What’s more, DxHash has more flexible scalability and supports the concurrent updates.
3 DxHash
3.1 DxHash Algorithm
DxHash uses the PseudoRandom Sequence to map the keys to the nodes. First of all, we introduce what is the PseudoRandom Generator (PRG) and the PseudoRandom Sequence (PRS).
pseudorandom generator. Random numbers are widely used in production and calculation. However, computers cannot generate truly random numbers but the pseudorandom numbers. The PseaudoRandom Generator (PRG) is a random function whose feature is for a fixed seed, the result is always fixed. And for different seeds, the results are evenly distributed over the range. We use to denote an ideal PRG in the paper.
pseudorandom sequence. A PRS is generated iteratively by a seed and a PRG. The th item in the PRS is generated in the th iteration. We use to denode . Thus, when the seed and the PRG is given, the PRS is . The feature of PRS is (1) for a fixed seed, the corresponding PRS is welldetermined and (2) the different items in a PRS are distributed evenly over the range.
3.1.1 Lookup
We now explain how the lookup operation of a key takes effect in DxHash. We start with a cluster with 8 nodes. In the cluster, 4 nodes are working and 4 are failed. Traditional hash calculate the mapping in the range of the 4 working nodes. However, when a node enters or exits, the key in the cluster is all possible to remap, which brokes the property of minimal disruption.
To address the issue, DxHash uses PRS iteratively to find a working node in the range of all the nodes. In this paper, we use to denote the set of all the nodes, to denote the set of working nodes, and to denote the set of failed nodes. Obviously, . For a given key, DxHash choose node as the mapped node where
and i is the minimum number that satisfies . For the sake of description, we assume the range of pseudorandom generater is between 0 and to avoid the modulus operation.
For example, the given cluster is shown as the lower portion of Figure 1. Node 0, 1, 3, 5 are working and node 2, 4, 6, 7 are failed which can be abstracted into an array, called Cluster State Array (CSA) in the paper. DxHash is processing the queries of two keys, denoted as K1 and K2. Each PRS is shown in the upper part of Figure 1. The query of K1 is simple that is immediately mapped to a working node . But the query of K2 is more complex. The first 3 items of the PRS of K2 are , , , respectively, which corresponds to 3 failed nodes. So the K2 is mapped to the th items of the PRS, that is node 3. It’s obviously that if there is no failed nodes in the cluster, all queries get the results by one search, meaning that the complexity of query is . However, as the number of failed nodes increases, so does the length of search. In section 3.3, we will prove the average length of searching a key only depends on the fraction of the fail nodes, not their absolute number.
3.1.2 Update
The updates of nodes can be divided into join and exit. When a node join the cluster, DxHash randomly allocate a failed node ID to the new one. Then DxHash removes the node ID from the failed set and adds it into the working set . When a node exits the cluster, the corresponding node ID is removed from the working set and inserted into the failed set . After the working set is changed, the lookup result for related keys is changed accordingly.
Consider Figure 1, if we add node 2 into the cluster, we get the state illustrated in Figure 2, which update is recorded in the CSA. Since the node 2 join the cluster, some data are remapped to this node, for example, K2 in node 3 is remapped to the node 2. Then we remove node 1. Since the node 1 exits the cluster, some data are remapped from this node to others, for example K1 is remapped from node 1 to node 2. We intuitively find that altered nodes only affect the keys remapped to the node while others are unacted, which is the minimal disruption. So we prove the DxHash has minimal disruption and is balanced in the following.
Theorem 1 (Minimal Disruption): DxHash guarantees minimal disruption.
Proof: Given an arbitrary key , the PRS it generates is fixed, which is represented as . We assume the th item is the node to serve . From the definition of DxHash, we can know the nodes in the set is all failed, where
() Removal. Assume a newly removed node is . If , it means is in the node r and is about to remap to another node, which don’t break the minimal disruption. If , it means is not in the node . Since the the states of nodes in stay failed which won’t be effect by a removed node, the K is still mapped to node .
() Addition. Assume a newly added node is . If , it means Note that the possible value of m is not unique, so we use to denote the minimun value of all m. Thus, . Expect for node , the state of the other nodes stay failed in . So the K is remapped to because the th item becomes a working node while the first nodes are failed.iF , it means the recovery of node has no effect to the states of the nodes corresponding the first items in PRS. So the mapping of stay unchanged.
We can find no matter removal or addition to the cluster, the remapped keys are all from or to the altered node. Other keys stay their original mapping. So the minimal disruption in DxHash is proved.
Theorem 2 (Balance): DxHash guarantees the uniform balance.
Proof: Given a specific state of the cluster, the mapping mappings generated during each pseudorandom iteration is always balanced. Induction is used to prove the correctness of the conclusion.
Basis. We now prove at the first iteration, the mapping of keys to the working nodes are evenly. Since the PseudoRandom Generator is ideal, the distribution of keys is uniform among all the nodes at the first iteration. So the workload identified in the first iteration in the working nodes are uniform, too. Meanwhile, the keys mapped to the failed nodes wait for the next iterations to remapped.
Induction Hypothesis. Assume the balance holds after every iteration .
Induction Step. Now, let’s prove the balance after the iteration . The keys mapped to the failed nodes last iteration need to remap at this iteration. As the PseudoRandom Generator is ideal, the distribution of these keys is uniform, too. So we can get that as the PseudoRandom Generator is ideal, DxHash always guarantees the uniform balance.
So here we introduce the algorithm algorithm and prove that the DxHash is consistent. When the and , the nodes can be arbitrarily removed from or added into the cluster. To ensure the availability of the algorithm, let’s talk about the boundary case. When the , it means there is no surviving work nodes in the cluster. Every iteration for a key to find a working node is always failed. So to keep the algorithm from going into an infinite loop, an upper bound of the number of iterations is introduced. DxHash terminate the pseudorandom sequence calculation after times and traverse the CSA instead. If there is still no work node, DxHash returns an error code and stop working.
The situation of all working nodes are failed is rare. But, the other boundary case may occur frequently, which is , meaning all nodes are working. In this situation, some CH algorithms don’t allow a new node to join, such as SACH [9] and AnchorHash [8]. In Dxhash, the node still can be added but leading to a large number of remapping since the is changed. To reduce the amount of remapping, DxHash double the cluster scale and then add the new nodes to the failed set . In this way, half the remapping is done less than the previous scheme which is similar to the traditional hashing. For example, as Figure 3 shows, in the original cluster whose size is 8, there are 4 failed nodes. After inserting 4 nodes, the cluster is full. A new addition of a node may causes the entire data to remap. So DxHash doubles the scale to 16, pushes node 915 into the failed set and leave node 8 as the new working node.
3.2 DxHash Implementation
We have proved the consistency of DxHash, but the theoretical performance depends on the implementation. So in the following, we introduce the implementation of DxHash.
In our implementation, we use an array as the set to record the state of all nodes, called Cluster State Array (CSA). The index of CSA is the node ID, and the content of the array is a flag bit which represent current node is working or not. In this way, each node uses at least 1 bit of memory, which is far less than almost all existing CH algorithms. Besides CSA, DxHash maintains a queue as the failed set . Each item in is a 32bit intager, represented a failed node ID. Note that in , sequentiality is not required. The reason we use a queue as the implementation of is that a queue can add and remove nodes at O(1) complexity. To that end, a stack can also be used.
Initiation. As shown in Alg. 1, the function Init of DxHash receives a CSA as an input. The CSA records the state of all nodes which are working or failed. And the failed nodes are pushed into .
AddNode. As shown in Alg. 2, the function AddNode receives no input but returns a node ID. It pops a failed node in and set the current item in to working, which means a failed node goes back to work again. The return ID is the recovered node ID.
RemoveNode. As shown in Alg. 3, the function RemoveNode receives a node ID. It pushes the node ID into and set the set the current item in to failed, which means such node is failed.
GetNode. As shown in Alg. 4, the funchion GetNode receives a key and returns the mapped node. DxHash uses a pseudorandom generator to caclulate a PRS for the key, and choose the first item in the PRS which corresponds
to a working node. Such node ID is returned.
Besides the core algorithms, the Scaleup and Scaledown of the cluster are important in DxHash. As mentioned, when the nodes in are all working and a new node are about to join, the size of cluster is doubled. In the implementation, the are doubled and the unused nodes are pushed into . After that, the addition of the node is handled by the function AddNode. When the number of the working nodes are far less than , for example less then 1/4 of , the Scaledown can be triggered. First, the nodes whose IDs are greater than is removed by the function RemoveNode. Then, the is halved and the nodes are deleted in whose IDs are greater than . At last, add the same number of nodes as removed earlier into DxHash. We can find the number of data remapping brought by Scaledown is larger then that brought by Scaleup, since the former requires extra removing and adding operations. So the Scaledown is off by default which are suggest to trigger manually.
3.3 DxHash Complexity Proof
Following let us disscuss the complexity of DxHash.
3.3.1 Space complexity
First, the space complexity of DxHash is easy to calculate. The data structure used in DxHash is a queue and an array. The size of array is in direct proportion to the total number of nodes, denoted as . The size of queue is in direct proportion to the number of failed nodes, denoted as , which is less than . So the space complexity of DxHash is , where a is the total number of nodes.
3.3.2 Time complexity
Now, we turn to providing the time comlexity of all operations in DxHash, including Addition, Removal, Query, Scaleup, and Scaledown.
Addition. In the addition process, both the access of the queue and the CSA can be completed in O(1) time. So the complexity of Addition is O(1).
Removal. The same as Addition, the complexity of Removal is O(1).
Query. The time complexity of query plays a crucial role in the performance of DxHash. DxHash calculates a pseudorandom number from a key iteratively until the number corresponds to a working node. Denote as and as , the probility of hitting a working node in each iteration is and the probility of hitting a failed node is . We denote as
. From here, we can find DxHash subject to Bernoulli distribution. For a key
, the number of pseudorandom number calculations when query is denoted by . The probility of a query ends just in the th iteration is , which means in the first iterations DxHash gets a failed node and in the th iteration DxHash finds a woking node. If ending in the th iteration, the number of pseudorandom number calculations is . So the the expectation of can be calculated as Formula 1.(1)  
In the similar way, the variance of
can be calculated as Formula 2.(2)  
Let’s replace as . We can get:
Theorem 3 (Query Complexity): Fix and , let and . For a key k, the number of hash operations when query is denoted by . The expectation of is while the variance of is .
Scaleup. When scaling the cluster up, DxHash doubled and push the unused nodes into . Those two operataions involves keys. So the time complexity of Scaleup is .
Scaledown. Similar to the process of scaleup, the time complexity of Scaledown is , too.
4 DxHash in Distributed Storage
In this section, some optimizations of DxHash are introduced in the scenario of distributed storage. It worth noting that in the hypothetical distributed storage, the metadata of cluster nodes is centrally managed. All nodes share a view of the topology of the cluster which is updated according to the failure and recovery of nodes in real time. In fact, a lot existing distributed storage systems are managed in this way, for example Memcached, Swift and Ceph. Therefore, the spread and load propertes [5], which are two important properties of CH, don’t need additional discussion.
4.1 Weighted DxHash
In the real scenario, a cluster is usually made up of physical nodes with differetn performance. The nodes with more memory, more advanced storage devices or a better processor can hold more workloads while the nodes with poor hardware can only hold less. To the physical nodes with better hardware, it’s easy to increase the load by introducing the virtual nodes. Deploying multiple virtual nodes on a physical node, the load increases naturally. For the physical nodes with poor hardware, Weighted DxHash is introduced to reduce the workloads.
In the algorithm of the native DxHash, which node a key is mapped to is decided by the PRS generated by the key. Weighted DxHash introduces a weight for each node and a hash function denoted as . The range for both of the function and the weight value is from 0 to 1. In Weighted DxHash, the conditions to map a key to a node are more restrictive. Not only the th item in PRS is the working node, but also the hashing value of the th item is smaller than the weight of the node. A more accurate description is as follows.
The mapping in Weighted DxHash. For a given key , denote the PRS of as . Denote the cluster as . Denote the weight of node as . The weight of the failed node is 0. Then, the node ID which belongs is the first item in the PRS which satisfies:
For example, as shown in Figure 4, there is a cluster with 8 nodes. Now the PSR and the weights of nodes are given. The node with a weight of 0 is failed so node 4, 6, 7 are failed while the others are working. If in native DxHash, the node 2 is choosen as the mapped node of . However, in weighted DxHash, node 2 can’t meet requirement because the hash value is larger than the weight of node 2. So the final mapped node is node 4.
Time complexity. It’s obvious that the probility of hitting a working node in each iteration is , where is the size of the cluster and is the weight of th node. And in each iteration, there are two calculations. One is for the pseudorandom number and one is for the hash value. The calculation process is similar to that of the native DxHash that the expectation of the number of calcucations when query is:
(3) 
It worth noting that the expectation of the key number in a node is
(4) 
, where is the key number of the th nodes and is the total number of keys in the cluster.
Space comlexity. In Weighted DxHash, the item in CSA is not the flag bit but the weight of current node. So the CSA is changed from a 1bit array to a 4byte array while other implementations unchanged. Although the memory footprint scales up, the total memory usage of Weighted DxHash is still smaller than almost all CH algorithms. For example, the AnchorHash, whose memory footprint is far less than Ring and Maglev hash, still needs 16 bytes of memory per node [8].
To adjust the load on different nodes, we introduce Weighted DxHash. For the node with poor hardware or bad network condition, the weight can be set to a smaller value to receive less load. For the node with better hardware, multiple virtual nodes can be set up on the physical node to take full advantage of hardware resources.
4.2 Asymmetric Replica Strategy
Although DxHash support scale up when the number of nodes reaches upper bound, the number of remapping is still too large, which is half of the total data at least. To reduce the amount of remapping, we introduce an Asymmetric Replica Strategy. In distributed storages, the replica strategy is indispensable for the system reliability. To avoid data loss, a threereplica policy is used usually.
In Asymmetric Replica Strategy, data is orginized as a keyvalue pair (KV pair). Each KV pair is save as three copies, which is named as the first, the second and the third replica in the paper. The key of the copies needs to be suffixed by the cluster size. The key of the first replica is suffixed by the current cluster size. The key of the second replica is suffixed by the double size of current cluster and the key of the third replica is suffixed by the quadruple size of the current cluster. And the mapping process is different, too. The mapping of the second replica is managed as the cluster is doubled while the mapping of the third mapping is managed as a cluster is quadrupled.
If the set of all nodes in cluster is , the three copies of is , and , the mapped nodes of the copies are the first items in the corresponding PRS respectively which satisfies:
In the formula, the node ID calculated in the second or third replica is possible to exceed the cluster size. Nodes not in the cluster are uniformly treated as failed nodes. The three copies are Asymmetric because the mappings are computed differently and the query performance are different, too. It’s obvious that the performance to access the first replica is better. However, when scaling cluster up, the second replica which is calculated as the double size of the previous cluster takes effect directly. In this situation, the original second replica and third replica is promoted as the first and second replica respectively, and the original first replica is demoted to the third replica which is calculated as the 8 times of the original cluster size. The system was temporarily downgraded from three copies to two copies, but high availability is guaranteed. What’s more, the amount of remapping is decreased greatly. First, the second replica and the third replica is stay unmoved, which is of the population. Second, of the first replica is stay unmoved, too. So there are at most ot the total data to remap.
Although the Asymmetric Replica Strategy does reduce the amount of remapping, the query performance to the second, third replica is affected. Read requests can be processed in the first replica, but the performance of strongly consistent writes is necessarily worse than the native replica strategy. Therefore, the usage of the Asymmetric replica strategy should be decided on a casebycase basis.
5 Evaluation
In this section we test and compare the performance of different CH algorithms, for example Karger Ring, MaglevHash, AnchorHash and DxHash.
5.1 Environment
As shown in Table 2, all experiments are performed on the same commercial mechine with the processer of Inter Xeon CPU E52620 at 2.00GHz, 32GB memory. The system is CentOS 7.8. The kernel version is 3.10.01127 and the GCC version is 7.3.1. All algorithms are implemented in C++ and complied by G++.
Processor  Intel Xeon CPU E52620 0 @ 2.00GHz 
Memory  32GB 
Operating System  CentOS Linux release 7.8.2003 (Core) 
Kernel Version  3.10.01127.13.1.el7.x86_64 
GCC Version  7.3.1 20180303 
5.2 Overview
We test the lookup rate and the memory footprint of 4 CH algorithms, including Karger Ring, MaglevHash, AnchorHash and DxHash. All algorithms are implemented in C++. For Karger Ring and MaglevHash, the balance is depend on the memory footprint. For example, in MaglevHash, reaching 1% hash space imbalance require times of memory than the minimal requirement.
In the experiments, the nodes are all working. Figure 5(a) shows the lookup rate of the 4 CH algorithms in different cluster size of 1K, 10K, 100K and 1M. The abscissa is the cluster size while the ordinate is the key lookup rate, and MKPS represented Million Keys Per Second. From the figure, the lookup performance of Ring is consistently worse than the others because the lookup complexity of Ring is while the complexity of others are all . When the cluster size is small, the three CH algorithms all have a better performance because queries in these algorithms only need a similar Hasing operation. As the cluster size gets larger, the performance of Maglev is poor since the lookup table is too large to get results quickly. AnchorHash and DxHash have similar performance which is less affected by the larger cluster size. We can find that DxHash has a better performance. This is because the memory footprint of DxHash is smaller so that the metadata in CPU cache has a better hit rate. It worth noting that since the DRAM is one byte addressing storage, for the best performance, we use a byte not a bit of memory to record the state of a DxHash node in this experiment to avoid extra bit operations.
Now let’s calculate the memory footprint of different algorithms. In Karger Ring, a Red Black Tree (RBTree) is used to organize the node states. Since a node in RBTree includes the 3 pointers (to father, left child and right child), the color and a KV pair, each node takes up 28 bytes of memory space. In our implementation, each physical node matches 100 virutal nodes for balance. In Maglev, one entry in the lookup table requires 4 bytes of memory and the number of slots is about 100 times of the number of the physical nodes to keep balance. In AnchorHash, a node uses up 16 bytes of memory space and in Dxhash, a node only needs 1 bit of memory. We calculate the memory footprint of the 4 CHes when the cluster size is 1K, 10K, 100K and 1M. The final memory footprint is shows in Figure 5(b). The abscissa is the cluster size and the ordinate is the memory usage whose unit is byte. Maglev and Ring takes up quite a bit of memory space. Ring and Maglev requires overprovisioned space to keep balance. We can find for nodes, the memory they require is 2.8 GB and 400 MB, respectively. The memory footprint of AnchorHash is acceptable that 16 MB of memory is required. But for DxHash, it takes up surprisingly little space which is at least 125KB, that can be fully stored in L2 cache. Although a byte of memory is provided for a node for the performance, the memory footprint is still only 1MB. It obviously that the DxHash has the minimal memory usage which is far less than any other CH algorithms.
5.3 Balance
In this experiment we evaluate load balancing of DxHash and compare it against other CH algorithms. We set the number of working nodes from 100 to 1000. In Karger Ring, one physical node matches 100 virtual nodes. The lookup table size of MaglevHash is 99991, a prime which is about 100 times the total number of nodes. For AnchorHash and DxHash, the total number of nodes is set to 1024. That means if the number of working nodes is 100, the failure rate of the cluster is 0.9. We randomly generated 10 million key and calculate the mapped nodeID. The result is shown in Figure 7
. We measure the balancing efficiency in terms of the standard deviation of the key number in all working nodes. For the convenience of comparisons between different cluster sizes, We divide the standard deviation by the mean to get the Coefficient of Variation (CV).
From Figure 7, we can find that the balancing efficiency of Karger Ring is so poor. The CV is up to 0.1 although the virtual nodes per physical node is 100. The performance is poor but is consistent with tests for other work [6]
. The other algorithms have a much better balancing efficiency. The CV of those is changed from 0.003 to 0.01 as the working nodes increases from 100 to 1000. This situation is meet expectation. According to the law of large numbers, when the number of keys is much greater than the number of nodes, the frequency converges on the probability, meaning that the key number in a node is equal to the mean and the variance is 0. The closer the key number is to the number of nodes, the greater the variance is. What’s more, the balance efficiency of Maglevhash depends on the ratio of the lookup table size to the number of nodes, too.
In general, Maglev, AnchorHash and DxHash all achieve high balance efficiency. Only Karger Ring has a poor load balancing.
5.4 Minimal disruption
In this experiment, we evaluate the disruption of 4 CH algorithms brought by the addition or removal of nodes. We increases the number of working nodes from 100 to 1000 with the step of 100. Every time adding nodes, we calculate the number of remapping keys, and divide the number by the total number of keys to get the remapping ratio. Then we compair the remapping ratio to the ideal which is calculated by , where is always 100 and is the number of nodes before addition. Figure 6 shows the result. As expect, of the 4 CH algorithms, Maglev can’t hold the minimal disruption when nodes are added or removed. But the extra disruption from Maglev is not as much as the traditional Hashing. And the other 3 CH algorithms basically guarantee the minimal disruption.
5.5 Fault Tolerance
In this section, we evaluate fault tolerance of the CH algorithms. In the former experiment, we find that the Karger King is not a ideal CH, because its poor performance, low balance efficiency and large memory footprint. Maglev is a more advanced CH, but still has a large memory footprint. And when nodes is added or removed, the time to rebuild the lookup table is too long. So we just compare the fault tolerance of AnchorHash and DxHash.
As metioned, the Average Search Length (ASL) for a key to look up the mapping node depends on the failure ratio. We test the ASL in different failure ratio. In the experiment, the total number of nodes is 1000. we randomly delete 100 nodes in a step until there are only 100 working nodes, meaning the failure ratio increases from 0 to 0.9. Figure 8 presents the trends in ASL as the incresement of the failure ratio. The experimental results agree with our theoretical calculation. The lookup complexity of AnchorHash is [8] and the lookup complexity of DxHash is , where is the total number of nodes and is the number of working nodes. It seems that AnchorHash has a better fault tolerance than DxHash, but it’s an extreme failure condition that the failure ratio is up to 70% or more. From Figure 8, even when 70% of nodes are failed, the ASL of a key lookup is only 3.4 in DxHash, which is close to the performance of AnchorHash.
Besides the ASL, we test the lookup performance of the two algorithms under different failure ratios. We repeat the test twice. One is in a cluster with 1K nodes and the other is in a cluster with 1M nodes. From Figure 9(a), comparing the lookup rate with different failure, it’s astonishing that DxHash always has a better performance. First reason is the memory footprint of DxHash is smaller. The entire metadata can be cached in CPU to speed up access. Second reason is that the algorithm itself. Although DxHash has a longer Avg. Search Length, for each miss, AnchorHash access memory 4 times while DxHash has only one. Unless ASL of DxHash is 4 times longer that of AnchorHash (corresponding to 97.5% failure rate), the performance of DxHash is always better. In Figure 9(b), when the cluster size is up to , the gap between the two is even more stark. The memory usage of DH is smaller and the time of access memory is less. So the lookup performance of DH is obviously better than AH. Even when the failure ratio is 0.9, the lookup rate of DH is still 6.44 Mkeys/s.
5.6 Weighted DxHash
In this experiment, we test the ASL and the data distribution in Weighted DxHash. The test cluster is composed of 1024 working nodes. We divide these nodes into two halves. The weight of one part is set to (called 1nodes) and the weight of the other part is changed from to with the step of (called nnodes). After 10 milion lookups, we count the average access number per node of the two parts. What’s more, the Average Search Length in the tests is recorded, too. Figure 10 presents our test results. The avg. access of 1nodes and nnodes are tested in different configurations. The ASL and the avg. access are all fits our calculation as the Formula 3 and 4, whose error is within 0.1%.
5.7 Asymmetric Replica Strategy
In this experiment, we test the remapping ratio of 10 million keys when scaleup. We set the cluster size as 1024, 2048, 4096, 8192 and 16384, respectively whose nodes are all working and add a new node into the cluster. The process of scaleup is triggered. The cluster size is doubled and the unused nodes are set as failed nodes. We count the ratio of key remapping in Dxhash with/without Asymmetric Replica Strategy (ARS) and compare these two in Figure 11. As expect, the remapping ratio is 0.5 in the original DxHash. If the ARS is used, the remapping ratio is only 0.2916 which is close to .
5.8 Discussion
The experiment results show that compared with the stateofart work, the disruption, the balance, the space complexity, the update complexity and the lookup comlexity of DxHash is all equal or better. Although the lookup complexity of DxHash is higher than AnchorHash, the performance of Dxhash is more than enough for the real scenarios where the failure ratio is less than 0.9 usually. Compared with AnchorHash [8], DxHash solves the problem of scaleup. What’s more, unlike AnchorHash, since DxHash is stateless, it is easy to extend to multithreading, high concurrency scenarios. And since the metadata of DxHash is tiny (125KB/1 million nodes), it’s perfect for distributed clusters because the full updates also have a tiny network overhead.
The biggest drawback of Dxhash is the remapping ratio of scaleup. Although the Asymmetric Replica Strategy helps to reduce the remapping number from 1/2 to 7/24 of all the data, it’s still a large number if the data scale is huge. Luckily, the scaleup is infrequent because the upper bound is doubled after each scaleup. It worth noting if there are a large number of nodes join the cluster at the time of scaleup, the remapping ratio stays 7/24 (or 1/2 without ARS). If the cluster size is also doubled when scaleup, the remapping ratio is still the ideal. So Dxhash also applies to the elastic storage.
6 Conclusions
This paper presents Dxhash, a fast, scalable and flexible consistent hashing algorithm. We provide the algorithm and the implementation of DxHash and proves the complexity. And based Dxhash, we propose two optimizations, which are Weighted DxHash and Asymmetric Replica Strategy
. We then compare the evaluation of DxHash to other existing CH algorithms. It shows DxHash can maintain millions of nodes while providing a high key lookup rate, a very low memory footprint, and a small update time upon the addition or removal of nodes. And the two extra optimizations achieve the respective effect. Finally, the code of DxHash and all the tests are open source.
References
 [1] Chung, C., Koo, J., Im, J., Arvind, and Lee, S. Lightstore: Softwaredefined networkattached keyvalue drives. In Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2019), ASPLOS ’19, Association for Computing Machinery, p. 939–953.
 [2] Eisenbud, D. E., Yi, C., Contavalli, C., Smith, C., Kononov, R., MannHielscher, E., Cilingiroglu, A., Cheyney, B., Shang, W., and Hosein, J. D. Maglev: A fast and reliable software network load balancer. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) (Santa Clara, CA, Mar. 2016), USENIX Association, pp. 523–535.

[3]
Fu, X., Peng, C., and Han, W.
A consistent hashing based data redistribution algorithm.
In
Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques
(Cham, 2015), X. He, X. Gao, Y. Zhang, Z.H. Zhou, Z.Y. Liu, B. Fu, F. Hu, and Z. Zhang, Eds., Springer International Publishing, pp. 559–566.  [4] Goel, P., Rishabh, K., and Varma, V. An alternate load distribution scheme in dhts. In 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom) (2017), pp. 218–222.

[5]
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and
Lewin, D.
Consistent hashing and random trees: Distributed caching protocols
for relieving hot spots on the world wide web.
In
Proceedings of the TwentyNinth Annual ACM Symposium on Theory of Computing
(New York, NY, USA, 1997), STOC ’97, Association for Computing Machinery, p. 654–663.  [6] Lamping, J., and Veach, E. A fast, minimal memory, consistent hash algorithm, 2014.
 [7] Liu, Z., Bai, Z., Liu, Z., Li, X., Kim, C., Braverman, V., Jin, X., and Stoica, I. Distcache: Provable load balancing for largescale storage systems with distributed caching. In 17th USENIX Conference on File and Storage Technologies (FAST 19) (Boston, MA, Feb. 2019), USENIX Association, pp. 143–157.
 [8] Mendelson, G., Vargaftik, S., Barabash, K., Lorenz, D. H., Keslassy, I., and Orda, A. Anchorhash: A scalable consistent hash. IEEE/ACM Transactions on Networking 29, 2 (2021), 517–528.
 [9] Nakatani, Y. Structured allocationbased consistent hashing with improved balancing for cloud infrastructure. IEEE Transactions on Parallel and Distributed Systems 32, 9 (2021), 2248–2261.
 [10] Shu, J., Chen, Y., Wang, Q., Zhu, B., Li, J., and Lu, Y. Thdpms: Design and implementation of an rdmaenabled distributed persistent memory storage system. ACM Trans. Storage 16, 4 (Oct. 2020).
 [11] Stoica, I., Morris, R., LibenNowell, D., Karger, D., Kaashoek, M., Dabek, F., and Balakrishnan, H. Chord: a scalable peertopeer lookup protocol for internet applications. IEEE/ACM Transactions on Networking 11, 1 (2003), 17–32.

[12]
Wang, X., and Loguinov, D.
Loadbalancing performance of consistent hashing: Asymptotic analysis of random node join.
IEEE/ACM Transactions on Networking 15, 4 (2007), 892–905.  [13] Wu, C., Sreekanti, V., and Hellerstein, J. M. Autoscaling tiered cloud storage in anna. Proc. VLDB Endow. 12, 6 (Feb. 2019), 624–638.
Comments
There are no comments yet.