Given a specified average load factor, hash tables bring the appeal to offer constant time lookup operations, and hence have been widely used in operating systems and applications. However, it is widely known that hash tables are vulnerable to hash collisions (Crosby and Wallach, 2003), and randomizing static hash function is not a complete solution (Edge, 2012b). For years, this vulnerability affected a long list of operating systems and programming languages such as the Linux kernel (Edge, 2012a), Perl (Crosby and Wallach, 2003) and PHP (Joye, 2012).
One possible solution is dynamically changing a hash table’s hash function, without affecting concurrent insert, delete, and lookup operations (henceforth simply common operations). We use the term dynamic to describe a hash table that can provide this flexibility feature, and use the term rebuild to describe the function that dynamically changes its hash function. Researchers have proposed several dynamic hash tables. Herbert Xu created a dynamic hash table for the networking subsystem of the Linux kernel in 2010 (XU, February 2010), to handle unpredictable large bursts of fragmented packets (Elsasser, 2019) and potential DoS attacks. Thomas Graf created another generic dynamic hash table in 2014 (Graf, August 2014), which since then has been widely used in the Linux kernel. Other researchers partially overcame this challenge by proposing hash table algorithms that can only enlarge or shrink bucket sizes by a factor of 2 (Shalev and Shavit, 2006; Triplett et al., 2011; Fatourou et al., 2018), which we refer to as resizable hash tables.
The core problem in designing a dynamic hash table is how to atomically distribute each node from the old hash table to the new one in rebuilding. Prior research activities overcame this challenge by using various techniques (detailed in Section 2), but in practice, we found that they have performance drawbacks when used in scenarios with heavy workloads, bursts of incoming data, and/or attacks. For example, Xu’s and Graf’s algorithm uses per-bucket locks to serialize concurrent update operations, which leads to severe contentions when the load factors increase to 20 and more. Resizable hash tables do not need to face this challenge, but they have limited capabilities to solve hash collisions.
This paper presents DHash (Dynamic Hash table), a dynamic hash table that can meet the following main goals.
(1) Dynamic hash table: Users can dynamically change the hash function, without affecting concurrent common hash table operations.
Rationale: Dynamic hash table is the algorithm of choice for critical applications facing bursts of update requests, buggy applications, and even malicious attacks.
(2) Modularity: The hash table should be modular; it can utilize various lock-free/wait-free set algorithms as the implementation of hash table buckets, without heavy engineering workload.
Rationale: The choice of the algorithm to solve conflicts within each bucket is a trade-off between the algorithm’s progress guarantee, performance, and engineering efforts. In practice, many users of hash tables cannot know the right choice of the algorithm in advance. For example, wait-free linked lists (Kogan and Petrank, 2011; Yang and Mellor-Crummey, 2016) are the algorithms of choice for users who want the strongest progress guarantee for common hash table operations, and users who look for fast lookup speed would choose lock-free skip lists (Fomitchev and Ruppert, 2004).
(3) Fast and non-blocking lookup operations: Lookup operations should be fast and non-blocking, no matter if a rebuild operation is in progress.
Rationale: Hash tables are commonly designed for use cases with significant reads than writes. For example, Herlihy and Shavit suggested a common workload for hash tables with 90% lookups and 10% insertions along with deletions (Herlihy and Shavit, 2008).
(4) Fast and non-blocking update operations: Insert and delete operations should be fast and non-blocking, even when a rebuild operation is distributing nodes.
Rationale: In modern computer systems, insert and delete requests typically reach hash tables in batch. For example, a variety of places in between two servers (e.g., buffers in hardware Network Card and kernel TCP/IP stacks) can buffer network packets and then send them out in batch for higher throughput (Alizadeh et al., 2012). Failing to handle the large bursts of update requests could result in performance degradation (Elsasser, 2019).
The core of DHash’s technical contribution is an efficient rebuilding strategy that can distribute nodes by using regular operations. The rebuilding strategy allows DHash to leverage a variety of lock-free/wait-free set algorithms as hash buckets, without heavy engineering workloads. Experimental data shows that with light workloads, the overall performance of DHash matches or slightly exceeds other practical representatives including the two dynamic hash table in the Linux kernel (XU, February 2010; Edge, 2012a) and one resizable hash table based on split-ordered-list (Shalev and Shavit, 2006). DHash noticeably outperforms these algorithms by factors of 2.3-6.2 and more under heavy workloads.
2. Related Work
This section sketches a high-level overview of prior researches on dynamic and resizable hash tables.
Herbert Xu’s dynamic hash table: Herbert Xu created a dynamic hash table (XU, February 2010) for the management of IGMP packets in the Linux kernel in 2010. As far as we know, this is the first practical dynamic hash table. The key idea behind Xu’s algorithm is to manage two sets of pointers in each node, so that common operations traverse one set of pointers while the rebuild operation is updating the other set. The two sets are exchanged upon the completion of every rebuild operation. One major benefit of introducing two sets of pointers is that it is not necessary to delete nodes from the old hash table while a rebuild operation is in progress.
Xu’s hash table algorithm is straightforward and easy to be implemented, but it has two major drawbacks in practice. (1) Each bucket contains a common linked list along with a lock to serialize concurrent update and rebuild operations to the bucket. (2) A linked list algorithm must be customized by adding an extra set of pointers, before it can be used by Xu’s hash table. This not only results in increased memory footprint, but also prevents Xu’s algorithm from utilizing other faster linked list algorithms. In contrast, DHash overcomes these drawbacks in its design.
Generic dynamic hash table in Linux kernel: Thomas Graf introduced a generic dynamic hash table into the Linux kernel in 2014 (Graf, August 2014), and this algorithm has been widely used in the kernel. Graf’s algorithm was originally based on Josh Triplett’s ATC’11 paper (Triplett et al., 2011), but has significantly improved in the performance of its rebuild operation.
Graf’s hash table maintains a single pointer in each node, and utilizes a mutex lock to synchronize concurrent update and rebuild operations to the same bucket. A rebuild operation traverses the hash table, and always finds a non-empty bucket and distributes its last node, by inserting this node into the new hash table and then deleting it from the old hash table. There is a time period during which the node can be found in both the old and the new hash table, and a lookup operation searching the old hash table could be erroneously redirected to the new hash table. So the lookup operation of Graf’s algorithm is designed to tolerate these behaviors.
Graf’s algorithm is a practical design. However, this algorithm has the following drawbacks. (1) The rebuild thread must reach the tail of a list to distribute a single node. (2) It uses locks to serialize updates to a single bucket. (3) It maintains unordered lists as its buckets, which noticeably increases the overhead of lookup operations. In contrast, DHash overcomes these drawbacks in its design.
Resizable hash tables The resizable hash tables do not change their hash functions; they can only enlarge and shrink their bucket sizes by a factor of 2. Ori Shalev presented the first lock-free resizable hash table (Shalev and Shavit, 2006) in 2006. This algorithm keeps a singly linked list, and all nodes in the hash table are chained in this list. In solving the atomic-distribution problem in resizing, Shalev’s algorithm does not move nodes among the buckets, instead, it moves the buckets among the nodes by referencing buckets to the proper nodes in the list. Shalev’s algorithm, even though is lock-free, has drawbacks in practice. (1) It must use a modulo hash function, which dramatically limits the flexible of the algorithm. (2) The algorithm must first reverse the bit string of the key before performing any operations. Unfortunately, the reverse operation is not always efficient on platforms where hardware cannot provide special instructions for bit string processing.
Josh Triplett presented a hash table that can incrementally shrink and expand by chaining multiple lists together and by splitting existing lists, respectively (Triplett et al., 2011). Triplett’s algorithm has drawbacks. For example, buckets are implemented as unordered lists, and concurrent insert and delete operations must block until a concurrent shrink operation finishes publishing the new hash table.
Researchers have proposed wait-free resizable hash tables that provide the strongest progress guarantee (Liu et al., 2014; Fatourou et al., 2018). However, it is not clear how features such as duplicated nodes and node replacement, which is commonly desired in practice, can be implemented in these wait-free resizable hash tables.
3. DHash Algorithm Overview
This section first presents the challenge in designing dynamic hash tables, and then sketches a high-level overview of the rebuild, lookup, insert, and delete operations of DHash, leaving technical details to Section 4.
Challenges The key challenge in designing a dynamic hash table is to atomically move nodes from the old hash table to the new one, without affecting concurrent common hash table operations.
This challenge is hard to be efficiently handled because moving a node must touch more than two buckets (linked lists), and the transition of the node must be atomic with respect to other concurrent operations. Prior work solved this issue by (1) introducing two sets of linked lists for each bucket (XU, February 2010), (2) acquiring corresponding per-bucket locks before distributing each node (XU, February 2010; Graf, August 2014), (3) avoiding moving nodes by adjusting bucket pointers (Shalev and Shavit, 2006), and/or (4) maintaining unordered linked lists that sometimes may contain nodes that do not belong to the linked lists (Triplett et al., 2011). These approaches, however, sacrifice the algorithms’ generality/performance. DHash, inspired by the RCU technique, takes a fundamentally different approach by releasing the atomicity requirement in distributing a node; the rebuild operation first deletes the node from the old hash table and then inserts it into the new hash table, by using regular operations, rather than expensive synchronization and memory fence instructions. The process of distributing the node leads to a short time period during which in neither hash tables can this node be found. We call this a node’s hazard period. To allow other operations to be able to access the node, DHash employs a global pointer that always points to the node that is in hazard period. On the other hand, if a rebuild thread is in progress, other operations need to check different locations because a node may reside in either the new or the old hash table, or is being referenced by the global pointer. In Section 4, we prove that if a lookup, insert, or delete operation checks both hash tables and checks the node currently in hazard period in a specified order, they can always find the node with the matching key and perform correctly.
3.1. Rebuild operation
DHash consists of a specified number of buckets. To solve collisions between multiple keys that hash to the same bucket, a lock-free linked list is used to chain together nodes containing these keys. The resulting data structure of DHash is illustrated in Figure (a)a. In this example, DHash consists of two buckets: Bkt 0 and Bkt 1. Bkt 0 contains three nodes (a, b, and c), and Bkt 1 contains two nodes (d and e).
When a rebuild operation starts, we assume that the new hash table contains three buckets, and that the users provide a new hash function that can map all of the keys to the new three-bucket array. The rebuild operation performs a hash table traversal and distributes nodes in the old hash table to the new one. Figure 1 illustrates the process of moving node a to the new hash table, with the initial state of two buckets shown in Figure (a)a and with the time advancing from figure to figure.
Specifically, node a is first referenced by the global pointer rebuild_cur, resulting in the state shown in Figure (b)b. Then, node a is removed from the old hash table and enters its hazard period, shown in Figure (c)c. When node a is in hazard period, other lookup and update operations can access it via the global pointer rebuild_cur. Allowing other threads to access the node that is in hazard period is the key reason that distributing a node is not necessary to involve expensive atomic operations and memory fences in DHash. Without loss of generality, we assume that when the rebuild operation is in progress, other operations concurrently insert a new node, f, into the new hash table, shown in Figure (c)c.
Then, node a is inserted into the new hash table, shown in Figure (d)d. After it has been successfully inserted into the new hash table, rebuild_cur is set to NULL. The rebuild operation traverses the old hash table and distributes all of the nodes to the new hash table. Then the rebuild operation exposes the new hash table to subsequent common operations, shown in Figure (e)e. After that, the rebuild operation waits for all prior unfinished operations to complete, before safely reclaiming the old hash table, shown in Figure (f)f.
3.2. Lookup, Insert, Delete operations
When rebuild operations are absent, which is the common case, a lookup, insert, and delete operation performs common hash table operations on the only hash table in DHash. When a rebuild operation is in progress, a node may be in hazard period during which in neither hash tables can this node be found. Hence, a lookup, delete, and insert operation must comply with this variation.
Lookup: A lookup operation must search both the hash tables and check the node referenced to by the global pointer rebuild_cur. Since the operation involves multiple shared memory spaces, synchronizing the rebuild and the lookup operation is a classic synchronization problem, which we solve by managing the lookup operation to check these memory spaces in a specified order. Specifically, the lookup operation first searches for the node in the old hash table, then checks if the node pointed to by rebuild_cur has the matching key, and finally searches in the new hash table. Lemma 4.1 in Section 4.5 proves that if a lookup operation performs in this order, it can find the node with the matching key, no matter if a rebuild operation is in progress.
Delete: Similarly, a delete operation first searches in the old hash table and deletes the node and returns if the node can be found. Otherwise, it checks the node pointed to by rebuild_cur. Finally, the delete operation searches in the new hash table. Lemma 4.2 in Section 4.6 proves that if a delete operation performs in this order, it can successfully find and delete the node with the matching key.
Insert: A rebuild operation waits until all prior unfinished operations have completed before replacing the old hash table with the new one (detailed in Section 4.4). Lemma 4.4 in Section 4.7 proves that when a rebuild operation is in progress, the insert operation can simply insert the node into the new hash table and then return.
4. DHash implementation
The design of DHash presented in Section 3 leads to a relatively straightforward implementation, which is the subject of this section.
We first give a brief overview of Read-Copy Update (RCU) which DHash uses to synchronize concurrent operations. Note that DHash can also use other synchronization mechanisms such as reference counters (Valois, 1995) and hazard pointers (Michael, 2004). We then present an RCU-based lock-free linked list which is used as the implementation of hash table buckets. Note that DHash is modular, such that the linked list can be replaced by other lock-free/wait-free set algorithms.
Read-Copy Update: RCU distinguishes between read-side code and write-side code and has the following primitives to synchronize read-write conflicts:
rcu_read_lock() / rcu_read_unlock(): Each time a thread wants to access shared variables, it accesses them in a read-side critical section, which begins with the primitive rcu_read_lock() and ends with the primitive rcu_read_unlock(). Within a read-side critical section, the lookup thread is safe to access the shared resources without needing to worry about the potential issues that these resources could be freed by other threads at the same time.
synchronize_rcu(): works as a wait-for-readers barrier. Each time an updater thread wants to update shared variables (e.g., to delete a node), it first makes the resources unreachable to subsequent lookup operations, and then invokes synchronize_rcu() to wait until existing lookup operations to safely complete before updating the shared variables.
call_rcu() is an asynchronous version of synchronize_rcu(). It can be used by updater threads that do not want to block.
RCU synchronizes readers with writers by using constrained access order, instead of shared variables. Any RCU-protected node accessed during a reader is guaranteed to remain unreclaimed until the reader completes its access and calls rcu_read_unlock(). The production-quality implementations of rcu_read_lock() and rcu_read_unlock() are extremely lightweight; they have exactly zero overhead in Linux kernels built for production use with CONFIG_PREEMPT=n (McKenney, 2020) and have extremely close to zero overhead in user-space applications when the QSBR flavor model is used (Desnoyers, 2012), such that readers of RCU-based data structures can perform as fast as single-threaded programs.
RCU-based lock-free linked list: For ease of presentation, in this paper we chose Michael’s lock-free linked list (Michael, 2002) as the implementation DHash’s buckets. We optimized Michael’s algorithm to meet our design goals better. Specifically, Michael’s original algorithm uses hazard pointers (Michael, 2004) to synchronize concurrent access to shared variables, which is robust but involves expensive programming and run-time overhead. A 64-bits long tag field must be added for each node to prevent the potential ABA-problem (Herlihy and Shavit, 2008).
To overcome these problems, we created an RCU-based lock-free linked list, which is based on Michael’s algorithm but leverages RCU to efficiently manage read-write conflicts. The modifications are as follows. (1) The RCU technique instead of hazard pointers is used as the memory reclamation scheme, such that the expensive memory fences in traversing the list can be removed. (2) The tag field in each node is saved, because the RCU technique prevents reclaiming (and hence reusing) nodes before concurrent lookup operations holding references to these nodes have completed. (3) To reclaim a node, call_rcu is used, such that a delete operation will not be blocked by prior unfinished lookup operations.
Data structures and the API set of our RCU-based lock-free linked list is presented in Algorithm LABEL:alg.structure.list. The structure lflist, which will be used as the implementation of DHash’s hash table buckets, is fundamentally a chain of nodes. For each node, the key field holds the key value, the next field points to the following node in the linked list if any, or has a NULL value otherwise. Since pointers are at least word aligned on all currently available architectures, the two least significant bits of next are used as the flag field indicating if the node is in a special state. The least significant bit, denoted as LOGICALLY_REMOVED, is used to indicate that a node has been logically removed by a delete operation. The second to the least significant bit, denoted as IS_BEING_DISTRIBUTED, is used to indicate that the node has been logically removed from the list by a rebuild operation. The difference between these two states is how the node will be reclaimed, which we will discuss in detail in the following paragraphs.
Structure snapshot is to return the search result to the function invoking lflist_find. Each time we want to search a node, an instance of snapshot is passed to lflist_find. Upon the completion of lflist_find, it is guaranteed that the cur field of the snapshot points to the list node which contains the value that is greater than or equal to the specified search key, and that prev and next fields point to its predecessor node and following node, respectively.
Our RCU-based lock-free linked list provides three basic operations, lflist_find, lflist_insert, and lflist_delete, as shown in Algorithm LABEL:alg.structure.list. Before invoking any of these functions, a caller must have entered the RCU read-side critical section by invoking rcu_read_look(). Function lflist_insert and lflist_delete also need the read-side protection because they need to first traverse the list. Function lflist_delete takes the third parameter flag, which is first stored to the flag field of the target node. Function lflist_delete deletes the matching node from the list and reclaims the node memory if flag is set to LOGICALLY_REMOVED. In contrast, if flag is set to IS_BEING_DISTRIBUTED, the node memory will not be reclaimed because the node will be inserted into the new hash table later. Function lflist_delete does not block; it uses call_rcu to asynchronously reclaim a node. Note that call_rcu is safe to be invoked within an RCU read-side critical section.
4.2. Data structures
Algorithm LABEL:alg.structure lists the data structures and auxiliary functions of DHash. The main structure, ht, is an array of buckets (bkts) of size B, where B is specified by the nbuckets field. Each element of bkts is fundamentally a pointer to our RCU-based lock-free linked list lflist. The hash field is a function pointer to the user-specified hash function. The ht_new field is set to NULL unless a rebuild operation is in progress, in which case it points to the new hash table that is going to replace the old one. The global variable rebuild_cur points to the node that is currently in hazard period or equals to NULL if there is no such a node in the system, and the mutex lock rebuild_lock is to serialize attempts to rebuild the hash table.
The two helper functions, clean_flag and set_flag, cleans or sets the flag bits of the node pointed to by htnp. Since the next field of a node could be updated by concurrent operations, these two operations must perform atomically. The helper function ht_alloc creates a hash table, by allocating the array of buckets and assigning the user-specified hash function to the hash field.
4.3. Solving read-write conflicts
There are read-write conflicts between DHash’s lookup and delete operations, and between common operations and rebuild operations. DHash solves this issue by leveraging the RCU synchronization mechanism, which is discussed in Section 4.1. Specifically, a caller must first enter an RCU read-side critical section before invoking DHash’s common operations and referencing any node in DHash, shown in the following code snippet.
rcu_read_lock(); node *cur = ht_lookup(htp, key); /* Accessing *cur is safe here. */ rcu_read_unlock(); /* Accessing *cur becomes unsafe. */
4.4. Rebuild operation
The rebuild operation is shown in Algorithm LABEL:alg.rebuild. Line LABEL:alg.rebuild.trylock attempts to acquire the global lock rebuild_lock, which serializes concurrent rebuild requests. Once DHash has the lock, it checks again that the rebuild is still required on line LABEL:alg.rebuild.doublecheck. Line LABEL:alg.rebuild.alloc allocates a new hash table which has the user-specified size and hash function. Line LABEL:alg.rebuild.assign1 assigns the reference to the new hash table to the ht_new field of the old hash table, allowing subsequent operations to access the new hash table. Line LABEL:alg.rebuild.sync1 performs an RCU synchronization barrier to wait for prior common operations, which may not be aware of the new hash table, to complete before the rebuild operation continues.
Function ht_rebuild traverses the old hash table, and one-by-one distributes nodes to the new hash table (Lines LABEL:alg.rebuild.cur1–LABEL:alg.rebuild.cur2). For each node, the global variable rebuild_cur first points to the node on line LABEL:alg.rebuild.cur1. The two write barriers on lines LABEL:alg.rebuild.mb1 and LABEL:alg.rebuild.mb2 pair with the read barriers in ht_lookup and ht_delete. They together guarantee that the updates performed by ht_rebuild to rebuild_cur and the two hash tables can be seen by other operations in the same order. Note that, for ease of presentation, we omit memory order specifications in the pseudo code. In practice, all accesses to bucket pointers (e.g., htbp), node pointers (e.g., htnp), and rebuild_cur must be made with the specifications of std::memory_order_acquired or release (Spec., 2011).
Line LABEL:alg.rebuild.del deletes the node from the old hash table. Function lflist_delete takes a third parameter IS_BEING_DISTRIBUTED, indicating that the node with the matching key will be deleted from the old hash table, but its memory will not be reclaimed. If this delete operation fails, which implies that the node has been deleted by other concurrent delete operations since the reference to the node was fetched on line LABEL:alg.rebuild.for2, the rebuild process skips this node (line LABEL:alg.rebuild.cont). Line LABEL:alg.rebuild.prepare prepares the node for reuse by, for example, cleaning the IS_BEING_DISTRIBUTED bit of the node. Then, line LABEL:alg.rebuild.insert inserts it into the proper bucket of the new hash table. Note that if the insertion operation fails, which means that one other node with the same key value has been inserted into the new hash table by other threads, line LABEL:alg.rebuild.freenode invokes call_rcu which frees the node after currently unfinished operations referencing to this node have completed. After the node has been inserted into the new hash table, the global pointer rebuild_cur is set back to NULL.
After distributing all node of the old hash table, line LABEL:alg.rebuild.sync2 waits for unfinished common operations, which may still hold references to the old hash table, to complete. Line LABEL:alg.rebuild.assign2 installs the new hash table as the current one, and again line LABEL:alg.rebuild.sync3 waits for all unfinished operations. Then, line LABEL:alg.rebuild.unlock releases the global lock, line LABEL:alg.rebuild.freeht frees the old hash table, and finally line LABEL:alg.rebuild.return returns success.
For each iteration, ht_rebuild() deletes a node from the old hash table and then inserts it into the new hash table, reusing the node’s memory. One potential issue with the reuse of nodes is that it may redirect concurrent lookup operations that are traversing the old hash table to the wrong lists. For example, suppose that a lookup operation is traversing a hash bucket of the old hash table and is referencing to node . At this time, the rebuild operation distributes by inserting into the proper hash bucket in the new hash table. This can redirect the lookup operation to the linked list in the new hash table, and result in a false negative if the node with the matching key is at the bottom of the linked list in the old hash table. There are two approaches to overcoming this problem. (1) The last nodes of the lists store corresponding bucket id’s. Once a lookup operation reaches the last node of a specified bucket, it read the id value from the node, and starts over if the value is not what expected. (2) The lookup operation checks if has been deleted before moving forward to subsequent nodes. For the lock-free linked list algorithm presented in this paper, deleting a node is performed by setting the least-significant two bits of its next field, such that the two steps (checking deletion and moving forward) can be performed atomically by using one compare-and-swap operation on the next field of node . The generic hash table in the Linux kernel (Graf, August 2014) uses the first approach, and DHash uses the second.
4.5. Lookup operation
The lookup operation is presented in Algorithm LABEL:alg.lookup. The function first searches for the specified key in the proper bucket of the old hash table (line LABEL:alg.lookup.find1). If a node with the matching key can be found in the bucket, a pointer referencing to the node is returned (line LABEL:alg.lookup.return1). Otherwise, line LABEL:alg.lookup.checkresize checks whether a rebuild operation is in progress. If rebuild operations are absent, line LABEL:alg.lookup.return2 returns -ENOENT indicating that no node with the matching key can be found in DHash. The two read barriers on lines LABEL:alg.lookup.mb1 and LABEL:alg.lookup.mb2 pair with the two write barriers in ht_rebuild. Line LABEL:alg.lookup.global continues the lookup operation by checking the node pointed to by the global pointer rebuild_cur. Recall that rebuild_cur always points to the node that is currently in hazard period. If the node pointed to by rebuild_cur matches, and if the LOGICALLY_REMOVED bit of the next field of the node has not been set, which means that the node has not been deleted by concurrent delete operations, line LABEL:alg.lookup.return22 returns a pointer to the node. Otherwise, function lookup continues by searching the new hash table and returns the pointer to the node if the lflist_find operation succeeds (line LABEL:alg.lookup.find2).
Algorithm LABEL:alg.lookup shows that a lookup() operation first searches for the node with the matching key in the old hash table (Line LABEL:alg.lookup.find1), then checks if the node pointed to by rebuild_cur is the right node (Line LABEL:alg.lookup.global), and finally searches in the new hash table (Line LABEL:alg.lookup.find2). This manipulation order guarantees that lookup operations can always find the node even if a rebuild operation is in progress. That is, the following lemma holds:
Lemma 4.1 ().
If DHash contains node with key value of K, operation ht_lookup(K) can return a pointer to , no matter if a rebuild operation is in progress.
Obviously, if rebuild operation is absent, node resides in the only hash table. Operation ht_lookup(K) can find the node in the only hash table (lines LABEL:alg.lookup.bkt1 - LABEL:alg.lookup.find1end).
We then prove that ht_lookup(K) can find the node when a rebuild operation is in progress. The code snippet to distribute a node is shown on lines LABEL:alg.rebuild.cur1 - LABEL:alg.rebuild.cur2. We use to denote the event in which the thread running the rebuild operation (henceforth rebuild thread for short) assigns the address of node to the global variable rebuild_cur (line LABEL:alg.rebuild.cur1), and use and to denote the events in which node is deleted from and inserted into the old and the new hash table, respectively (lines LABEL:alg.rebuild.del and LABEL:alg.rebuild.insert). Similarly, we use and to denote the events in which the lookup thread searches for node in the old and the new hash table, respectively (lines LABEL:alg.lookup.find1 and LABEL:alg.lookup.find2). We use to denote the event in which the lookup thread checks the node pointed to by rebuild_cur (line LABEL:alg.lookup.global). In the following proof, since the rebuild thread is the only thread that performs write/delete/insert operations, and the lookup thread is the only thread that performs find operations, we omit thread symbol without introducing any ambiguity. For brevity, we use the acronym rbc to stand for rebuild_cur. One event precedes another event , written , if occurs at an earlier time.
By inspecting the code of ht_rebuild in Algorithm LABEL:alg.rebuild we get that:
By inspecting the code of ht_lookup in Algorithm LABEL:alg.lookup, we get that:
When the lookup and the rebuild thread is simultaneously accessing node , there are three types of interleaving between these two threads:
, which implies that the lookup thread searches for node before the rebuild thread starts distributing the node. Thus, the node can be found in the old hash table and the lookup operation can return a pointer to on line LABEL:alg.lookup.return1.
, which implies that the lookup thread searches for node after it has been inserted into the new hash table by the rebuild thread. Thus, the node can be found in the new table and the lookup operation can return a pointer to on line LABEL:alg.lookup.return3.
It follows that:
Once rbc(rebuild_cur) is set to point to node it remains. Hence the lookup thread can find node via rebuild_cur and can return a pointer to it on line LABEL:alg.lookup.return22.
In overall, if there is a node with the matching key in DHash, it is guaranteed that the ht_lookup operation can find the node and return a pointer to it, no matter if a rebuild operation is in progress. ∎
4.6. Delete operation
The delete operation of DHash is shown in Algorithm LABEL:alg.delete. The function first attempts to delete the node from the old hash table on line LABEL:alg.delete.del1, and returns SUCCESS if succeeds on line LABEL:alg.delete.return1. Otherwise, the function continues by checking if a rebuild operation is in progress on line LABEL:alg.delete.checknew. The two read barriers on lines LABEL:alg.delete.mb1 and LABEL:alg.delete.mb2 pair with the two write barriers in ht_rebuild. If a rebuild operation is in progress, the delete operation checks if the node pointed to by rebuild_cur has the expected key value on line LABEL:alg.delete.checkglobal, and if the answer is yes, the rebuild operation deletes the node by setting the LOGICALLY_REMOVED bit of the next field of the node (Line LABEL:alg.delete.deleteglobal). Function delete() continues by attempting to delete the node with the matching key from the new hash table (Line LABEL:alg.delete.del2). If the delete operation fails, line LABEL:alg.delete.return4 returns -ENOENT indicating that no node with the matching key can be found in DHash.
To delete a node, DHash adopts a classic lightweight mechanism presented in (Michael, 2002), by separating the deletion of a node into two stages: logical and physical deletion. The first stage is to mark a node (e.g., by setting the least-significant bits in the next field) to prevent subsequent lookup operations from returning this node, and prevent subsequent insert and delete operations from inserting and deleting nodes after this node. The second stage, which is typically performed by subsequent lookup operations, is to physically delete the node from the list by swinging the next pointer of the previous node to the next node in the list and then reclaiming the node memory.
Since a delete operation fundamentally performs lookup operations in addition to a logical deletion if the node with the matching key is found, it is straightforward to prove that if the manipulation order of a delete operation is the same as that of a lookup operation shown in Algorithm LABEL:alg.lookup, the delete operation can always find the node (because of Lemma 4.1) and delete the node (a logical deletion can always succeed). That is, the following lemma holds:
Lemma 4.2 ().
If DHash contains node with the key value of K, operation ht_delete(K) can successfully delete node , no matter if a rebuild operation is in progress.
4.7. Insert operation
Function ht_insert() in Algorithm LABEL:alg.insert inserts a new node into DHash. The function first allocates a new node and initializes it (line LABEL:alg.insert.alloc) and then checks if a rebuild operation is in progress on line LABEL:alg.insert.checkglobal.
If there is no rebuild operation in progress, function ht_insert inserts the new node in the old hash table on line LABEL:alg.insert.insertold. In contrast, if a rebuild operation is in progress, it inserts the node in the new hash table on line LABEL:alg.insert.insertnew. If any insertion fails, which implies that a node with the same key value has been inserted into DHash before this insert operation is performed. Function ht_insert frees the newly allocated node on line LABEL:alg.insert.free and returns a failure message on line LABEL:alg.insert.return3. Since the RCU technique is used to synchronize insert operations and rebuild operation, the following lemma holds:
Lemma 4.3 ().
When a rebuild operation is in progress, function ht_insert can successfully insert node into DHash.
Recall that function ht_insert() is run in an RCU read-side critical section, and function ht_rebuild() performs a synchronize_rcu barrier on line LABEL:alg.rebuild.sync1 (called barrier 1) before distributing nodes to the new hash table. If the insert operation starts before barrier 1, it may or may not see the new hash table, and hence could insert the node into either the old or the new hash table. Inserting the node into any of the hash tables is correct, because barrier 1 prevents function ht_rebuild() from starting distributing nodes until the insert operation completes and leaves its RCU read-side critical section. In the other case, if the insert operation starts after barrier 1, which means that function ht_rebuild() is distributing nodes, the insert operation will insert the node into the new hash table. A second synchronize_rcu barrier on line LABEL:alg.rebuild.sync2 force the function ht_rebuild() to wait until the insert operation completes and leaves its RCU read-side critical section. ∎
Now, we prove that the following lemma holds:
Lemma 4.4 ().
No matter if a rebuild operation is in progress, when an operation ht_insert(K) returns, it is guaranteed that a node with the key value of K can be found in the hash table.
If a rebuild operation is absent, which is the common case, function ht_insert inserts the new node into the only hash table (Lines LABEL:alg.insert.common.b–LABEL:alg.insert.common.e). If a rebuild operation is in progress, Lemma 4.3 guarantees that ht_insert will eventually insert the new node into the new hash table. Function ht_insert fails only if another node with the same key value has been inserted into the hash table, which guarantees that a node with the key value of K can be found. ∎
For brevity, we provide only informal proof sketches. The full proof of correctness will be provided in the full version of the paper. Recall that DHash is modular. Therefore, if a lock-free/wait-free set algorithm can provide the API set listed in Algorithm LABEL:alg.structure.list, DHash can utilize it as the implementation of hash buckets. Therefore, the correctness of DHash depends on the set algorithm used. In this paper, we choose the RCU-based linked list presented in Section 4.1 as the example.
Safety: When rebuild operations are absent, safety is proved by following similar arguments as those used to prove Michael’s lock-free hash table (Michael, 2002). When a rebuild operation is running, Lemmas 4.1, 4.2, and 4.4 show that concurrent lookup, insert, and delete operations can execute correctly.
Progress guarantee: DHash is a blocking data structure because the RCU technique is used, and therefore the rebuild operation can be blocked. Specifically, ht_rebuild serializes concurrent rebuild requests by using a mutex lock and waits for prior hash table operations by using the synchronize_rcu barriers. This is acceptable for a practical implementation because rebuild operations commonly are rare and their speed is not the major concern if they do not noticeably affect concurrent hash table operations.
Nevertheless, the lookup, insert, and delete operations could be lock-free/wait-free, which is determined by the set algorithm used. For example, Algorithm LABEL:alg.lookup shows that a lookup operation invokes the list operation lflist_find twice. Other statements in a lookup operation are regular instructions, which can complete in a finite number of CPU cycles. As a result, for the implementation of DHash presented in this paper, its lookup operation is lock-free because searching a linked list is lock-free. (The find operation of Michael’s list may start over from the list head when they find a marked node.) Similarly, we can prove that the insert and delete operations of DHash are lock-free. (As discussed in Section 4.1, call_rcu is used in reclaiming deleted nodes, such that delete operations will not block.) Note that since DHash is modular, programmers can instead use a wait-free linked list and make the common operations become wait-free.
Linearizability: DHash is linearizable if the set algorithm used is linearizable. The linked list presented in Section 4.1 is linearizable, because we did not change the control flow of Michael’s list algorithm. Specifically, we keep all the CAS instructions and memory barriers that the algorithm contains. As a result, DHash is linearizable because every operation on the hash table has a specific linearization point, where it takes effects.
Specifically, every lookup operation that finds the node with the matching key via rebuild_cur takes effect on line LABEL:alg.lookup.assignrebuild. For other cases, the lookup operation linearizes in either of the two invocations of lflist_find. Similarly, every delete operation that finds the node with the matching key via rebuild_cur takes effect on line LABEL:alg.delete.assignrebuild. For other cases, the delete operation linearizes in either of the two invocations of lflist_delete. Every insert operation takes effect in either of the two invocations of lflist_insert.
In this section, we demonstrate that on three different architectures (1) the overall performance of DHash matches or slightly exceeds other practical alternatives, (2) DHash noticeably outperforms other alternatives under heave workloads, and (3) the rebuild operation of DHash is efficient and predictable in execution time.
6.1. Evaluation Methodology
We choose Xu’s hash table (XU, February 2010) (HT-Xu for short) as the representative of dynamic hash tables that maintain two sets of list pointers in each node. We choose the rhashtable algorithm in the Linux kernel (Graf, August 2014) (HT-RHT for short) as the representative of dynamic hash tables that use a single set of list pointers. We also compare the performance of DHash to the famous split-ordered-list resizable hash tables (Shalev and Shavit, 2006) (HT-Split for short) that maintain a single ordered linked list for a hash table.
We implemented DHash as a user-space library in C. The original implementation of HT-Xu is tightly combined with the multicasting code of the Linux kernel, so we use the implementation in perfbook (McKenney and others, 2011), which is a good representative of HT-Xu and run in user-space. We implemented a user-space HT-RHT
that is strictly close to the original kernel implementation except that we omitted some sophisticated features such as Nested Tables to handle GFP_ATOMIC memory allocation failures and Listed Tables to support duplicated nodes. The open-source projectuserspace-rcu (Desnoyers, 2012) includes a up-to-date implementation of HT-Split. Hence, we use the implementation in userspace-rcu in experiments.
For all of the implementations, optimizations such as cache-line padding are applied if possible. We compile the code with GCC 5.4.0 on all platforms where Ubuntu 16.04.5 is installed. We use -O3 as our optimization level without any special optimization flags.
Hardware platforms We evaluated the performance of aforementioned hash tables on three different architectures. Table 1 lists the key characteristics of these servers.
|Intel Ivy Bridge||2.6 G||2||24||15 M||64 G|
|IBM Power9||2.9 G||1||16||80 M||16 G|
|Cavium ARMv8||2.0 G||2||96||16 M||32 G|
Benchmarking framework To compare the performance and robustness of DHash to the alternatives, we extended the hashtorture benchmarking framework in (McKenney and others, 2011). Specifically, the extended framework consists of a specified number of worker threads, each of which performs the workload with the specified distribution of insert, delete, and lookup operations specified by parameter . In mapping worker threads to CPU cores, we use a performance-first mapping; a new thread is mapped to the CPU core that has the smallest number of worker threads running on it. Experiments performed on a single CPU socket are marked with an *, experiments performed on multiple CPU sockets are marked with a #, and experiments in which worker threads oversubscribe CPU cores are marked with a !. In experiments, we varied parameters that significantly affects the performance of concurrent hash tables: the mix of operations , the average load factor , the number of buckets , and the range of keys . When a test starts, every worker thread performs an infinite loop. In each iteration, the worker thread randomly selects an operation type (insert, delete, or lookup) according to the specified distribution , chooses a key from 0 to the specified upper bound , and then performs the specified operation. We chose parameters as follows: is set to ten million that is large enough to prevent CPU caches from buffering the whole test set. We controlled the average load factor indirectly by inserting nodes in a hash table before starting a test, and by selecting the ratio of insert to be equal to that of delete, which guarantees a fixed number of nodes in the hash table.
6.2. Overall performance
Figures 2 shows the overall performance of the hash tables with various load factors and operation mixes. We only present in this section the results for representative experiments performed on the Intel Ivy Bridge server. Experimental results on other architectures are discussed in Section 6.4
. Note that for clarity, the range of the y-axis in the last two figures is smaller than in other Figures. Standard deviation is denoted by vertical bars, which may be too small to be visible in the figure. To compare withHT-Split, in this experiment, both the new and the old hash tables use the same hash function, degrading DHash, HT-RHT, and HT-Xu to resizable hash tables. A rebuild thread continuously rebuilds a hash table from its initial size to the alternative size and back. While continuos resizes do not necessary reflect a common usage pattern for a hash table, this experiment noticeably demonstrate the overall performance of a hash table under rebuilding/resizing, demonstrating the baseline performance of the hash table.
Experimental results show that (1) the overall performance of DHash matches or slightly exceeds other practical alternatives with small average load factors (Figures (a)a–(b)b), and (2) DHash significant outperforms other hash tables under heavy workloads (Figures (e)e–(f)f). For example, Figure (f)f shows that when 48 worker threads concurrently execute operations, DHash can still handle 9.87 million operations per second, which outperforms HT-Split, HT-Xu, and HT-RHT by factors of 2.3, 5.3, and 6.2, respectively.
Another important observation is that DHash outperforms other alternatives with respect to scalability and robustness. Figures 2 shows that when the number of worker threads exceed the number of CPU cores (24 for the Intel Ivy Bridge server), the performance of DHash increases slightly despite the fact that more threads is contending the hash table. For example, Figure (e)e shows that as the number of worker threads increases from 24 to 48, the overall performance of DHash increases from 7.03 to 9.74 million operations per second. The performance of other alternatives, however, becomes flat or decreases due to the increased contention on bucket locks.
6.3. Rebuilding efficiency
In this section, we measure how fast various rebuild operations can rebuild. For brevity, Figure 3 only shows the results of representative experiments running on the Intel Ivy Bridge server, and with one worker thread. The x-axis of the figure is the amount of nodes in hash tables, and the y-axis the time spent in rebuilding these hash tables. Note that for clarity, both axes do not start from zero, and the y-axis is shown as log scale. The results of experiments with different percentages of lookup operations (90% and 80%) are shown in Figures (a)a and (b)b, respectively.
We make the following observation. As expected, the cost of the resize operation of HT-Split is consistently low, because of the fact that HT-Split is a resizable hash table and hence it only changes the array of bucket pointers when resizing. The rebuild operation of HT-Xu is much more efficient compared with DHash and HT-RHT because of its two-sets-of-pointers property, which allows a rebuild operation to rebuild the hash table by traversing the hash table once. For other dynamic hash tables, which need to distribute all nodes to the new hash table, the time required is basically linear to the amount of nodes in the old hash table. For this type of hash tables, DHash outperforms HT-RHT in rebuilding efficiency because HT-RHT always traverses a bucket list and then distributes the last node. In contrast, DHash distributes the head nodes, avoiding the traversing overheads.
Another observation is that operation mixes do not noticeably affect the rebuilding efficiency of all evaluated hash tables, as shown by the comparison of Figures (a)a and (b)b. This observation suggests that for DHash, given a hash table with a specified number of nodes, programmers can predict how long the algorithm will take to rebuild the hash table.
6.4. Performance on different architectures
We now evaluate the overall performance of DHash on ARM and PowerPC, other two important architectures in industry. The characteristics of the servers used were listed in Table 1. The benchmarking framework is the same as in Section 6.2. Experimental results with different average load factors are marked with different suffixes in Figure 4. For example, HT-DHash-20 shows the results of DHash with the average load factor of 20.
We make the following observation. On both architectures, DHash scales well. The overall performance of DHash increases nearly linearly when the number of worker threads increases, until worker threads oversubscribe CPU cores. After that, the performance of DHash increases slightly or stays constant, but does not decrease. Figure 4 shows that even if the average load factor of the hash table reaches 200, DHash can provide the throughput of 4.1 and 7.9Mop/s on IBM Power9 and ARMv8, respectively, indicating that DHash is the algorithm of choice for real applications with heavy workloads.
To overcome the hash collision problem, this paper presents DHash, a flexibly, efficient hash table that can dynamically change its hash function on the fly. DHash allows programmers to create specific implementations that meet their requirements in terms of the algorithm’s progress guarantee and performance. We present the core technique to efficiently distribute nodes from the old hash table to the new one in rebuilding, and show that the result is highly scalable and robust using a variety of benchmarks on three types of architectures.
- Less is more: trading a little bandwidth for ultra-low latency in the data center. In NSDI, Cited by: §1.
- Denial of service via algorithmic complexity attacks. In USENIX Security Symposium, Cited by: §1.
- Userspace RCU. Note: https://liburcu.org/[Online; accessed 9-May-2019] Cited by: §4.1, §6.1.
- A generic hash table. Note: https://lwn.net/Articles/510202/[Online; accessed 9-Feb-2020] Cited by: §1, §1.
- Denial of service via hash collisions. Note: https://lwn.net/Articles/474912/[Online; accessed 9-Feb-2020] Cited by: §1.
- rhashtable: avoid reschedule loop after rapid growth and shrink. Note: https://lkml.org/lkml/2019/1/23/789[Online; accessed 9-Feb-2020] Cited by: §1, §1.
- An efficient wait-free resizable hash table. In SPAA, Cited by: §1, §2.
- Lock-free linked lists and skip lists. In Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, pp. 50–59. Cited by: §1.
- Lib: resizable, scalable, concurrent hash table. Note: Linux kernel git commit: 7e1e77636e36 Cited by: §1, §2, §3, §4.4, §6.1.
- The art of multiprocessor programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: Cited by: §1, §4.1.
- [PHP-DEV] 5.3.9 Release, Hash DoS. Note: https://lwn.net/Articles/474971/[Online; accessed 9-Feb-2020] Cited by: §1.
- Wait-free queues with multiple enqueuers and dequeuers. In ACM SIGPLAN Notices, Vol. 46, pp. 223–234. Cited by: §1.
- Dynamic-sized nonblocking hash tables. In Proceedings of the 2014 ACM symposium on Principles of distributed computing, pp. 242–251. Cited by: §2.
- Is parallel programming hard, and, if so, what can you do about it?. Linux Technology Center, IBM Beaverton. Cited by: §6.1, §6.1.
- Introduction to RCU. Note: http://www.rdrop.com/~paulmck/RCU/[Online; accessed 9-May-2019] Cited by: §4.1.
- High performance dynamic lock-free hash tables and list-based sets. In SPAA, Cited by: §4.1, §4.6, §5.
- Hazard pointers: safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems 15 (6), pp. 491–504. Cited by: §4.1, §4.1.
- Split-ordered lists: lock-free extensible hash tables. J. ACM 53, pp. 379–405. Cited by: §1, §1, §2, §3, §6.1.
- Memory order specification. Note: https://en.cppreference.com/w/cpp/atomic/memory_order[Online; accessed 9-Feb-2020] Cited by: §4.4.
- Resizable, scalable, concurrent hash tables via relativistic programming. In USENIX Annual Technical Conference, Cited by: §1, §2, §2, §3.
- Lock-free linked lists using compare-and-swap. In PODC, Vol. 95, pp. 214–222. Cited by: §4.1.
- Bridge: add core igmp snooping support. Note: Linux kernel git commit: eb1d16414339 Cited by: §1, §1, §2, §3, §6.1.
- A wait-free queue as fast as fetch-and-add. ACM SIGPLAN Notices 51 (8), pp. 16. Cited by: §1.