Dash: Scalable Hashing on Persistent Memory

03/16/2020 ∙ by Baotong Lu, et al. ∙ 0

Byte-addressable persistent memory (PM) brings hash tables the potential of low latency, cheap persistence and instant recovery. The recent advent of Intel Optane DC Persistent Memory Modules (DCPMM) further accelerates this trend. Many new hash table designs have been proposed, but most of them were based on emulation and perform sub-optimally on real PM. They were also piece-wise and partial solutions that side-step many important properties, in particular good scalability, high load factor and instant recovery. We present Dash, a holistic approach to building dynamic and scalable hash tables on real PM hardware with all the aforementioned properties. Based on Dash, we adapted two popular dynamic hashing schemes (extendible hashing and linear hashing). On a 24-core machine with Intel Optane DCPMM, we show that compared to state-of-the-art, Dash-enabled hash tables can achieve up to  3.9X higher performance with up to over 90 size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

7 Related Work

Dash builds upon many techniques from prior in-memory and PM-based hash tables, tree structures and PM programming tools. In-Memory Hash Indexes. Section 2.2 has covered extendible hashing [ExtHashing] and linear hashing [LarsonLinearHashing, LitwinLinearHashing], so we do not repeat here. Cuckoo hashing [CuckooHashing] achieves high memory efficiency through displacement: a record can be inserted into one of the two buckets computed using two independent hash functions; if both buckets are full, a randomly-chosen record is evicted to its alternative bucket to make space for the new record. The evicted record is inserted in the same way. MemC3 [MemC3] proposes a single-writer, multi-reader optimistic concurrent cuckoo hashing scheme that uses version counters with a global lock. libcuckoo [libcuckoo] extends MemC3 to support multi-writer. These approaches may incur many memory writes due to consecutive cuckoo displacements. Dash limits the number of probings and uses optimistic locking to reduce PM writes. Static Hashing on PM. Most work aims at reducing PM writes, improving load factor and reducing the cost of full-table rehashing. Some proposals use multi-level designs that consist of a main table and additional levels of stashes to store records that cannot be inserted into the main table. PFHT [PFHT] is a two-level scheme that allows only one displacement to reduce writes. It uses linked lists to resolve collisions in the stash, which may incur many cache misses during probing. Path hashing [PathHashing] further organizes the stash as an inverted complete binary tree to lower search cost. Level hashing [LevelHashing] is a two-level scheme that bounds the search cost to at most four buckets. Upon resizing, the bottom-level is rehashed to be 4 size of the top-level table, and the previous top level becomes the new bottom level. Compared to cuckoo hashing, the number of buckets needed to probe during a lookup is doubled. Dash also uses stashes to improve load factor, but most search operations only need to access two buckets thanks to the overflow metadata. Dynamic Hashing on PM. CCEH [CCEH] is a crash-consistent extendible hashing scheme which avoids full-table rehashing [ExtHashing]. To improve search efficiency, it bounds its probing length to four cachelines, but this can lead to low load factor and frequent segment splits. CCEH’s recovery process requires scanning the directory upon restart, thus sacrifies instant recovery. Prior proposals often use pessimistic locking [LevelHashing, CCEH] which can easily become a bottleneck due to excessive PM writes when manipulating locks. The result is even conflict-free search operations cannot scale. NVC-hashmap [NVCHashmap] is a lock-free, persistent hash table based on split-ordered lists [SplitOrderedLists]. Although the lock-free design can reduce PM writes, it is hard to implement; the linked list design may also incur many cache misses. Dash solves these problems with optimistic locking that reduces PM writes and allows near-linear scalability for search operations. Range Indexes. Most range indexes for PM are B+-tree or trie variants and aim to reduce PM writes [Hwang2018, FPTree, BzTree, Chen2015, NV-Tree, DBPCM, PMwCAS, Lee2017, HiKV]. An effective technique is unsorted leaf nodes [Chen2015, DBPCM, NV-Tree, FPTree] at the cost of linear scans, while hash indexes mainly reduce PM writes by avoiding consecutive displacements. FP-tree[FPTree] proposes fingerprints in leaf nodes to reduce PM accesses; Dash adopted it to reduce unnecessary bucket probing and efficiently support variable-length keys. Some work [FPTree, HiKV] places part of the index in DRAM (e.g., inner nodes) to improve performance. This trades off instant recovery as the DRAM part has to be rebuilt upon restart [PiBench]. The same tradeoff can be seen in hash tables by placing the directory in DRAM. With bucket load balancing techniques, Dash can use larger segments and place the directory in PM, avoiding this tradeoff. PM Programming. PM data structures rely heavily on userspace libraries and OS support to easily handle such issues as PM allocation and space management. PMDK [PMDK] is so far the most popular and comprehensive library. An important issue in these libraries is to avoid leaking PM permanently. A common solution [nvmmalloc, Makalu, PMDK, PAllocator] is to use an allocate-activate approach so that the allocated PM is either owned by the application or the allocator upon a crash. At the OS level, PM file systems provide direct access (DAX) to bypass caches and allow pointer-based accesses [ext4]. Some traditional file systems (e.g., ext4 and XFS) have been adapted to support DAX. PM-specific file systems are also being proposed to further reduce overhead [NOVA, BPFS, PMFS, Strata, Aerie, splitfs]. We find support for PM programming is still in its early stage and evolving quickly with possible bugs and inefficiencies as Section 6.9 shows. This requires careful integration and testing when designing future PM data structures.

8 Conclusion

Persistent memory brings new challenges to persistent hash tables in both performance (scalability) and functionality. We identify that the key is to reduce both unnecessary PM reads and writes, whereas prior work solely focused on reducing PM writes and ignored many practical issues such as PM management and concurrency control, and traded off instant recovery capability. Our solution is Dash, a holistic approach to scalable PM hashing. Dash combines both new and existing techniques, including (1) fingerprinting to reduce PM accesses, (2) optimistic locking, and (3) a novel bucket load balancing technique. Using Dash, we adapted extendible hashing and linear hashing to work on PM. On real Intel Optane DCPMM, Dash scales with up to 3.9 better performance than prior state-of-the-art, while maintaining desirable properties, including high load factor and sub-second level instant recovery.

8 Conclusion

Persistent memory brings new challenges to persistent hash tables in both performance (scalability) and functionality. We identify that the key is to reduce both unnecessary PM reads and writes, whereas prior work solely focused on reducing PM writes and ignored many practical issues such as PM management and concurrency control, and traded off instant recovery capability. Our solution is Dash, a holistic approach to scalable PM hashing. Dash combines both new and existing techniques, including (1) fingerprinting to reduce PM accesses, (2) optimistic locking, and (3) a novel bucket load balancing technique. Using Dash, we adapted extendible hashing and linear hashing to work on PM. On real Intel Optane DCPMM, Dash scales with up to 3.9 better performance than prior state-of-the-art, while maintaining desirable properties, including high load factor and sub-second level instant recovery.

1 Introduction

Dynamic hash tables that can grow and shrink as needed at runtime are a fundamental building block of many data-intensive systems, such as database systems [MySQL, PostgreSQL, MeetWalkers, Cicada, MMDB-Impl, MMDBMS] and key-value stores [Bluecache, MemC3, SLM-DB, WiscKey, LevelDB, Redis, FASTER]. Persistent memory (PM) such as 3D XPoint [Intel3DXP] and memristor [Memristor] promises byte-addressability, persistence, high capacity, low cost and high performance, all on the memory bus. These features make PM very attractive for building dynamic hash tables that persist and operate directly on PM, with high performance and instant recovery. The recent release of Intel Optane DC Persistent Memory Module (DCPMM) brings this vision closer to reality. Since PM exhibits several distinct properties (e.g., asymmetric read/write speeds and higher latency); blindly applying prior disk or DRAM based approaches [ExtHashing, LarsonLinearHashing, LitwinLinearHashing] would not reap its full benefits, necessitating a departure from conventional designs.

1.1 Hashing on PM: Not What You Assumed!

There have been a new breed of hash tables specifically designed for PM [CCEH, LevelHashing, Dali, PFHT, PathHashing, NVCHashmap] based on DRAM emulation, before actual PM was available. Their main focus is to reduce cacheline flushes and PM writes for scalable performance. But when they are deployed on real Optane DCPMM, we find (1) scalability is still a major issue, and (2) desirable properties are often traded off.

Figure 1 shows the throughput of two state-of-the-art PM hash tables [LevelHashing, CCEH] under insert (left) and search (right) operations, on a 24-core server with Optane DCPMM running workloads under uniform key distribution (details in Section 6). As core count increases, neither scheme scales for inserts, nor even read-only search operations. Corroborating with recent work [PMPrimitives, PiBench], we find the main culprit is Optane DCPMM’s limited bandwidth which is 3–14 lower than DRAM’s [UCSDMeasurement]. Although the server is fully populated to provide the maximum possible bandwidth, excessive PM accesses can still easily saturate the system and prevent the system from scaling. We describe two major sources of excessive PM accesses that were not given enough attention before, followed by a discussion of important but missing functionality in prior work.

Figure 1: Throughput of state-of-the-art PM hashing (CCEH [CCEH] and Level Hashing [LevelHashing]) for insert (left) and search (right) operations on Optane DCPMM. Neither matches the expected scalability.

Excessive PM Reads. Much prior work focused on reducing writes to PM, however, we note that it is also imperative to reduce PM reads; yet many existing solutions reduce PM writes by incurring more PM reads. Different from the device-level behavior (PM reads being faster than writes), end-to-end write latency (i.e., the entire data path including CPU caches and write buffers in the memory controller) is often lower than reads [PMPrimitives, UCSDGuide]. The reason is while PM writes can leverage write buffers, PM reads mostly touch the PM media due to hash table’s inherent random access patterns. In particular, existence checks in record probing constitute a large proportion of such PM reads: to find out if a key exists, one or multiple buckets (e.g., with linear probing) have to be searched, incurring many cache misses and PM reads when comparing keys.

Heavyweight Concurrency Control. Most prior work side-stepped the impact of concurrency control. Bucket-level locking has been widely used [CCEH, LevelHashing], but it incurs additional PM writes to acquire/release read locks, further pushing bandwidth consumption towards the limit. Lock-free designs [NVCHashmap] can avoid PM writes for read-only probing operations, but are notoriously hard to get right, more so in PM for safe persistence [PMwCAS].

Neither record probing nor concurrency control typically prevents a well-designed hash table to scale on DRAM. However, on PM they can easily exhaust PM’s limited bandwidth. These issues call for new designs that can reduce unnecessary PM reads during probing and lightweight concurrency control that further reduces PM writes.

Missing Functionality. We observe in prior designs, necessary functionality is often traded off for performance (though scalability is still an issue on real PM). (1) Indexes could occupy more than 50% of memory capacity [HybridIndex], so it is critical to improve load factor (records stored vs. hash table capacity). Yet high load factor is often sacrificed by organizing buckets using larger segments in exchange for smaller directories (fewer cache misses) [CCEH]. As we describe later, this in turn can trigger more pre-mature splits and incur even more PM accesses, impacting performance. (2) Variable-length keys are widely used in reality, but prior approaches rarely discuss how to efficiently support it. (3) Instant recovery is a unique, desirable feature that could be provided by PM, but is often omitted in prior work which requires a linear scan of the metadata whose size scales with data size. (4) Prior designs also often side-step the PM programming issues (e.g., PM allocation), which impact the proposed solution’s scalability and adoption in reality.

1.2 Dash

We present Dash, a holistic approach to dynamic and scalable hashing on real PM without trading off desirable properties. Dash uses a combination of new and existing techniques that are carefully engineered to achieve this goal.

1
We adopt fingerprinting [FPTree] that was first used in PM tree structures to avoid unnecessary PM reads during record probing. The idea is to generate fingerprints (one-byte hashes) of keys and place them compactly to summarize the possible existence of keys. This allows a thread to tell if a key possibly exists by scanning the fingerprints which are much smaller than the actual keys.

2
Instead of traditional bucket-level locking, Dash uses an optimistic, lightweight flavor of it that relies on verification to detect conflicts, rather than (expensive) shared locks. This allows Dash to avoid PM writes for search operations. With fingerprinting and optimistic concurrency, Dash avoids both unnecessary reads and writes, saving PM bandwidth and allowing Dash to scale well.

3
Dash retains desirable properties. We propose a new load balancing strategy to postpone segment splits with improved space utilization. To support instant recovery, we limit the amount of work to be done upon recovery to be small (reading and possibly writing a one-byte counter), and amortize recovery work to runtime. Compared to prior work that handles PM programming issues in ad hoc ways, Dash uses PM programming models (PMDK [PMDK], one of the most popular PM libraries) to systematically handle crash consistency, PM allocation and achieve instant recovery.

Although these techniques are not all new, Dash is the first to integrate them for building hash tables that scale without sacrificing features on real PM. Techniques in Dash can be applied to various static and dynamic hashing schemes. Compared to static hashing, dynamic hashing can adjust hash table size on demand without full-table rehashing which may block concurrent queries and significantly limit performance. In this paper, we focus on dynamic hashing and apply Dash to two classic approaches: extendible hashing [ExtHashing, CCEH] and linear hashing [LitwinLinearHashing, LarsonLinearHashing]. They are both widely used in database and storage systems, such as Oracle ZFS [ZFS], IBM GPFS [GPFS], Berkeley DB [BDB] and SQL Server Hekaton [hekaton].

Evaluation using a 24-core Intel Xeon Scalable CPU and 1.5TB of Optane DCPMM shows that Dash can deliver high performance, good scalability, high space utilization and instant recovery with a constant recovery time of 57ms. Compared to the aforementioned state-of-the-art [CCEH, LevelHashing], Dash achieves up to 3.9 better performance on realistic workloads, and up to over 90% load factor with high space utilization and the ability to instantly recover.

1.3 Contributions and Paper Organization

We make four contributions. First, we identify the mismatch between existing and desirable PM hashing approaches, and highlight the new challenges. Second, we propose Dash, a holistic approach to building scalable hash tables on real PM. Dash consists of a set of useful and general building blocks applicable to different hash table designs. Third, we present Dash-enabled extendible hashing and linear hashing, two classic and widely used dynamic hashing schemes. Finally, we provide a comprehensive empirical evaluation of Dash and existing PM hash tables to pinpoint and validate the important design decisions. Our implementation is open-source at:

https://github.com/baotonglu/dash.

In the rest of the paper, we give necessary background in Section 2. Sections 35 present our design principles and Dash-enabled extendible hashing and linear hashing. Section 6 evaluates Dash. We discuss related work in Section 7 and conclude in Section 8.

2 Background

We first give necessary background on PM (Optane DCPMM) and dynamic hashing, then discuss issues in prior PM hash tables.

2.1 Intel Optane DC Persistent Memory

Hardware. We target Optane DCPMMs (in DIMM form factor). In addition to byte-addressability and persistence, DCPMM offers high capacity (128/256/512GB per DIMM) at a price lower than DRAM’s. It supports modes: Memory and AppDirect. The former presents capacious but slower volatile memory. DRAM is used as a cache to hide PM’s higher latency, with hardware-controlled caching policy. The AppDirect mode allows software to explicitly access DRAM and PM with persistence in PM, without implicit caching. Applications need to make judicious use of DRAM and PM. Similar to other work [LevelHashing, CCEH, Dali, NVCHashmap], we leverage the AppDirect mode, as it provides more flexibility and persistence guarantees.

System Architecture. The current generation of DCPMM requires the system be equipped with DRAM to function properly. We also assume such setup with PM and DRAM, both of which are behind multiple levels of volatile CPU caches. Data is not guaranteed to be persisted in PM until a cacheline flush instruction (CLFLUSH, CLFLUSHOPT or CLWB[IntelManual] is executed or other events that implicitly cause cacheline flush occur. Writes to PM may also be reordered, requiring fences to avoid undesirable reordering. The application (hash table in our case) must explicitly issue fences and cacheline flushes to ensure correctness. CLFLUSH and CLFLUSHOPT will evict the cacheline that is being flushed, while CLWB does not (thus may give better performance). After a cacheline of data is flushed, it will reach the asynchronous DRAM refresh (ADR) domain which includes a write buffer and a write pending queue with persistence guarantees upon power failure. Once data is in the ADR domain, it is considered as persisted. Although DCPMM supports 8-byte atomic writes, internally it uses the 256-byte blocks. But software should not (be hardcoded to) depend on this as it is an internal parameter of the hardware which may change in future generations.

Performance Characteristics. At the device level, as many previous studies have shown, PM exhibits asymmetric read and write latency, with writes being slower. It exhibits 300ns read latency, 4 longer than DRAM’s. More recent studies [PMPrimitives, UCSDGuide], however revealed that on Optane DCPMM, read latency as seen by the software is often higher than write latency. This is attributed to the fact that writes (store instructions) commit once the data reaches the ADR domain at the memory controller rather than when reaching DCPMM media. On the contrary, a read operation often needs to touch the actual media unless the data being accessed is cache-resident (which is rare especially in data structures with inherent randomness, e.g., hash tables). Tests also showed that the bandwidth of DCPMM depends on many factors of the workload. In general, compared to DRAM, it exhibits 3/8 slower sequential/random read bandwidth. The numbers for sequential/random write are 11/14. Notably, the performance of small random stores is severely limited and non-scalable [UCSDGuide]

, which, however, is the inherent access pattern in hash tables. These properties exhibit a stark contrast to prior estimates about PM performance 

[NVM-DLog, NVRAMEra, vanRenen2018], and lead to significantly lower performance of many prior designs on DCPMM than originally reported. Thus, it is important to reduce both PM reads and writes for higher performance. More details on raw DCPMM device performance can be found elsewhere [UCSDMeasurement]; we focus on the end-to-end performance of hash tables on PM.

2.2 Dynamic Hashing

Now we give an overview of extendible hashing [ExtHashing] and linear hashing [LarsonLinearHashing, LitwinLinearHashing]. We focus on their memory-friendly versions which PM-adapted hash tables were based upon. Dash can also be applied to other approaches which we defer to future work.

Extendible Hashing. The crux of extendible hashing is its use of a directory to index buckets so that they can be added and removed dynamically at runtime. When a bucket is full, it is split into two new buckets with keys redistributed. The directory may get expanded (doubled) if there is not enough space to store pointers to the new bucket. Figure 2(a) shows an example with four buckets, each of which is pointed to by a directory entry; a bucket can store up to four records (key-value pairs). In the figure, indices of directory entries are shown in binary. The two least significant bits (LSBs) of the hash value are used to select a bucket; we call the number of suffix bits being used here the global depth. The hash table can have at most 2 directory entries (buckets). A search operation follows the pointer in the corresponding directory entry to probe the bucket. Each bucket also has a local depth. In Figure 2(a), the local depth of each bucket is 2, same as the global depth. Suppose we want to insert key 30 that is hashed to bucket 01, which is full and needs to be split to accommodate the new key.111Choosing a proper hash function that evenly distributes keys to all buckets is an important but orthogonal problem to our work. Splitting the bucket will require more directory entries. In extendible hashing, the directory always grows by doubling its current size. The result is shown in Figure 2(b). Here, bucket 01 in Figure 2(a) is split into two new buckets (001 and 101), one occupies the original directory entry, and the other occupies the second entry in the newly added half of the directory. Other new entries still point to their corresponding original buckets. Search operations will use three bits to determine the directory entry index (global depth now is three).

After a bucket is split, we increment its local depth by one, and update the new bucket’s local depth to be the same (3 in our example). The other unsplit buckets’ local depth remains 2. This allows us to determine whether a directory doubling is needed: if a bucket whose local depth equals the global depth is split (e.g., bucket 001 or 101), then the directory needs to be doubled to accommodate the new bucket. Otherwise (local depth global depth), the directory should have 2 directory entries pointing to that bucket, which can be used to accommodate the new bucket. For instance, if bucket 000 needs to be split, directory entry 100 (pointing to bucket 000) can be updated to point to the new bucket.

Figure 2: An example of extendible hashing. (a) The hash table is full. (b) The local depth of unsplit buckets is 2. Splitting buckets with local depth global depth will not double the directory.

Linear Hashing. In-memory linear hashing takes a similar approach to organizing buckets using a directory with entries pointing to individual buckets [LarsonLinearHashing]. The main difference compared to extendible hashing is that in linear hashing, the bucket to be split is chosen “linearly.” That is, it keeps a pointer (page ID or address) to the bucket to be split next and only that bucket would be split in each round, and advances the pointer to the next bucket when the split of the current bucket is finished. Therefore, the bucket being split is not necessarily the same as the bucket that is full as a result of inserts, and eventually the overflowed bucket will be split and have its keys redistributed. If a bucket is full and an insert is requested to it, more overflow buckets will be created and chained together with the original, full bucket. For correct addressing and lookup, linear hashing uses a group of hash functions , where covers twice the range of . For buckets that are already split, is used so we can address buckets in the new hash table capacity range, and for the other unsplit buckets we use to find the desired bucket. After all buckets are split (a round of splitting has finished), the hash table’s capacity will be doubled; the pointer to the next-to-be-split bucket is reset to the first bucket for the next round of splitting.

Determining when to trigger a split is an important problem in linear hashing. A typical approach is to monitor and keep the load factor bounded [LarsonLinearHashing]. The choice of a proper splitting strategy may also vary across workloads, and is orthogonal to the design of Dash.

2.3 Dynamic Hashing on PM

Now we discuss how dynamic hashing is adapted to work on PM. We focus on extendible hashing and start with CCEH [CCEH], a recent representative design; Section 5 covers linear hashing on PM.

To reduce PM accesses, CCEH groups buckets into segments, similar to in-memory linear hashing [LarsonLinearHashing]. Each directory entry then points to a segment which consists of a fixed number of buckets indexed by additional bits in the hash values. By combining multiple buckets into a larger segment, the directory can become significantly smaller as fewer bits are needed to address segments, making it more likely to be cached entirely by the CPU, which helps reducing access to PM. Note that split now happens at the segment (instead of bucket) level. A segment is split once any bucket in it is full, even if the other buckets in the segment still have free slots, which results in low load factor and more PM accesses. To reduce such pre-mature splits, linear probing can be used to allow a record to be inserted into a neighbor bucket. However, this improves load factor at the cost of more cache misses and PM accesses. Thus, most approaches bound probing distance to a fixed number, e.g., CCEH probes no more than four cachelines. However, our evaluation (Section 6) shows that linear probing alone is not enough in achieving high load factor.

Another important aspect of dynamic PM hashing is to ensure failure atomicity, particularly during segment split which is a three-step process: (1) allocate a new segment in PM, (2) rehash records from the old segment to the new segment and (3) register the new segment in the directory and update local depth. Existing approaches such as CCEH only focused on step 3, side-stepping PM management issues surrounding steps 1–2. If the system crashes during step 1 or 2, we need to guarantee the new segment is reclaimed upon restart to avoid permanent memory leaks. In Sections 4 and 6.1, we describe Dash’s solution and a solution for existing approaches.

3 Design Principles

The aforementioned issues and performance characteristics of Optane DCPMM lead to the following design principles of Dash:

  • Avoid both Unnecessary PM Reads and Writes. Probing performance impacts not only search operations, but also all the other operations. Therefore, in addition to reducing PM writes, Dash must also remove unnecessary PM reads to conserve bandwidth and alleviate the impact of high end-to-end read latency.

  • Lightweight Concurrency. Dash should scale well on multicore machines with persistence guarantees. Given the limited bandwidth, concurrency control must be lightweight to not incur much overhead (i.e., avoid PM writes for search operations, such as read locks). Ideally, it should also be relatively easy to implement.

  • Full Functionality. Dash must not sacrifice or trade off important features that make a hash table useful in practice. In particular, it needs to support near-instantaneous recovery and variable-length keys and achieve high space utilization.

4 Dash for Extendible Hashing

Based on the principles in Section 3, we describe Dash in the context of Dash-Extendible Hashing (Dash-EH). We discuss how Dash applies to linear hashing in Section 5.

4.1 Overview

Similar to prior approaches [CCEH, LarsonLockFreeLH], Dash-EH uses segmentation. As shown in Figure 3, each directory entry points to a segment which consists of a fixed number of normal buckets and stash buckets for overflow records from normal buckets which did not have enough space for the inserts. The lock, version number and clean marker are for concurrency control and recovery, which we describe later.

Figure 4 shows the internals of a bucket. We place the metadata used for bucket probing on the first 32 bytes, followed by multiple 16-byte record (key-value pair) slots. The first 8 bytes in each slot store the key (or a pointer to it for keys longer than 8 bytes). The remaining 8 bytes store the payload which is opaque to Dash; it can be an inlined value or a pointer, depending on the application’s need. The size of a bucket is adjustable. In our current implementation it is set to 256-byte (block size of Optane DCPMM [UCSDMeasurement]) for better locality. This allows us to store 14 records (16-byte each) per bucket.

The 32-byte metadata includes key data structures for Dash-EH to handle hash table operations and realize the design principles. It starts with a 4-byte version lock for optimistic concurrency control (Section 4.4). The allocation bitmap reserves one bit per slot, to indicate whether the corresponding slot stores a valid record. The membership bitmap is reserved for bucket load balancing which we describe later (Section 4.3). A 4-bit counter records the number of records stored in the bucket. What follows are structures such as fingerprints and counters to accelerate probing and improve load factor. Most unnecessary probings are avoided by scanning the fingerprints area.

Figure 3: Overall architecture of Dash-EH.
Figure 4: Dash-EH bucket layout. The first 32 bytes are dedicated to metadata that optimizes probing and load factor, followed by records. Normal and stash buckets share the same layout.

4.2 Fingerprinting

Bucket probing (i.e., search in one bucket) is a basic operation needed by all the operations supported by a hash table (search, insert and delete) to check for key existence. Searching a bucket typically requires a linear scan of the slots. This can incur lots of cache misses and is a major source of PM reads, especially so for long keys stored as pointers. It is a major reason for hash tables on PM to exhibit low performance. Moreover, such scans for negative search operations (i.e., when the target key does not exist) are completely unnecessary.

We employ fingerprinting [FPTree] to reduce unnecessary scans. It was originally used by trees to reduce PM accesses with an amortized number of key loads of one. We adopt it in hash tables to reduce cache misses and accelerate probing. Fingerprints are one-byte hashes of keys for predicting whether a key possibly exists. We use the least significant byte of the key’s hash value. To probe for a key, the probing thread first checks whether any fingerprint matches the search key’s fingerprint. It then only accesses slots with matching fingerprints, skipping all the other slots. If there is no match, the key is definitely not present in the bucket. This process can be further accelerated with SIMD instructions [IntelManual].

Fingerprinting particularly benefits negative search (where the search key does not exist) and uniqueness checks for inserts. It also allows Dash to use larger buckets to tolerate more collisions and improve load factor, without incurring many cache misses: most unnecessary probes are avoided by fingerprints. This design contrasts with many prior designs that trade load factor for performance by having small buckets of 1–2 cachelines [PFHT, CCEH, LevelHashing].

As Figure 4 shows, each bucket contains 14 slots, but 18 fingerprints (bits 64–208); 14 are for slots in the bucket, and the other four represent keys placed in a stash bucket but were originally hashed into the current bucket. They can allow early avoidance of access to stash buckets, saving PM bandwidth. We describe details next as part of the bucket load balancing strategy that improves load factor.

4.3 Bucket Load Balancing

Segmentation reduces cache misses on the directory (by reducing its size). However, as we describe in Sections 2.3 and 6, this is at the cost of load factor: in a naive implementation the entire segment needs to be split if any bucket is full, yet other buckets in the segment may still have much free space. We observe that the key reason is load imbalance caused by the (inflexible) way buckets are selected for inserting new records, i.e., a key is only mapped to a single bucket. Dash uses a combination of techniques for new inserts to balance loads among buckets while limiting PM reads needed. Algorithm 1 shows how the insert operation works in Dash-EH at a high level, with three key techniques described below.

Balanced Insert. To insert a record whose key is hashed into bucket (), Dash probes both bucket and and inserts the record into the bucket that is less full (Figure 3 step 1). Lines 17–21 in Algorithm 1 show the idea; we discuss insertion of a record into a bucket and other details in the following sections. The rationale behind is to improve load factor by amortizing the load of hot buckets while limiting the amount of PM accesses (at most two buckets). Compared to balanced insert, linear probing allows a record to be inserted into bucket where if buckets are full. Probing multiple buckets may degrade performance by imposing more PM reads and cache misses. It is also hard to tune the number of buckets to probe.

Displacement. If both the target bucket and probing bucket are full, Dash-EH tries to displace (move) a record from bucket or to make room for the new record (Algorithm 1 lines 23–26). With balanced insert, a record in bucket can be moved to bucket if (1) it could be inserted to either bucket (i.e., is the probing bucket of the record being moved), and (2) bucket has a free slot. Therefore, for a record with and both bucket and are full, we first try to find a record in bucket whose and move it to . If such a record does not exist, we repeat in the reverse direction for bucket but move a record whose (the target bucket). In essence, displacement follows a similar strategy to balanced insert, but is for existing records.

We use a per-bucket Membership bitmap (Figure 3) to accelerate displacement. If a bit is set, then the corresponding key was not originally hashed into this bucket; it was placed here because of balanced insert or displacement. When checking for bucket (), a record whose membership bit is set (unset) can be displaced. Dash then only needs to scan the bitmap to pick directly a record to move, without having to examine the actual keys. This reduces unnecessary PM accesses, and is especially beneficial for variable length keys which are not inlined but represented by pointers.

1def dash_eh_insert(key, value):
  h = hash(key)
3retry:
  # Obtain references and lock buckets
5  [target_seg] = get_segment(h)
  [target_bucket, probing_bucket] = target_seg.bk(h)
7  Lock target_bucket and probing_bucket
9  # Verify the correctness of the segment reference
  [verify_seg] = get_segment(h)
11  if verify_seg is not target_seg
    Unlock and goto retry
13
  if key exists in either bucket or the stash:
15    Unlock and return Result::KeyExists
17  if target_bucket or probing_bucket is not full:
    if target_bucket.count <= probing_bucket.count:
19      target_bucket.insert(key, value, h)
    else:
21      probing_bucket.insert(key, value, h)
  else
23    # Try displacement (possibly stashing)
    bucket = displace(target_bucket, probing_bucket)
25    if bucket is not NULL:
      bucket.insert(key, value, h)
27    elif stash_bucket.insert(target, key, value, h):
      target.overflow = true
29      Set overflow fingerprint bitmap and fingerprint
    else  # Stashing failed, have to split
31      split_segment(h)
      goto retry
33  Unlock target_bucket and probing_bucket
  return Result::Inserted
Algorithm 1 Dash-EH insert algorithm with bucket load balancing.

Stashing. If the record cannot be inserted into bucket or after balanced insert and displacement, stashing will be the last resort before segment split has to happen. In Figure 3, a tunable number of stash buckets follow the normal buckets in each segment. If a record cannot be inserted into its target bucket nor the probing bucket, we insert the record to a stash bucket; we call these records overflow records. Stash buckets use the same layout as that of normal buckets; probing of a stash bucket follows the same procedure as probing a normal bucket (see Section 4.2). While stashing can be effective in improving load factor, it could incur non-trivial overhead: the more stash buckets are used, the more CPU cycles and PM reads will be needed to probe them. This is especially undesirable for negative search and uniqueness check in insert operations, since both need to probe all stash buckets, despite it may be completely unnecessary.

To solve this problem, we try to set up record metadata including fingerprints in a normal bucket and only refer actual record access to the stash bucket. As Figure 4 shows, four additional fingerprints per bucket are reserved for overflow records stored in stash buckets. A 1-bit overflow bit indicates whether the bucket has overflowed any record to a stash bucket. Another 4-bit overflow fingerprint bitmap records whether the corresponding fingerprint slot is occupied. This process is shown in Algorithm 1 (lines 27–29). Similar to inserting records into a normal bucket, the overflow record’s fingerprint can also follow the balanced insert strategy, with the help of overflow membership bitmap which is used to indicate whether the overflow fingerprints originally belong to this bucket. If the overflow fingerprint cannot be inserted into neither the target nor the probing bucket, we increment overflow count in the target bucket. Once the counter becomes positive, a probing thread will have to check the stash area to ensure that a key does or does not exist. Thus, it is desirable to reserve enough slots of overflow fingerprints in each bucket so that the overflow counter is rarely positive. As Section 6 shows, using 2–4 stash buckets per segment can improve load factor to over 90% without imposing significant overhead.

4.4 Optimistic Concurrency

Dash employs optimistic locking, an optimistic flavor of bucket-level locking inspired by optimistic concurrency control [OCC, Silo]. Insert operations will follow traditional bucket-level locking to lock the affected buckets. Search operations are allowed to proceed without holding any locks (thus avoiding writes to PM) but need to verify the read record. For this to work, in Dash the lock consists of (1) a single bit that serves the role of “the lock” and (2) a version number for detecting conflicts (not to be confused with the version number in Figure 3 for instant recovery). As line 7 in Algorithm 1 shows, the inserting thread will acquire bucket-level locks for the target and probing buckets. This is done by atomically setting the lock bit in each bucket by trying the compare-and-swap (CAS) instruction [IntelManual] until success. Then the thread enters the critical section and continues its operations. After the insert is done, the thread releases the lock by (1) resetting the lock bit and (2) incrementing the version number by one, in one step using an atomic write.

To probe a bucket for a key, Dash first takes a snapshot of the lock word and checks whether the lock is being held by a concurrent writer (the lock bit is set). If so, it waits until the lock is released and repeats. Then it is allowed to read the bucket without

holding any lock. Upon finishing its operations, the reader thread will read the lock word again to verify the version number did not change, and if so, it retries the entire operation as the record might not be valid as a concurrent write might have modified it. This lock-free read design requires segment deallocation (due to merge) happen only after no readers are (and will be) using the segment. We use epoch-based reclamation 

[epoch] to achieve this without incurring much overhead.

Dash does not use segment-level locks, saving PM access in the segment level. As a result, structural modification operations (SMOs, such as segment split) need to lock all the buckets in each segment. Directory doubling/halving is handled similarly: the directory lock is only held when the directory is being doubled or halved. For other operations on the directory (e.g., updating a directory entry to point to a new segment), no lock is taken. Instead, they are treated as search operations without taking the directory lock. This is safe because we guarantee isolation in the segment level: an inserting thread must first acquire locks to protect the affected buckets. “Real” probings (search/insert) proceed without reading the directory lock but again need to verify that they entered the right segment by re-reading the directory to test whether these two read results match; if not, the thread aborts and retries the entire operation.

4.5 Support for Variable-Length Keys

Dash stores pointers to variable-length keys, which is a common approach [FPTree, BzTree, CCEH, LevelHashing]. A knob is provided to switch between the inline (fixed-length keys up to 8 bytes) and pointer modes. Though dereferencing pointers may incur extra overhead, fingerprinting largely alleviates this problem. For negative search where the target key does not exist, no fingerprint will match and so key probing will not happen at all. For positive search, as we have discussed in Section 4.2, the amortized number of key load (therefore the number cache misses caused by following the key pointer) is one [FPTree].

4.6 Record Operations

Now we present details on how Dash-EH performs insert, search and delete operations on PM with persistence guarantees.

1def bucket::insert(key, value, h):
  slot = slots[slot_id = get_free_slot()]
3  slot.assign(key, value)
  CLWB+FENCE(slot)  # Persist the record first
5
  fingerprints[slot_id] = LSB_byte(h)
7  # Since allocation_bitmap and counter are in the same
  # word, their updates are done in one store operation
9  allocation_bitmap.set(slot_id)
  ++counter
11
  # Persist allocation bitmap, fingerprint and counter
13  # in one flush (same cacheline, no reordering on x86)
  CLWB+FENCE(allocation_bitmap, fingerprint, counter)
15
def displace(target = b, prob = b+1):
17  # Try to move a record from b+1 to b+2
  slot_id = prob.get_unset_LSB(membership)
19  if slot_id is not Invalid:
    (prob+1).insert(prob.slots[slot_id])
21    # Mark deletion for the moved record and decrease
    # the counter, done in one store operation
23    prob.allocation_bitmap &= (1 << slot_id)
    prob.counter--
25    CLWB+FENCE(prob.allocator_bitmap, counter)
    return prob
27  else
    # Try to move a record from b to b-1
29    slot_id = target.get_set_LSB(membership)
    if slot_id is not Invalid:
31      (target-1).insert(target.slots[slot_id])
      target.allocation_bitmap &= (1 << slot_id)
33      target.counter--
      CLWB+FENCE(target.allocator_bitmap)
35      return target
    else
37      return NULL
Algorithm 2 Bucket insert and displacement in Dash-EH.

Insert. Section 4.3 presented the high-level steps for insert; here we focus on the bucket-level. As the bucket::insert function in Algorithm 2 shows, we first write and persist the new record (lines 3-4), and then set up the metadata (fingerprint, allocation bitmap and counter, lines 6–10). Note that the allocation bitmap and counter are in one word; they are updated in one atomic write. The CLWB and fence are then issued (line 14) to persist all the metadata. Once the corresponding bit in the bitmap is set, the record is visible to other threads. If a crash happens before the bitmap is persisted, the new record is regarded as invalid; otherwise, the record is successfully inserted. This allows us to avoid expensive logging while maintaining consistency.

Displacing a record needs two steps: (1) inserting it into the new bucket and (2) deleting it from the original bucket. As the displace function in Algorithm 2 shows, step 2 is done by resetting the corresponding bit in the allocation bitmap without moving data. In case a crash happens before step 2 finishes, a record will appear in both buckets. This necessitates a duplicate detection mechanism upon recovery, which is amortized over runtime (see Section 4.8).

If the insert has to happen in a stash bucket, we set the overflow metadata in the normal bucket. This cannot be done atomically with 8-byte writes and may need a (complex) protocol for crash consistency. We note that the overflow metadata is an optimization and does not influence correctness: records can still be found correctly even without it. So we do not explicitly persist it and rely on the lazy recovery mechanism to build it up gradually (described later).

1def dash_eh_search(key):
retry:
3  # Get the segment and buckets
  [target_seg] = get_segment(h)
5  [target_bucket, probing_bucket] = target_seg.bk(h)
  vt = target_bucket.version_lock
7  vp = probing_bucket.version_lock
9  # Verify the correctness of the segment reference
  [verify_seg] = get_segment(h)
11  if verify_seg is not target_seg
    goto retry
13
  if is_lock_set(vt) or is_lock_set(vp)
15    goto retry
17  result = target_bucket.search(key)
  if vt is not target_bucet.version_lock
19    goto retry
  if result is not NULL
21    return result
23  result = probing_bucket.search(key)
  if vp is not probing_bucet.version_lock
25    goto retry
  if result is not NULL
27    return result
29  # Determine whether to search in the stash buckets
  # Note that version lock check is omitted below
31  if probing bucket.overflow_count is zero
    if key matches overflow fingerprints
33      search corresponding stash buckets and return
    else
35      return NULL
  else
37    Search by scanning the stash buckets and return
  return NULL
Algorithm 3 Dash-EH search algorithm.

Search. With balanced insert and displacement, a record could be inserted into its target bucket where or its probing bucket . A search operation then has to check both buckets if the record is not found in bucket . As Algorithm 3 shows, to search for a key, the probing thread starts by probing the directory to obtain a reference to the corresponding segment and buckets (lines 4–5). It then takes a snapshot of the version number of both buckets (lines 6–7) for verification later. We verify at line 10 that the segment did not change (i.e., the corresponding directory entry still points to it) and retry if needed. Once segment check passed, we check whether the target/probing buckets are being modified (i.e., locked) at lines 14–15 and if not, continue to actually start to search the target and probing buckets (lines 17–29) using the bucket::search function (not shown). Note that we need to verify the lock version did not change after bucket::search returns (lines 18 and 24).

If neither bucket contains the record, it might be stored in a stash bucket (lines 31–37). If overflow_count , then we search the stash buckets as the overflow fingerprint area does not have enough space for all overflow records from the bucket. Otherwise, stash access is only needed if there is a matching fingerprint (lines 31–35).

Delete. To delete a record from a normal bucket, we reset the corresponding bit in the allocation bitmap, decrement the counter and persist these changes. Then the slot becomes available for future reuse. A segment merge operation will be triggered if the load factor drops below a threshold. To delete a record from a stash bucket, in addition to marking the slot as free in the allocation bitmap, we also clear the overflow fingerprint in the target bucket which this record overflowed from if it exists; otherwise we only decrement the target bucket’s overflow counter.

4.7 Structural Modification Operations

When a thread has exhausted all the options to insert a record into a bucket, it triggers a segment split and possibly expansion of the directory. Conversely, when the load factor drops below a threshold, segments can be merged to save space. At a high level, three steps are needed to split a segment : (1) allocate a new segment , (2) rehash keys in and redistribute records in and , and (3) attach to the directory and set the local depth of and . These operations cause the structure of the hash table to change and must be made crash consistency on PM while maintaining high performance.

For crash consistency, Dash-EH chains all segments using side links to the right neighbor. Each segment has a state variable to indicate whether the segment is in an SMO and whether it is the one being split or the new segment. An initial value of zero indicates the segment is not part of an SMO. Figure 5 shows an example. Note that as shown by the figure, Dash-EH uses the most significant bits (MSBs) of hash values to address and organize segments and buckets (i.e., the directory is indexed by the MSBs of hash values), similar to other recent work [CCEH]. This is different from traditional extendible hashing described in Section 2.2 that uses LSBs of hash values to address buckets. Using LSBs was the choice in the disk era to reduce I/O: the directory can be doubled by simply copying the original directory and appending it to the directory file. On PM, such advantage is marginalized as to double a directory, one needs to allocate and persist a double-sized directory in PM anyway to keep the directory in a contiguous address space. Using MSBs also allows directory entries pointing to the same segment to be co-located, reducing cacheline flushes during splits [CCEH].

Figure 5: Segment split in Dash-EH; the global depth is 2.

To split a segment , we first mark its state as SPLITTING and allocate a new segment whose address is stored in the side link of . is then initialized to carry ’s side link as its own. Its local depth is set to the local depth of plus one. Then, we change ’s state to NEW to indicate this new segment is part of a split SMO for recovery purposes (see Section 4.8). We rely on PM programming libraries (PMDK [PMDK]) to atomically allocate and initialize the new segment; in case of a crash, the allocated PM block is guaranteed to be either owned by Dash or the allocator and will not be permanently leaked. After initialization, we finish up step 2 by redistributing records between and . Records moved from to are deleted in after they are inserted into . Note that the rehashing/redistributing process does not need to be done atomically: if a crash happens in the middle of rehashing, upon (lazy) recovery we redo the rehashing process with uniqueness check to avoid repeating work for records that were already inserted into before the crash; we describe more details later in Section 4.8. Figure 5(b) shows the state of the hash table after step 2. Then the directory entry for and the local depth of are updated as shown in Figure 5(c). Similarly, these updates are conducted using an atomic PMDK transaction which may use any approach such as lightweight logging. Many other systems avoid the use of logging to maintain high performance, largely because of the frequent pre-mature splits. But split is much rarer in Dash thanks to bucket load balancing that gives high load factor (Section 4.3); this allows Dash-EH to employ logging-based PMDK transactions that abstracts away many details and eases implementation.

4.8 Instant Recovery

Dash provides truly instant recovery by requiring a constant amount of work (reading and possibly writing a one-byte counter), before the system is ready to accept user requests. We add a global version number and a clean marker shown in Figure 3, and a per-segment version number. clean is a boolean that denotes whether the system was shutdown cleanly; tells whether recovery (during runtime) is needed. Upon a clean shutdown, clean is set to true and persisted. Upon restart, if clean is true, we set clean to false and start to handle requests. Otherwise, we increment by one and start to handle requests. For both clean shutdown and crash cases, “recovery” only involves reading clean and possibly bumping . The actual recovery work is amortized over segment accesses.

To access a segment, the accessing thread first checks whether the segment version matches . If not, the thread (1) recovers the segment to a consistent state before doing its original operation (e.g., insert or search), and (2) sets the segment’s version number to so that future accesses can skip the recovery pass. With such lazy recovery approach, a segment is not recovered until it is accessed. Multiple threads may access a segment that needs to be recovered. We employ a segment-level lock that is only for recovery purpose, but a thread only tries to acquire the lock if it sees the segment’s version number does not match . Our current implementation uses one-byte version numbers. In case the version number wraps around and recovery is needed, we reset to zero and set the version number of each segment to one. Since crash and repeated crashes are rare events, such wrap-around cases should be very rare.

Recovering a segment needs four steps: (1) clear bucket locks, (2) remove duplicate records, (3) rebuild overflow metadata, and (4) continue the ongoing SMO. Since some locks in the locked state at the moment of crash, every lock in each bucket needs to be reset. Duplicate records are detected by checking the fingerprints in neighboring buckets. This is lightweight since the real key comparison is only needed if the fingerprints match. Overflow metadata in normal buckets also needs to be cleared and rebuilt based on the records in stash buckets as we do not guarantee their crash consistency for performance reasons. Finally, if a segment is in the

SPLITTING state, the accessing thread will follow the segment’s side link to test whether the neighbor segment is in the NEW state. If so, we restart the rehashing-redistribution process and finish the split. Otherwise, we reset the state variable which in effect rolls back the split.

5 Dash for Linear Hashing

We present Dash-LH, Dash-enabled linear hashing that uses the building blocks discussed previously (balanced insert, displacement, fingerprinting and optimistic concurrency). We do not repeat them here and focus on the design decisions specific to linear hashing.

5.1 Overview

Figure 6 (focus on segments 0-3 for now) shows the overall structure of Dash-LH. Similar to Dash-EH, Dash-LH also uses segmentation and splits at the segment level. However, we follow the linear hashing approach to always split the segment pointed to by the Next pointer, which is advanced after the segment is split. Since the segment to be split is not necessarily a full segment, it needs to be able to accommodate overflow records, e.g., using linked lists. However, linked list traversal would incur many cache misses, which is a huge penalty for PM hash tables. Instead, we leverage the stashing design in Dash and use an adjustable number of stash buckets. In addition to a fixed number of stash buckets (e.g., 2 stash buckets) in each segment, we store a linked list of stash buckets. A segment split is triggered whenever a stash bucket is allocated to accommodate overflow records. This contrasts with classic linear hashing which splits a bucket at a time which is vulnerable to long overflow chains under high insertion rate. Dash-LH uses larger split unit (segment) and chaining unit (stash bucket rather than individual records), reducing chain length (therefore pointer chasing and cache misses). The overflow metadata and fingerprints further helps alleviate the performance penalty brought by the need to search stash buckets. Overall, as we show in Section 6, Dash-LH can also achieve near-linear scalability on realistic workloads.

5.2 Hybrid Expansion

Similar to Dash-EH, it is also important to reduce directory size for better cache locality. Some designs use double expansion [DoubleExpansion] which increases segment size exponentially: allocating a new segment doubles the number of buckets in the hash table. For example, the second segment allocated would be the size of the first segment, and the third segment would be larger than the first segment, and so on. The benefit is that directory size can become very small and be often fit even in the L1 cache. However, it also makes load factor drop by half whenever a new segment is allocated.

To reduce space waste, we postpone double expansion and expand the hash table by several fixed-size segments first, before triggering double expansion. We call the number of such fixed expansions the stride. Figure 6

shows an example (stride = 4). A directory entry can point to an array of segments; the first four entries point to one-segment arrays, the next four entries point to two-segment arrays, and so on. With a larger stride, the allocation of larger segment arrays will have less impact on load factor. The result is very small directory size that is typically L1-resident. Using 16KB segments, the first segment array will include 64 segments, with a stride of four, we can index TB-level data with a directory less than 1KB.

Figure 6: Overview of Dash-LH. Segments are organized in arrays.

5.3 Concurrency

Since linear hashing expands in one direction, splits are essentially serialized by locking the Next pointer. To shorten the length of the critical section, we adopt the expansion strategy proposed by LHlf [LarsonLockFreeLH] where the expansion only atomically advances Next without actually splitting the segment. Then any thread that accesses a segment that should be split (denoted in the segment metadata area) will first conduct the actual split operation. As a result, multiple segments splits can execute in parallel by multiple threads. Before advancing the Next pointer, the accessing thread first probes the directory entry for the new segment to test whether the corresponding segment array is allocated. If not, it allocates the segment array and stores it in the directory entry. The performance of PM allocator therefore may impact overall performance, as we show in Section 6.

Dash-LH uses a variable N to compute the number of buckets of the base table. After each round of the split, Next is reset to zero and N is incremented to denote that the number of buckets is doubled. For consistency guarantees, we store N (32-bit) and Next (32-bit) in a 64-bit word which can be updated atomically.

6 Evaluation

This section evaluates Dash and compares it with two other state-of-the-art PM hash tables, CCEH [CCEH] and level hashing [LevelHashing]. Specifically, through experiments we confirm the following:

  • Dash-enabled hash tables (Dash-EH and Dash-LH) scale well on multicore servers with real Optane DCPMM;

  • The bucket load balancing techniques allow Dash to achieve high load factor while maintaining high performance;

  • Dash provides instant recovery with a minimal, constant amount of work needed upon restart, reducing service downtime.

6.1 Implementation

We implemented Dash-EH and Dash-LH using PMDK [PMDK], which provides crash-safe PM management, allocation and synchronization. These primitives are essential for building crash-consistent persistent data structures, but also introduce overhead. For example, PMDK allocator exhibits scalability problems and is much slower than the DRAM counterpart [PiBench]. Such overheads are ignored in previous emulation-based work, but are not negligible in reality. We take them into account in our evaluation. The other hash tables under comparison (CCEH [CCEH] and level hashing [LevelHashing]) were both proposed based on DRAM emulation. We ported them to run on Optane DCPMM using their original code.222Code downloaded from https://github.com/DICL/CCEH and https://github.com/Pfzuo/Level-Hashing. Now we summarize the key implementation issues and our solutions.

Crash Consistency. Dash uses PMDK transactions for segment splits. This frees Dash from handling low-level details while guaranteeing safe and atomic allocations. We noticed a consistency issue in CCEH code where a power failure during segment split could leak PM. We fixed this problem using PMDK transaction. We also adapted CCEH and level hashing to use PMDK reader-writer locks that are automatically unlocked upon recovery.

Persistent Pointers. Both CCEH and level hashing assume standard 8-byte pointers based on DRAM emulation, while some systems use 16-byte pointers for PM [PMDK, FOEDUS]. Long pointers break the memory layout and make atomic instructions hard to use. To solve these problems, we extended PMDK to ensure that PM is mapped onto the same virtual address range across different runs (using MAP_FIXED_NOREPLACE333We also had to replace MAP_SHARED_VALIDATE with MAP_SHARED for MAP_FIXED_NOREPLACE to work, detailed in our code repository. in mmap and setting mmap_min_addr in the kernel). The application then directly operates on traditional 8-byte pointers like using DRAM, but still with persistence guarantees. All hash tables experimented here use this approach.

Garbage Collection. We implemented a general-purpose epoch-based PM reclamation mechanism for Dash. We also observed that the open-sourced implementation of CCEH allows threads to access the directory without acquiring any locks, which may allow access to freed memory (due to directory doubling or halving). We fixed this problem with the same epoch-based reclamation approach.

6.2 Experimental Setup

We run experiments on a server with a Intel Xeon Gold 6252 CPU clocked at 2.1GHz, 768 GB of Optane DCPMM (6128GB DIMMs on all six channels) in AppDirect mode, and 192GB of DRAM (632GB DIMMs). The CPU has 24 cores (48 hyperthreads) and 35.75 MB of L3 cache. The server runs Arch Linux with kernel 5.5.3 and PMDK 1.7. All the code is compiled using GCC 9.2 with all optimization enabled. Threads are pinned to physical cores.

Parameters. For fair comparison, we set CCEH and level hashing to use the same parameters as in their original papers [LevelHashing, CCEH]. Our own tests showed these parameters gave the best performance and load factor overall. Level hashing uses 128-byte (two cachelines) buckets. CCEH uses 16KB segments and 64-byte (one cacheline) buckets, with a probing length of four. Dash-EH and Dash-LH use 256-byte (four cachelines) buckets and 16KB segments. Each segment has two stash buckets, making it enough to have four overflow fingerprint slots per bucket so that the overflow counter is rarely positive. Dash-LH uses hybrid expansion with a stride of eight and its first segment array includes 64 segments.

Benchmarks.

We stress test each hash table using microbenchmarks. For search operations, we run positive search and negative search: the latter probes specifically non-existent keys. Unless otherwise specified, for all runs we preload the hash table with 10 million records, then execute 190 million inserts (as an insert-only benchmark), 190 million positive search/negative search/delete operations back-to-back on the 200-million-record hash table. A carefully chosen hash function should output uniformly distributed hash values to mitigate the impact of data skew. For all hash indexes, We use GCC’s

std::_Hash_bytes (based on Murmur hash [MurmurHash]) as the hash function, which is known to be fast and provides high-quality hashes. Similar to other work [CCEH, LevelHashing] we use uniformly distributed random keys in our workloads. We also tested skewed workloads under the Zipfian distribution (with varying skewness) and found all operations achieved better performance benefitting from the higher cache hit ratios on hot keys, and contention is rare because the hash values are largely uniformly distributed. Due to space limitation, we omit the detailed results over skewed workloads. For fixed-length key experiments, both keys and values are 8-byte integers; for variable-length we use (pointers to) 16-byte keys and 8-byte data.