Achieving High Throughput and Elasticity in a Larger-than-Memory Store

06/05/2020 ∙ by Chinmay Kulkarni, et al. ∙ 0

Millions of sensors, mobile applications and machines are now generating billions of events. This has led to specialized many-core key-value stores (KVSs) that can ingest and index these events at high rates (100 Mops/s/machine). These systems are efficient if events are generated on the same machine, but, to be practical and cost-effective they must ingest events over the network and scale across cloud resources elastically. We present Shadowfax, a new distributed key-value store that transparently spans DRAM, SSDs, and cloud blob storage while serving 130 Mops/s/VM over commodity Azure VMs using conventional Linux TCP. Beyond high single-VM performance, Shadowfax uses a unique approach to distributed reconfiguration that avoids any server-side key ownership checks or cross-core coordination both during normal operation and migration. Hence, Shadowfax can shift load in 17 s to improve cluster throughput by 10 the state-of-the-art, it has 8x better throughput (than Seastar+memcached) and scales out 6x faster.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Millions of sensors, mobile applications, users, and machines now continuously generate billions of events. These events are processed by streaming engines [11, 13] and ingested and aggregated by state management systems (Figure 1). Real-time queries are issued against this ingested data to train and update models for prediction, to analyze user behavior, or to generate device crash reports, etc. Hence, these state management systems are a focal point for massive numbers of events and queries over aggregated information about them.

Recently, this has led to specialized KVSs that can ingest and index these events at high rates (100 Mops/s/machine) by exploiting many-core hardware [14, 46]. These systems are efficient if events are generated on the same machine as the KVS, but, in practice, these events need to be aggregated from a wide and distributed set of data sources. Hence, fast indexing schemes alone only solve part of the problem. To be practical and cost-effective, a complete system for aggregating these events must ingest events over the network, must scale across machines as well as cores, and must be elastic (by provisioning and reconfiguring over inexpensive generic cloud resources as workload demands change).

Figure 1: A typical data processing pipeline. Services receive and process raw data and events. A state management system ingests processed data and serves offline queries against this data.

The only existing KVSs that provide similar performance [33, 25, 40, 32] rely on application-specific hardware acceleration, making them impossible to deploy on today’s cloud platforms. Furthermore, these systems only store data in DRAM, and they do not scale across machines; adding support to do so without cutting into normal-case performance is not straightforward. For example, many of them statically partition records across cores to eliminate cross-core synchronization. This optimizes normal-case performance, but it makes concurrent operations like migration and scale out impossible; transferring record data and ownership between machines and cores requires a stop-the-world approach due to these systems’ lack of fine-grained synchronization.

Achieving this level of performance while fulfilling all of these requirements on commodity cloud platforms requires solving two key challenges simultaneously. First, workloads change over time and cloud VMs fail, so systems must tolerate failure and reconfiguration. Doing this without hurting normal-case performance at 100 M/op/s is hard, since even a single extra server-side cache miss to check key ownership or reconfiguration status would cut throughput by tens-of-millions of operations per second. Second, the high CPU cost of processing incoming network packets that convey these events easily dominates in these workloads, especially since, historically, cloud networking stacks have not been designed for high data rates and high efficiency. We show this is changing; by careful design of each server’s data path, cloud applications can exploit transparent hardware acceleration and offloading offered by cloud providers to process more than 100 Mops/s per cloud virtual machine (VM).

We present Shadowfax, a new distributed key-value store that transparently spans DRAM, SSDs, and cloud blob storage while serving 130 Mops/s/VM over commodity Azure VMs [15] using conventional Linux TCP. Beyond high single-VM performance, its unique approach to distributed reconfiguration avoids any server-side key ownership checks and any cross-core coordination during normal operation and data migration both in its indexing and network interactions. Hence, Shadowfax can shift load in 17 s to improve cluster throughput by 10% with little disruption. Compared to the state-of-the-art, it has 8 better throughput (than Seastar+memcached [10]) and scales out 6 faster (than Rocksteady [26]).

In this paper, we describe and evaluate three key pieces of Shadowfax that eliminate coordination throughout the client- and server-side by eliminating cross-request and cross-core coordination:

Low-cost Coordination via Global Cuts:

All of the major components of Shadowfax including indexing, request dispatching, durability, checkpointing/recovery, and migration center around a key mechanism inherited from the FASTER index upon which it is built: asynchronous global cuts [14, 41, 17]. In contrast to totally-ordered or stop-the-world approaches used by most systems, cores in Shadowfax avoid stalling to synchronize with one another, even when triggering these complex operations, which require defining clear before/after points in time among concurrent operations for consistency. Instead, each core participating in these operations both at clients and servers independently decides a point in a global cut that defines a boundary among operation sequences in these complex operations. In this paper, we mainly focus on how global cuts help coordinate both server and client threads (through its partitioned sessions) by focusing in detail on their role in Shadowfax’s low-coordination data migration and reconfiguration protocol.

End-to-end Asynchronous Clients:

All requests from a client on one machine to Shadowfax are asynchronous with respect to one another all the way throughout Shadowfax’s client- and server-side network submission/completion paths and servers’ indexing and (SSD and cloud storage) I/O paths. This avoids all client- and server-side stalls due to head-of-line blocking, ensuring that clients can always continue to generate requests and servers can always continue to process them. In turn, clients naturally batch requests, improving server-side high throughput especially under high load. This batching also suits hardware accelerated network offloads available in cloud platforms today further lowering CPU load and improving throughput. Hence, despite batching, requests complete in less than 40 µs to 1.3 ms at more than 120 M/op/s/VM, depending on which transport and hardware acceleration is chosen.

Partitioned Sessions, Shared Data:

Asynchronous requests eliminate blocking between requests within a client, but maintaining high throughput also requires minimizing coordination costs between cores at clients and servers. Instead of partitioning data among cores to avoid synchronization on record accesses [24, 43, 33, 10], Shadowfax partitions network sessions across cores; its lock-free hash index and log-structured record heap are shared among all cores. This risks contention when some records are hot and frequently mutated, but this is more than offset by the fact that no software-level inter-core request forwarding or routing is needed within server VMs.

In the remainder of this paper, we describe how the key synchronization mechanisms at the core of FASTER’s design (§2.1) naturally led to Shadowfax’s sessions that extend global cuts over the network (§3.1.1). We describe how this enables Shadowfax to perform the same over the network as with a local FASTER instance (§4.2), and we describe how they enable reconfiguration (§3.2) and parallel data migration (§3.3). We also describe how Shadowfax does this while supporting larger-than-memory datasets that span SSD and cloud blob storage. Finally, we evaluate Shadowfax against other state-of-the-art shared-nothing approaches (§4), showing that by eliminating record ownership checks and cross-core communication for routing requests it improves per-machine throughput by 8.5 on commodity cloud VMs. We also show it retains high throughput during migrations and scaled it to an 8-machine cluster that ingests and indexes 400 M/ops/s.

2 FASTER Key-Value Store

Shadowfax is built over the FASTER single-node KVS, which it relies on for hash indexing, record storage, and checkpointing. Here, we describe some key aspects of FASTER, since Shadowfax’s design integrates with it and builds on its mechanisms. More details about FASTER itself can be found elsewhere [14, 41]. Specifically, Shadowfax relies on FASTER’s asynchronous global cuts, which help avoid coordination, and its HybridLog, which transparently spans DRAM and SSD.

Figure 2: FASTER’s HybridLog allocator spans memory and local SSD. The portion in memory contains a mutable region that acts as a cache and a read-only region. FASTER’s hash table points to a reverse linked list of records on the HybridLog.

In most ways, FASTER works like most durable hash table libraries. It includes a lock-free hash table divided into cacheline-sized buckets (Figure 2). Each 8 byte bucket entry contains a pointer to a record whose key hashes to that bucket. Each record points to another record, forming a linked list of records with common significant key hash bits. Each bucket entry contains additional bits from the associated records’ key hash, increasing hashing resolution and disambiguating what records the bucket entry points to without extra cache misses and without full key comparisons. Each record pointed to by the hash table is stored in the HybridLog.

FASTER clients can use it like any other library, but a common pattern is to pin one client application thread per CPU core to eliminate scheduler overheads. Each client thread calls read or read-modify-write operations on keys in FASTER. FASTER’s cache-conscious design and lock-freedom are key in its ability to perform more than 100 M/op/s on a single multicore machine. Shadowfax uses FASTER with a similar threading scheme described later.

2.1 Asynchronous Global Cuts

Lock-freedom makes FASTER fast, but it creates challenges for synchronization and memory safety. Updated versions of records may be installed in its hash table, even as old versions of that record are still being read by other threads. This is a common problem in all lock-free, RCU-like schemes [34]. To solve this, FASTER

uses an epoch-based memory-protection scheme 

[27]. All threads calling into FASTER are registered with an epoch manager that tracks when threads begin and end access to FASTER’s internal structures. When a page is evicted to SSD, the epoch-based scheme ensures that the memory is not reused while any thread could still be accessing it. The full details of this scheme are beyond the scope of this paper.

Critically, this epoch-based scheme also plays a key role in coordinating information across threads lazily without inducing stalls. During complex, system-wide events (such as checkpointing), threads lazily coordinate by registering callback actions that are eventually executed once each thread synchronizes some local state with an updated global value. The same mechanism can also be used to trigger a function only once all threads are guaranteed to have updated their local state from some global state. In effect, this allows trigger actions that are guaranteed to take effect only after all threads agree on and have each locally observed some transition in global state. This can be used to create a machine-wide asynchronous global cut, where events such as global state transitions are realized asynchronously and lazily over a set of independent thread-local state transitions.

Figure 3: Checkpoint version changes in FASTER take place over a global cut. A global version number is updated first. Once every thread has independently observed this update, a checkpoint is taken.

FASTER’s checkpointing protocol serves as a great example of this [41]. It first moves all threads from a system checkpoint version to a new checkpoint version , and then it captures and persists records in version . As threads process operations, they stamp each new record version they create with their local checkpoint version number. The boundary between the two checkpoint versions and forms a global cut across all of the operations of all of the threads (Figure 3). The protocol first increments a global variable that contains the systems’ checkpoint version number, and it registers an epoch action that checkpoints version when all threads have observed this new checkpoint version . Each thread periodically (or when it encounters specific situations) checks for epoch actions, which triggers a refresh of each thread’s local checkpoint version number. Once each thread has refreshed its local copy of the global checkpoint version number to all operations before the global cut are guaranteed to have completed. Hence, it is safe to checkpoint version of the database, which is triggered by the registered epoch action.

Section 3.2.1 shows how Shadowfax employs these global cuts to safely move ownership of records between servers while preserving throughput.

2.2 HybridLog Allocator

FASTER allocates and stores all records in its HybridLog, which spans memory and SSD (Figure 2). The HybridLog combines in-place updates (for records in memory) and log-structured organization (for records on SSD), and it provides lock-free access to records.

The portion of the HybridLog’s address space on SSD forms the stable region. It contains cold records that have not been recently updated. The portion in memory is composed of two regions: a (larger) mutable region and a (smaller) read-only region. Records in the mutable region can be modified in-place with appropriate synchronization that is chosen by the application using FASTER (for example, atomic operations, locks, or validation). This region acts as a cache for recently updated records and avoids expensive per-update allocations.

The read-only region mostly contains records that are being asynchronously written to SSD. These records cannot be updated in place, since they must remain stable during I/O. The read-only region represents records that are becoming cold, and it acts as a second-chance cache. FASTER uses a read-copy-update to modify records in this region: the updated record version is appended to the mutable region, and the hash table is updated to point to it. This helps provide good cache hit rates without fine-grained, per-record metadata.

Each record entry in FASTER’s hash table points to a reverse linked list of records on the HybridLog, allowing it to maintain a compact hash table for larger-than-memory datasets that span storage media. Note, that a consequence of this is that hash table lookups in FASTER may need to traverse chains of records that span from memory onto SSD. Section 3.3.2 describes how Shadowfax extends HybridLog so that it also spans shared cloud storage and how this accelerates the completion of scale out and data migration.

3 Shadowfax Design

Shadowfax is a distributed key-value store. Each server in the system stores records inside an instance of FASTER, and clients issue requests for these records over the network. These requests can be of three types: reads that return a record’s value, upserts that blindly update a record’s value, and read-modify-writes that first read a record’s value and then update a particular field within it. Within a server, records are allocated on FASTER’s HybridLog, whose stable region is extended by Shadowfax to also span a shared remote storage tier in addition to main memory and local SSD.

Each server runs one thread per core, and it shares its FASTER instance among all threads. Threads on remote clients directly establish a network session with one server thread on the machine that contains the records the client wishes to access or modify (§3.1.1). Sessions are the key to retaining FASTER’s throughput over the network: they allow clients to issue asynchronous requests; they batch requests to improve server-side throughput and avoid head-of-line blocking; and they avoid software-level inter-core request dispatching.

Shadowfax uses hash partitioning to divide records among servers. The set of hash ranges owned by a server at a given logical point of time is associated with a per-server strictly increasing view number. A fault-tolerant, external metadata store (e.g. ZooKeeper [22]) durably maintains these view numbers along with mappings from hash ranges to servers and vice versa. View numbers serve two key purposes in Shadowfax. First, they help minimize the impact of record ownership checks at servers, helping them retain FASTER’s performance. Second, they allow the system to make lazy and asynchronous progress through record ownership changes (§3.2).

Sessions and low-coordination global cuts via views play a key role in Shadowfax’s reconfiguration, data migration, and scale out. Its scale out protocol migrates hash ranges from a source server to a target server and is designed to minimize migration’s impact to throughput. The protocol uses a view change to transfer ownership of the hash range from the source to the target along with a small set of recently accessed records. This allows the target to immediately start serving requests for these records and helps maintain high throughput during scale out. Next, threads on the source work in parallel to collect records from FASTER and transmit them over sessions to the target. Similarly, threads on the target work in parallel to receive these records and insert them into its FASTER instance. This parallel approach helps migrate records quickly, reducing the duration of scale out’s impact on throughput. Scale out completes once all records have been moved to the target.

3.1 Partitioned Dispatch & Sessions

Shadowfax’s network request dispatching mechanism and client library need to be capable of saturating servers inside FASTER. One option would be to maintain a FASTER instance per server thread, partitioning records across them to avoid cache coherence costs. However, this would create a routing problem at the server; requests picked up from the network would need to be routed to the correct thread. This would require cross-thread coordination, hurting throughput and scalability. Clients could be made responsible for routing requests to the correct server thread, but this would require every client thread to open a connection to every server thread and would not scale well on cloud networks. Clients could avoid this by partitioning all requests and transmitting them to the correct server thread, but this would require cross-thread coordination at the client which would also not scale well.

Using a connectionless transport like UDP could make client side routing feasible without introducing cross-thread coordination [37, 33]. However, the system would lose its ability to perform congestion control and flow control, and it would not tolerate packet loss. These are basic requirements for running a storage system over a cloud network.

Figure 4: Each server thread receives batches of requests from sessions and processes them via a shared, per-machine FASTER instance. Results are returned over the network by the same thread, avoiding cross-thread coordination.

Shadowfax avoids cross-thread coordination by sharing a single instance of FASTER between server threads. FASTER defers cross-core communication to hardware cache coherence on the accessed records themselves, cleanly partitioning the rest of the system (Figure 4). Each server runs a pinned thread on each vCPU inside a cloud VM. Each server thread runs a continuous loop that does two things. First, it polls the network for new incoming connections. Next, it polls existing connections for requests, and it unpacks these requests, calling into FASTER to handle each of them. After requests are executed, the returned results are transmitted back over the session they were received on. Since FASTER is shared, neither requests nor results are ever passed across server threads.

3.1.1 Client Sessions

Figure 5: Client threads buffer requests in sessions along with a callback. Batches of asynchronous requests are kept pipelined to the server, keeping both the client and the server busy.

Shadowfax’s partitioned dispatch/shared data approach also extends to clients. Since they don’t need to route requests to specific server threads, they can reduce connection state while avoiding cross-thread coordination.

However, clients must also avoid stalling due to network delay in order to saturate servers. To do this, each client thread is pinned to a different vCPU of a cloud VM, and it issues asynchronous requests against an instance of Shadowfax’s client library (Figure 5). The library pipelines batches of these requests to servers.

The client library achieves this through sessions. When the library receives a request, it first checks if it has a connection to the server that owns the corresponding record. If it does not, it looks up a cached copy of ownership mappings (periodically refreshed from the metadata store), establishes a connection to a thread on the server that owns the record, and associates a new session with the connection. Next, it buffers the request inside the session, enqueues a completion callback for the request inside the session, and returns. This allows the client thread to continue issuing requests without blocking. Once enough requests have been buffered inside a session, the library sends them out in a batch to the server thread. On receiving a batch of results from the server thread, the library dequeues callbacks and executes them to complete the corresponding requests.

Sessions are fully pipelined, so multiple batches of requests can be sent to a server thread without waiting for responses from prior batches. This also means that a client thread can continue issuing asynchronous requests into session buffers while waiting for results. This pipelined approach hides network delays and helps saturate servers. It also helps keep request batch sizes small, which is good for latency.

3.1.2 Exploiting Cloud Network Acceleration

The cloud network has traditionally not been designed for high data rates and efficiency. The high CPU cost of processing packets over this network can easily prevent servers and clients from retaining FASTER’s throughput. However, this is beginning to change; many cloud providers are now transparently offloading parts of their networking stack onto SmartNIC FPGAs to reduce this cost. Shadowfax’s design interplays well with this acceleration; batched requests avoid high per-packet overheads and its reduced connection count avoids the performance collapse some systems experience [18].

Since threads do not communicate or synchronize, all CPU cycles recovered from offloading the network stack can be used for executing requests at the server and issuing them from the client. This allows Shadowfax to retain FASTER’s high throughput using the Linux kernel’s TCP stack on cloud networks, avoiding dependence on kernel-bypass or RDMA.

3.2 Record Ownership

To support distributed operations such as scale out and crash recovery, Shadowfax must be able to move ownership of records between servers at runtime. This creates a problem during normal operation: a client might send out a batch of requests to a server after referring to its cache of ownership mappings. By the time the server receives the batch, it might have lost ownership of some of the requested records in the batch (e.g. due to scale out). To solve this, the server must validate that it still owns the records requested in the batch before it processes the batched requests. This can hurt normal case throughput if each request in the batch must be cross-checked against a set of hash ranges owned by the server.

Shadowfax solves this by associating the set of hash ranges owned by a server with a per-server strictly-increasing view number. All request batches are tagged with a view number, letting servers quickly assess whether the batch only includes requests for records that it currently owns. When a server’s set of owned ranges changes, its view number is advanced. Each server’s latest view number is durably stored along with a list of the hash ranges it owns in the metadata store.

When a client connects to a server, it caches a copy of the server’s latest view (a view number and its associated hash ranges) inside the session. Every batch of requests sent on that session is tagged with this view number, and clients only put requests for keys into batches that were owned by that server in that view number. Upon receiving a batch, the server always checks its current view number against the view number tagged on the batch. If they match, then the server and client agree about what hash ranges are owned by the server, ensuring the batch is safe to process without further key or hash range checks on each request. If they don’t match, then either the client or the server has out-of-date information about what hash ranges the server owns. In this case, the server rejects the batch and refreshes its view from the metadata server. Upon receiving this batch rejection, the client refreshes its view information from the metadata server and then reissues requests from the rejected batch.

In essence, view numbers offload expensive hash range checks on each requested key to clients, reducing load at servers. For a server that owns hash ranges accepting requests in batches of size , views reduce the cost of ownership checks from to . Even more crucially, since it is a single integer comparison per batch, it ensures we never take a cache miss to perform record ownership checks, which would be prohibitive at 100 M/op/s. Hence, views are key in supporting dynamic movement of ownership between servers while preserving normal case throughput.

3.2.1 Ownership Transfer

When ownership of a hash range needs to be transferred to or away from a server, its ownership mappings are first atomically updated at the metadata store. This increments its view number and adds or removes the hash range from its mapping. Servers and clients observe this view change either when they refresh their local caches of views and ownership mappings (via an epoch action) or when they communicate with a machine that has already observed this change.

When a server involved in the transfer observes that its view has changed, it must move into the new view. However, this step is not straightforward; keeping with Shadowfax’s design principle, it must be achieved without stalling server threads.

Figure 6: Ownership transfer in Shadowfax. A view change is first propagated using a global cut across server threads. Each server thread then pushes this cut out to clients over its sessions. This approach avoids coordination within both, servers and clients.

Within the server, this view change is propagated asynchronously across threads via an epoch action (Figure 6). Threads each mark a point in their sequence of operations, collectively creating a global cut among all of the operations on all of the threads at the server (§2.1). This cut unambiguously ensures no two servers concurrently serve operations on an overlapping hash range. This approach is free of synchronous coordination, helping maintain high throughput.

The server might be connected to clients still using the old view; it must also propagate the view change to clients in a similar way without stalling client threads. Sessions help Shadowfax achieve this. When a server thread moves into a new view, view validations on request batches received over sessions with clients still in an older view are rejected. This effectively pushes the global cut taken on the server out to all clients connected to it. Each client thread independently updates its thread local cache of ownership mappings and reissues requests from the rejected batch. This helps avoid cross-thread coordination on the client too.

3.3 Scaling Out

Shadowfax migrates hash ranges from a source to a target server to scale out. Migration proceeds in phases that transfer hash range ownership to the target before migrating records.

Migration is implemented as a state machine on the source and target. Both servers transition through migration phases on global cuts, created in the same non-blocking, low-coordination way described in §2.1. First, each thread enters into a phase at a point in the sequence of requests that it is processing that it chooses (a point that makes up part of the global cut for the transition into that phase), and then it starts performing the work of that phase. Once all threads have entered into the phase and have completed all work relating to it, the server begins transitioning to the next phase.

Migration is driven by the source as we outline below:

Sampling:

Initiated by receiving a Migrate() RPC from a client, whereupon the source

  1. atomically remaps ownership of hash ranges from the source to the target, increments the source’s and target’s view numbers, and registers a dependency between the source and target (for crash recovery, §3.3.1) within the metadata store;

  2. begins sampling hot records by forcing all accessed records to be copied to the HybridLog tail.

Since the records are not yet at the target and a migration is in progress, both the source and the target continue to temporarily operate in the old ownership view; at this point the source is still servicing requests for records in the migrating ranges. To ensure that sampled records only get copied once, the source only copies records whose address is lower than the HybridLog tail address at the start of this phase.

Prepare:

Initiated after all source threads have completed the Sampling phase. The source asynchronously sends a PrepForTransfer() RPC to the target, transitioning the target to its own Target-Prepare phase. The Target-Prepare phase tells the target that ownership transfer is imminent. The target temporarily pends requests in the migrating hash ranges (since some clients may discover the new views) and services them after the source indicates that it has stopped servicing requests in the old view.

Transfer:

Initiated after all source threads have completed the Prepare phase. The source moves into its new view and stops servicing requests on the migrating hash ranges. When all server threads are in the new view, it sends out an asynchronous TransferedOwnership() RPC to the target, which also includes the hot records sampled in the Sampling phase. This moves the target into its Target-Receive phase, whereupon it inserts the sampled records into its FASTER instance and then begins servicing requests for the migrating hash ranges. This also triggers the target to service any requests pending from the Target-Prepare phase.

Migrate:

Initiated after all source threads have completed the Transfer phase. The source uses thread-local sessions to send records in the migrating hash ranges to the target. Threads interleave processing normal requests with sending batches of migrating records collected from the source’s hash table to the target. Each thread works on independent, non-overlapping hash table regions, avoiding contention.

Complete:

Initiated after all source threads have completed the Migrate phase. The source sends an asynchronous CompleteMigration() RPC, moving the target to the Target-Complete phase. Then, the source asynchronously checkpoints its log, so if it crashes after migration has completed, it can be recovered without filtering out migrated records. When the checkpoint completes, the source sets a flag in the metadata store indicating that its role in migration is complete, and it returns to normal operation.

The target is mostly passive during migration; most of its phase changes are triggered by source RPCs. Requests for a record may come after the target has received a TransferOwnership() RPC but before the source has sent that record. The target marks these requests pending, and it processes them when it receives the corresponding record.

When the target receives the CompleteMigration() RPC, it also takes an asynchronous checkpoint, ensuring that it can be recovered to its post-migration state in the case that the source or target crashes. Afterward, it also sets a flag at the metadata store indicating that its role in the migration is complete, and it returns to normal operation.

Shadowfax maintains high throughput during scale up via low-coordination, non-blocking epoch actions and purely asynchronous inter-machine communication. The source prioritizes request processing, making progress in between request batches. Its state machine transitions are independent of the target; all migration RPCs and checkpoints are asynchronous. The target prioritizes request processing in the same way. Early ownership transfer, sampled records, and pending operations let the target start servicing requests on moved ranges quickly, improving throughput recovery. Sessions let the source collect and asynchronously transmit records in parallel while the target receives and inserts them in parallel.

3.3.1 Fault Tolerance During Migration

Migration completes once the source and target have both completed their asynchronous checkpoint to ensure the migration’s effects will be recovered; have marked their part of the migration completed at the metadata store; and have returned to normal operation. After this point, the migration is durable, and the temporary dependency between the source and target at the metadata store is garbage collected. If either machine crashes hereafter, it can be independently recovered from a checkpoint containing the effects of the migration.

If either server crashes during, recovery must involve both, which is why the metadata store tracks the dependency between them. This is because of early ownership transfer; during migration, the target services operations on the migrating ranges, but many records belonging to it may still be on the source. When recovering a server, if Shadowfax finds a migration dependency involving the server without both completion flags set, it cancels the migration by setting a cancellation flag in the metadata server. Then, it transfers ownership of hash ranges back to the source (incrementing the source and target’s view), restores both machines using their latest (pre-migration) checkpoints, and recovers requests on hash ranges that were issued during migration at the source.

This cancellation procedure ensures migration is deadlock-free. If either server fails to make progress through the protocol in a timely manner, the migration can always be cancelled by any party, and both servers can be rolled back. No server can stall migration completion indefinitely.

We are currently working on implementing full crash recovery. As with ownership transfer, it creates a global cut across client sessions, indicating which operations clients should retransmit to servers after a crash. This client-assisted recovery eliminates the need for write-ahead-logging at servers, which would introduce a serial bottleneck. A full description and evaluation of this mechanism is beyond the scope of this paper, and we leave it to future work.

3.3.2 Decoupling Faster with Shared Storage

Migration cannot become durable until all records have been moved to the target, so Shadowfax must ensure that this happens quickly. However, FASTER’s larger-than-memory index makes this challenging: entries in its hash table point to linked lists of records, which can span onto local SSD. Performing I/O (sequential or random) to migrate these records can slow migration and hurt throughput.

Shadowfax’s shared remote tier helps solve this problem. Records on local SSD are always eventually flushed to this tier, so migration can avoid accessing them. When the source encounters an address for a record in a list that is on the SSD, it sends an indirection record to the target that indicates this record’s location in the shared tier. This indirection record contains the next address in the list, an identifier for the source’s log, the hash range being migrated, and the hash entry that pointed to the list. The target inserts these records into its hash table using the hash entry contained in the record. These records become part of its end-of-migration checkpoint. Overall, these fine-grained inter-log dependencies represented by indirection records accelerate migration completion by eliminating all I/O that would otherwise be needed to consolidate records to materialize the target’s checkpoint.

During normal operation, if the target encounters an indirection record when processing a request and the request’s key falls in the hash range contained in the record, the target asynchronously retrieves the actual record from the shared tier using the contained address and log identifier, inserts it into its hash table, and then completes the request.

3.3.3 Cleaning Up Indirection Records

Figure 7: Indirection records create fine-grained data dependencies between logs. These dependencies are cleaned up lazily during log compaction. This allows migration to be restricted to main memory.

Migrations can accumulate indirection records between server logs for records that are never accessed (Figure 7). On scaling up (①) by migrating a hash range from Log 0 to Log 2, Log 2 contains indirection records that point to Log 0 on the shared tier. Dependencies are also created during scale down (②) when records on Log 1 are migrated to Log 2. These dependencies must eventually be cleaned up.

Shadowfax must already periodically do log compaction to eliminate stale versions of records from its shared tier; resolving and removing indirection records can be piggybacked on this process to eliminate overheads for cleaning up indirection records (③). When compacting its log, if a server encounters a record belonging to a hash range it no longer owns, the server transmits the record to the current owner. On receiving such a record, the owner first looks up the key. If it encounters an indirection record while doing so and the key falls in the contained hash range, then it means that the key was not retrieved from the shared tier after migration. In this case, the server inserts the received record; otherwise, it discards the record.

Barring normal case request processing, this lazy approach ensures that records not in main memory are accessed only once, during the sequential I/O of compaction, which has to be done anyway. It is also deadlock-free: two servers might have indirection records pointing to each others’ log, but the resulting dependencies are cleaned up independently.

4 Evaluation

To evaluate Shadowfax, we focused on six key questions:

[leftmargin=0.5em]

Does it preserve FASTER’s performance?

§4.2 shows that Shadowfax preserves FASTER’s scalability and adds in negligible overhead. Its throughput scales to 130 Mops/s on 64 threads on a VM even when using Linux TCP.

How does it compare to an alternate design?

§4.2 shows that Shadowfax performs 4x better than a state-of-the-art approach that partitions dispatch as well as data.

Does it provide low latency?

§4.3 shows that while serving 130 Mops/s, Shadowfax’s median latency is 1.3 ms on Linux TCP. Using two-sided RDMA decreases this to 40 µs.

Can it maintain high throughput during scale out?

§4.4 shows that when migrating 10% of a server’s hash range, Shadowfax’s scale-out protocol can maintain system throughput above 80 Mops/s. Parallel data migration can help complete scale out in under 17 s, and sampled records help recover throughput 30% faster (§4.4.3).

Do indirection records help scale out?

§4.4.2 shows that by restricting migration to main memory, indirection records help speed scale out by 6x. They also have a negligible impact on server throughput once scale out completes.

Do views reduce scale out’s impact on normal operation?

§4.4.4 shows that validating ownership using views has a negligible impact on normal case server throughput. When compared to hash validating each request within a batch, views improve throughput by as much as 17% depending on the number of hash ranges owned by the server.

We also ran Shadowfax on a small CloudLab [42] cluster consisting of 256 cores spread across 8 servers and found that its throughput scales linearly to 400 Mops/s. We omit this experiment from this paper because of a lack of space.

4.1 Experimental Setup

CPU Xeon E5-2673 v4 2.3 GHz, 64 vCPUs in total
RAM 432 GB
SSD 96,000 IOPS, 500 MB/s sequential writes
Network 30 Gbps, Hardware accelerated
OS Ubuntu 18.04, Linux 5.0.0-1036-azure
Table 1: Virtual machine details used to evaluate Shadowfax. This is the E64_v3 series available on Azure. Instances were configured to use hardware acceleration to speed up Linux’s TCP stack.

We evaluated Shadowfax on the Azure public cloud [15]. We ran all experiments on the E64_v3 series of virtual machines [5] (Table 1). Experiments use 64 cores unless otherwise noted. Each VM uses accelerated networking, which offloads much of the networking stack onto FPGAs [1], allowing us to evaluate Shadowfax over regular Linux TCP. Shadowfax’s remote tier uses Azure’s paged blobs on premium storage [3], which offer 7,500 random IOPS with a write throughput of 250 MB/s per blob.

We used a dataset of 250 million records, each consisting of an 8 byte key and 256 byte value (totalling 80 GB in Shadowfax). To evaluate the system under heavy ingest, we used YCSB’s F workload [12] consisting of read-modify-write requests. Each request reads a record, increments a counter within the record, and writes back the result. This counter could represent heartbeats for a sensor device, click counts for an advertisement or views/likes on a social media profile. Unless noted, requested keys follow YCSB’s default Zipfian distribution ().

We compare to two baselines; one representing the state-of-the-art in fast request processing, the other representing the state-of-the-art in data migration.

Seastar+Memcached [10]

is an open-source framework for building high performance multi-core services. Its shared-nothing design constrasts with Shadowfax; servers partition data across cores, eliminating the need for locking. Clients can send requests to any server thread; Seastar uses message passing via shared memory queues to route each request to the core that processes requests for that data item. Seastar represents a best case for the state-of-practice; it is highly optimized. It uses lightweight, asynchronous futures to avoid context switch overheads, and it uses advanced NIC features like FlowDirector 

[6] to partition and scale network processing. We used an open-source, lock-free, shared-nothing version of Memcache on Seastar as a baseline [9]. We batched 100 operations per request, which maximized its throughput.

Rocksteady [26] is a state-of-the-art migration protocol for RAMCloud [39]. To accelerate migration, it immediately routes requests for migrated records to the target, while it is transfering records (which only reside in memory). It slowly performs disk I/O in the background to incorporate the migrated records into durable, on-disk replicas that belong to the target; this must complete before the source and target can be independently recovered. We modified Shadowfax to use a similar approach as a baseline. Instead of using indirection records, first, all in-memory records are moved; then, the source performs a sequential scan over all records on durable storage, where all encountered live are sent to the target.

4.2 Throughput Scalability

Figure 8: Shadowfax’s thread scalability. With TCP acceleration enabled, throughput scales linearly to 130 Mops/s and tracks FASTER. With acceleration disabled, throughput scales to only 75 Mops/s.
Figure 9:

Shadowfax’s thread scalability. With TCP acceleration enabled, throughput scales linearly to 87 Mops/s under a uniform distribution. In comparison, Seastar scales to 10 Mops/sec.

Network Saturation Throughput (Mops/s) Batch Size (KB) Median Latency (µs) Queue Depth
TCP 130 32 1300 1927
w/o Accel 75 32 2200 1927
Infrc 126 1 38.6 60
TCP-IPoIB 125 8 260 482
Table 2: Shadowfax’s latency at server saturation. On Azure’s RDMA instances, it can maintain a median latency of 40 µs while performing 126 Mops/s. With TCP, this increases to 1.3 ms.
Figure 10: Running throughput when 10% of a server’s load is migrated to an idle target. Migration was initiated at 1 minute. For a memory budget of 60 GB (graph (b)), the scale-out protocol shifts this load in 32 s while maintaining throughput above 80 Million ops/sec.
Figure 11: Source and target throughput during scale up. Sequentially scanning over the cold tier during migration (graph (c)), increases the duration of scale out to 180 s during which the source loses one thread’s worth of throughput (1.5 Million ops/sec).

Shadowfax partitions request dispatching across threads for performance. It shares access to FASTER

between threads to provide high throughput even under skew. To demonstrate this, we measured throughput while scaling the number of threads on one server machine with one client machine. The entire dataset resides in memory, ensuring the experiment is CPU-limited. Figure 

8 shows the results on Shadowfax, on FASTER when requests are generated on the same machine, and on Shadowfax without hardware accelerated networking.

Shadowfax retains FASTER’s scalability. FASTER scales to service 128 Mops/s on 64 threads. Adding in the dispatch layer and remote client preserves performance; Shadowfax scales to 130 Mops/s on 64 threads. This is because it avoids cross thread synchronization or communication for request processing from the point a client thread issues a request until the server thread executes it on FASTER. Client threads’ pipelined batches of asynchronous requests also avoid any slowdown from stalls induced by network delay, keeping all threads at the client and server busy at all times.

Hardware network acceleration also plays an important role in maintaining performance; when disabled, throughput drops by 1.7x to 75 Mops/s on 64 threads. Here, CPU overhead for TCP transport processing increases, so the server slows due to additional time spent in recv() syscalls instead of doing work. Hardware acceleration offloads a significant portion of packet processing to a SmartNIC, allowing Shadowfax to maintain FASTER’s scalability without relying on kernel-bypass networking using DPDK or RDMA.

Next, we compared Shadowfax to Seastar (Figure 9) using a uniform key access distribution; this is the only distribution that Seastar’s client harness supports (this advantages Seastar’s shared-nothing approach, which suffers imbalance under skew). Seastar scales to 10 Mops/s on 28 threads, after which throughput is flat. Shadowfax scales linearly to 85 Mops/s on 64 threads; even at 28 threads, it is already 4x faster than Seastar. This is because Seastar partitions work at the wrong layer; threads maintain independent indices to avoid synchronizing on records, but this forces threads to use inter-core message passing when they receive a request to route it to the thread that has that record. Shadowfax’s shared FASTER instance is lock-free and minimizes cache footprint. This leaves all synchronization and communication to the hardware cache coherence, which is more efficient than explicit software coordination and only incurs high costs when real contention arises in data access patterns, rather than pessimistically synchronizing on all requests. Shadowfax’s advantage grows with skew; Figure 8 shows its performance improves by 1.5x under skew, whereas Seastar’s performance would decrease.

4.3 Batching and Latency

Shadowfax clients send requests in pipelined batches to amortize network overheads and keep servers busy. Asynchronous requests with hardware network acceleration help reduce batch sizes and latency. To show this, we measured its median latency and batch size at server saturation. Table 2 presents results on TCP, TCP with hardware acceleration disabled, and with two-sided RDMA (Infrc). We used Azure’s HC44rs [4] instances for Infrc, since they support (100 Gbps) RDMA; they have Xeon Platinum 8168 processors with 44 vCPUs.

Most of Shadowfax’s latency comes from batching, which amortizes network CPU costs. Accelerated networking reduces CPU load, decreasing the batching needed to retain throughput. Accelerated networking keeps the batch size required to saturate throughput small at 32 KB, which also keeps median latency low at 1.3 ms. Without acceleration, increased batch size doesn’t help; with 32 KB batches throughput drops to 75 Mops/s, and median latency increases to 2.2 ms.

Predictably, the batch size required to saturate throughput on Infrc is significantly lower at 1 KB, dropping median latency to 40 µs. This is because the network is faster and the stack is implemented in hardware; servers and clients can receive and transmit batches with near-zero software overhead (including system calls). Secondly, vCPUs on these instances are faster; they have a base clock rate of 2.7 GHz compared to 2.3 GHz on the TCP instances (Table 1). This speeds servers and clients, reducing the batch size and threads (from 64 to 44) required to reach the same throughput.

To evaluate this further, we ran Shadowfax using TCP over IPoIB [7] on the Infrc instances (Table 2, TCP-IPoIB). Throughput still saturates at 125 Mops/s. Compared to hardware accelerated TCP, faster vCPUs reduce the batch size by 4x (8 KB) and median latency by 5x (260 µs). Differences in the network might also contribute to these improvements, but we found Shadowfax to be CPU-bound in both cases.

4.4 Scale Out

Figure 12: Number of pending operations during scale up. Indirection records (graph (b)) result in remote accesses to shared storage, which leads to larger pending queues once scale up has completed. Without them (graphs (a), (c)), these queues drain shortly after scale up completes.

Shadowfax’s migration transfers hash ranges between two machines and minimizes throughput impact while doing so. Indirection records help restrict migration to memory, speeding up scale out, decoupling the source and target sooner. To demonstrate this, we measured throughput during scale up.

In a 5-minute experiment with one client and two servers (a source and a target), the entire hash space initially resides at the source. After one minute, 10% of this hash range is moved to the target. Figure 10 shows system throughput during the experiment; Figure 11 shows source and target throughput separately. In (a), all records are placed in memory. In (b) and (c), servers are restricted to a memory budget of 60 GB, allowing us to compare the impact of indirection records (in (b)) against Rocksteady’s scan-the-log approach (in (c)).

4.4.1 All-In-Memory Scale Out

Global cuts for ownership transfer avoid stalling cores at migration start, but the view change for this cut has some impact; request batches are invalidated, causing requests to be shuffled among sessions buffers at the client. This is visible in Figure 10 (a); throughput at the start of scale out (1 minute) briefly drops to 80 Mops/s.

Figure 11 (a) shows that throughput on the source stays at 85 Mops/s after this. This is because the source is collecting and transmitting records as it services requests. Parallel migration limits the length of this impact in two ways. First, it accelerates migration, completing in 17 s and restoring full throughput. Second, as more records shift to the target, it serves more requests, causing system throughput to recover even before scale up completes. Once scale up completes, system throughput increases by 10% as expected.

Shadowfax’s asynchronous client library helps limit the impact too. When the target receives a request for a record that has not been migrated yet, it marks the request as pending. This keeps clients from blocking, allowing them to continue sending requests. To prevent a buildup of pending requests, the target periodically tries to complete them. Figure 12 (a) shows the number of pending operations at the target during migration. When migration starts, requests flood the target, pending 100 million requests. As records migrate, these requests complete, with the last pending operation completing 100 s after migration start. Hence, practical migrations must be small and incremental to bound delay; however, throughput recovery is more important in Shadowfax’s target applications whereas latency can be tolerated with asynchrony.

4.4.2 Indirection Records

With a 60 GB memory budget, some records to be migrated are on the source’s SSD. Rocksteady’s approach (Figure 10 (c)) migrates records from memory and then scans the on-SSD log to migrate colder records. Parallel migration completes the in-memory phase in just 14 s. Thoughput improves quickly after this phase, since these are hotter records. However, the second phase is single threaded, scans over files on SSD, and takes 165 s to complete. Hence, scale out takes 180 s, and the source and target remain inter-dependent for fault tolerance.

Figure 13: Impact of indirection records on migration size. Indirection records lead to larger migrations because migration can no longer scan and filter out records that are not in main memory.

Indirection records solve this, completing migration in 32 s (Figure 10 (b)) and decoupling the source and target 6x faster. By sending out records that point to shared remote storage, migration is restricted to memory and avoids I/O at the source altogether. However, this approach increases the amount of data transmitted to the target. Figure 13 show this effect. Compared to Rocksteady’s 5.60 GB, indirection records cause 16.47 GB to be transmitted from memory to the target. This is because we must send about one indirection record per hash table bucket entry, totaling 11 GB here. The larger migration takes 18 s longer than Rocksteady’s in-memory phase, but it decreases the total duration of migration by 150 s.

After migration, requests that hit indirection records at the target cause remote accesses to shared cloud storage. These requests are infrequent (these records are cold), and they have little impact on throughput (Figure 10 (b)). However, cloud storage is slow, so in the time it takes to retrieve one such record, the target receives many requests for it which must pend. Requests that pend during scale out complete by 4 minutes (Figure 12 (b)). The gradual upward slope after this is due to the requests that pend on access to remote shared storage. Requests never pend after scale out with Rocksteady; however, its slow sequential scan causes requests to pend awaiting transmission from the source during its longer migration.

We also measured the impact of fetching records from shared remote storage when resolving indirection records during compaction, but its throughput impact was neglible.

4.4.3 Sampled Records

Figure 14: Impact of shipping sampled records during ownership transfer to the target. Sending these hot records over improves the target’s throughput during the first 5 s of migration.

Shadowfax sends a small set of hot records to the target during ownership transfer, which allows the target to start servicing requests and recovering throughput quickly. Figure 14 shows target throughput when this is enabled (Sampling) and when it is disabled (No Sampling). In this experiment, all data starts in the source’s memory, so scale out completes in 17 s. When enabled, throughput at the target rises up to 8 Mops/s immediately after ownership transfer. If disabled, this happens 5 s later, once sufficient records have been migrated over. At this point, nearly 30% of scale out has completed, meaning that by sampling and shipping hot records during ownership transfer, the target starts contributing to system throughput 30% faster.

4.4.4 Ownership Validation

Figure 15: The overhead of using views to validate record ownership at a server is negligible. When coupled with fast migration, this allows Shadowfax to shard and redistribute load whenever required.

Views allow Shadowfax to fluidly move ownership of hash ranges between servers and help minimize the overhead of scale out on normal operation of the system. Figure 15 demonstrates this; it presents normal case server throughput under an increasing number of hash splits. When using views to validate record ownership at the server (View Validation), throughput stays fairly constant. On switching over to an approach that hashes every received key and looks up a trie of owned hash ranges at the server (Hash Validation), throughput gradually drops as the number of hash splits increase.

This figure shows the benefit of using views given a particular scale out granularity; if scale out always moves 7% of a server’s load (16 hash splits), then view validation can improve normal case server throughput by 5%. Similarly, if it always moves 0.2% of a server’s load (512 hash splits), then this improvement increases to 10%.

5 Related Work

Shadowfax builds on several areas of recent research.

Epochs and Cuts. There are many schemes for synchronization and memory protection in lock-free concurrent data structures including hazard pointers [35], read-copy-update [34] and epoch-based schemes [27, 21]. Like Shadowfax, serveral other systems [30, 28, 29, 31] use epochs for this purpose.

Shadowfax’s use of epochs resembles Silo’s, a (single-node) in-memory store [44]. Like in Silo, Shadowfax’s epochs avoid strong ordering among requests except on coarse boundaries, improving scalability. Silo also uses epochs to improve write-ahead logging scalability [47]. Shadowfax extends these epochs back to clients by asynchronously choosing points in server execution and correlating these back to per-client sequence numbers, effectively pushing the overhead of logging out of servers altogether. Similarly, Scalog’s persistence-before-ordering approach uses global cuts that define and order shards of operations on different machines [17]; Shadowfax uses similar cuts across threads and client session buffers to define an order to enforce boundaries among operations.

High-throughput Networked Stores. Some in-memory stores exploit kernel-bypass networking or RDMA and optimize for multicore. Many of these focus on throughput but do not provide fault tolerance or scale out [36, 33, 23], both of which can slow normal-case request processing. RAMCloud focuses on low-latency and has crash recovery, but its throughput is two orders of magnitude less than Shadowfax [39, 38].

FaRM [18, 19] creates large clusters of distributed memory where clients primarily use one-sided RDMA reads to construct data structures like hash tables. FaRM supports scale out and crash recovery by relying on whole-machine battery backup and in-memory replication. FaRM’s reported per-core throughput is about 300,000 reads/s/core, compared to Shadowfax’s 1.5 million read-modify-writes/s/core, though there are differences in experimental set up. For example, FaRM doesn’t report numbers for read-modify-write or write-only workloads which are significantly more expensive in FaRM, since they involve server CPU, require replication, and cannot be done with one-sided RDMA operations.

Elasticity. Scale out and migration are key features in shared, replicated stores [16, 2, 8]. High-throughput, multicore stores complicate this because normal-case request processing is highly optimized and migration competes for CPU. Some stores rely on in-memory replicas for fast load redistribution [19, 45]; this is expensive due to DRAM’s high cost and replication overhead, so this does not work for Shadowfax.

Squall [20] migrates data in the H-Store [24] database; it exploits skew via on-demand record pulls from source to target with colder data moved in the background. Rocksteady [26] uses this idea in RAMCloud along with a deferred replication scheme that avoids write-ahead logging for migrated data.

6 Conclusion

Practical KVSs must ingest events over the network and elastically scale across machines. Shadowfax does this with state-of-the-art performance that reaches 130 M/op/s/VM by relying on its asynchronous global cuts, partitioned sessions, and end-to-end asynchronous clients.

References

  • [1] Accelerated Networking. https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-cli. Accessed: 4/22/2020.
  • [2] Apache Cassandra. http://cassandra.apache.org/. Accessed: 2/28/2020.
  • [3] Azure Blob storage. https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-pageblob-overview. Accessed: 4/22/2020.
  • [4] Azure HPC VMs. https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/. Accessed: 4/27/2020.
  • [5] Azure Memory Optimized VMs. https://docs.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series. Accessed: 4/22/2020.
  • [6] Intel Flow Director. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-ethernet-flow-director.pdf. Accessed: 4/22/2020.
  • [7] IPoIB. https://www.advancedclustering.com/act_kb/ipoib-using-tcpip-on-an-infiniband-network/. Accessed: 4/28/2020.
  • [8] Redis. http://redis.io/. Accessed: 2/28/2020.
  • [9] Seastar Applications. http://seastar.io/seastar-applications/. Accessed: 4/22/2020.
  • [10] Seastar Framework. http://seastar.io. Accessed: 4/22/2020.
  • [11] Spark Streaming. https://spark.apache.org/streaming/.
  • [12] YCSB Workloads. https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads. Accessed: 4/22/2020.
  • [13] Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J. C., Terwilliger, J. F., and Wernsing, J. Trill: A high-performance incremental query processor for diverse analytics. Proc. VLDB Endow. 8, 4 (Dec. 2014), 401–412.
  • [14] Chandramouli, B., Prasaad, G., Kossmann, D., Levandoski, J., Hunter, J., and Barnett, M. Faster: A concurrent key-value store with in-place updates. In Proceedings of the 2018 International Conference on Management of Data (New York, NY, USA, 2018), SIGMOD ’18, ACM, pp. 275–290.
  • [15] Copeland, M., Soh, J., Puca, A., Manning, M., and Gollob, D. Microsoft Azure: Planning, Deploying, and Managing Your Data Center in the Cloud, 1st ed. Apress, USA, 2015.
  • [16] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. Dynamo: Amazon’s highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles (New York, NY, USA, 2007), SOSP ’07, Association for Computing Machinery, p. 205–220.
  • [17] Ding, C., Chu, D., Zhao, E., Li, X., Alvisi, L., and Renesse, R. V. Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Association, pp. 325–338.
  • [18] Dragojević, A., Narayanan, D., Castro, M., and Hodson, O. Farm: Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (Seattle, WA, Apr. 2014), USENIX Association, pp. 401–414.
  • [19] Dragojević, A., Narayanan, D., Nightingale, E. B., Renzelmann, M., Shamis, A., Badam, A., and Castro, M. No compromises: distributed transactions with consistency, availability, and performance . In SOSP (2015), pp. 85–100.
  • [20] Elmore, A. J., Arora, V., Taft, R., Pavlo, A., Agrawal, D., and El Abbadi, A. Squall: Fine-grained live reconfiguration for partitioned main memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2015), SIGMOD ’15, ACM, pp. 299–313.
  • [21] Fraser, K. Practical lock-freedom. PhD thesis, University of Cambridge, UK, 2004.
  • [22] Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In 2010 USENIX Annual Technical Conference, Boston, MA, USA, June 23-25, 2010 (2010), USENIX Association.
  • [23] Kalia, A., Kaminsky, M., and Andersen, D. G. Using RDMA efficiently for key-value services. In ACM SIGCOMM 2014 Conference, SIGCOMM’14, Chicago, IL, USA, August 17-22, 2014 (2014), pp. 295–306.
  • [24] Kallman, R., Kimura, H., Natkins, J., Pavlo, A., Rasin, A., Zdonik, S., Jones, E. P. C., Madden, S., Stonebraker, M., Zhang, Y., Hugg, J., and Abadi, D. J. H-store: A High-performance, Distributed Main Memory Transaction Processing System. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1496–1499.
  • [25] Kaufmann, A., Peter, S., Sharma, N. K., Anderson, T., and Krishnamurthy, A. High performance packet processing with flexnic. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2016), ASPLOS ’16, Association for Computing Machinery, p. 67–81.
  • [26] Kulkarni, C., Kesavan, A., Zhang, T., Ricci, R., and Stutsman, R. Rocksteady: Fast migration for low-latency in-memory storage. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP ’17, ACM, pp. 390–405.
  • [27] Kung, H. T., and Lehman, P. L. Concurrent manipulation of binary search trees. ACM Trans. Database Syst. 5, 3 (Sept. 1980), 354–382.
  • [28] Levandoski, J., Lomet, D., Sengupta, S., Stutsman, R., and Wang, R. High Performance Transactions in Deuteronomy. In Conference on Innovative Data Systems Research (CIDR 2015) (2015).
  • [29] Levandoski, J., Lomet, D., Sengupta, S., Stutsman, R., and Wang, R. Multi-version Range Concurrency Control in Deuteronomy. Proceedings of the VLDB Endowment 8, 13 (Sept. 2015), 2146–2157.
  • [30] Levandoski, J. J., Lomet, D. B., and Sengupta, S. The Bw-Tree: A B-tree for new hardware platforms. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013 (2013), pp. 302–313.
  • [31] Levandoski, J. J., Lomet, D. B., Sengupta, S., Birka, A., and Diaconu, C. Indexing on Modern Hardware: Hekaton and Beyond. In SIGMOD (2014), pp. 717–720.
  • [32] Li, B., Ruan, Z., Xiao, W., Lu, Y., Xiong, Y., Putnam, A., Chen, E., and Zhang, L. Kv-direct: High-performance in-memory key-value store with programmable nic. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP ’17, ACM, pp. 137–152.
  • [33] Li, S., Lim, H., Lee, V. W., Ahn, J. H., Kalia, A., Kaminsky, M., Andersen, D. G., Seongil, O., Lee, S., and Dubey, P. Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (New York, NY, USA, 2015), ISCA ’15, ACM, pp. 476–488.
  • [34] McKenney, P. E., and Slingwine, J. D. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems (1998), pp. 509–518.
  • [35] Michael, M. M. Safe Memory Reclamation for Dynamic Lock-free Objects Using Atomic Reads and Writes. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing (New York, NY, USA, 2002), PODC ’02, ACM, pp. 21–30.
  • [36] Mitchell, C., Geng, Y., and Li, J. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In 2013 USENIX Annual Technical Conference, San Jose, CA, USA, June 26-28, 2013 (2013), pp. 103–114.
  • [37] Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., and Venkataramani, V. Scaling Memcache at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 2-5, 2013 (2013), N. Feamster and J. C. Mogul, Eds., USENIX Association, pp. 385–398.
  • [38] Ongaro, D., Rumble, S. M., Stutsman, R., Ousterhout, J., and Rosenblum, M. Fast Crash Recovery in RAMCloud. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 29–41.
  • [39] Ousterhout, J., Gopalan, A., Gupta, A., Kejriwal, A., Lee, C., Montazeri, B., Ongaro, D., Park, S. J., Qin, H., Rosenblum, M., and et al. The ramcloud storage system. ACM Trans. Comput. Syst. 33, 3 (Aug. 2015).
  • [40] Phothilimthana, P. M., Liu, M., Kaufmann, A., Peter, S., Bodik, R., and Anderson, T. Floem: A programming system for nic-accelerated network applications. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (USA, 2018), OSDI’18, USENIX Association, p. 663–679.
  • [41] Prasaad, G., Chandramouli, B., and Kossmann, D. Concurrent Prefix Recovery: Performing CPR on a Database. In Proceedings of the 2019 International Conference on Management of Data (New York, NY, USA, 2019), SIGMOD ’19, Association for Computing Machinery, p. 687–704.
  • [42] Ricci, R., Eide, E., and The CloudLab Team. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (Dec. 2014).
  • [43] Stonebraker, M., and Weisberg, A. The voltdb main memory DBMS. IEEE Data Eng. Bull. 36, 2 (2013), 21–27.
  • [44] Tu, S., Zheng, W., Kohler, E., Liskov, B., and Madden, S. Speedy transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP ’13, Association for Computing Machinery, p. 18–32.
  • [45] Wei, X., Shen, S., Chen, R., and Chen, H. Replication-driven live reconfiguration for fast distributed transaction processing. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017. (2017), pp. 335–347.
  • [46] Wu, C., Faleiro, J., Lin, Y., and Hellerstein, J. Anna: A kvs for any scale. In 2018 IEEE 34th International Conference on Data Engineering (ICDE) (2018), pp. 401–412.
  • [47] Zheng, W., Tu, S., Kohler, E., and Liskov, B. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014. (2014), pp. 465–477.