Non-volatile memory (NVM), such as Intel’s Optane DC persistent memory module (PMM) (27), is now commercially available. NVM provides high-capacity durable memory with near-DRAM performance at much lower cost and energy use. These properties are likely to drive the adoption of NVM as a main memory substitute (ZDNet, 2019, 2018; InsideHPC, 2019). With remote direct memory access (RDMA), NVM access across the network is also low overhead (Yang et al., 2019; Shan et al., 2018), leading us to explore the use of NVM for distributed file systems.
A common paradigm for cloud-based distributed file systems (such as Amazon EFS (3), NFS (Haynes and Noveck, 2015), Ceph (Weil et al., 2006), Colossus/GFS (Ghemawat et al., 2003), and Octopus (Lu et al., 2017)), is to separate storage from caching concerns. File storage is disaggregated, with file management centralized on machines physically separated from application servers. Client memory is treated as a shared, volatile block cache for file data and metadata managed by the local kernel. Storage disaggregation simplifies resource pooling and server specialization by physically separating application and file system concerns.
Disaggregation’s simplicity comes at a cost, which becomes apparent as we move from SSD/HDD to NVM. First, in steady state, applications are slowed by the overhead to access kernel-level client caches, and (on cache misses) by the need for multiple network round trips to consult disaggregated meta-data servers and then to access the actual data. Second, on failure, disaggregated file systems must rebuild data and metadata caches of failed clients from scratch, causing long recovery times to reestablish application-level service and high network utilization during recovery. Third, managing client caches at page-block granularity amplifies the small IO operations typical of many of today’s distributed applications. These costs are well-known within the high-performance storage community and have led some to advocate for a complete redesign of the file system interface (54; 46; 68; 37).
Rather than a new API, we present a design for a distributed file system aimed at colocated NVM storage rather than disaggregated storage. We present Assise, a distributed file system providing a POSIX API that unleashes the performance of NVM by making local NVM a primary design principle. For availability, Assise provides near-instantaneous application fail-over onto a cache replica that mirrors the application’s local file system cache in its NVM, an approach that would be impossible in a disaggregated design. Similarly, Assise provides applications efficient small file IO to NVM without kernel involvement or block amplification. Assise introduces reserve replicas that can act as a remote, next-level cache with lower latency than local SSD. In cascaded failure scenarios, these reserve replicas become cache replicas to preserve near-instantaneous fail-over.
To enable these properties, we designed and built to our knowledge the first crash consistent file cache coherence layer for NVM (CC-NVM). CC-NVM serves cached file system state in Assise with strong consistency guarantees. CC-NVM provides crash consistency with prefix semantics (Wang et al., 2013) by enforcing write order to local NVM via logging and to cross-socket and remote NVM by leveraging the write ordering of DMA and RDMA, respectively. CC-NVM provides linearizability at arbitrary granularity via leases (Gray and Cheriton, 1989) that can be delegated among nodes and applications for direct management of file system state. Finally, CC-NVM supports automatic management of large-capacity, slow storage by migrating cold data at block granularity to disaggregated media, such as SSDs.
Assise achieves the following goals.
High availability. Assise provides near-instantaneous application fail-over to a configurable number of replicas, while minimizing the time to restore the system replication factor after a failure.
Scalability. Assise provides strong consistency but remains scalable using dynamic delegation of leases to nodes and applications; node-local sharing uses CC-NVM for consistency without network communication.
Low tail latency. To efficiently support diverse application workloads, Assise logs IO operations directly to colocated NVM via kernel-bypass, with asynchronous chain replication (van Renesse and Schneider, 2004) and optimistic crash consistency.
Simple programming model. Assise supports unmodified applications using the familiar POSIX API with linearizability and prefix crash consistency semantics (Wang et al., 2013).
We make the following contributions.
We present the design and implementation of Assise, a distributed file system that efficiently exploits NVM by leveraging colocation as a primary design principle. Assise is the first distributed file system to recover the file system cache for fast fail-over and to locally synchronize reads and writes to file system state.
We present CC-NVM, the first persistent and available cache coherence layer. CC-NVM provides locality for data and meta-data updates, replicates for availability, provides crash consistency with prefix semantics for persistence, and linearizability for shared state access.
We quantify the performance benefits of NVM colocation versus disaggregation for distributed file systems. We compare Assise’s steady-state and fail-over behavior to RDMA-accelerated versions of Ceph (with Bluestore (Aghayev et al., 2019)) and NFS, as well as Octopus (Lu et al., 2017), a distributed file system designed for RDMA and NVM, using common cloud applications and benchmarks, such as LevelDB, Postfix, MinuteSort, and FileBench.
Our evaluation shows that Assise provides up to 22 lower write latency and up to 56 higher throughput than NFS and Ceph (w/ Bluestore). Assise also outperforms Octopus by up to an order of magnitude for these workloads. For a sharded mail server workload, Assise scales better than Ceph, providing 6 higher throughput at scale. Finally, Assise is more available than Ceph, returning a recovering key-value store to full throughput up to 103 faster. Assise’s implementation builds on Strata (Kwon et al., 2017)
as its node-local store and we plan to release Assise as open source.
One limitation of our current implementation is that while we reuse NVM file data after a system crash, we do not yet implement fast reboot of the guest OS kernel from checkpointed NVM state. This does not impact application recovery time as the cache replica can be quickly restored to full performance, but it does mean that we need to wait for the OS kernel to reboot to reestablish the file system’s replication factor.
|Memory||R/W Latency||Seq. R/W GB/s||$/GB|
|DDR4 DRAM||82 ns||107 / 80||35.16 (Hardware, 2019)|
|NVM (local)||175 / 94 ns||32 / 11.2||4.51 (Hardware, 2019)|
|NVM-NUMA||230 ns||4.8 / 7.4||-|
|NVM-RDMA||3 / 8 s||3.8||-|
|SSD (local)||10 s||2.4 / 2.0||0.32 (28)|
Distributed applications have diverse workloads, with IO granularities large and small (Li et al., 2019), different sharding patterns, and consistency requirements. All demand high availability and scalability. Supporting all of these properties simultaneously has been the goal of decades of distributed file systems research (Weil et al., 2006; Lu et al., 2017; Yang et al., 2019; Hitz et al., 1994; Watanabe, 2009; Anderson et al., 1995; Haynes and Noveck, 2015). Until recently, trade-offs had to be made, for example, by favoring large transfers ahead of small IO, performance ahead of crash consistency, or common case performance ahead of fast recovery. We argue that with the advent of fast non-volatile memory, many of these trade-offs need to be re-evaluated.
The opportunity posed by NVM is two-fold:
Table 1 shows measured access latency, bandwidth, and cost for modern server memory and storage technologies, including Optane DC PMM (measurement details in §5). We can see that local NVM access latency and bandwidth are up to an order of magnitude better than SSD or NVM accessed via RDMA (NVM-RDMA) or resident on another NUMA node (NVM-NUMA). At the same time, NVM’s per-byte cost is 13% that of DRAM. NVM’s unique characteristics allow it to be used as the first layer in the storage hierarchy, as well as the last layer in a server’s memory hierarchy. Data center operators are already moving to deploy NVM at scale (ZDNet, 2018, 2019; InsideHPC, 2019).
NVM and recovery.
Persistent storage with near-DRAM performance can provide a recoverable cache for hot file system data that can persist across reboots. The vast majority of system failures are due to software crashes that simply require rebooting (Ford et al., 2010; Birke et al., 2014; Hennessy and Patterson, 2017). Caching hot file system data in NVM allows us to recover quickly from these failures.
2.1. Alternatives are Insufficient
Disaggregated block stores.
Block stores, such as Amazon’s EBS (2) and S3 (7), use a multi-layer storage hierarchy to provide cheap access to vast amounts of data (Li et al., 2019). However, block stores have a minimum IO granularity (16KB for EBS) and IO smaller than the block size suffers performance degradation from write amplification (Raju et al., 2017; Li et al., 2019).
Disaggregated file stores.
Disaggregated file systems like Ceph (Weil et al., 2006) use consistent hashing over data and metadata to provide scalable file service for cloud applications. However, remote access for data harms performance as shown in Table 1. While they can support small IO more efficiently than block stores, their promise of higher throughput via parallel access to disaggregated storage is surpassed by the up to 8 higher bandwidth of local NVM.
RDMA-optimized file stores.
To combat network overheads, several distributed file systems have been built (Lu et al., 2017; Yang et al., 2019) or retrofitted (Weil et al., 2006; Haynes and Noveck, 2015) to use RDMA. Orion (Yang et al., 2019) goes one step further and leverages local NVM for data storage, but it still disaggregates meta-data management. However, local NVM has both lower latency and higher bandwidth than the (RDMA) network and these file systems still incur network overhead for each access. The high network latency and limited bandwidth increases file system operation latency, reduces throughput, and limits scalability.
3. Assise Design
We first overview the design of Assise by highlighting how its architecture differs from that of disaggregated file systems. We then expand on the design of Assise’s components and the properties they achieve.
Figure 1 contrasts the cache coordination architecture of disaggregated file systems and Assise. Each subfigure shows two dual-socket nodes executing a number of application processes sharing a distributed file system. Both designs use a (replicated) cluster manager for membership management and failure detection, but they diverge in all other respects.
Disaggregated file systems
first partition available cluster nodes into clients and servers. Clients cache hot file system state in a volatile kernel buffer cache that is shared by processors across NUMA nodes (NVM-NUMA) and accessed via expensive system calls. Persistent file system state is stored in NVM on remote servers. For persistence and consistency, clients thus have to coordinate updates with replicated storage and meta-data servers via the network (NVM-RDMA) with higher latency than local NVM. Data is typically distributed at random over replicated storage servers for simplicity and load balance. Crash consistency is often provided just for meta-data, which is centralized.
avoids disaggregated servers and instead uses CC-NVM to coordinate linearizable hot state among processes. Processes access cached file system state in colocated NVM directly via a library file system (LibFS), which may be replicated (2 LibFS replicas shown in black) for immediate fail-over. CC-NVM coordinates LibFSes hierarchically via shared daemons (SharedFS—one per NUMA node) and the cluster manager—all may persist state in local NVM and replicate it via RDMA.
Crash consistency modes in Assise.
Assise supports two crash consistency modes: optimistic or pessimistic (Chidambaram et al., 2013). Mount options specify the chosen crash consistency mode. When pessimistic, fsync forces immediate, synchronous replication and all writes prior to an fsync persist across node failures. When optimistic, Assise commits all operations locally in order, but it is free to delay replication until the application forces replication with a dsync system call (and we take dsync from the literature (Chidambaram et al., 2013)). Optimistic mode provides lower latency persistence with higher throughput, but risks data in a crash requiring fail-over to another replica. In either mode, Assise guarantees a crash-consistent file system with prefix semantics (Wang et al., 2013)—all recoverable writes are in order and no parts of a prefix of the write history are missing.
We now first describe cluster coordination and membership management in Assise (§3.1). We then describe the IO paths in more detail (§3.2) and show how CC-NVM interacts with them to provide linearizability and crash consistency with prefix semantics (§3.3). Finally, we describe recovery (§3.4) and reserve replicas (§3.5).
3.1. Cluster Coordination and Failure Detection
Like disaggregated file systems, Assise leverages a replicated cluster manager for storing the cluster configuration and detecting node failures. Assise uses the ZooKeeper (9) distributed coordination service as its cluster manager.
Unlike disaggregated file systems that only need to track storage node (and not client) membership, each SharedFS instance in Assise registers with the cluster manager. Application processes access the file system via a LibFS dynamically linked into the process. In our prototype, the system administrator decides which SharedFS instances manage replicas for which parts of the file system namespace at the granularity of a subtree; the cluster manager records this mapping. When a subtree is first accessed, LibFSes contact their local SharedFS, which consults the cluster configuration and sets up RDMA connections with the subtree’s replicas. The SharedFS allocates private cache space in NVM (1GB by default) for the LibFS on each replica and then hands the corresponding RDMA connections to LibFS.
The cluster manager sends heartbeat messages to each active SharedFS once every second. If no response is received after a timeout, the node is marked failed and all remaining SharedFS are notified. When the node comes back online, it contacts the cluster manager and initiates recovery (§3.4). We leave it as future work to integrate more advanced failure detectors, e.g., (So and Sirer, 2007; Leners et al., 2011).
3.2. IO Paths
We now describe how application writes and reads interact with Assise caches. Then, we discuss how cache eviction works, how DMA improves cross-socket evictions, and how permissions are enforced.
Writes in Assise occur in two stages:
1. LibFS directly writes to a local cache in NVM.
2. Local writes are later chain-replicated by LibFS.
Local writes: To allow direct, low latency access to NVM, Assise does not rely on a shared kernel buffer cache. Instead, LibFS caches file system state in process-local memory; file operations are function calls that implement the POSIX API using this cache. The LibFS cache is split into an NVM and a DRAM portion. NVM stores persistent updates, while DRAM is used to cache read-only state (see reads). To efficiently support small writes, the NVM portion of the cache holds an update log (§3.3), rather than data blocks.
Remote replication: To outlive node failures, local writes are replicated (on fsync when pessimistic, on dsync when optimistic) by LibFS to reserved NVM of the next replica along the appropriate replication chain via RDMA, followed by an RPC to continue the chain. The final replica in the chain sends an acknowledgment RPC back along the chain to indicate that the chain completed successfully, and the fsync/dsync can return.
To read data LibFS first checks the local cache for the requested data block. If not found, it checks a shared, block-based, second level NVM cache provided by a SharedFS cache replica of the corresponding subtree. If not found there, LibFS checks reserve replicas (if configured) and, in parallel, checks cold storage (not shown in Figure 0(b)).
Reads from remote (including NUMA) nodes and cold storage are cached in DRAM. LibFS prefetches up to 256KB of data sequentially when caching in DRAM. The read cache uses 4KB blocks. For small (¡ 4 KB) remote NVM reads, LibFS first fetches the requested data and then prefetches the containing 4KB block (and continues prefetching). This minimizes small read latency while improving the performance of workloads with spatial locality.
When a LibFS private cache fills, it replicates any pending unreplicated writes and initiates batched eviction on each replica along the replication chain via RPC. Eviction is done in LRU fashion, first to the SharedFS second level caches, then to cold storage. Each replica along the chain evicts in parallel and acknowledges when eviction is finished. This ensures that all replicas cache identical state for fast failover.
Processor stores to NVM on another NUMA node have overheads due to cross-socket hardware cache coherence, limiting throughput (Xu et al., 2019). Since CC-NVM provides cache coherence, Assise can bypass some hardware cache coherence mechanisms by using DMA (Le et al., 2017) when evicting to NVM on another NUMA node. This yields up to 30% improvement in cross-socket file system write throughput (§5.2).
Permissions and kernel bypass.
Assise assumes a single administrative domain with UNIX file and directory ownership and permissions. SharedFS enforces that LibFS caches only authorized data, by checking permissions and data integrity upon eviction and enforcing permissions on reads. To minimize latency of SharedFS cache reads, Assise allows read-only mapping of authorized parts of the SharedFS cache into the LibFS address space. LibFS caches and SharedFS mappings are invalidated when files or directories are closed and whenever contents are evicted from the cache.
3.3. Crash consistent cache coherence with CC-NVM
Assise uses CC-NVM, a distributed cache coherence layer. CC-NVM provides linearizability when sharing file system state among processes and prefix semantics upon a crash.
Crash consistency with prefix semantics.
To provide prefix crash consistency semantics, CC-NVM tracks write order in persistent memory. To do so, the LibFS cache is subdivided into an update log in NVM and a read-only cache in DRAM. Each POSIX call that updates state is recorded, in order, in the update log and encapsulated in a Strata transaction (Kwon et al., 2017). When chain-replicating, CC-NVM leverages the ordering guarantees of RDMA to write the log in order to remote replicas. This ensures that file system updates are persisted and replicated atomically and that a prefix of the write history can be recovered (§3.4).
When in optimistic mode, Assise might coalesce updates to save network bandwidth. To provide prefix semantics in optimistic mode, CC-NVM wraps each batch of replicated, coalesced file system operations in a Strata transaction (Kwon et al., 2017). This ensures that replicated batches are applied atomically, which is important in the face of crashes during replication.
Sharing with linearizability.
Protected sharing of the file system with linearizability requires a mechanism that allows CC-NVM to serialize concurrent access to shared state by multiple untrusted LibFS instances and to recover the same serialization after a crash.
Leases (Gray and Cheriton, 1989) provide a simple, fault-tolerant mechanism to delegate access. Leases function similarly to reader-writer locks, but they can be revoked (to allow another process to get a turn) and they expire after a timeout (after which they may be reacquired). In CC-NVM, leases are used to grant shared read or exclusive write access to a set of files and directories—multiple read leases to the same set may be concurrently valid, but write leases are exclusive. Reader/writer semantics efficiently support shared files and directories that are read-mostly and widely used, such as configuration files and executables, but also write-intensive files and directories that are not frequently shared. CC-NVM also supports a special type of directory lease called a namespace lease (Kwon et al., 2017) that includes all files and directories at or below a particular directory. A namespace lease holder controls access to files and directories within that namespace. For example, a LibFS with an exclusive namespace lease on /tmp/bwl-ssh/ can create files or directories within this directory.
Leases must be acquired by LibFS from SharedFS via a system call before LibFS can cache the data covered by the lease. Our prototype does this upon first IO; leases are kept until they are revoked by SharedFS. This occurs when another LibFS wishes access to a leased file or when a LibFS instance crashes or the lease times out. Revocation incorporates a grace period in which the current lease holder can finish its ongoing IO operations before releasing contended leases. SharedFS enforces that the private update log and dirty cache entries of the lease holder are clean and replicated before the lease is transferred. For crash consistency, SharedFS logs each lease transfer in NVM and then replicates it. A LibFS may overlap IO with SharedFS record replication until fsync/dsync, when it needs to wait for SharedFS replication to finish before returning.
Leases are managed hierarchically, with coarse leases for large portions of the namespace stored in the cluster manager and individual SharedFS managing namespaces (and possibly individual files) for their local LibFS. As with SharedFS, the cluster manager also logs and replicates each lease transfer, so that the cluster manager can also failover nearly instantaneously while preserving lease semantics.
LibFS requests leases first from their local SharedFS. If the local SharedFS is not the lease manager, it consults the cluster manager. If a lease manager (another SharedFS) exists for the requested directory or file, SharedFS forwards the request to the manager and it caches the lease manager’s information (leased namespace and expiration time of lease). If there is no current lease manager, the cluster manager assigns the lease to the requesting SharedFS. The cluster manager prevents leases from changing hands too quickly by assigning leases for at least 5 seconds. Once a lease expires, the cluster manager assigns it to the next SharedFS that requests it. In this way, the system naturally migrates leases to the SharedFS that is local to the LibFSes using them.
The hierarchical structure allows CC-NVM to reduce network communication and minimize lease operation latencies, even in the presence of sharing, e.g., two LibFS on the same node require only local communication with their SharedFS in the common case. Storing coarse leases in the cluster manager allows the manager to expire leases for crashed nodes and to reassign their namespace to a replacement SharedFS.
3.4. Fail-over and Recovery
CC-NVM allows each node to persistently cache file-system state in NVM, which it can use for fast recovery. Assise can optimize recovery performance according to crash prevalence.
An application process crashing is the most common failure scenario. In this case, the local SharedFS simply replicates and evicts the dead LibFS update log, recovering all completed writes, even in optimistic mode, and then expires its leases. Log-replay based eviction is idempotent (Kwon et al., 2017), ensuring crash consistency even in the face of a LibFS or SharedFS crash during eviction. The crashed process can be restarted on the local node and immediately re-use all file system state. The LibFS DRAM read-only cache has to be rebuilt, but performance impact after fail-over is minimal (§5.4).
Another common failure mode is a reboot due to an OS crash. In this case, we can use NVM to dramatically accelerate OS reboot by storing a checkpoint of a freshly booted OS. After boot, Assise can initiate recovery for the previously running LibFS instances, by examining the SharedFS log stored in NVM.
Cache replica fail-over.
On a node failure due to a power outage or hardware problem, rebooting and/or repairing the server can take some time. To avoid waiting for the server to recover, as soon as the failure is detected, we fail-over to a node hosting its cache replica. The SharedFS on that node takes over the leases for the failed node and restarts its applications, using the replicated SharedFS log to re-grant leases to the LibFS instances. The new instance will see a prefix of the operations that were completed on the original node; it will see all operations that preceded the most recent completed dsync.
Writes to the file system during the server downtime can invalidate cached data of the failed node. To track writes during downtime, the cluster manager maintains an epoch number, which it increments on node failure and recovery. All SharedFS instances are notified of epoch updates. All SharedFS instances share a per-epoch bitmap in a sparse file indicating what inodes have been written during each epoch. The bitmaps are deleted at the end of an epoch when all nodes have recovered.
When a node crashes, the cluster manager makes sure that all of the node’s leases expire before the node can rejoin. When rejoining, Assise initiates SharedFS recovery. A recovering SharedFS contacts an online SharedFS to collect relevant epoch bitmaps. SharedFS then invalidates every block from every file that has been written since its crash. This simple protocol could be optimized, for instance, by tracking what blocks were written, or checksumming regions of the file to allow a recovering SharedFS to preserve more of its local data. But the table of files written during an epoch is small and quickly updated during file system operation, and our simple policy has been sufficient.
3.5. Reserve Replicas
To fully exploit the memory hierarchy presented in Table 1, remote NVM can be used as a third-level cache, behind local DRAM and local NVM. To do so, we introduce reserve replicas. Like cache replicas, reserve replicas receive all file system updates via chain-replication, but leverage a different data migration policy. Reserve replicas track the LRU chain for a specified “third-level” portion beyond the LibFS and SharedFS caches. Reserve replicas evict their third-level, rather than second-level data to their colocated SharedFS cache.
Cache replicas can read from reserve replicas via RPC with lower latency and higher bandwidth than cold storage (NVM-RDMA versus SSD in Table 1). Applications do not run on reserve replicas in the common case. In the rare case of a failure cascade bringing down all cache replicas, processes can fail-over to reserve replicas, albeit with reduced short-term performance (since hot data must be migrated from cold storage back into NVM). After fail-over, reserve replicas become cache replicas.
Assise is built on libpmem (49) for persisting data on NVM and libibverbs to enable RDMA operations in userspace. Assise intercepts POSIX file system calls and invokes the corresponding LibFS implementation of these functions in userspace. The Assise implementation started with the Strata code base and consists of 28,982 (+10,136 added and -2,619 removed versus Strata) lines of C code (LoC), with LibFS and SharedFS using 16,515 (+9,402 -734) and 6,563 (+734 -159) LoC, respectively. The remaining 5,904 LoC contain utility code, such as hash table, linked list, and rb-tree implementations. SharedFS runs in its own user-level process, communicating with LibFSes via UNIX domain sockets (Kwon et al., 2017).
Assise has been tested on a variety of different applications. It can successfully run existing Filebench profiles. It has passed all unit tests for the LevelDB key-value store and passes MinuteSort validation.
4.1. Network IO Paths
For lossless, in-order data transfer among nodes, Assise uses RDMA reliable connections (RCs). RCs have low header overhead, improving throughput for small IO (MacArthur and Russell, 2012; Kalia et al., 2015). RCs also provide access to one-sided verbs, which bypass CPUs on the receiver side, reducing message transfer times (Dragojević et al., 2014; Mitchell et al., 2013) and memory copies (Taleb et al., 2018).
Logs are naturally suited for one-sided RDMA operations and Assise uses RDMA WRITEs for log replication. Replication operations typically require only one WRITE, reducing header and DMA overheads (MacArthur and Russell, 2012). The only exceptions are when the remote log wraps around to the beginning or when the local log is fragmented (due to coalescing), such that it exceeds the hardware limit for the number of gathers in a single WRITE.
Persistent RDMA writes.
The RDMA specification does not define the persistence properties of remote NVM access via RDMA. In current practice, the remote CPU is required to flush any RDMA WRITE data from the remote processor cache to NVM. Assise flushes all writes via the CLWB and SFENCE instructions on each replica, before acknowledging successful replication. In the future, it is likely that enhancements to PCIe will allow NICs to bypass the processor cache and write directly into NVM to provide persistence (Kim et al., 2018).
Remote NVM reads.
Assise reads data from remote nodes by issuing RPC requests. To keep the request sizes small, Assise identifies files using their inode numbers instead of their path. As an optimization, DRAM read cache locations are pre-registered with the NIC. This allows the remote node to reply to a read RPC by RDMA writing the data directly to the requester’s cache, obviating the need for an additional data copy.
We evaluate Assise’s common-case as well as its recovery performance, and break down the performance benefits attained by each system component. To put our results in perspective, we compare Assise to three state-of-the-art distributed file systems that natively support (or are retrofitted to support) NVM and RDMA. Our experiments rely on several microbenchmarks and Filebench (Tarasov et al., 2016) profiles, in addition to several real applications, such as LevelDB, Postfix, and MinuteSort.
Our evaluation answers the following questions:
IO latency and throughput breakdown (§5.1). How close to NVM performance do the file systems operate under various IO patterns? What are the overheads?
CC-NVM scalability (§5.2). How do multiple processes sharing the file system perform? By how much can CC-NVM’s hierarchical coherence improve multi-node and multi-socket performance?
Cloud application performance (§5.3, §5.5). We evaluate the performance of a number of cloud applications with various consistency requirements. By how much can a reserve replica improve read latency? By how much can optimistic crash consistency improve throughput? Can a sharded application benefit from CC-NVM?
Availability (§5.4). How quickly can applications recover from various failure scenarios?
Our experimental testbed consists of 5 dual-socket Intel Cascade Lake-SP servers running at 2.2GHz, with a total of 48 cores (96 hyperthreads), 384 GB DDR4-2666 DRAM, 6 TB Intel Optane DC PMM, 375 GB Intel Optane DC P4800X series NVMe-SSD, and a 40 GbE ConnectX-3 Mellanox Infiniband NIC. To leverage all 6 memory channels per processor, there are 6 DIMMs of DRAM and NVM per socket. All nodes use Fedora 27 with Linux kernel version 4.18.19 and are connected via an Infiniband switch.
We first measure the achievable IO
latency and throughput for each memory layer in our testbed server. We
do this by using sequential IO and as many cores within a single NUMA
domain as necessary. We measure DRAM and NVM latency and throughput
using Intel’s memory latency checker (5). NVM-RDMA
performance is measured using RDMA READ and WRITE_WITH_IMM (to flush
remote processor caches) operations to remote NVM. SSD performance is
/dev/nvme device files. The IO sizes that
yielded maximum performance are 64 B for DRAM, 256 B for NVM(-RDMA),
and 4 KB for SSD. Table 1 shows these
results. The measured IO performance for DRAM, NVM, and SSD match the
hardware specifications of the corresponding devices
and is confirmed by others (Izraelevitz et al., 2019).
NVM-RDMA throughput matches the line rate of
the NIC. NVM-RDMA write latency has to invoke the remote CPU (to
flush caches) and is thus larger than read latency. We now investigate
how close to these limits each file system can operate.
No open-source distributed file system provides all of Assise’s features. Hence, a direct performance comparison is difficult. We perform comparisons against the Linux kernel-provided NFS version 4 (Haynes and Noveck, 2015) and Ceph version 14.2.1 (Weil et al., 2006) with BlueStore (Aghayev et al., 2019), both retrofitted for RDMA, as well as Octopus (Lu et al., 2017). We cannot compare with Orion (Yang et al., 2019) as it is not publicly available. Only Ceph provides availability via replicated object storage daemons (OSDs), delegating meta-data management to a (potentially sharded) meta-data server (MDS). Octopus is designed for RDMA and NVM and, unlike Ceph and NFS, supports small IO efficiently by bypassing the kernel buffer cache, but uses distributed NVM for increased capacity, rather than availability. Hence, Ceph is the closest comparison.
The state-of-the-art does not support persistent caches and consistency semantics are often weaker than Assise’s. Assise provides data crash consistency, while both Ceph/Bluestore and Octopus provide only meta-data crash consistency (Chidambaram et al., 2012). For NFS, crash consistency is determined by the underlying file system. We use EXT4-DAX (55), which also provides meta-data crash consistency for performance. When sharing data, NFS provides close-to-open consistency (Haynes and Noveck, 2015), while Octopus and Ceph provide stronger consistency (Documentation, ), and Assise provides linearizability.
The LibFS cache is configured as 1GB NVM update log, 2GB DRAM cache and we run Assise in pessimistic mode. When we specify a number of testbed machines used, these will be used as cache replicas in Assise and OSD and MDS replicas in Ceph, while NFS uses only one machine as server, the rest as clients. To keep cluster size identical, we place applications on replicas for Assise and Ceph, and clients for NFS. Assise’s and Ceph’s cluster managers run on 2 separate testbed machines. We configure NFS to use RDMA for the server connection. We configure Ceph to provide its client-side file system via the Ceph kernel driver and use IP over Infiniband, which was the best performing configuration (we also tried FUSE and Accelio (10)). Ceph and NFS use the kernel buffer cache in DRAM to cache data and NVM for storage. Octopus uses FUSE to provide its file system interface to applications in direct IO mode to bypass the kernel buffer cache (47).
Average and tail write latency.
We compare synchronous write latencies on an otherwise idle cluster with 2 machines (except Assise-3r which uses 3 machines). Each experiment appends 1 GB of data into a single file, and we report per-operation latency. The file size is smaller than each file system’s cache size, so no evictions occur. Figure 2 shows the average and 99th percentile sequential write latencies over various common IO sizes (random write latencies are similar for all file systems). Synchronous writes involve write calls that operate locally (except for Octopus where writes may be remote), and fsync calls that involve remote nodes for replication or persistence.
For Assise, we break down the write call latency into the local data NVM write to the LibFS update log (dashed line) and local LibFS data structure updates (solid line) to track the write. We break down fsync call latency into NVM-RDMA write (dotted line) and RPC to enforce remote persistence (bar). For NFS and Ceph, we break writes down into write (solid line) and fsync call latencies (bar). Octopus’ fsync is a no-op.
Assise’s local write latencies match that of Strata (Kwon et al., 2017). Assise’s average write latency for 128B two-node replicated writes is only 8% higher than the aggregate latencies of required local and NVM-RDMA writes (cf. Table 1). Three replicas (Assise-3r) increase Assise’s overhead to 2.2 due to chain-replication with sequential RPCs. The 99th percentile replicated write latency is up to 2.1 higher than the average for 2 replicas. This is due to Optane PMM write tail-latencies (Izraelevitz et al., 2019). The tail difference diminishes to 19% for 3 replicas due to the higher average.
Ceph and NFS use the kernel buffer cache and interact at 4KB block granularity with servers. On a write miss, the client performs block read-modify-write to the server. For small writes, the incurred network IO amplification during fsync is the main reason for up to an order of magnitude higher aggregate write latency than Assise. In this case, NFS’ write latency is up to 3.2 higher than Assise due to kernel crossings and copy overheads. For large writes, network IO amplification diminishes but the memory copy required to maintain a kernel DRAM buffer cache becomes a major overhead. NFS write latency alone is now higher than Assise’s replicated write latency (and up to 2.7 higher than Assise’s write latency), while NFS aggregate write latency is up to 7.2 higher than Assise. Ceph has higher fsync overheads due to replication.
Octopus eliminates the DRAM buffer cache and block orientation, which improves its performance drastically versus NFS and Ceph. However, Octopus still requires kernel crossings for each IO operation and treats all NVM as disaggregated. Octopus exhibits up to 2.1 higher latency than Assise for small (¡ 64 KB) writes. This overhead stems from FUSE kernel crossings (around 10s (Vangoor et al., 2017)) and Octopus’ use of the NIC to write to both local and remote NVM, while Assise writes to local and remote NVM at user-level. Large writes ( 64 KB) amortize Octopus’ write overheads and Assise now has up to 1.7 higher write latency. This is because Assise replicates for availability, while Octopus does not provide availability.
Average and tail read latency.
Read latency is affected by whether a read hits or misses in the cache. While Assise cache replicas always store the entire hot state in local NVM, Ceph and NFS’s DRAM-based kernel buffer cache has less capacity and misses can occur for an application’s hot dataset. In this case, Ceph and NFS have to read from disaggregated NVM. We show both cases by reading a 1GB file, once with a warm cache and once with a cold cache. The results are shown in Figure 2.
We first compare Assise’s cache-hit latencies, where data is served from the LibFS cache. We see that Assise is up to 38% faster than NFS and Ceph for small (128B) IO. Kernel crossing overheads amortize for large IO and Assise latency is up to 49% higher than NFS at 1MB IO, as LibFS reads from NVM, while the kernel buffer cache uses DRAM.
If Assise misses in the LibFS cache, cache replica data is served from the local SharedFS, while NFS and Ceph clients have to fetch data from disaggregated servers when it is not present in the DRAM cache. When NFS and Ceph miss in the cache, they incur orders of magnitude higher average and tail latencies, especially for small reads. Again, Ceph performs worse than NFS due to more user/kernel crossings when reading. The elimination of a cache hurts Octopus read performance, as it has to fetch metadata and data (serially) from remote NVM. Octopus’ read latency is up to two orders of magnitude higher than the other file systems hitting in the cache. Even when compared to cache miss performance, Octopus does not handle small (128B) IO well, due to FUSE overhead. This overhead amortizes for larger IO ( 4KB), where Octopus incurs 58% the read latency of an NFS cache miss at 64KB IO. It is possible to configure FUSE to use a DRAM buffer cache for Octopus. In this case, Octopus read hit latency is 1.8 that of a Assise and NFS cache hit, with the remaining overhead due to FUSE. However, this inflates write latencies by up to an order of magnitude due to additional memory copies introduced by the buffer cache.
Figure 3 shows average throughput at 4KB IO size with 24 threads (all cores of one socket) of one process via sequential and random IO over 10 runs. To evaluate a standard replication factor of 3, we use 3 machines for this experiment (2 for NFS, which does not replicate). The process reads/writes 12 GB of data, sharded over 24 files, or one 512 MB file per thread. write calls are not followed by fsync in this experiment, giving file systems opportunity to optimize throughput (e.g., via batching). To measure throughput through the entire cache hierarchy we limit the kernel buffer-cache size to 3GB, the same size as the LibFS cache. The total amount of accessed data is larger than the cache size, causing cache eviction. Octopus crashes during this experiment and is not shown.
For sequential writes, Assise and NFS achieve 74% and 58% of the NVM-RDMA bandwidth (cf. Table 1), respectively, due to protocol overhead for NFS and log header overhead for Assise. Chain-replication in Assise affects throughput marginally. Ceph replicates in parallel to 2 remote replicas, consuming 2 the network bandwidth. This reduces its throughput to 61% of Assise and 77% of NFS. Assise achieves similar performance for sequential and random writes, as Assise’s writes are log-structured. NFS and Ceph perform poorly for random writes due to sequential cache prefetching incurring additional reads from remote servers, reducing their goodput and causing Assise to achieve 2.9 the throughput of Ceph. NFS’s performance is especially poor at only 33% of Ceph’s throughput.
For sequential and random reads, Assise achieves 90% and 68%, respectively, of the local NVM bandwidth, reading from the second-level SharedFS cache that is the data store in Assise’s case. The 10% difference for sequential reads to full local NVM bandwidth is due to meta-data lookups, while random reads additionally suffer PMM buffer misses (Izraelevitz et al., 2019). NFS and Ceph are limited by NVM-RDMA bandwidth for sequential reads (3.8 GB/s) and again provide worse random read performance due to cache prefetching.
5.2. CC-NVM Scalability
To evaluate CC-NVM, we run a multi-processing benchmark conducting atomic file creation, a common file system sharing pattern. Processes in parallel create and write 4KB files with random data in private directories, then rename the files to a shared directory. We will see this pattern used in Postfix (§5.5). This benchmark uses 2 machines. Processes are balanced over machines and NUMA nodes (i.e., 16 processes imply 4 processes per NUMA node per machine). For scalability, file creation is sharded over available NUMA nodes, with one shared directory per NUMA node. We configure Ceph with 2 sharded MDSes (1 per node) to eliminate contention on a single MDS. We repeat the benchmark 5 times and report the average delivery throughput. Each run atomically creates 480K files.
Figure 4 presents throughput scalability over an increasing number of parallel processes. For Assise, CC-NVM delegates directory leases to shard-local SharedFS, which act as arbiters for their shard’s LibFS. This allows Assise to scale to 64 processes before being bottlenecked by intra-shard synchronization. Ceph uses disaggregated MDSes that cannot mirror the sharding pattern, resulting in higher overhead for consistent access to the shared directories, as each client has to communicate with 2 MDSes on average for atomic rename (we tried a single MDS with similar performance due to increased MDS contention). We observe MDS CPU utilization going from 10% with 2 processes to 70% with 96 processes. Ceph scales to 16 processes.
We emulate Orion by restricting CC-NVM to use a single SharedFS lease manager. While Orion provides several light-weight mechanisms to communicate with its MDS, these methods cannot be applied for metadata operations that affect multiple inodes (e.g. renames). In this case, data is stored on colocated NVM, but synchronization requires communication with a disaggregated MDS (Assise-disagg). Assise outperforms this variant by 76% at scale.
To quantify the benefit of cross-socket DMA, we repeat the benchmark with shared directories across NUMA nodes. For 4KB files, use of DMA improves performance by 10% at scale. For 1MB files, performance improves by 30%. We use DMA only for data, hence larger files benefit more. Due to space limits, we do not show these results in detail.
5.3. Application Benchmarks
We evaluate the performance of a number of common cloud applications, such as the LevelDB key-value store (Dean and Ghemawat, 2011), as well as the Fileserver and Varmail profiles of the Filebench (Tarasov et al., 2016) benchmarking suite, emulating file and mail servers, and MinuteSort. We use 3 machines for LevelDB and Filebench.
We run a number of LevelDB latency benchmarks, including sequential and random IO, skewed random reads with 1% of highly accessed objects, and sequential synchronous writes (fsync after each write). All benchmarks use a key size of 16 bytes and a value size of 1 KB on a working set of 1M objects. The workload is closed-loop, single-threaded and accesses the entire dataset. Figure 5 presents average measured operation latency, as reported by the benchmark.
Assise, Ceph, and NFS perform similarly for reads, where LevelDB overhead is minimal and caching (for Ceph and NFS) allows them to operate close to hardware speeds—Ceph performs best due to serving cache misses from the local replica, while NFS has to serves misses from a disaggregated server. For non-synchronous writes, NFS is up to 26% faster than Assise, as these go to its client (volatile) buffer cache in large batches (LevelDB buffers, too), while Assise is 69% faster than NFS for synchronous writes that cannot be buffered. Random IO and synchronous writes also incur increasing LevelDB indexing overhead for all results. Ceph performs worse than NFS for writes because it replicates (as does Assise) and Assise performs 22 better. Octopus serves some IO remotely and thus performs worst for reads and better only than Ceph for writes.
Reserve replica read latency.
Reserve replicas reduce read latency for cold data by allowing these reads to be served from remote NVM, rather than cold storage. For this benchmark, we configure Assise to limit the aggregate (LibFS and SharedFS) cache to 2GB and to use the local SSD for cold storage. We then run the LevelDB random read experiment with a 3GB dataset. We repeat the experiment 1) with 3 cache replicas and 2) with 2 cache and 1 reserve replica. Figure 6 shows a CDF of the result. The benchmark accesses data uniformly at random, causing 33% of the reads to be cold. Consequently, at the 50th percentile, read latencies are similar for both configurations (served from cache). At the 66th percentile, reads in the first setup are served from SSD and have 2.2 higher latency than reserve replica reads in the second setup. At the 90th percentile, the latency gap extends to 6.
Two benchmarks operate on a working set of 10,000 files of 16 KB and 128 KB average size for Varmail and Fileserver, respectively. Files grow via 16 KB appends in both benchmarks (emulating mail delivery in Varmail). Varmail reads entire files (emulating mailbox reads) and Fileserver copies files, both at 1 MB IO granularity. Varmail and Fileserver have write to read ratios of 1:1 and 2:1, respectively. Varmail leverages a write-ahead log with strict persistence semantics (fsync after log and mailbox writes), while Filebench consistency is relaxed (no fsync). Figure 7 shows average measured throughput of both benchmarks. Assise outperforms Ceph by 3 for Fileserver and 56 for Varmail, respectively. Ceph performs worse than NFS for Varmail due to stricter persistence requiring it to replicate frequently and due to MDS contention, as Varmail is meta-data intensive, while NFS, Ceph, and Assise can buffer writes in the Fileserver benchmark. NFS performs worse than Ceph as Fileserver appends to random files, causing many random writes (§5.1).
Optimistic crash consistency.
We repeat this benchmark for Assise in optimistic mode (Assise-Opt) and change Varmail to use dsync after each mailbox write (but not after the log write). Prefix semantics allow Assise to buffer and eliminate (coalesce (Kwon et al., 2017)) the temporary log write without losing consistency. Assise-Opt achieves 2.1 higher throughput than Assise. Fileserver has few redundant writes and Assise-Opt is only 7% faster.
|System||Partition [s]||Sort [s]||Flush [s]||Total [s]||GB/s|
We implement and evaluate Tencent Sort (Jiang et al., 2016), the current winner of the MinuteSort external sorting competition (53). Tencent Sort sorts a partitioned input dataset, stored on a number of cluster nodes, to a partitioned output dataset on the same nodes. It leverages a distributed sort consisting of a range partition, network shuffle, and a mergesort (cf. MapReduce (Dean and Ghemawat, 2004)). The first phase presorts unsorted input files into ranges, stored in partitioned temporary files on destination machines. Ranges serve as input to the mergesort phase, which sorts them and writes the output partitions. The number of partitions determine the amount of sort parallelism. Each phase in our implementation is realized with multiple processes, one per partition. A distributed file system is used to store input, output, and temporary files, implicitly taking care of the network shuffle.
We benchmark the Indy category of MinuteSort. Indy requires sorting a synthetic dataset of 100 B records with 10 B keys, distributed uniformly at random. We partition a 320GB input dataset over 4 machines, stored on colocated NVM, with 40 input partitions per machine. This results in a parallelism of 40 cores per machine, a little less than one NUMA node per machine (160 processes in total sharing the file system). Higher amounts of parallelism did not improve performance, as we are bottlenecked by the network bandwidth. MinuteSort does not require replication, so we turn it off. It calls fsync only once for each output partition, after it was written. We compare a version running a single Assise file system with one leveraging per-machine NFS mounts. For Assise, we configure the temporary and output directories to be colocated with the mergesort processes. We do the same for NFS, by exporting corresponding directories from each mergesort node. We conduct 3 runs and report the average. We use the official competition tools (53) to generate and verify the input and output datasets. Table 2 shows the result. Assise sorts the dataset in roughly 1 minute, 2.2
faster than NFS. Scaling this result to the cluster size of the original Tencent Sort (512 machines), we can estimate that Assise sorts 1.5faster than the current world record holder.
Ceph and Assise provide availability. We evaluate how quickly these file systems return an application back to full performance after fail-over and recovery. To do so, we run LevelDB on the same dataset (§5.3) with a 1:1 read-write ratio and measure operation latency before, during, and after fail-over and recovery. These experiments use 2 machines (primary and backup). LevelDB initially runs on the primary. We inject a primary failure by killing the primary’s file system daemon (SharedFS for Assise and OSD for Ceph) and LevelDB. Failures are detected using a 1 s heartbeat response timeout via each file system’s cluster manager. Once a primary failure is detected, LevelDB automatically restarts on the backup. We now wait 30 s before restarting file system daemons on the primary, emulating machine reboot time from NVM without a checkpoint. Once the primary is back online, LevelDB stops on the backup and is restarted on the primary. We repeat this benchmark 5 times and report average results.
Fail-over to hot backup.
Figure 8 shows a time series of measured LevelDB operation latencies during one run of this experiment. Pre-failure, we see bursts of low-latency operations in between stretches of higher latency. This is LevelDB’s steady-state. Bursts show LevelDB writes to its own DRAM log. These are periodically merged with files when the DRAM log is full, causing higher latency for writes waiting on the log to become available. During primary failure, no operations are executed. It takes 1 s to detect the failure and restart LevelDB on the backup (light shaded box). Due to unclean shutdown, LevelDB first checks its dataset for integrity before executing further operations (dark shaded box). For cache replica failover, Assise only needs to digest the per-process log (up to 1GB) on the backup replica, making fail-over near-instantaneous. LevelDB returns to full performance in both latency and throughput 230 ms after failure detection. On Ceph, it takes 3.7 s after failure detection until further operations are executed. However, LevelDB stalls soon thereafter upon compaction (further dark shaded box), which involves access to further files, resulting in an additional 15.6 s delay, before reaching steady-state. Ceph’s long aggregate fail-over time of 23.7 s is due to Ceph losing its DRAM cache, which it rebuilds during LevelDB restart. Assise reaches full performance after failure detection 103 faster than Ceph. LevelDB performs better on the backup, as neither file system has to replicate.
We restart the primary after 30 s. During this time, many file system operations have occurred on the backup that need to be replayed on the primary. Both Assise and Ceph allow applications to operate during recovery, but performance is affected. As soon as the primary is back online, we cleanly close the database on the backup and restart on the primary. Assise detects outdated files via epochs and reads their contents from the remote cache replica upon access. Once read, the local copy is updated, causing future reads to be local. LevelDB returns to full performance 938 ms after restarting it on the recovering primary. Ceph also rebuilds the local OSD, but eagerly and in the background. Ceph takes 13.2 s before LevelDB serves its first operation due to contention with OSD recovery and suffers another delay of 24.9 s on first compaction, reaching full performance 43.4 s after recovery start. Assise recovers to full performance 46 faster than Ceph.
Fail-over to reserve backup.
We measure worst-case LevelDB fail-over time to an Assise reserve replica by cleaning its cache, causing all data to be in cold storage. We repeat the experiment. LevelDB serves its first request on the backup 303 ms after failure detection, but with higher latency due to SSD reads. LevelDB returns to full performance after another 2.5 s. At this point, the entire dataset has migrated back to cache.
For this benchmark, we simply kill LevelDB. In this case, the failure is immediately detected by the local OS and LevelDB is restarted. Ceph can reuse the shared kernel buffer cache in DRAM, resulting in LevelDB restoring its database after 1.63 s and returning to full performance after an additional 2.15 s, for an aggregate 3.78 s fail-over duration. With Assise, the DB is restored in 0.71s, including recovery of the log of the failed process and to acquire the required leases. Full-performance operations occur after an additional 0.16s, for an aggregate 0.87 s fail-over time. Assise recovers this case 4.34 faster than Ceph, showing that process-local caches are not a hindrance to recovery.
5.5. Assise Scaling with a Postfix Mail Server
We use a Postfix mail server deployment to measure the performance of parallel mail delivery. A load balancer forwards incoming email to Postfix mail queue daemons running on a cluster of machines. On each machine, a pool of mail delivery processes pulls email from the machine-local mail queue and delivers it to user Maildir directories on a cluster-shared distributed file system. To ensure atomic mail delivery, a Postfix delivery process writes each incoming email to a new file in a process-private directory and then renames this file to the recipient’s Maildir (cf. §5.2).
Our benchmark delivers 80K emails of the Enron dataset (Klimt and Yang, 2004)
, with each email reaching an average of 4.5 recipients. This results in a total of 360K mail deliveries. Each email has an average size of 200 KB (inclusive of attachments) and the dataset occupies 70 GB. We run Postfix on 3 machines, the load balancer on a 4th machine, and as many mail clients as necessary to maximize throughput. We repeat each experiment 3 times and report average mail delivery throughput and standard deviation (error bars) in Figure9 over an increasing number of delivery processes, balanced over machines (3 processes = 1 process per machine). We compare various Assise configurations and Ceph with 2 MDSes (but 1 MDS performed similarly).
In the first configuration (Assise-rr) the load balancer uses a round-robin policy to send emails to mail queues. Due to a lack of locality, mails delivered to the same Maildir often require synchronization across machines, causing CC-NVM to frequently delegate leases remotely and resulting in increased mail delivery latencies. Despite this, Assise-rr is able to outperform Ceph by up to 5.6x at 15 processes. We verified that Ceph indeed stops scaling by running the benchmark with 300 delivery processes. For this 20 increase in resources, Ceph’s throughput improves by 1.8.
We use the Leiden algorithm (Traag et al., 2019) to find communicating user cliques within the dataset and partition Maildirs such that cliques reside on the same server. This is akin to sharding by suborganization, a classic mail sharding pattern (Birrell et al., 1981). The load balancer is configured to prefer the recipient’s shard. For mail messages with multiple recipients, it picks the shard with the most receivers. In case of mail queue overload, the load balancer sends mail to a random unloaded shard. Sharding users in this manner provides up to 20% better performance (Assise-sharded) due to the fact that repeated deliveries to users of the same clique are likely to occur on the same server, allowing CC-NVM to synchronize delivery locally. At 15 processes, we are network-bound due to replication. Sharding did not improve Ceph’s performance.
To show performance without contention on shared directories, we shard Maildirs by delivery process. This allows delivery processes to deliver email without the need for synchronization. This configuration (Assise-private) is a best-case scenario for Assise since directory contention overheads are non-existent. Assise-private is able to outperform all prior configurations and scale linearly until it reaches roughly 2k mail operations per second, where we become bottlenecked by network bandwidth. Ceph performance is again unaffected.
Our evaluation with a real application demonstrates Assise’s ability to easily scale its performance. Most importantly, the results show Assise’s ability to deliver almost the same level of performance even if it has to perform synchronization on shared directories.
6. Related Work
Distributed file systems.
Ceph (Weil et al., 2006) and HDFS (4) are popular cloud file systems that can leverage both RDMA (10) and NVM (Islam et al., 2016), but are designed for high latency and low bandwidth storage (esp. for small IO). Thus, both dedicate metadata storage to a special cluster and separate small metadata updates from large ones. Both focus on mapping files to objects and leave storage problems to user space OSDs, which store objects in local file systems. Due to a layer of indirection, both incur additional latency and CPU overhead, especially for small IO. Ceph uses a pseudo-random data distribution function that allows clients to find OSDs themselves, but without locality.
Octopus (Lu et al., 2017) and Orion (Yang et al., 2019) are designed to use RDMA and NVM for high performance. Still, neither leverages kernel-bypass for low-latency IO (Octopus uses FUSE, Orion runs in the kernel). Like Ceph, Octopus uses a hash function for file placement (Octopus does not replicate), ignoring data locality. Orion stores data locally (for internal clients), but uses a disaggregated metadata server. Both systems perform multiple remote operations per file IO in the common case to update file and/or metadata, increasing IO latency. Because of update contention at the central metadata server, Orion scales to a small number of clients. Orion has no evaluation of server failure handling (Assise’s is in §5.4).
Assise, in contrast, is designed for a scenario where CPU and networking overheads are high compared to storage access. Hence, Assise eliminates kernel overheads for file operations and localizes storage operations to efficiently use NVM. Assise stores metadata with data, allowing reads to be handled locally. IO incurs a single replica operation in the common case, without requiring dedicated metadata servers or a distribution function to balance load. Instead of relying on per-node local file systems, Assise supports hot and cold storage on multiple media. To scale further, Assise delegates lease management with locality according to access patterns.
Apache Crail (1) provides the HDFS file system API and focuses on throughput by accessing distributed data in parallel, but does not provide fault tolerance. Assise provides fault tolerance and high throughput by accessing fast, local NVM with higher bandwidth than the network. Dropbox uses Amazon S3 for data blocks, but keeps metadata in DRAM, backed by an SSD (Metz, 2016); technical details on its operation are not public. RAMcloud (Ousterhout et al., 2015) is a key-value store that uses SSD as a back up for data in DRAM. The capacity limits of DRAM mean that many RAMcloud operations still involve the network, and because DRAM state cannot be recovered after a crash, it is vulnerable to cascading node failures. Even after single node failures, state must be restored from remote nodes. For example, RAMcloud uses the full-bisection bandwidth of the network to speed recovery. Ursa (Li et al., 2019) uses HDD as a back up for SSD. It attains SSD small IO performance at HDD storage cost by similarly leveraging log structure. Blizzard (Mickens et al., 2014) uses a full-bisection bandwidth network to stripe client IO to remote HDDs for high throughput. Tachyon (Li et al., 2014) leverages DRAM and aims to circumvent replication network bottlenecks by leveraging the concept of lineage, where lost output is recovered by re-executing the operations that created the output. Tachyon requires applications to be ported to its lineage API.
Assise leverages log structure on NVM and SSD and reduces network overheads via log coalescing and local operation. Assise recovery does not require a full bisection bandwidth network, backup storage on SSD or HDD, or a lineage API.
We argue that NVM’s unique characteristics require a redesign of distributed file systems to cache and manage data on colocated persistent memory. We show how to leverage NVM colocation in the design and implementation of Assise, a distributed file system that provides low tail latency, high throughput, scalability, and high availability with a strong consistency model. Assise uses cache replicas in NVM to minimize application recovery time and ensure data availability, while leveraging a crash consistent file system cache coherence layer (CC-NVM) to provide scalability. In comparing with several state-of-the-art file systems, our results show that Assise improves write latency up to 22, throughput up to 56, fail-over time up to 103, and scalability up to 6 versus Ceph, while providing stronger consistency semantics.
-  Note: http://crail.apache.org/ Cited by: §6.
-  Note: https://aws.amazon.com/ebs/ Cited by: §2.1.
-  Note: https://aws.amazon.com/efs/ Cited by: §1.
-  Note: http://hadoop.apache.org/ Cited by: §6.
-  Note: https://software.intel.com/en-us/articles/intelr-memory-latency-checker Cited by: §5.
-  (2019) File systems unfit as distributed storage backends: lessons from 10 years of ceph evolution. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19. Cited by: 3rd item, §5.
-  (2017-08) Amazon S3. Note: https://aws.amazon.com/s3/ Cited by: §2.1.
-  (1995) Serverless network file systems. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, New York, NY, USA, pp. 109–126. External Links: Cited by: §2.
-  (2017-08) Apache ZooKeeper. Note: https://zookeeper.apache.org Cited by: §3.1.
-  (2018-08) Bandwidth: a memory bandwidth benchmark. Note: http://www.accelio.org/ Cited by: §5, §6.
-  (2014) Failure analysis of virtual and physical machines: patterns, causes and characteristics. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’14, Washington, DC, USA, pp. 1–12. External Links: Cited by: §2.
-  (1981) Grapevine: an exercise in distributed computing. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, SOSP ’81, New York, NY, USA, pp. 178–179. External Links: Cited by: §5.5.
-  (2013) Optimistic crash consistency. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, New York, NY, USA, pp. 228–243. External Links: Cited by: §3.
-  (2012) Consistency without ordering. In Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST’12, Berkeley, CA, USA, pp. 9–9. External Links: Cited by: §5.
-  (2011) LevelDB: A Fast Persistent Key-Value Store. Note: https://opensource.googleblog.com/2011/07/leveldb-fast-persistent-key-value-store.html Cited by: §5.3.
-  (2004) MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6, OSDI’04, Berkeley, CA, USA, pp. 10–10. External Links: Cited by: §5.3.
-  Differences from POSIX. Note: http://docs.ceph.com/docs/master/cephfs/posix/ Cited by: §5.
-  (2014) FaRM: fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, pp. 401–414. Cited by: §4.1.
-  (2010) Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, Berkeley, CA, USA, pp. 61–74. External Links: Cited by: §2.
-  (2003) The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, New York, NY, USA, pp. 29–43. External Links: Cited by: §1.
-  (1989) Leases: an efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP ’89, New York, NY, USA, pp. 202–210. External Links: Cited by: §1, §3.3.
-  (2019-04) Intel Optane DC persistent DIMM prices listed: $842 for 128gb, $2,668 for 256gb. Note: https://www.tomshardware.com/news/intel-optane-dimm-pricing-performance,39007.html Cited by: Table 1.
-  (2015-03) Network file system (nfs) version 4 protocol. Note: https://tools.ietf.org/html/rfc7530 Cited by: §1, §2.1, §2, §5, §5.
-  (2017) Computer architecture, sixth edition: a quantitative approach. 6th edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: Cited by: §2.
-  (1994) File system design for an nfs file server appliance. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, WTEC’94, Berkeley, CA, USA, pp. 19–19. External Links: Cited by: §2.
-  (2019-09) Intel optane DC persistent memory comes to oracle exadata X8M. Note: https://insidehpc.com/2019/09/intel-optane-dc-persistent-memory-comes-to-oracle-exadata-x8m/ Cited by: §1, §2.
-  (2019-03) Intel optane dc persistent memory. Note: http://www.intel.com/optanedcpersistentmemory Cited by: §1.
-  (2019-04) Intel SSD DC P4610 1.6TB. Note: Google Shopping search. Lowest non-discount price. Cited by: Table 1.
-  (2016) High performance design for hdfs with byte-addressability of nvm and rdma. In Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, New York, NY, USA, pp. 8:1–8:14. External Links: Cited by: §6.
-  (2019-04) Basic performance measurements of the Intel Optane DC Persistent Memory Module. Note: https://arxiv.org/abs/1903.05714v2 Cited by: §5, §5.1, §5.1.
-  (2016) Tencent sort. Technical report Tencent Corporation. Note: http://sortbenchmark.org/TencentSort2016.pdf Cited by: §5.3.
-  (2015) Using rdma efficiently for key-value services. ACM SIGCOMM Computer Communication Review 44 (4), pp. 295–306. Cited by: §4.1.
-  (2018) Hyperloop: group-based nic-offloading to accelerate replicated transactions in multi-tenant storage systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, New York, NY, USA, pp. 297–312. External Links: Cited by: §4.1.
The enron corpus: a new dataset for email classification research.
European Conference on Machine Learning, pp. 217–226. Cited by: §5.5.
-  (2017) Strata: a cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, New York, NY, USA, pp. 460–477. External Links: Cited by: §A.1, §A.1, Appendix A, 4th item, §1, §3.3, §3.3, §3.3, §3.4, §4, §5.1, §5.3.
-  (2017-04) Fast memcpy with SPDK and intel I/OAT DMA engine. Note: https://software.intel.com/en-us/articles/fast-memcpy-using-spdk-and-ioat-dma-engine Cited by: §3.2.
-  (2019) RECIPE: reusing concurrent in-memory indexes for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, Canada. Cited by: §1.
-  (2011) Detecting failures in distributed systems with the falcon spy network. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11. Cited by: §3.1.
-  (2014) Tachyon: reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, New York, NY, USA, pp. 6:1–6:15. External Links: Cited by: §6.
-  (2019) URSA: hybrid block storage for cloud-scale virtual disks. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19, New York, NY, USA, pp. 15:1–15:17. External Links: Cited by: §2.1, §2, §6.
-  (2017) Octopus: an rdma-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’17, Berkeley, CA, USA, pp. 773–785. External Links: Cited by: 3rd item, §1, §2.1, §2, §5, §6.
-  (2012) A performance study to guide rdma programming decisions. In High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on, pp. 778–785. Cited by: §4.1, §4.1.
-  (2016-03) The epic story of Dropbox’s exodus from the Amazon cloud empire. Note: https://www.wired.com/2016/03/epic-story-dropboxs-exodus-amazon-cloud-empire/ Cited by: §6.
-  (2014) Blizzard: fast, cloud-scale block storage for cloud-oblivious applications. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI’14, Berkeley, CA, USA, pp. 257–273. External Links: Cited by: §6.
-  (2013) Using one-sided rdma reads to build a fast, cpu-efficient key-value store.. In USENIX Annual Technical Conference, pp. 103–114. Cited by: §4.1.
-  (2017-06) NVM programming model (NPM). Version 1.2 edition, SNIA. Cited by: §1.
-  (2017) Octopus - github repostiroy. Note: https://github.com/thustorage/octopus Cited by: §5.
-  (2015-08) The ramcloud storage system. ACM Trans. Comput. Syst. 33 (3), pp. 7:1–7:55. External Links: Cited by: §6.
-  (2017-08) Persistent memory programming. Note: http://pmem.io/ Cited by: §4.
-  (2017-10) PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP ’17), SOSP ’17, Shanghai, China. Cited by: §2.1.
-  (2018-10) LegoOS: a disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 69–87. External Links: Cited by: §1.
-  (2007) Latency and bandwidth-minimizing failure detectors. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, New York, NY, USA, pp. 89–99. External Links: Cited by: §3.1.
-  Sort benchmark home page. Note: http://sortbenchmark.org/ Cited by: §5.3, §5.3.
-  (1981-07) Operating system support for database management. Commun. ACM 24 (7), pp. 412–418. External Links: Cited by: §1.
-  (2014-09) Supporting filesystems in persistent memory. Note: https://lwn.net/Articles/610174/ Cited by: §5.
-  (2018) Tailwind: fast and atomic rdma-based replication. Cited by: §4.1.
-  (2016) Filebench: a flexible framework for file system benchmarking. USENIX ;login: 41 (1). Cited by: §5.3, §5.
-  (2019) From louvain to leiden: guaranteeing well-connected communities. Scientific reports 9. Cited by: §5.5.
-  (2004) Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, Berkeley, CA, USA, pp. 7–7. External Links: Cited by: 3rd item.
-  (2017) To fuse or not to fuse: performance of user-space file systems.. In FAST, pp. 59–72. Cited by: §5.1.
-  (2013) Robustness in the salus scalable block store. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, nsdi’13, Berkeley, CA, USA, pp. 357–370. External Links: Cited by: 5th item, §1, §3.
-  (2009) Solaris 10 ZFS essentials. Prentice Hall. Cited by: §2.
-  (2006) Ceph: a scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI ’06, Berkeley, CA, USA, pp. 307–320. External Links: Cited by: §1, §2.1, §2.1, §2, §5, §6.
-  (2019) Finding and fixing performance pathologies in persistent memory software stacks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, New York, NY, USA, pp. 427–439. External Links: Cited by: §3.2.
-  (2019) Orion: a distributed file system for non-volatile main memory and rdma-capable networks. In 17th USENIX Conference on File and Storage Technologies (FAST 19), Boston, MA, pp. 221–234. External Links: Cited by: §1, §2.1, §2, §5, §6.
-  (2018-07) Google cloud taps new intel memory module for SAP HANA workloads. Note: https://www.zdnet.com/article/google-cloud-taps-new-intel-memory-module-for-sap-hana-workloads/ Cited by: §1, §2.
-  (2019-08) Baidu swaps DRAM for optane to power in-memory database. Note: https://www.zdnet.com/article/baidu-swaps-dram-for-optane-to-power-in-memory-database/ Cited by: §1, §2.
-  (2018-10) Write-optimized and high-performance hashing index scheme for persistent memory. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 461–476. External Links: Cited by: §1.
Appendix A IO Paths Appendix
In this appendix, we describe in detail the IO paths introduced in Section 3.2, including cold storage in SSD, and how they integrate with Strata. We use some Strata terminology in this section, which is defined in . Figure 10 illustrates the major IO paths in Assise. Application processes using Assise run on a configurable number of cache replicas. For illustration, the figure shows an example with 2 cache replicas and 1 reserve replica.
a.1. Write Path
Writes in Assise involve three high-level mechanisms that operate on different time scales: 1. To allow for persistence with low latency, LibFS directly writes into a local, configurably sized, private update log in NVM ( in Figure 10). 2. The local update log is later (on fsync when pessimistic, on dsync when optimistic) chain-replicated by LibFS (, ), which provides opportunities for saving network bandwidth by coalescing  the log before replication. 3. When the update log fills beyond a threshold, a digest  of the log is initiated on every replica () to evict its contents. We describe replication and digestion next.
Replication and crash consistency.
When pessimistic, fsync forces immediate, synchronous replication. The caller is blocked until all writes up to the fsync have been replicated. Thus, all writes prior to an fsync outlive node failures.
When optimistic, fsync is a no-op and Assise is free to delay replication to coalesce more operations in the write log before replication. In this case, Assise initiates replication on dsync or upon digestion (see below).
After coalescing the local update log, its contents are written to the LibFS private update log on the next replica along the replication chain via RDMA writes (). Finally, an RPC is sent to the replica to initiate chain replication to the next replica (), and so on. The final replica in the chain sends an acknowledgment back along the chain to indicate that the chain completed successfully.
When a process’s private update log fills beyond a threshold, LibFS replicates all log contents and then initiates a digest on each replica along the replication chain via RPC (, ). Each replica checks log integrity and potentially further coalesces them (). Each replica along the chain digests in parallel and acknowledges when its digest operation is finished.
Cold data migration.
Digests insert new data into SharedFS hot shared areas  in NVM (the second-level cache), migrating cold data out of these areas (). Assise migration is LRU based. Data migrates from private write log to hot, to optional reserve, to cold shared area. For cache replicas, the hot shared area resides in NVM; there is no reserve shared area, and a cold shared area resides in SSD. For the reserve replica, the hot and cold shared areas both reside in SSD, and there is a reserve shared area in NVM that is used to accelerate cold reads (cf. §A.2 and §3.5).
a.2. Read Path
Assise allows different versions of the same file block to be available in multiple storage layers simultaneously. The LRU data migration mechanism guarantees that the latest version is always available in the fastest media of the storage hierarchy. Upon a read, LibFS 1. checks the process-private update log and DRAM read cache (log hashtable and read cache in Figure 10) for the requested data block (). If not found, LibFS 2. checks the node-local hot shared area () via extent trees (cached in process-local DRAM—extent tree in Figure 10). If the data was found in either of these areas, it is read locally. If not found, LibFS 3. checks the reserve shared area on the reserve replica (), if it exists, and in parallel checks the cold shared area on the local replica (). If the data was found in the reserve shared area, LibFS reads it remotely. Otherwise, it is read locally.
Read cache management.
Recently read data is cached in DRAM, except if it was read from local NVM, where DRAM caching does not provide much benefit. Assise prefetches up to 256KB of data sequentially when reading from cached media. The read cache caches 4KB blocks, which is also the IO granularity of the SSD. Remote NVM reads can happen at smaller granularity (§3.2). Filling the DRAM cache with new data might necessitate evicting old data. In this case, the data is written back from DRAM to NVM by LibFS to the local update log (). The updated data migrates to the hot shared area on digest. Finally, upon release of a lease, LibFS invalidates corresponding cache entries.