Analyzing IO Amplification in Linux File Systems

07/26/2017
by   Jayashree Mohan, et al.
0

We present the first systematic analysis of read, write, and space amplification in Linux file systems. While many researchers are tackling write amplification in key-value stores, IO amplification in file systems has been largely unexplored. We analyze data and metadata operations on five widely-used Linux file systems: ext2, ext4, XFS, btrfs, and F2FS. We find that data operations result in significant write amplification (2-32X) and that metadata operations have a large IO cost. For example, a single rename requires 648 KB write IO in btrfs. We also find that small random reads result in read amplification of 2-13X. Based on these observations, we present the CReWS conjecture about the relationship between IO amplification, consistency, and storage space utilization. We hope this paper spurs people to design future file systems with less IO amplification, especially for non-volatile memory technologies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/08/2019

CFS: A Distributed File System for Large Scale Container Platforms

We propose CFS, a distributed file system for large scale container plat...
09/12/2019

Exploring the Behavior of Coherent Accelerator Processor Interface (CAPI) on IBM Power8+ Architecture and FlashSystem 900

The Coherent Accelerator Processor Interface (CAPI) is a general term fo...
10/05/2018

Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing

We present a new approach to testing file-system crash consistency: boun...
07/27/2019

SSDFS: Towards LFS Flash-Friendly File System without GC operation

Solid state drives have a number of interesting characteristics. However...
04/26/2015

Evaluating Dynamic File Striping For Lustre

We define dynamic striping as the ability to assign different Lustre str...
04/17/2019

InvisibleWriteRule: the sufficient condition for safely omittable write operations in concurrency control

Concurrency Control (CC) is the heart of the correctness and is the curr...
06/11/2021

Bandwidth-Optimal Random Shuffling for GPUs

Linear-time algorithms that are traditionally used to shuffle data on CP...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

File systems were developed to enable users to easily and efficiently store and retrieve data. Early file systems such as the Unix Fast File System [1] and ext2 [2] were simple file systems. To enable fast recovery from crashes, crash-consistency techniques such as journaling [3] and copy-on-write [4] were incorporated into file systems, resulting in file systems such as ext4 [5] and xfs [6]. Modern file systems such as btrfs [7] include features such as snapshots and checksums for data, making the file system even more complex.

While the new features and strong crash-consistency guarantees have enabled wider adoption of Linux file systems, it has resulted in the loss of a crucial aspect: efficiency. File systems now maintain a large number of data structures on storage, and both data and metadata paths are complex and involve updating several blocks on storage. In this paper, we ask the question: what is the IO cost of various Linux file-system data and metadata operations? What is the IO amplification of various operations on Linux file systems? While this question is receiving wide attention in the world of key-value stores [8, 9, 10, 11, 12, 13] and databases [14], this has been largely ignored in file systems. File systems have traditionally optimized for latency and overall throughput [15, 16, 17, 18], and not on IO or space amplification.

We present the first systematic analysis of read, write, and space amplification in Linux file systems. Read amplification indicates the ratio of total read IO to user data respectively. For example, if the user wanted to read 4 KB, and the file system read 24 KB off storage to satisfy that request, the read amplification is 6. Write amplification is defined similarly. Space amplification measures how efficiently the file system stores data: if the user writes 4 KB, and the file system consumes 40 KB on storage (including data and metadata), the space amplification is 10.

We analyze five widely-used Linux file systems that occupy different points in the design space: ext2 (no crash consistency guarantees), ext4 (metadata journaling), XFS (metadata journaling), F2FS (log-structured file system), and btrfs (copy-on-write file system). We analyze the write IO and read IO resulting from various metadata operations, and the IO amplification arising from data operations. We also analyze these measures for two macro-benchmarks: compiling the Linux kernel, and Filebench varmail [19]. We break down write IO cost by IO that was performed synchronously (during fsync()) and IO that was performed during delayed background checkpointing.

We find several interesting results. For data operations such as overwriting a file or appending to a file, there was significant write amplification (2–32). Small random reads resulted in a read amplification of 2–8

, even with a warm cache. Metadata operations such as directory creation or file rename result in significant storage IO: for example, a single file rename required 12–648 KB to be written to storage. Even though ext4 and xfs both implement metadata journaling, we find XFS significantly more efficient for file updates. Similarly, though F2FS and btrfs are implemented based on the log-structured approach (copy-on-write is a dual of the log-structured approach), we find F2FS to be significantly more efficient across all workloads. In fact, in all our experiments, btrfs was an outlier, producing the highest read, write, and space amplification. While this may partly arise from the new features of btrfs (that other file systems do not provide), the copy-on-write nature of btrfs is also part of the reason.

We find that IO amplification arises due to three main factors: the block-based interface, the crash-consistency mechanisms of file systems, and the different data structures maintained by storage to support features such as snapshots. Based on these observations, we introduce the CReWS conjecture. The CReWS conjecture states that for a general-purpose file system on a shared storage device, it is impossible to provide strong crash-consistency guarantees while also minimizing read, write, and space amplification. We discuss different designs of file systems, and show that for a general-purpose file system (used by many applications), minimizing write amplification leads to space amplification. We hope the CReWS conjecture helps guide future file-system designers.

With the advent of non-volatile memory technologies such as Phase Change Memory [20] that have limited write cycles, file-system designers can no longer ignore IO amplification. Such technologies offer the byte-based interface, which can greatly help to reduce IO amplification. Data structures can be updated byte-by-byte if required, and the critical metadata operations can be redesigned to have low IO footprint. We hope this paper indicates the current state of IO amplification in Linux file systems, and provides a useful guide for the designers of future file systems.

2 Analyzing Linux File Systems

We now analyze five Linux file systems which represent a variety of file-system designs. First, we present our methodology (§2.1) and a brief description of the design of each file system (§2.2). We then describe our analysis of common file-system operations based on three aspects: read IO, write IO, and space consumed (§2.3).

2.1 Methodology

We use blktrace [21], dstat [22], and iostat [23] to monitor the block IO trace of different file-system operations such as rename() on five different Linux file systems. These tools allow us to accurately identify the following three metrics.

Write Amplification. The ratio of total storage write IO to the user data. For example, if the user wrote 4 KB, and that resulted in the file system writing 8 KB to storage, the write amplification is 2. For operations such as file renames, where there is no user data, we simply report the total write IO. Write IO and write amplification both should be minimized.

Read Amplification. Similar to write amplification, this is the ratio of total storage read IO to user-requested data. For example, if the user wants to read 4 KB, and the file system reads 12 KB off the storage to serve the read request, the read amplification is 3. We report the total read IO for metadata operations such as file creation. Read amplification should also be minimized.

Space Amplification. The ratio of bytes consumed on storage to bytes stored by the user. For example, the user wants to store 4 KB. If the file system has to consume 20 KB on storage to store 4 KB of user data, the space amplification is 5. Space amplification is a measure of how efficiently the file system is using storage, and thus should be minimized. We calculate space amplification based on the unique disk locations written to the storage, during the workloads.

Note that if the user stores one byte of data, the write and space amplification is trivially 4096 since the file system performs IO in 4096 block-sized units. We assume that a careful application will perform read and write in multiples of the block size. We also use noatime in mounting the file systems we study. Thus, our results represent amplification that will be observed even for careful real-world applications.

2.2 File Systems Analyzed

We analyze five different Linux file systems. Each of these file systems is (or was in the recent past) used widely, and represents a different point in the file-system design spectrum.

ext2. The ext2 file system [24] is a simple file system based on the Unix Fast File System [1]. ext2 does not include machinery for providing crash consistency, instead opting to fix the file system with fsck after reboot. ext2 writes data in place, and stores file metadata in inodes. ext2 uses direct and indirect blocks to find data blocks.

ext4. ext4 [2] builds on the ext2 codebase, but uses journaling [3] to provide strong crash-consistency guarantees. All metadata is first written to the journal before being checkpointed (written in-place) to the file system. ext4 uses extents to keep track of allocated blocks.

XFS. The XFS [6] file system also uses journaling to provide crash consistency. However, XFS implements journaling differently from ext4. XFS was designed to have high scalability and parallelism. XFS manages the allocated inodes through the inode B+ tree, while the free space information is managed by B+ trees. The inodes keep track of their own allocated extents.

F2FS. F2FS [25] is a log-structured file system designed specifically for solid state drives. Similar to the original LFS [26], F2FS writes all updates to storage sequentially. The logs in F2FS are composed of multiple segements, with the segment utilization monitored using Segment Information Table (SIT). Additionally, to avoid the wandering tree problem [27], F2FS assigns a node ID to the metadata structures like inodes, direct and indirect blocks. The mapping between node ID and the actual blockaddress is maintained in a Node Address Table (NAT), which has to be referred to read data off storage, resulting in some overhead. Though data is written sequentially to the logs, NAT and SIT updates are first journaled and then written out in place.

btrfs. btrfs [7] is a copy-on-write file system based on B+ trees. The entire file system is composed of different B+ trees (e.g., file-system tree, extent tree, checksum tree, etc.), all emerging from a single tree called as the tree of tree roots. All the metadata of Btrfs is located in these trees. The file-system tree stores the information about all the inodes, while the extent tree holds the metadata related to each allocated extent. Btrfs uses copy-on-write logging, in which any modification to a B+ tree leaf/node is preceded by copying of the entire leaf/node to the log tree.

2.3 Analysis

We measure the read IO, write IO, and space consumed by different file-system operations.

2.3.1 Data Operations

First, we focus on data operations: file read, file overwrite, and file append. For such operations, it is easy to calculate write amplification, since the workload involves a fixed amount of user data. The results are presented in Table 1.

Measure ext2 ext4 xfs f2fs btrfs
File Overwrite
Write Amplification 2.00 4.00 2.00 2.66 32.65
Space Amplification 1.00 4.00 2.00 2.66 31.17
File Append
Write Amplification 3.00 6.00 2.01 2.66 30.85
Space Amplification 1.00 6.00 2.00 2.66 29.77
File Read (cold cache)
Read Amplification 6.00 6.00 8.00 9.00 13.00
File Read (warm cache)
Read Amplification 2.00 2.00 5.00 3.00 8.00
Table 1: Amplification for Data Operations. The table shows the read, write, and space amplification incurred by different file systems when reading and writing files.

File Overwrite. The workload randomly seeks to a 4KB-aligned location in a 100 MB file, does a 4 KB write (overwriting file data), then calls fsync() to make the data durable. The workload does 10 MB of such writes. From Table 1, we observe that ext2 has the lowest write and space amplification, primarily due to the fact that it has no extra machinery for crash consistency; hence the overwrites are simply performed in-place. The 2write amplification arises from writing both the data block and the inode (to reflect modified time). XFS has a similar low write amplification, but higher space amplification since the metadata is first written to the journal. When compared to XFS, ext4 has higher write and space amplification: this is because ext4 writes the superblock and other information into its journal with every transaction; in other words, XFS journaling is more efficient than ext4 journaling. Interestingly, F2FS has an efficient implementation of the copy-on-write technique, leading to low write and space amplification. The roll-forward recovery mechanism of F2FS allows F2FS to write only the direct node block and data on every fsync(), with other data checkpointed infrequently [25]. In contrast, btrfs has a complex implementation of the copy-on-write technique (mostly due to a push to provide more features such as snapshots and stronger data integrity) that leads to extremely high space and write amplification. When btrfs is mounted with the default mount options that enable copy-on-write and checksumming of both data and metadata, we see 32write amplification as shown in Table 1. However, if the checksumming of the user data is disabled, the write amplification drops to 28, and when the copy-on-write feature is also disabled for user data (metadata is still copied on write), the write amplification for overwrites comes down to about 18.6. An interesting take-away from this analysis is that even if you pre-allocate all your files on these file systems, writes will still lead to 2–30write amplification.

File Append. Our next workload appends a 4 KB block to the end of a file and calls fsync(). The workload does 10 MB of such writes. The appended file is empty initially. Our analysis for the file overwrite workload mostly holds for this workload as well; the main difference is that more metadata (for block allocation) has to be persisted, thus leading to more write and space amplification for ext2 and ext4 file systems. In F2FS and xfs, the block allocation information is not persisted at the time of fsync(), leading to behavior similar to file overwrites. Thus, on xfs and f2fs, pre-allocating files does not provide a benefit in terms of write amplification.

Figure 1: Write Amplification for Various Write Sizes. The figure shows the write amplification observed for writes of various sizes followed by a fsync() call.

We should note that write amplification is high in our workloads because we do small writes followed by a fsync(). The fsync() call forces file-system activity, such as committing metadata transactions, which has a fixed cost regardless of the size of the write. As Figure 1 shows, as the size of the write increases, the write amplification drops close to one. Applications which issue small writes should take note of this effect: even if the underlying hardware does not get benefit from big sequential writes (such as SSDs), the file system itself benefits from larger writes.

File Reads. The workload seeks to a random 4 KB aligned block in a 10 MB and reads one block. In Table 1, we make a distinction between a cold-cache read, and a warm-cache read. On a cold cache, the file read usually involves reading a lot of file-system metadata: for example, the directory, the file inode, the super block etc.. On subsequent reads (warm cache), reads to these blocks will be served out of memory. The cold-cache read amplification is quite high for all the file systems. Even in the case of simple file systems such as ext2, reading a file requires reading the inode. The inode read triggers a read-ahead of the inode table, increasing the read amplification. Since the read path does not include crash-consistency machinery, ext2 and ext4 have the same read amplification. The high read amplification of xfs results from reading the metadata B+ tree and readahead for file data. F2FS read amplification arises from reading extra metadata structures such as the NAT table and the SIT table [25]. In btrfs, a cold-cache file read involves reading the Tree of Tree roots, the file-system and the checksum tree, leading to high read amplification. On a warm cache, the read amplification of all file systems greatly reduces, since global data structures are likely to be cached in memory. Even in this scenario, there is 2–8read amplification for Linux file systems.

Measure ext2 ext4 xfs f2fs btrfs
File Create
Write Cost (KB) 24 52 52 16 116
fsync 4 28 4 4 68
checkpoint 20 24 48 12 48
Read Cost (KB) 24 24 32 36 40
Space Cost (KB) 24 52 20 16 116
Directory Create
Write Cost (KB) 28 64 80 20 132
fsync 4 36 4 8 68
checkpoint 24 28 76 12 64
Read Cost (KB) 20 20 60 36 60
Space Cost (KB) 28 64 54 20 132
File Rename
Write Cost (KB) 12 32 16 20 648
fsync 4 20 4 8 392
checkpoint 8 12 12 12 256
Read Cost (KB) 20 24 48 40 48
Space Cost (KB) 12 32 16 20 392
Table 2: IO Cost for Metadata Operations. The table shows the read, write, and space IO costs incurred by different file systems for different metadata operations. The write cost is broken down into IO at the time of fsync(), and checkpointing IO performed later.

2.3.2 Metadata Operations

We now analyze the read and write IO (and space consumed) by different file-system operations. We present file create, directory create, and file rename. We have experimentally verified that the behavior of other metadata operations, such as file link, file deletion, and directory deletion, are similar to our presented results. Table 2 presents the results. Overall, we find that metadata operations are very expensive: even a simple file rename results in the 12–648 KB being written to storage. On storage with limited write cycles, a metadata-intensive workload may wear out the storage quickly if any of these file systems are used.

In many file systems, there is a distinction between IO performed at the time of the fsync() call, and IO performed later in the background. The fsync() IO is performed in the critical path, and thus contributes to user-perceived latency. However, both kinds of IO ultimately contribute to write amplification. We show this breakdown for the write cost in Table 2.

File Create. The workload creates a new file in a pre-existing directory of depth three (e.g., a/b/c) and calls fsync() on the parent directory to ensure the creation is persisted. File creation requires allocating a new inode and updating a directory, and thus requires 16–116 KB of write IO and 24–40 KB of read IO in the various file systems. F2FS is the most efficient in terms of write IO (but requires a lot of read IO). Overall, ext2 is the most efficient in performing file creations. ext2, XFS, and F2FS all strive to perform the minimum amount of IO in the fsync() critical path. Due to metadata journaling, ext4 writes 28 KB in the critical path. btrfs performs the worst, requiring 116 KB of write IO (68 KB in the critical path) and 40 KB in checkpointing IO. The poor performance of btrfs results from having to update a number of data structures, including the file-system tree, the directory index, and backreferences to create a file [7].

Directory Create. The workload creates a new directory in an existing directory of depth four, and calls fsync() on the parent directory. Directory creation follows a similar trend to file creation. The main difference is the additional IO in creating the directory itself. As before, btrfs experience the most write IO cost and read IO cost for this workload. ext2 and F2FS are the most efficient.

File Rename. The workload renames a file within the same directory, and calls fsync() on the parent directory to ensure the rename is persisted. Renaming a file requires updating two directories. Performing rename atomically requires machinery such as journaling or copy-on-write. ext2 is the most efficient, requiring only 32 KB of IO overall. Renaming a file is a surprisingly complex process in btrfs. Apart from linking and unlinking files, renames also change the backreferences of the files involved. btrfs also logs the inode of every file and directory (from the root to the parent directory) involved in the operation. The root directory is persisted twice, once for unlink, and once for the link. As a result, btrfs is the least efficient, requiring 696 KB of IO to rename a single file. Even if many of these inodes are cached, btrfs renames are significantly less efficient than in other file systems.

Macro-benchmark: Kernel Compilation. To provide a more complete picture of the IO amplification of file systems, we also measure IO amplification for a macro-benchmark: uncompressing a Linux kernel tarball, and compiling the kernel. The results are presented in Table 3. The file systems perform 6.09–6.41 GB of write IO and 0.23–0.27 GB of read IO. ext2 is the most efficient file system, achieving the lowest write and space cost. Among file systems providing crash-consistency guarantees, ext4 and XFS perform well, achieving lower write and space cost than the copy-on-write file systems of F2FS and btrfs. btrfs performs the most write IO, and uses the most space on storage. The kernel compilation workload does not result in lot of write amplification (or variation between file systems), because the fsync() is not called often; thus each file system is free to group together operations to reduce IO and space cost. Even in this scenario, the higher write and space amplification of btrfs is observed.

Macro-benchmark: Filebench Varmail. We ran the Varmail benchmark from the Filebench benchmark suite [19] with the following parameters: 16 threads, total files 100K, mean file size 16 KB. Varmail simulates a mail server, and performs small writes followed by fsync() on different files using multiple threads. In this fsync()-heavy workload, we see that the effects of write, read, and space amplification are clear. ext2 still performs the least IO and uses the least storage space. btrfs performs 38% more write IO than ext2, and uses 39% more space on storage. F2FS performs better than btrfs, but has a high read cost (10other file systems).

Measure ext2 ext4 xfs f2fs btrfs
Kernel Compilation
Write Cost (GB) 6.09 6.19 6.21 6.38 6.41
Read Cost (GB) 0.25 0.24 0.24 0.27 0.23
Space Cost (GB) 5.94 6.03 5.96 6.2 6.25
Filebench Varmail
Write Cost (GB) 1.52 1.63 1.71 1.82 2.10
Read Cost (KB) 116 96 116 1028 0
Space Cost (GB) 1.45 1.57 1.50 1.77 2.02
Table 3: IO Cost for Macro-benchmarks. The table shows the read, write, and space IO costs incurred by different file systems when compiling the Linux kernel 3.0 and when running the Varmail benchmark in the Filebench suite.

Discussion. IO and space amplification arises in Linux file systems due to using the block interface, from crash-consistency techniques, and the need to maintain and update a large number of data structures on storage. Comparison of XFS and ext4 shows that even when the same crash-consistency technique (journaling) is used, the implementation leads to a significant difference in IO amplification. With byte-addressable non-volatile memory technologies arriving on the horizon, using such block-oriented file systems will be disastrous. We need to develop lean, efficient file systems where operations such as file renames will result in a few bytes written to storage, not tens to hundreds of kilobytes.

3 The CReWS Conjecture

Inspired by the RUM conjecture [28] from the world of key-value stores, we propose a similar conjecture for file systems: the CReWS conjecture111We spent some time trying to come up with something cool like RUM, but alas, this is the best we could do.

The CReWS conjecture states that it is impossible for a general-purpose file system to provide strong crash (C)onsistency guarantees while simultaneously achieving low (R)ead amplification, (W)rite amplification, and (S)pace amplification.

By a general-purpose file system we mean a file system used by multiple applications on a shared storage device. If the file system can be customized for a single application on a dedicated storage device, we believe it is possible to achieve the other four properties simultaneously.

For example, consider a file system designed specifically for an append-only log such as Corfu [29] (without the capability to delete blocks). The storage device is dedicated for the append-only log. In this scenario, the file system can drop all metadata and treat the device as a big sequential log; storage block 0 is block 0 of the append-only log, and so on. Since there is no metadata, the file system is consistent at all times implicitly, and there is low write, read, and space amplification. However, this only works if the storage device is completely dedicated to one application.

Note that we can extend our simple file-system to a case where there are N applications. In this case, we would divide up the storage into N units, and assign one unit to each application. For example, lets say we divide up a 100 GB disk for 10 applications. Even if an application only used one byte, the rest of its 10 GB is not available to other applications; thus, this design leads to high space amplification.

In general, if multiple applications want to share a single storage device without space amplification, dynamic allocation is required. Dynamic allocation necessitates metadata keeping track of resources which are available; if file data can be dynamically located, metadata such as the inode is required to keep track of the data locations. The end result is a simple file system such as ext2 [24] or NoFS [15]. While such systems offer low read, write, and space amplification, they compromise on consistency: ext2 does not offer any guarantees on a crash, and a crash during a file rename on NoFS could result in the file disappearing.

File systems that offer strong consistency guarantees such as ext4 and btrfs incur significant write amplification and space amplification, as we have shown in previous sections. Thus, to the best of our knowledge, the CReWS conjecture is true.

Implications. The CReWs conjecture has useful implications for the design of storage systems. If we seek to reduce write amplification for a specific application such as a key-value store, it is essential to sacrifice one of the above aspects. For example, by specializing the file system to a single application, it is possible to minimize the three amplification measures. For applications seeking to minimize space amplification, the file system design might sacrifice low read amplification or strong consistency guarantees. For non-volatile memory file systems [30, 31], given the limited write cycles of non-volatile memory [32], file systems should be designed to trade space amplification for write amplification; given the high density of non-volatile memory technologies [20, 33, 34, 35, 36], this should be acceptable. Thus, given a goal, the CReWS conjecture focuses our attention on possible avenues to achieve it.

4 Conclusion

We analyze the read, write, and space amplification of five Linux file systems. We find that all examined file systems have high write amplification (2–32) and read amplification (2–13). File systems that use crash-consistency techniques such as journaling and copy-on-write also suffer from high space amplification (2–30). Metadata operations such as file renames have large IO cost, requiring 32–696 KB of IO for a single rename.

Based on our results, we present the CReWS conjecture: that a general-purpose file system cannot simultaneously achieve low read, write, and space amplification while providing strong consistency guarantees. With the advent of byte-addressable non-volatile memory technologies, we need to develop leaner file systems without significant IO amplification: the CReWS conjecture will hopefully guide the design of such file systems.

References

  • [1] Marshall K McKusick, William N Joy, Samuel J Leffler, and Robert S Fabry. A fast file system for unix. ACM Transactions on Computer Systems (TOCS), 2(3):181–197, 1984.
  • [2] Avantika Mathur, Mingming Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. The new ext4 filesystem: current status and future plans. In Proceedings of the Linux symposium, volume 2, pages 21–33. Citeseer, 2007.
  • [3] Robert Hagmann. Reimplementing the Cedar file system using logging and group commit. In SOSP, 1987.
  • [4] Dave Hitz, James Lau, and Michael Malcolm. File System Design for an NFS File Server Appliance. In Proceedings of the USENIX Winter Technical Conference (USENIX Winter ’94), San Francisco, California, January 1994.
  • [5] Avantika Mathur, Mingming Cao, Suparna Bhattacharya, Alex Tomas Andreas Dilge and, and Laurent Vivier. The New Ext4 filesystem: Current Status and Future Plans. In Ottawa Linux Symposium (OLS ’07), Ottawa, Canada, July 2007.
  • [6] Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. Scalability in the xfs file system. In USENIX Annual Technical Conference, volume 15, 1996.
  • [7] Ohad Rodeh, Josef Bacik, and Chris Mason. Btrfs: The linux b-tree filesystem. ACM Transactions on Storage (TOS), 9(3):9, 2013.
  • [8] Michael A Bender, Martin Farach-Colton, Jeremy T Fineman, Yonatan R Fogel, Bradley C Kuszmaul, and Jelani Nelson. Cache-oblivious streaming b-trees. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 81–92. ACM, 2007.
  • [9] Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, and Raju Rangaswami. Nvmkv: a scalable, lightweight, ftl-aware key-value store. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 207–219, 2015.
  • [10] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. Lsm-trie: An lsm-tree-based ultra-large key-value store for small data items. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 71–82, 2015.
  • [11] Russell Sears and Raghu Ramakrishnan. blsm: a general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 217–228. ACM, 2012.
  • [12] Pradeep J Shetty, Richard P Spillane, Ravikant R Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. Building workload-independent storage with vt-trees. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 17–30, 2013.
  • [13] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Wisckey: separating keys from values in ssd-conscious storage. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 133–148, 2016.
  • [14] Percona TokuDB. https://www.percona.com/software/mysql-database/percona-tokudb.
  • [15] Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Consistency Without Ordering. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST ’12), pages 101–116, San Jose, California, February 2012.
  • [16] Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Optimistic Crash Consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP ’13), Farmington, PA, November 2013.
  • [17] Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Application crash consistency and performance with ccfs. In 15th USENIX Conference on File and Storage Technologies (FAST 17), pages 181–196, Santa Clara, CA, 2017. USENIX Association.
  • [18] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, et al. Betrfs: A right-optimized write-optimized file system. In FAST, pages 301–315, 2015.
  • [19] Andrew Wilson. The new and improved filebench. In 6th USENIX Conference on File and Storage Technologies (FAST 08), 2008.
  • [20] Simone Raoux, Geoffrey W. Burr, Matthew J. Breitwisch, Charles T. Rettner, Y. C. Chen, Robert M. Shelby, Martin Salinga, Daniel Krebs, S. H. Chen, H.L. Lung, and C. H. Lam. Phase-change random access memory: A scalable technology. IBM Journal of Research and Development, 52(4.5):465–479, 2008.
  • [21] Block I/O Layer Tracing. https://linux.die.net/man/8/blktrace, December 2016.
  • [22] Generating System Resource Statisting. https://linux.die.net/man/1/dstat, December 2016.
  • [23] Reporting I/O Statistics. https://linux.die.net/man/1/iostat, December 2016.
  • [24] Remy Card, Theodore Ts’o, and Stephen Tweedie. Design and Implementation of the Second Extended Filesystem. In First Dutch International Symposium on Linux, Amsterdam, Netherlands, December 1994.
  • [25] Changman Lee, Dongho Sim, Joo Young Hwang, and Sangyeun Cho. F2fs: A new file system for flash storage. In FAST, pages 273–286, 2015.
  • [26] Mendel Rosenblum and John K Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1):26–52, 1992.
  • [27] Artem B Bityutskiy. Jffs3 design issues, 2005.
  • [28] Manos Athanassoulis, Michael S Kester, Lukas M Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. Designing access methods: The rum conjecture. In International Conference on Extending Database Technology, pages 461–466, 2016.
  • [29] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D Davis. Corfu: A shared log design for flash clusters. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 1–14, 2012.
  • [30] Jian Xu and Steven Swanson. NOVA: a log-structured file system for hybrid volatile/non-volatile main memories. In FAST, 2016.
  • [31] Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. System software for persistent memory. In Proceedings of the Ninth European Conference on Computer Systems, page 15. ACM, 2014.
  • [32] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A durable and energy efficient main memory using phase change memory technology. In ACM SIGARCH computer architecture news, volume 37, pages 14–23. ACM, 2009.
  • [33] Chun Jason Xue, Youtao Zhang, Yiran Chen, Guangyu Sun, J Jianhua Yang, and Hai Li. Emerging non-volatile memories: opportunities and challenges. In Proceedings of the seventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pages 325–334, 2011.
  • [34] Yenpo Ho, Garng M Huang, and Peng Li. Nonvolatile Memristor Memory: Device Characteristics and Design Implications. In Proceedings of the 2009 International Conference on Computer-Aided Design, pages 485–490. ACM, 2009.
  • [35] Dmitri B. Strukov, Gregory S. Snider, Duncan R. Stewart, and R. Stanley Williams. The missing memristor found. Nature, 2008.
  • [36] Leon Chua. Resistance switching memories are memristors. Applied Physics A, 102(4):765–783, 2011.