Flash solid-state drives (SSDs) are widely used to deploy persistent data storage in data centers, both in bare-metal and cloud deployments [mcdipper, fatcache, rocksdb, leveldb, ssd-dc1, ssd-dc2, schroeder:fast:2016], while also being an integral part of public cloud offerings [ibm:ssd, aws:ssd, google:ssd]. While SSDs with novel technologies such as 3D Xpoint [micron, optane] offer significant advantages [lepers:sosp:2019, kourtis:fast:2019, wu:damon:2019], they are not expected to replace flash-based SSDs anytime soon. These newer technologies are not yet as mature from a technology density standpoint, and have a high cost per GB, which significantly hinders their adoption. Hence, flash SSDs are expected to be the storage medium of choice for many applications in the short and medium term future [future].
Persistent tree data structures (PTSes) are widely used to index large datasets. PTSes are particularly appealing, because, as opposed to hash-based structures, they allow for storing the data in sorted order, thus enabling efficient implementations of range queries and iterators over the dataset. Examples of PTSes are the log structured merge (LSM) tree [oneil:lsm:1996], used, e.g., by RocksDB [rocksdb] and BigTable [chang:tocs:2008]; the B+Tree [comer:csur:1979], used, e.g., by Db2 [db2] and WiredTiger [wiredtiger] (which is the default storage engine in MongoDB [mongodb]); the Bw-tree [Levandoski:sigmod:2013], used, e.g., by Hekaton [diaconu:sigmod:2013] and Peloton [pavlo:cidr:2017, peloton]; the B- tree [brodal:soda:2003], used, e.g., by Tucana [Papagiannis:atc:2016] and TokuMX [tokumx].
Over the last decade, due to the reduction in the prices of flash memory, PTSes have been increasingly deployed over flash SSDs [price1, price2]. Not only do PTSes use flash SSDs as a drop-in replacement for hard disk drives, but new PTS designs are specifically tailored to exploit the capabilities of flash SSDs and their internal architectures [wang:bwtree:sigmod18, wang:eurosys:2014, lu:tos:2017, shen:tos:2018, trivedi:tos:2018].
Benchmarking PTSes on flash SSDs. Given their ubiquity, evaluating accurately and fairly the performance of PTSes on flash SSDs is a task of paramount importance for industry and research alike, to compare alternative designs. Unfortunately, as we show in this paper, such task is a complex process, which may easily incur subtle pitfalls that can lead to an inaccurate and non-reproducible performance assessment.
The reason for this complexity is the intertwined effects of the internal dynamics of flash SSDs and of the PTS implementations. On the one hand, flash SSDs employ firmware logic to deal with the idiosyncrasies of the underlying flash memory, which result in highly non-linear dynamics [Hu:systor:2009, stoica:vldb2013, desnoyer:tocs:2014, stoica:mascots:2019]. On the other hand, PTSes implement complex operations, (e.g., compactions in LSM-Trees and rebalancing in B+Trees), and update policies (e.g., in a log-structured fashion vs in-place). These design choices are known to lead to performance that are hard to analyze [lsm-bush:sigmod19]. They also result in widely different access patterns towards the underlying SSDs, thus leading to complex, intertwined performance dynamics.
Contributions. In this paper, we identify seven benchmarking pitfalls which relate to different aspects of the evaluation of a PTS deployed on a flash SSD. Furthermore, we provide guidelines on how to avoid these pitfalls. We provide specific suggestions both to academic researchers, to improve the fairness, completeness and reproducibility of their results, and to performance engineers, to help them identify the most efficient and cost-effective PTS for their workload and deployment.
In brief, the pitfalls we describe and their consequences on the PTS benchmarking process are the following:
Running short tests. Flash SSDs have time-evolving performance dynamics. Short tests lead to results that are not representative of the long-term application performance.
Ignoring device write amplification (WA-D). SSDs perform internal garbage collection that leads to write amplification. Ignoring this metric, leads to inaccurate measurements of the I/O efficiency of a PTS.
Ignoring internal state of the SSD. Experimental results may significantly vary depending on the initial state of the drive. This pitfall leads to unfair and non-reproducible performance measurements.
Ignoring the effect of the dataset size on SSD performance. SSDs will exhibit different performance depending on the amount of data stored. This pitfall leads to biased evaluations.
Ignoring the extra storage capacity that a PTS needs. This pitfall leads to ignore the storage versus performance trade-off of a PTS, which is methodologically wrong and can result in sub-optimal deployments in production systems.
Ignoring software overprovisioning. Performance of SSDs can be improved by overprovisioning storage space. This pitfall leads to ignore the storage versus performance trade-off achievable by a PTS.
Ignoring the effect of the underlying storage technology on performance. This pitfall leads to drawing quantitative and qualitative conclusions that do not hold across different SSD types.
We present experimental evidence of these pitfalls using an LSM-Tree and an B+Tree, the two most widely used PTSes, which are also at the basis of several other PTSes designs [bortnikov:vldb:2018, ren:vldb:2017, luo:vldb:2020]. In particular, we consider the LSM-Tree implementation of RocksDB, and the B+Tree implementation of WiredTiger. We use these two systems since they are widely adopted in production systems and research studies.
The storage community has studied the performance of flash SSDs extensively, focusing on understanding, modeling, and benchmarking their performance [stoica:vldb2013, desnoyer:tocs:2014, tavakkol:fast:2018, ioannou:mascots:2018]. Despite this, we have found that many works from the systems and databases communities are not aware of the pitfalls in evaluating PTSes on SSDs. Our work offers a list of factors to consider, and bridges this gap between these communities by illustrating the intertwined effects of the complex dynamics of PTSes and flash SSDs. Ultimately, we hope that our work paves the way for a more rigorous, fair, and reproducible benchmarking of PTSes on flash SSDs.
Outline. The remainder of the paper is organized as follows. Section 2 provides background on the internals of flash SSDs, LSM-Trees and B+Trees. Section 3 describes the experimental setup of our study. Section 4 presents the results of our experimental analysis and discusses the benchmarking pitfalls that we identify. Section 5 discusses related work. Section 6 concludes the paper.
Section 2.1 provides on overview of the PTSes that we use to demonstrate the benchmarking pitfalls, namely LSM-Trees and B+Trees. Section 2.2 provides background on the key characteristics of flash-based SSDs that are related to the benchmarking pitfalls we describe. Figure 1 shows the flow of write operations from an application deployed over a PTS to the flash memory of an SSD.
2.1 Persistent Tree Data Structures
We start by introducing the two PTSes that we use in our experimental study, and two key metrics of the design of a PTS.
LSM-Trees [oneil:lsm:1996] have two main components: an in-memory component, called memtable, and a disk component. Incoming writes are buffered in the memtable. Upon reaching its maximum size, the memtable is flushed onto the disk component, and a new, empty memtable is set up. The disk component is organized in levels , with progressively increasing sizes. stores the memtables that have been flushed. Each level , organizes data in sorted files that store disjoint key ranges. When is full, part of its data is pushed to through an operation called compaction. Compaction merges data from one level to the next, and discards older versions of duplicate keys, to maintain disjoint key ranges in each level.
B+Trees [comer:csur:1979] are composed of internal nodes and leaf nodes. Leaf nodes store key-value data. Internal nodes contain information needed to route a request for a target key to the corresponding leaf. Writing a key-value pair entails writing to the appropriate leaf node. A key-value write may also involve modifying the internal nodes to update routing information, or to perform re-balancing of the tree.
2.1.3 Application-level write amplification
Internal operations such as flushing and compactions in LSM-Trees, and internal node updating in B+Trees incur extra writes to persistent storage. These extra writes are detrimental for performance because they compete with key-value write operations for the SSD bandwidth, resulting in lower overall throughput and higher latencies [sears:sigmod:2012, balmau:atc:2017, lepers:sosp:2019, luo:vldb:2020]. We define application-level write amplification (WA-A) the ratio between the overall data written by the PTS (which considers both application data and internal operations) and the amount of application data written. WA-A is depicted in the left part of Figure 1.
2.1.4 Space amplification
A PTS may require additional capacity of the drive, other than the one needed to store the latest value associated with each key. LSM-Trees, for example, may store multiple values of the same key in different levels (the latest of which is at the lowest level that contains the key). B+Trees store routing information in internal nodes, and may reserve some buffers to implement particular update policies [Wiredtiger:cow]. Space amplification captures the amount of extra capacity needed by a PTS, and it is defined as the ratio of the amount of bytes that the PTS occupies on the drive and the size of the application’s key-value dataset.
2.2 Flash SSDs
In this section we describe the internal architecture of flash SSDs, as well as key concepts relevant to their performance dynamics.
Flash-based SSDs organize data in pages, which are combined in blocks. A prominent characteristics of flash memory is that pages do not support in-place updates of the pages. A page needs to be erased before it can be programmed (i.e., set to a new value). The erase operation is performed at the block level, so as to amortize its cost.
Flash translation layers (FTLs) [ftl:csur] hide such idiosyncrasy of the medium. In general, an FTL performs writes out-of-place, in a log-structured fashion, and maintains a mapping between logical and physical addresses. When space runs out, a garbage collection process selects some memory blocks, relocates their valid data, and then erases the blocks.
In the remainder of the paper, for the sake of brevity, we use the term SSD to refer to a flash SSD.
Over-provisioning is a key technique to enable garbage collection and to reduce the amount of data that it relocates. Over-provisioning means adding extra capacity to the SSD to store extra blocks used for garbage collection. The more an SSD is over-provisioned, the lower is the number of valid data that needs to be relocated upon performing garbage collection, and, hence, the higher is the performance of the SSD. SSDs manufacturers always over-provision SSDs by a certain amount. The user can further implement a software over-provisioning of the SSD by erasing its blocks and enforcing that a portion of the logical block address (LBA) space is never written. This is achieved by restricting the addresses written by the host application, either programmatically, or by reserving a partition of the disk without ever writing to it.
2.2.3 Device-level write amplification
Garbage collection reduces the performance of the SSD as it leads to internal re-writing of data in an SSD. We define device-level write amplification (WA-D) as the ratio between the amount of data written to flash memory (including the writes induced by garbage collection) and the amount of host data sent to the device. WA-D is depicted in the right part of Figure 1.
3 Experimental Setup
This section describes our experimental setup, which includes the PTS systems we benchmark, the hardware on which we deploy them, and the workloads we consider.
We consider two key-value (KV) stores: RocksDB [rocksdb], that implements an LSM-Tree, and WiredTiger [wiredtiger], that implements a B+Tree. Both are mature systems widely used on their own and as building blocks of other data management systems. We configure RocksDB and WiredTiger to use small (10 MB) in-memory page caches and direct I/O, so that the dataset does not fit into RAM, and both KV and internal operations are served from secondary storage.
Unless stated otherwise, we use the following workload in our tests. The dataset is composed of 50M KV pairs, with 16 bytes keys and 4000 bytes values. The size of the dataset is 200 GB, which represent 50% of the capacity of the storage device. Before each experiment we ingest all KV pairs in sequential order. We consider a write-only workload, where one user thread updates existing KV pairs according to a uniform random distribution. We focus on a write workload as it is the most challenging to benchmark accurately, both for the target data structures and for the SSDs. We use a single user thread to avoid the performance dynamics caused by concurrent data accesses. We also consider variations of this workload to show that the pitfalls we describe apply to a broad class of workloads.
To demonstrate our pitfalls, we analyze several application, system and hardware performance metrics.
i) KV store throughput, i.e., the number of operations per second completed by the KV store.
ii) Device throughput, i.e., the amount of data written per second to the drive as observed by the OS. The device throughput is often used to measure the capability of a system to utilize the available I/O resources [lepers:sosp:2019]. We measure device throughput using iostat.
iii) User-level write amplification, which we measure by taking the ratio of the device write throughput and the product of the KV store and the size of a KV pair. By using the device write throughput, we factor in also the write overhead posed by the filesystem. We assume such overhead to be negligible with respect to the amplification caused by the PTS itself.
iv) Application-level write amplification, which we measure via SMART attributes of the device.
v) Space amplification, which we obtain by taking the ratio of the disk total utilization and the cumulative size of the KV pairs in the dataset. Also this metric factors in the overhead posed by the filesystem, which is negligible with respect to the several GB datasets that we consider.
For the sake of readability of the plots, unless stated otherwise, we report 10-minutes average values when plotting the evolution of a performance metric over time.
3.4 State of the drive
We experiment with two different initial conditions of the internal state of the drive.
Trimmed. All blocks of the device are erased (using the blkdiscard utility). Hence, initial writes are stored directly into free blocks and do not incur additional overhead (no WA-D occurs), while updates after the free blocks are exhausted incur internal garbage collection. A trimmed device exhibits performance dynamics close (i.e., modulo the wear of the storage medium) to the ones of a mint factory-fresh device. This setting is representative of bare-metal standalone deployments, where a system administrator can reset the state of the drive before deploying the KV store, and the drive is not shared with other applications.
Preconditioned. The device undergoes a preliminary writing phase so that its internal state resembles the state of a device that has been in use. To this end, we first write the whole drive sequentially, to ensure that all logical addresses have associated data. Then, we issue random writes for an amount of bytes that is twice the size of the disk, so as to trigger garbage collection. In this setting even the first write operation issued by an application towards any page is effectively an over-write operation. This setting is representative of consolidated deployments, e.g., public clouds, where multiple applications share the same physical device, or standalone deployments with an aged filesystem.
These two configurations represent the two endpoints of the spectrum of the possible initial conditions of the drive, concerning the state of the drive’s block. In a real-world deployment the initial conditions of the drive would be somewhere in-between these two endpoints.
We use a machine equipped with an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 physical cores, without hyper-threading) and 126 GB of RAM. The machine runs Ubuntu 18.04 with a 4.15 generic Linux kernel. The machine’s persistent storage device is a 400 GB Intel p3600 enterprise-class SSD [intel:p3600]. Unless stated otherwise, we setup a single partition on the drive, which takes the whole available space. We mount an ext4 filesystem configured with the nodiscard parameter [intel-nvme].
4 Benchmarking Pitfalls
This section discuss the benchmarking pitfalls in detail. For each pitfall we first give a brief description; then discuss the pitfall in depth, by providing experimental evidence that demonstrates the pitfall itself and its implications on the performance evaluation process; and finally outline guidelines on how to avoid the pitfall in research studies and production systems.
4.1 Steady-state vs. bursty performance
Pitfall 1: Running short tests. Because both PTS and SSD performance varies over time, short-lived tests are unable to capture how the systems will behave under a continuous (non-bursty) workload.
Discussion. Figure 6 shows the KV store and device throughput (top), and the WA-D and WA-A (bottom) over time, for RocksDB (left) and WiredTiger (right). These results refer to running the two systems on a trimmed SSD. The plots do not show the performance of the systems during the initial loading phase. The pictures show that, during a performance test, a PTS exhibits an initial transient phase during which the dynamics and the performance indicators may widely differ from the ones observed at steady-state. Hence, taking early measurements of a data store’s performance may lead to a substantially wrong performance assessment.
Figure (a)a shows that measuring the performance of RocksDB in the first 15 minutes would report a throughput of 11-8 KOps/s, which is 3.6-2.6 times higher than the 3 KOps/s that RocksDB is able to sustain at steady-state. In the first 15 minutes, the device throughput of RocksDB is between 375 and 300 MB/s, which is more than 3 times the throughput sustained at steady-state.
Figure (c)c sheds lights on the causes of such performance degradation. The performance of RocksDB decreases over time for the effect of the increased write amplification, both at the PTS and device level. WA-A increases over time while the levels of the LSM-Tree fills up, and its curve flattens once the layout of the LSM tree has stabilized. WA-D increases over time because of the effect of garbage collection. The initial value of WA-D is close to one, because the SSD is initially trimmed, and keys are ingested in order during the initial data loading, which results in RocksDB issuing sequential writes to the drive. The compaction operations triggered over time, instead, do not result in sequential writes to the SSD flash cells, which ultimately lead to a WA-D slightly higher than 2.
WiredTiger exhibits performance degradation as well, as shown in Figure (b)b. The performance reduction in WiredTiger is lower than in RocksDB for three reasons. First, WA-A is stable over time, because updating the B+Tree to accommodate new application writes incurs an amount of extra writes that does not change over time. Second, the increase in WA-D is lower than in RocksDB: WA-D reaches at most the value of 1.7, and converges to 1.5. We analyze more in detail the WA-D of WiredTiger in the next section. Third, WiredTiger is less sensitive to the performance of the underlying device because of synchronization and CPU overheads [lepers:sosp:2019].
Guidelines. Researchers and practitioners should distinguish between steady-state and bursty performance, and prioritize reporting the former. In light of the results portrayed by Figure 6, we advocate that, to detect steady-state behavior, one should implement a holistic approach that encompasses application-level throughput, WA-A, and WA-D. Techniques such as CUSUM [cusum] can be used to detect that the values of these metrics do not change significantly for a long enough period of time.
To measure throughput we suggest using an average over long periods of times, e.g., in the order of ten minutes. In fact, it is well known that PTSes are prone to exhibit large performance variations over short period of times [sears:sigmod:2012, luo:vldb:2019, lepers:sosp:2019]. Furthermore, we suggest expressing the WA-A at time as the ratio of the cumulative application writes up to time and the cumulative host writes up to time . This is aimed at avoiding oscillations that would be obtained if measuring the WA-D over small time windows. Finally, if WA-D cannot be computed directly from SMART attributes, then we suggest, as a rule of thumb, to consider the SSD as having reached steady-state after the cumulative host writes accrue to at least 3 times the capacity of the drive. The first device write ensures that the drive is filled once, so that each block has data associated with it. The second write triggers garbage collection, which overwrites the block layout induced by the initial filling of the data-set. The third write ensures that the garbage collection process approaches steady state, and is needed also to account for the (possibly unknown) amount of hardware extra capacity of the SSD (which makes the actual capacity of the drive higher than the nominal capacity exposed to the application).
4.2 Analysis of WA-D
Pitfall 2: Not analyzing WA-D. Overlooking WA-D leads to partial or even inaccurate performance analysis.
Discussion. Many evaluations only consider WA-A in their analysis, which can lead to inaccurate conclusions. We advocate considering WA-D when discussing the performance of a PTS for three (in addition to being fundamental to identify steady state, as discussed previously) main reasons.
WA-D directly affects the throughput of the device, which strongly correlates with the application-level throughput. Analyzing WA-D explains performance variations that cannot be inferred by the analysis of the WA-A alone. Figure (b)b shows that WiredTiger exhibits a throughput drop at around the 50th minute mark, despite the fact Figure (d)d shows no variations in WA-A. Figure (d)d shows that at the 50th minute mark WA-D increases from its initial value of 1, indicating that the SSD has run out of clean blocks, and the garbage collection process has started. This increase in WA-D explains the reduction in SSD throughput, which ultimately determines the drop in the throughput achieved by WiredTiger.
The analysis of WA-D also contributes to explaining the performance drops in RocksDB, depicted in Figure (a)a. Throughout the test, the KV throughput drops by a factor of 4, from 11 to 2.5 KOps/s. Such a huge degradation is not consistent with the 2 increase of WA-A and the slightly increased CPU overhead caused by internal LSM-Tree operations (most of the CPUs are idle throughout the test). The doubling of the WA-D explains the huge device throughput degradation, which contributes to the application-level throughput loss.
WA-D is an essential measure of the I/O efficiency of a PTS. One needs to multiply WA-A by WA-D to obtain the end-to-end write amplification – from application to memory cells– incurred by a PTS on flash. This is the write amplification value that should be used to quantify the I/O efficiency of a PTS on flash, and its implications on the lifetime of an SSD. Focusing on WA-A alone, as done in the vast majority of PTS performance studies, may lead to incorrect conclusions. For example, Figure 6 (bottom) shows that RocksDB incurs a steady-state WA-A of 12, which is higher than the WA-A achieved by WiredTiger by a factor of 1.2. However, the end-to-end write amplification of RocksDB is 25, which is 2.1 higher than WiredTiger’s.
WA-D measures the flash-friendliness of a PTS. A low WA-D indicates that a PTS generates a write access pattern towards the SSD that does not incur much garbage collection overhead. Measuring WA-D, hence, allows for quantifying the fitness of a PTS for flash SSD deployments, and for verifying the effectiveness of flash-conscious design choices.
For example, LSM-Trees are often regarded as flash-friendly due to their highly sequential writes, while B+Tree are considered less flash-friendly due to their random write access pattern. The direct measurement of WA-D in our tests, however, capsizes this conventional wisdom, showing that RocksDB and WiredTiger achieve a WA-D of around 2.1 and 1.5, respectively. As a reference, a pure random write workload, targeting also 60% of the device capacity, has a WA-D of 1.4 [stoica:vldb2013]. In the next section we provide additional insights on the causes of such mismatch between expectation and measured performance.
Guidelines. The analysis of the WA-D should be a standard step in the performance study of any PTS. Such analysis is fundamental to measure properly the flash-friendliness and I/O efficiency of alternative systems. Moreover, the analysis of WA-D leads to important insights on the internal dynamics and performance of a PTS, as we show in the following sections.
4.3 Initial conditions of the drive
Pitfall 3: Overlooking the internal state of the SSD. Not controlling the initial condition of the SSD may lead to biased and non-reproducible performance results.
Discussion. Figure 11 shows the performance over time of RocksDB (left) and WiredTiger (right), when running on an SSD that has been trimmed or preconditioned before starting the experiment. The top row reports KV throughput, and the bottom one reports WA-D. The plots do not show the performance of the systems during the initial loading phase.
The plots show that the initial state of the SSD heavily affects the performance delivered by a PTS and that, crucially, the steady-state performance of a PTS can greatly differ depending on the initial state of the drive. Such a result is surprising, given that one would expect the internal state of an SSD to converge to the same configuration, if running the same workload for long enough, and hence to deliver the same steady-state performance regardless of the initial state of the SSD.
To understand the cause of this phenomenon, we have monitored the host write access pattern generated by RocksDB and WiredTiger with blktrace. Figure 12 reports the CDF of the access probability of the page access frequency in the two systems. We observe that in WiredTiger 46% of the pages are not written (0 read or write accesses). This indicates that WiredTiger only writes to a limited portion of the logical block address (LBA) space, corresponding to the LBAs that initially store the GB of KV pairs plus some small extra capacity, i.e., of the SSD’s capacity in total. In the case of the trimmed device, this data access pattern corresponds to having only of the LBAs with associated valid data. Because SSD garbage collection only relocates valid data, this gives GBs of extra over-provisioning to the SSD, which in turn leads to a low WA-D. In a preconditioned device, instead, all LBAs have associated valid data, which means that the garbage collection process has only the hardware over-provisioning available, and needs to relocate more valid pages when erasing a block, leading to a higher WA-D.
The difference in WA-D, and hence in performance, depending on the initial state of the SSD is much less visible in RocksDB. This is due to the facts that the LSM tree utilizes more capacity than a B+Tree and RocksDB writes to the whole range of the LBA space. Hence, the initial WA-D for RocksDB depends heavily on the initial device state, however, all LBAs are eventually over-written and thus the WA-D converges to roughly the same value, regardless of the initial state of the drive.
Our results and analysis lead to two important lessons.
The I/O efficiency of a PTS on SSD is not only a matter of the high level design of the PTS, but also on its low-level implementation. Our experiments show that the benefits on WA-D given by the large sequential writes of the LSM implementation of RocksDB are lower than the benefits achieved by the B+Tree implementation of WiredTiger, despite the fact that WiredTiger generates a more random write access pattern.
Not controlling the initial state of the SSD can potentially jeopardize two key properties of a PTS performance evaluation: fairness and reproducibility. The fairness of a benchmarking process can be compromised by simply running the same workload on two different PTSes back to back. The performance of the second PTS is going to be heavily influenced by the state of the SSD that is determined by the previous test with the other PTS. The lack of fairness can lead a performance engineer to pick a sub-optimal PTS for their workload, or a researcher to report incorrect results.
The reproducibility of a benchmarking process can be compromised because running two independent tests of a PTS with the same workload and on the same hardware may lead to substantially different results. For production engineers this means that the performance study taken on a test machine may differ widely with respect to the performance observed in production. For researches, it means that it may be impossible to replicate the results published in another work.
Guidelines. To overcome the aforementioned issues, we recommend to control and report the initial state of the SSD before every test. This state depends on the target deployment of the PTS. For a performance engineer, such state should be as similar as possible to the one observed in the production environment, which also depends on other applications possibly collocated with the PTS. We suggest to researchers to precondition the SSD as described in Section 3.4. In this way, they can evaluate the PTS in the most general case possible, thus broadening the scope of their results. To save on the preconditioning time, the SSD can be trimmed, provided that one checks that the steady-state performance of the PTS does not substantially differ from the one observed on a preconditioned drive.
4.4 Data-set size
Pitfall 4: Testing with a single dataset size. The amount of data stored by the SSD changes its behavior and affects overall performance results.
Discussion. Figure 16 reports the steady-state throughput (left), WA-D (middle), and WA-A (right) of RocksDB and WiredTiger with datasets whose sizes span from 0.25 to 0.88 of the capacity of the SSD (from 100GB to 350GB of KV pairs). We do not report results for RocksDB for the two biggest datasets because it runs out of space. The figure reports results both with a trimmed and with a preconditioned SSD.
Figure (a)a shows that the throughput achieved by the two systems is affected by the size of the dataset that they manage, although to a different extent and in different ways depending on the initial state of the SSD. By contrasting Figure (b)b and Figure (c)c we conclude that the performance degradation brought by the larger data-set is primarily due to the idiosyncrasies of the SSD. In fact, larger datasets lead to more valid pages in each flash block, which increases the amount of data being relocated upon performing garbage collection, i.e., the WA-D.
Changing the dataset size affects the comparison between the two systems both quantitatively and qualitatively. We also note that the comparison among the two systems is affected by the initial condition of the drive.
On a trimmed SSD, RocksDB achieves a throughput that is 3.3 higher than WiredTiger’s when evaluating the two systems on the smallest dataset. On the largest dataset, however, this performance improvement shrinks to 1.9. Moreover, WiredTiger exhibits a lower WA-D across the board, due to the LBA access pattern discussed in the previous section.
On a preconditioned SSD, the speedup of RocksDB over WiredTiger still depends on the size of the dataset, but it is lower in absolute values than on a trimmed SSD, ranging from 2.7 on the smallest dataset to 2.57 on the largest one. Moreover, whether RocksDB has a better WA-D than WiredTiger depends on the dataset size. In particular, the WA-D of RocksDB and WiredTiger are approximately equal when storing datasets whose sizes are up to half of the drive’s capacity. Past that point, RocksDB’s WA-D is sensibly lower than WiredTiger’s (2.3 versus 2.6). This happens because the benefits of WiredTiger’s LBA access pattern decrease with the size of the dataset (and hence of the range of LBAs storing KV data) and the reduced over-provisioning due to preconditioning.
Guidelines. We suggest production engineers to benchmark alternative PTSes with a dataset of the size that is expected in production, and refrain from testing with scaled-down datasets for the sake of time. We suggest researchers to experiment with different dataset sizes. This suggestion has a twofold goal. First, it allows a researcher to study the sensitivity of their PTS design to different device utilization values. Second, it disallows evaluations that are purposely or accidentally biased in favor of one design over another.
4.5 Space amplification
Pitfall 5: Not accounting for space amplification. The space utilization overhead of a PTS determines its storage requirements and deployment monetary cost.
Discussion. PTSes frequently trade additional space for improved performance, and understanding their behavior depends on understanding these trade-offs. Figure (a)a reports the total disk utilization incurred by RocksDB and WiredTiger depending on the size of the dataset. The disk utilization includes the overhead due to filesystem meta-data. Because RocksDB frequently writes and erases many large files, its disk utilization varies sensibly over time. The value we report is the maximum utilization that RocksDB achieves. Figure (b)b reports the space amplification corresponding to the utilization depicted in Figure (a)a.
WiredTiger uses an amount of space only slightly higher than the bare space needed to store the dataset, and achieves a space amplification that ranges from 1.15 to 1.12. RocksDB, instead, requires much more additional disk space to store the several levels of its LSM-Tree. Overall, RocksDB achieves an application space amplification ranging between 1.86, with the smallest dataset we consider, to 1.39, with the biggest dataset that it can handle 111The disk utilization in RocksDB depends on the setting of its internal parameters, most importantly, the maximum number of levels, and the size of each level [rocksdb-tuning]. It is possible to achieve a lower space amplification than the one we report, but at the cost of substantially lower throughput due to increased compaction overhead..
These results show that space amplification plays a key role in the performance versus storage space trade-off. Such trade-off affects the total storage cost of a PTS deployment, given an SSD drive model, a target throughput, and total dataset size. In fact, a PTS with a low space amplification may fit the target dataset in a smaller and cheaper drive with respect to another PTS with a higher write amplification, or can index more data given the same capacity, requiring fewer drives to store the whole dataset.
To showcase this last point, we perform a back-of-the-envelope computation to identify which of the two systems require fewer SSDs (and hence incur a lower storage cost) to store a given dataset and at the same time achieve a target throughput. We use the throughput and disk utilization values that we measure for our SSD (see Figure (a)a and Figure (a)a). For simplicity, we assume one PTS instance per SSD, and that the aggregate throughput of the deployment is the sum of the throughputs of the instances. Figure (c)c reports the result of this computation. Despite having a lower per-instance throughput, the higher space efficiency of WiredTiger makes it preferable over RocksDB in scenarios with a large dataset and a relatively low target throughput. This configuration represents an important class of workloads, given that with the ever-increasing amount of data being collected and stored, many applications begin to be storage capacity-bound rather than throughput-bound [cidon:atc:2017].
Guidelines. The experimental evaluation of a PTS should not focus only on performance, but should include also space amplification. For research works, analyzing space amplification provides additional insights on the performance dynamics and trade-offs of the design choices of a PTS, and allows for a multi-dimensional comparison among designs. For production engineers, analyzing space amplification is key to compute the monetary cost of provisioning the storage for a PTS in production, which is typically more important than sheer maximum performance [microsoft:ssd].
As a final remark, we note that this pitfall applies also to PTSes not deployed over an SSD, and hence our considerations apply more broadly to PTSes deployed over any persistent storage medium.
4.6 SSD over-provisioning
Pitfall 6: Overlooking SSD software over-provisioning. Allocating extra over-provisioning capacity to the SSD may lead to more favorable capacity versus performance trade-offs.
Discussion. Figure 23 compares the steady-state throughput (left) and WA-D (right) achieved by RocksDB and WiredTiger in two settings: the default one in which the whole SSD capacity is assigned to the disk partition accessible by the filesystem underlying the PTS, and one in which some SSD space is not made available to the filesystem underlying the PTS, and is instead assigned as extra over-provisioning to the SSD. Specifically, in the second setting we trim the SSD and assign a 300GB partition to the PTS. Hence, the SSD has 100GB of trimmed capacity that is not visible to the PTS. We choose this value because 100GB corresponds to half of the free capacity of the drive once the 200 GB dataset has been loaded. For both settings we consider the case in which the PTS partition remains clean after the initial trimming, and the case in which it is preconditioned.
Extra over-provisioning improves the performance of RocksDB by a factor of 1.83. This substantial improvement is caused by a drastic reduction of WA-D, that drops from 2.3 to 1.4, and it applies to RocksDB regardless of the initial state of the PTS partition, for the reason discussed in Section 4.3.
The impact of extra over-provisioning is much less evident in WiredTiger. In the trimmed device case, the extra over-provisioning has no effect on WiredTiger. This happens because WiredTiger writes only to a certain range of the LBA space (see Figure 12). Hence, all other trimmed blocks act as over-provisioning, regardless of whether they belong to the PTS partition or the extra over-provisioning one. On the preconditioned device, instead, all blocks of the PTS partition have data associated with them, so the only software over-provisioning is given by the trimmed partition. This extra over-provisioning reduces WA-D from 1.7 to 1.3, yielding a throughput improvement of 1.14.
Allocating extra over-provisioning can be an effective technique to reduce the number of PTS instances needed in a deployment (and hence reduce its storage cost), because it increases the throughput of the PTS without requiring additional hardware resources. However, extra over-provisioning also reduces the amount of data that a single drive can store, which potentially increases the amount of drives needed to store a dataset and the storage deployment cost. To assess in which use cases using extra over-provisioning is the most cost-effective choice we perform a back-of-the-envelope computation of the number of drives needed to provision a RocksDB deployment given a dataset size and a target throughput value. We perform this study on RocksDB because it benefits the most from extra over-provisioning. We use the same simplifying assumptions that we made for the previous similar analysis. Figure 24 reports the results of our computation. As expected, extra over-provisioning is beneficial for use cases that require high throughput for relatively small datasets. For larger datasets with relatively low performance requirements, it is more convenient to allocate as much of the drive’s capacity as possible to RocksDB.
Guidelines. It is well known that PTSes have several tuning knobs that have a huge effect on performance [lim:fast:2016, lsm-bush:sigmod19, vat:2020]. We suggest to consider SSD over-provisioning as an additional, yet first class tuning knob of a PTS. SSD extra over-provisioning trades capacity for performance, and can reduce the storage cost of a PTS deployment in some use cases.
4.7 Storage technology
Pitfall 7: Testing on a single SSD type. The performance of a PTS heavily depends on the type of SSD used. This makes it hard to extrapolate the expected performance when running on other drives and also to reach conclusive results when comparing two systems.
Discussion. We exemplify this pitfall through an experiment where we keep the workload and the RocksDB and WireTiger configurations constant and only swap the underlying storage device. We use three SSDs: an Intel p3600 [intel:p3600] flash SSD, i.e., the drive used for the experiments discussed in previous sections; an Intel 660 [intel-660p] flash SSD; and an Intel Optane [optane]. We refer to these SSDs as SSD1, SSD2 and SSD3, respectively, in the following discussion. SSD3 is a high-end SSD, based on the newer 3DXP technology that achieves higher performance than flash SSDs. We use SSD3 as an upper bound on performance that a PTS can achieve on a flash block device.
To try and isolate the performance impact due to the SSD technology (i.e., architecture and underlying storage medium performance) itself in the assessment of a PTS, we eliminate, as much as possible, the other sources of performance variability that we have discussed so far. To this end, we run a workload with a dataset that is 10 smaller than the default one, and we trim the flash SSDs. In this way, the effect of garbage collection is the flash SSDs is minimized, resulting in a WA-D very close to one.
Figure 25 shows the steady-state throughput of RocksDB and WiredTiger when deployed on the three SSDs. As can be depicted from the plot, the performance impact changing the underlying drive varies drastically across the two systems. Explaining these performance variations requires gaining a deeper understanding of the low-level design of the SSDs, which is not always achievable.
RocksDB achieves the highest throughput on SSD3, and a higher throughput on SSD1 than on SSD2. This performance trend is mostly determined by the write latencies of the SSDs, which are the lowest in SSD3, and lower in SSD2 than in SSD1. Also WiredTiger achieves the highest throughput on SSD3 but, surprisingly, it obtains a higher throughput on SSD2 than on SSD1. We argue that the reason for this performance dynamic lies in the fact that SSD2 has a larger internal cache than SSD1. Because WireTiger performs small writes, uniformly distributed over time, the cache of SSD2 is able to absorb them with very low latency, and destages them in the background. The larger cache of SSD2 does not yield the same beneficial effects to RocksDB because RocksDB performs large bursty writes, which overwhelm the cache, leading to longer write latencies and, hence, lower throughput.
These dynamics also lead to the surprising result that either of the two systems we consider can achieve a higher throughput than the other, just by changing the SSD on which they are deployed.
We also observe very different performance variations for the two systems, when deployed on different SSDs. On the one hand, the best and worst throughputs achieved by RocksDB vary by a factor of almost 20 (SSD2 versus SSD3). On the other hand, they vary only by a factor of 2.4 for WiredTiger.
These results are especially important, because they indicate that the performance comparison across PTS design points, and the corresponding conclusions that are drawn, are strongly dependent on the SSD employed in the benchmarks, and hence hard to generalize.
The type of SSD also affects dramatically the performance predictability of the two system. Figure 28 reports the throughput of RocksDB (left) and WiredTiger (right) when deployed over the three SSDs. To highlight the performance variability we average the throughput over a 1 minute interval (as opposed to the default 10 minutes used in previous plots).
The throughput of RocksDB varies widely over time, and the extent of the variability depends on the type of SSD. When using SSD1, RocksDB suffers from throughput swings of . When using SSD2, RocksDB has long periods of time where no application-level writes are executed at all. This happens because the large writes performed by RocksDB overwhelm the cache of SSD2 and lead to long stall times due to internal data destaging. On SSD3, the relative RocksDB throughput variability decreases to . WireTiger is less prone to performance variability, and exhibits a more steady and predictable performance irrespective of the storage technology.
Guidelines. We have shown that it is difficult to predict the performance of a PTS just based on the high-level SSD specifications (e.g., bandwidth and latency specifications). Therefore, we recommend testing a PTS on multiple SSD classes, preferably using devices from multiple vendors, and using multiple flash storage technologies. Testing on multiple SSD classes allows researchers to draw broader and more significant conclusions about the performance of a PTS’s design, and to assess the “intrinsic” validity of such design, without tying it to specific characteristics of the medium on which the design has been tested. For a performance engineer, testing with multiple drives is essential to identify which one yields the best combination of storage capacity, performance and cost depending on the target workload.
4.8 Additional workloads
In this section, we show that our pitfalls affect a wider set of workloads than the one we have considered so far. For space constraints, we focus on the first three pitfalls. Figure 33 reports the throughput and WA-D over time achieved by RocksDB (left) and WiredTiger (right) with two workloads that we obtain by varying one parameter of our default workload. The plots do not show the performance of the systems during the initial loading phase. The top row refers to a workload in which the size of the values is 128 bytes (vs. 4000 used so far). To keep the amount of data stored indexed by the PTS constant to the one used by the previous experiments, we increase accordingly the number of keys. The bottom row refers to a workload in which the top level application submits a mix of read and write operations, with a 50:50 ratio. We run these workloads using both a preconditioned SSD and a trimmed one.
As the plots show, the first three pitfalls apply to these workloads as well. First, steady-state behavior can be very different from the performance observed at the beginning of the test. In the mixed read/write workload, it takes longer to the PTSes and to the SSD to stabilize, because writes are less frequent. This is visible by comparing the top row in Figure 33 with the top row in Figure 6 (Section 4.1).
Second, the value and the variations of WA-D are important to explain performance changes and dynamics. We note that the WA-D function of WiredTiger in the trimmed case with small values (Figure (d)d) is different from the one observed for the workload with 4000B values (Figure (d)d in Section 4.1). With 4000B values the WA-D function starts at a value close to 1, whereas with 128B values its starting point is closer to 2. This happens because the initial data loading phase leads to different SSD states depending on the size of the KV pairs. With 4000B values, one KV pair can fill one filesystem page, which is written only once to the underlying SSD. With small values, instead, the same page needs to written multiple times to pack more KV pairs, which in higher fragmentation at the SSD level. Such a phenomenon is not visible in RocksDB, because it writes large chunks of data regardless the size of the individual KV pairs.
Third, the initial state of the SSD leads to different transient and steady-state performance. As for the other workloads considered in the paper, this pitfall applies especially to WiredTiger.
In light of these results, we argue that our pitfalls and guideline should apply to every workload that has a sustained write component. Some of our pitfalls are also relevant to read-only workloads, i.e., the ones concerning the data-set size, the space amplification, and the storage technology.
5 Related Work
In this section we discuss the performance analyses of existing storage systems on SSD, and how they relate to the pitfalls we describe; work on modeling and benchmarking SSDs in the storage community; and research in the broader field of system benchmarking.
Performance analyses of SSD-based storage systems. Benchmarking the performance of PTSes on SSDs is a task frequently performed both in academia [balmau:atc:2019, sears:sigmod:2012, balmau:atc:2017, bindschaedler:asplos:2020, raju:sosp17:pebblesdb, luo:vldb:2019, dayan:monkey:2018, dayan:dostoevsky:2018, bortnikov:vldb:2018, marmol:atc:2015, lim:sosp:2011] and in industry [leveldb-bench, hyperleveldb-bench, wiredtiger-bench, rocksdb-bench] to compare different designs.
In general, these evaluations fall short in considering the benchmarking pitfalls we discuss in this paper. For example, the evaluations of the systems do not report the duration of the experiments, or they do not specify the initial conditions of the SSD on which experiments are run, or consider a single dataset size. As we show in this paper, these aspects are crucial for both the quantitative and the qualitative analysis of the performance of a data store deployed on an SSD.
In addition, performance benchmarks of PTSes typically focus on application-level write amplification to analyze the I/O efficiency and flash-friendliness of a system [lim:sosp:2011, balmau:atc:2017, raju:sosp17:pebblesdb, dayan:monkey:2018, dayan:dostoevsky:2018, marmol:atc:2015]. We show that also device-level write amplification must be taken into account for these purposes.
A few systems distinguish between bursty and sustained performance, by investigating the variations of throughput and latencies over time in LSM-tree key value stores [luo:vldb:2019, sears:sigmod:2012, balmau:atc:2019]. Balmau et al. [balmau:atc:2019] additionally show how some optimizations can improve the performance of LSM-Tree key-value stores in the short term, but lead to performance degradation in the long run. These works focus on the high-level, i.e., data-structure specific, causes of such performance dynamics. Our work, instead, investigates the low-level causes of performance variations in PTSes, and correlates them with the idiosyncratic performance dynamics of SSDs.
Yang et al. [yang:inflow:2014] show that stacking log-structured data structures on top of each of each other may hinder the effectiveness of the log-structured design. Oh et al. [oh:fast:2012] investigate the use of SSD over-provisioning to increase the performance of a SSD-based cache. Athanassoulis et al. [rum] propose the RUM conjecture, which states that PTSes have an inherent trade-off between performance and storage cost. Our paper touches some of these issues, and complements the findings of these works, by covering in a systematic fashion several pitfalls of benchmarking PTSes on SSDs, and by providing experimental evidence for each of them.
SSD performance modeling and benchmarking. The research and industry storage communities have produced much work on modeling and benchmarking the performance of SSDs. The Storage Networking Industry Association has defined the Solid State Storage Performance Test Specification [snia], which contains the guidelines on how to perform rigorous and reproducible SSD benchmarking, so that performance results are comparable across vendors. Many analytical models [Hu:systor:2009, stoica:vldb2013, desnoyer:tocs:2014, stoica:mascots:2019] aim to express in closed form the performance and the device level WA of an SSD as a function of the workload and the SSD internals and parameters. MQSim [tavakkol:fast:2018] is a simulator specifically designed to replicate quickly and accurately the behavior of an SSD at steady state.
Despite this huge body of work, we have previously shown that practitioners and researchers in the system and database communities do not properly –or entirely–take into account the performance dynamics of SSDs when benchmarking PTSes. With our work, we aim to raise awareness about the SSD performance intricacies in the system and database communities as well. To this end, we show the intertwined effects of the SSD idiosyncrasies on the performance of PTSes, and provide guidelines on how to conduct a more rigorous and SSD-aware performance benchmarking.
System benchmarking. The process of benchmarking a system is notoriously difficult, and can incur subtle pitfalls that may undermine its results and conclusions. Such a complexity is epitomized by the list of benchmarking crimes [benchmarking-crimes], a popular collection of benchmarking errors and anti-patterns that are frequently found in the evaluation of research papers. Raasveldt et al. [raasveldt:2018] provide a similar list with a focus on DB systems. Many research papers target different aspects of the process of benchmarking a system. Mariq et al. [mariq:osdi:2018], Uta et al. [uta:nsdi:2020], and Hoefler and Belli [hoefler:sc:2015] focus on the statistical relevance of the measurements, investigating whether experiments can be repeatable, and how many trials are needed to consider a set of measurements meaningful. Our work is complementary to this body of research, in that it aims to provide guidelines to obtain a more fair and reproducible performance assessment of PTSes deployed on SSDs.
The complex interaction between a persistent tree data structure and a flash SSD device can easily lead to inaccurate performance measurements. In this paper we show seven pitfalls that one can incur when benchmarking a persistent tree data structure on a flash SSD. We demonstrate these pitfalls using RocksDB and WiredTiger, two of the most widespread implementations of the LSM-tree and of the B+tree persistent data structures, respectively. We also present guidelines to avoid the benchmarking pitfalls, so as to obtain accurate and representative performance measurements. We hope that our work raises awareness about and provides a deeper understanding of some benchmarking pitfalls, and that it paves the way for a more rigorous, fair, and reproducible benchmarking.