On Failure Diagnosis of the Storage Stack

05/06/2020 ∙ by Duo Zhang, et al. ∙ 0

Diagnosing storage system failures is challenging even for professionals. One example is the "When Solid State Drives Are Not That Solid" incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed for failures caused by a Linux kernel bug. With the system complexity keeps increasing, such obscure failures will likely occur more often. As one step to address the challenge, we present our on-going efforts called X-Ray. Different from traditional methods that focus on either the software or the hardware, X-Ray leverages virtualization to collects events across layers, and correlates them to generate a correlation tree. Moreover, by applying simple rules, X-Ray can highlight critical nodes automatically. Preliminary results based on 5 failure cases shows that X-Ray can effectively narrow down the search space for failures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

The storage stack is witnessing a sea-change driven by the advances in non-volatile memory (NVM) technologies  [40, 27, 43, 52, 18, 83, 66, 23, 80]. For example, flash-based solid state drives (SSDs) and persistent memories (PMs) are replacing hard disk drives (HDDs) as the durable device  [91, 6, 61, 92, 38, 17]; NVMe [65] and CXL [31] are redefining the host-device interface; blk-mq [21] alleviates the single queue and lock contention bottleneck at the block I/O layer; the SCSI subsystem and the Ext4 file system, which have been tuned for HDDs for decades, are also being adapted for NVM (e.g., scsi-mq [79, 29, 88] and DAX [33]); in addition, various NVM-oriented new designs/optimizations have been proposed (e.g., F2FS [54], NOVA [94], Kevlar [42]), some of which require cohesive modifications throughout the storage stack (e.g., the TRIM support [57]).

The new systems generally offer higher performance. However, as a disruptive technology, the NVM-based components have to co-exist with the traditional storage ecosystem, which is notoriously complex and difficult to get right despite decades of efforts [95, 70, 59, 82]. Compared with the performance gain, the implication on system reliability is much less studied or understood.

One real example is the “When Solid-State Drives Are Not That Solid” incident occurred in Algolia data center [92], where a random subset of SSD-based servers crashed and corrupted files for unknown reasons. The developers “spent a big portion of two weeks just isolating machines and restoring data as quickly as possible”. After trying to diagnose almost all software in the stack (e.g., Ext4, mdadm [3]), and switching SSDs from different vendors, they finally (mistakenly) concluded that it was Samsung’s SSDs to blame. Samsung’s SSDs were criticized and blacklisted, until one month later Samsung engineers found that it was a TRIM-related Linux kernel bug that caused the failure [72].

As another example, Zheng et al. studied the behavior of SSDs under power fault [101]. The testing framework bypassed the file system, but relied on the block I/O layer to apply workloads and check the behavior of devices. In other words, the SSDs were essentially evaluated together with the block I/O layer. Their initial experiments were performed on Linux kernel v2.6.32, and eight out of fifteen SSDs exhibited a symptom called “serialization errors” [101]. However, in their follow-up work where similar experiments were conducted on a newer kernel (v3.16.0) [100], the authors observed that the failure symptoms on some SSDs changed significantly (see Table 1, adapted from  [100]). It was eventually confirmed that the different symptoms was caused by a sync-related Linux kernel bug [100].

OS (Kernel) SSD-1 SSD-2 SSD-3
Debian 6.0 (2.6.32) 317 27 0
Ubuntu 14.04 (3.16) 88 1 0
Table 1: SSDs exhibit different symptoms when tested on different OSes. Each cell shows the average number of errors. Reported by [100].

One commonality of the two cases above is that people try to infer the behavior of storage devices indirectly through the operating system (OS) kernel, and they tend to believe that the kernel is correct. This is natural in practice because users typically have to access storage devices with the help of the kernel, and they usually do not have the luxury of inspecting the device behavior directly. Also, NVM devices are relatively young compared with the long history of the OS kernel, so they might seem less trustable. We call such common practice as a top-down approach.

Nevertheless, both cases show that the OS kernel may play a role in causing system failures, while the device may be innocent. More strangely, in both cases, different devices seem to have different sensitivity to the kernel bug, and some devices may even “tolerate” the kernel bug. For example, no failure was observed on Intel SSDs in the Algolia case [92], and the SSD-3 in Table 1 never exhibited any serialization errors in Zheng et al.’s experiments [101]. Since switching devices is one simple and common strategy to identify device issues when diagnosing system failures, the different sensitivity of devices to the software bugs can easily drive the investigation to the wrong direction, wasting human efforts and resulting in wrong conclusions, as manifested in the two cases above.

In fact, similar confusing and debatable failures are not uncommon today [36, 2, 1, 46]. With the trend of storage devices becoming more capable and more special features are being exposed to the host-side software [68, 12, 5], the interaction between hardware and software is expected to be more complex. Consequently, analyzing storage system failures solely based on the existing top-down approach will likely become more problematic. In other words, new methodologies for diagnosing failures of the storage stack are much needed.

The rest of the paper is organized as follows: First, we discuss the limitations of existing efforts (§2); Next, we introduce our idea (§3) and the preliminary results (§4); Finally, we describe other related work (§5) and conclude with the discussion topics section (§6).

2 Why Existing Efforts Are Not Enough

In this section, we discuss two groups of existing efforts that may alleviate the challenge of diagnosing storage stack failures to some extent. We defer the discussion of other related work (e.g., diagnosing performance issues and distributed systems) to §5.

2.1 Testing the Storage Software Stack

Great efforts have been made to test the storage software in the stack [95, 62, 64, 99], with the goal of exposing bugs that could lead to failures. For example, EXPLODE [95] and B [64] apply fault injections to detect crash-consistency bugs in file systems.

However, testing tools are generally not suitable for diagnosing system failures because they typically require a well-controlled environment (e.g., a highly customized kernel [95, 64]), which may be substantially different from the storage stack that need to be diagnosed.

2.2 Practical Diagnosing Tools

To some extent, failure diagnosis is the reverse process of fault injection testing. Due to the importance, many practical tools have been built, including the following:

Debuggers  [41, 26, 49] are the de facto way to diagnose system failures. They usually support fine-grained manual inspection (e.g., set breakpoints, check memory bytes). However, significant human efforts are needed to harness the power and diagnose the storage stack. The manual effort required will keep increasing as the software becomes more complex. Also, these tools typically cannot collect the device information directly.

Software Tracers  [93, 11, 34, 35] can collect various events from a target system to help understand the behavior. However, similar to debuggers, they focus on host-side events only, and usually do not have automation support for failure inspection.

Bus Analyzers  [45, 78] are hardware equipments that can capture the communication data between a host system and a device, which are particularly useful for analyzing the device behavior. However, since they only report bus-level information, they cannot help much on understanding system-level behaviors.

Note that both debuggers and software tracers represent the traditional top-down diagnosis approach. On the other hand, bus analyzers have been used to diagnose some of the most obscure failures that involved host-device interactions [32, 74], but they are not as convenient as the software tools.

3 X-Ray: A Cross-Layer Approach

Our goal is to help practitioners to narrow down the the root causes of storage system failures quickly. To this end, we are exploring a framework called X-Ray, which is expected to have the following key features:

Figure 1: The X-Ray Approach. The target software stack is hosted in a virtual machine; DevAgent, HostAgent, and X-Explorer are the three main components; the basic mode visualizes a correlation tree for inspection; the advanced mode highlights critical nodes based on rules.
  • Full stack: many critical operations (e.g., sync, TRIM) require cohesive support at both device and host sides; inspecting only one side (and assuming the other side is correct) is fundamentally limited;

  • Isolation: the device information should be collected without relying on the host-side software (which may be problematic itself);

  • Usability: no special hardware or software modification is needed; manual inspection should be reduced as much as possible.

3.1 Overview

Figure 1 shows an overview of the X-Ray framework, which includes three major components: DevAgent, HostAgent, and X-Explorer.

First, we notice that the virtualization technology is mature enough to support unmodified OS today [89, 67, 69]. Moreover, recent research efforts have enabled emulating sophisticated storage devices in a virtual machine (VM), including SATA SSDs (e.g., VSSIM [96]) and NVMe SSDs (e.g., FEMU [55]). Therefore, we can leverage virtualization to support cross-layer analysis with high fidelity and no hardware dependence.

Specifically, we host the target storage software stack in a QEMU VM [19]. At the virtual device layer, the DevAgent (§3.2) monitors the commands (e.g., SCSI, NVMe) transferred from the kernel under the bug-triggering workload. Optionally, the associated data (i.e., bits transferred by commands) can be recorded too.

Meanwhile, to understand the high-level semantics of the system activities, the HostAgent (§3.3) monitors the function invocations throughout the software stack (e.g., system calls, kernel internal functions), and records them with timestamps at the host side. Optionally, key variables may be recorded too with additional overhead.

The X-Explorer (§3.4) helps to diagnose the system behavior in two modes: (1) the basic mode visualizes a correlation tree of cross-layer events for inspection; (2) the advanced mode highlights critical events based on rules, which can be either specified by the user or derived from a normal system execution.

3.2 DevAgent

The device-level information is helpful because storage failures are often related to the persistent states, and changing persistent states (in)correctly requires (in)correct device command sequences. The DevAgent records the information in a command log directly without any dependency on the host-side kernel (which might be buggy itself), similar to the bus analyzer [45].

SCSI Device. The Linux kernel communicates with a SCSI device by sending Command Descriptor Blocks (CDBs) over the bus. QEMU maintains a struct SCSICommand for each SCSI command, which contains a 16-byte buffer (SCSICommand->buf

) holding the CDB. Every SCSI command type is identified by the opcode at the beginning of the CDB, and the size of CDB is determined by the opcode. For example, the CDB for the WRITE_10 command is represented by the first 10 bytes of the buffer. For simplicity, we always transfer 16 bytes from the buffer to the command log and use the opcode to identify valid bytes. QEMU classifies SCSI commands into either Direct Memory Access (DMA) commands (e.g., READ_10) or Admin commands (e.g., VERIFY_10), and both are handled in the same way in DevAgent since they share the same structure.

NVMe Device. QEMU maintains a struct NvmeCmd for each NVMe command, and emulates the io_uring[73, 47] interface to transfer NVMe commands to a NVMe device, The interface defines two types of command queues: submission and completion. The submission queues are further classified into either I/O submission queue or Admin submission queue, which are processed via nvme_process_sq_io and nvme_process_sq_admin in QEMU respectively. The DevAgent intercepts both queues and records both I/O commands and Admin commands, similar to SCSI.

Figure 2: A partial correlation tree. The tree includes one syscall (green), 704 kernel functions (white nodes), and 3 device commands (blue); the critical path (red) is selected by a simple rule: all ancestors of the command nodes.

3.3 HostAgent

The HostAgent aims to track host-side events to help understand the high level semantics of system activities. As mentioned in §2, many tracers have been developed with different tradeoffs [71]. The current prototype of HostAgent is based on ftrace [39], which has native support on Linux based on kprobes [11]. We select ftrace [39] because its convenient support on tracing caller-callee relationship. When CONFIG_FUNCTION_GRAPH_TRACER is defined, the ftrace_graph_call routine will store the pointer to the parent to a ring buffer at function return via the link register, which is ideal for X-Ray. On the other hand, ftrace

only records function execution time instead of the epoch time needed for synchronization with DevAgent events. To workaround the limitation, we modify the front end of

ftrace to record the epoch time at system calls, and calculates the epoch time of kernel functions based on their execution time since the corresponding system calls. Another issue we observe is that ftrace may miss executed kernel functions. We are working on improving the completeness.

3.4 X-Explorer

The events collected by DevAgent and HostAgent are valuable for diagnosis. However, the quantity is usually too large for manual inspection. Inspired by the visualization layer of other diagnosis tools [20, 84, 10, 75], the X-Explorer visualizes the relationships among the events and highlights the critical ones.

3.4.1 TreeBuilder

The TreeBuilder generates a correlation tree to represent the relationships among events in the storage stack. The tree contains three types of nodes based on the events from HostAgent and DevAgent: (1) SYSCALL nodes represent the system calls invoked in the bug-triggering workload; (2) KERNEL nodes represent the internal kernel functions involved; (3) CMD nodes represent the commands observed at the device.

There are two types of edges in the tree: (1) the edges among SYSCALL and KERNEL nodes represent function invocation relations (i.e., parent and child); (2) the edges between CMD nodes and other nodes represent close relations in terms of time. In other words, the device-level events are correlated to the host-side events based on timestamps. While the idea is straightforward, we observe an out-of-order issue caused by virtualization: the HostAgent timestamp is collected within the VM, while the DevAgent timestamp is collected outside the VM; the device commands may appear to occur before the corresponding system calls based on the raw timestamps. To workaround the issue, we set up an NTP server [87] at the DevAgent side and perform NTP synchronization at the HostAgent Side. We find that such NTP based synchronization may mitigate the timestamp gap to a great extent, as will be shown in §4. Another potential solution is to modify the dynamic binary translation (DBT) layer of QEMU to minimize the latency.

3.4.2 TreePruner

The correlation tree is typically large due to the complexity of the storage stack. Inspired by the rule-based diagnosis tools [20], the TreePruner traverses the tree and highlights the critical paths and nodes (i.e., the paths and nodes of interest) automatically based on a set of rules stored in the RuleDB, which can be either specified by the user or derived from a normal system execution.

User-specified rules. Users may specify expected relations among system events as rules. For example, the sync-family system calls (e.g., sync, fsync) should generate SYNC_CACHE (SCSI) or FLUSH (NVMe) commands to the device, which is crucial for crash consistency; similarly, blkdev_fsync should be triggered when calling fsync on a raw block device. In addition, users may also specify simple rules to reduce the tree (e.g., all ancestor nodes of WRITE commands).

Our current prototype hard-coded a few rules as tree traversal operations based on the failure cases we studied (§4). We are exploring more flexible interfaces (e.g., SQL-like [44] or formula-based [22]) to enable expressing more sophisticated rules.

Normal system execution. Failures are often tricky to reproduce due to different environments (e.g., different kernel versions)  [25]. In other words, failures may not always manifest even under the same bug-triggering workloads. Based on this observation, and inspired by delta debugging [97, 63], we may leverage a normal system execution as a reference when available.

When a normal system is available, we host the corresponding software stack in the X-Ray VM and build the correlation tree under the same bug-triggering workload. For clarity, we name the tree from the normal system execution as the reference tree, which essentially captures the implicit rules among events in the normal execution. By comparing the trees, divergences that cause different symptoms can be identified quickly.

4 Preliminary Results

Figure 3: Comparison. (a) the critical path from Figure 2; (b) the critical path from a reference tree.

We have built a preliminary prototype of X-Ray and applied it to diagnose 5 failure cases based on real bugs from the literature [100, 58, 64, 50]. We discuss one case in details and summarize the results at the end.

Case Study. Figure 2 shows a partial correlation tree for diagnosing a failure where synchronous writes appear to be committed out of order on a raw block device. The tree starts with a syscall node (green), which triggers 704 kernel functions (white nodes) and three device commands (blue nodes). The red lines shows the critical path and nodes selected by one simple rule: “ancestors of device commands” (Rule#3). The tree itself is part of the original tree (not shown) selected via another simple rule: “syscalls with WRITE commands” (Rule#1).

Figure 3 (a) shows a zoomed in version of the critical path and nodes from Figure 2. We can easily see that the fsync syscall only generates three WRITE (0x2a) commands without explicitly sending SYNC_CACHE (0x35) command to the device, which is incorrect based on POSIX. Further investigation confirms that the root cause lies in the blkdev_fsync node on the critical path.

When a normal system is available, X-Ray may help more. Figure 3 (b) shows the critical path on a reference tree. Apparently, the SYNC_CACHE (0x35) command appears, and a new function blkdev_issue_flush is involved. By comparison, it is clear that the difference stems from the blkdev_fsync node.

Summary. Table 2 summarizes the result. Besides Rule#1 and Rule#3, we define another Rule#2: “functions between the syscall and commands”. Table 2 shows the node count of the original tree and the node counts after applying each rule. The bold cell means the root cause can be covered by the nodes selected via the corresponding rule. Overall, the simple rules can effectively narrow down the search space for root cause (0.06% - 4.97% of the original trees). We are studying other failure patterns and developing more intelligent rules.

ID Original Rule#1 Rule#2 Rule#3
1 11,353 704 571 30
(100%) (6.20%) (5.03%) (0.26%)
2 34,083 697 328 22
(100%) (2.05%) (0.96%) (0.06%)
3 24,355 1,254 1,210 15
(100%) (5.15%) (4.97%) (0.06%)
4 273,653 10,230 9,953 40
(100%) (3.74%) (3.64%) (0.01%)
5 284,618 5,621 5,549 50
(100%) (1.97%) (1.95%) (0.04%)
Table 2: Result Summary.

5 Related Work

Analyzing Storage Devices. Many researchers have studied the behaviors of storage devices in depth, including both HDDs [14, 15, 30, 51, 76] and SSDs [8, 28, 40, 43, 53, 56, 77, 85, 60]. For example, Maneas et al. [60] study the reliability of 1.4 million SSDs deployed in NetApp RAID systems. Generally, these studies provide valuable insights for reasoning complex system failures involving device, which is complementary to X-Ray.

Diagnosing Distributed Systems. Great efforts have been made on tracing and analyzing distributed systems  [9, 16, 84, 48, 75, 13]. For example, Aguilera et al. [9] trace network messages and infer causal relationships and latencies to diagnose performance issues. Similar to X-Ray, these methods need to align traces. However, their algorithms typically make use of unique features of network events (e.g., RPC Send/Receive pairs, IDs in message headers), which are not available for X-Ray. On the other hand, some statistic based methods [48] are potentially applicable when enough traces are collected.

Software Engineering. Many software engineering techniques have been proposed for diagnosing user-level programs (e.g., program slicing [7, 90, 98], delta debugging [97, 63], checkpoint/re-execution [81, 86]). In general, applying them directly to the storage stack remains challenging due to the complexity. On the other hand, some high-level ideas are likely applicable. For example, Sambasiva et al. [75] apply delta debugging to compare request flows to diagnose performance problems in Ursa Minor [4], similar to the reference tree part of X-Ray.

6 Discussion Topics Section

Emulating Storage Devices. As mentioned in §3, sophisticated SATA/NVMe SSDs have been emulated in QEMU VM  [96, 55]. Among others, such efforts are important for realizing the VM-based full-stack tracing and diagnosis. However, we do have observed some limitations of existing emulated devices, which may affect the failure reproducing (and thus diagnosis) in VM. For example, advanced features like the TRIM operation are not fully supported on VSSIM or FEMU yet, but the Algolia failure case [92] requires a TRIM-capable device to manifest. As a result, we are not able to reproduce the Algolia failure in VM. Therefore, emulating storage devices precisely would be helpful for the X-Ray approach and/or failure analysis in general, in addition to the other well-known benefits [96, 55]. We would like to discuss how to improve the emulation accuracy under practical constraints (e.g., confidentiality).

Deriving Rules.

The automation of X-Ray depends on the rules. The current prototype hard-coded a number of simple rules based on our preliminary study and domain knowledge, which is limited. We would like to explore other implicit rules in the storage stack with other domain experts. Also, we plan to collect correlation trees from normal system executions and apply machine learning algorithms to derive the potential rules. We would like to discuss the feasibility.

Other Usages of X-Ray. We envision that some other analysis could be enabled by X-Ray. For example, with precise latency and casual relationships among events, we may identify the paths that are critical for I/O performance, similar to the request flow analysis in distributed systems [75]. Another possibility is to measure the write amplification at different layers across the stack. We would like to discuss the opportunities.

Other Challenges of Failure Diagnosis. There are other challenges that are not covered by X-Ray. For example, X-Ray assumes that there is a bug-triggering workload that can reliably lead to the failure. In practice, deriving bug-triggering workloads from user workloads (which may be huge or inconvenient to share) is often tricky [37, 24]. We would like to discuss such challenges.

Sharing Failure Logs. The cross-layer approach would be most effective for diagnosing obscure failures that involve both the OS kernel and the device [92, 100]. Based on our communications with storage practitioners, such failures are not uncommon. However, the details of such failures are usually unavailable to the public, which limits the use cases that could shape the design of reliability tools like X-Ray. The availability of detailed failure logs at scale is critical for moving similar research efforts forward. We would like to discuss how to improve log sharing given constraints (e.g., privacy).

7 Acknowledgements

We thank the anonymous reviewers for their insightful comments and suggestions. This work was supported in part by the National Science Foundation (NSF) under grants CNS-1566554/1855565 and CCF-1717630/1853714. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF.

References

  • [1] (2016) Discussion on data loss on mSATA SSD module and Ext4 . Note: http://pcengines.ch/msata16a.htm Cited by: §1.
  • [2] (Nov 7 - Dec 8, 2014) Discussion on kernel TRIM support for SSDs: [1/3] libata: Whitelist SSDs that are known to properly return zeroes after TRIM . Note: http://patchwork.ozlabs.org/patch/407967/ Cited by: §1.
  • [3] A guide to mdadm. Note: https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm Cited by: §1.
  • [4] M. Abd-El-Malek, W. V. C. II, C. Cranor, G. R. Ganger, J. Hendricks, A. J. Klosterman, M. Mesnier, M. Prasad, B. Salmon, R. R. Sambasivan, S. Sinnamohideen, J. D. Strunk, E. Thereska, M. Wachs, and J. J. Wylie (2005) Ursa minor: versatile cluster-based storage. In FAST’05: Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies, Cited by: §5.
  • [5] A. Abulila, V. S. Mailthody, Z. Qureshi, J. Huang, N. S. Kim, J. Xiong, and W. Hwu (2019) FlatFlash: exploiting the byte-accessibility of ssds within a unified memory-storage hierarchy. In ASPLOS ’19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: §1.
  • [6] Advanced Flash Technology Status,Scaling Trends & Implications to Enterprise SSD Technology Enablement. Note: https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2012/20120821_TA12_Yoon_Tressler.pdf Cited by: §1.
  • [7] H. Agrawal, R. A. DeMillo, and E. H. Spafford (1991) An execution-backtracking approach to debugging. In IEEE Software, Cited by: §5.
  • [8] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy (2008) Design Tradeoffs for SSD Performance. In ATC’08: USENIX 2008 Annual Technical Conference, Cited by: §5.
  • [9] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. A. Reynolds, and A. Muthitacharoen (2003) Performance debugging for distributed systems of black boxes. In ACM SIGOPS Operating Systems Review, Cited by: §5.
  • [10] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. A. Reynolds, and A. Muthitacharoen (2003) Performance debugging for distributed systems of black boxes. In ACM SIGOPS Operating Systems Review, Cited by: §3.4.
  • [11] An introduction to KProbes. Note: https://lwn.net/Articles/132196/ Cited by: §2.2, §3.3.
  • [12] D. Bae, I. Jo, Y. adel Choi, J. Hwang, S. Cho, D. Lee, and J. Jeong (2018) 2B-ssd: the case for dual, byte- and block-addressable solid-state drives. In ISCA ’18: Proceedings of the 45th Annual International Symposium on Computer Architecture, Cited by: §1.
  • [13] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang (2007) Towards highly reliable enterprise network services via inference of multi-level dependencies. In ACM SIGCOMM Computer Communication Review, Cited by: §5.
  • [14] L. N. Bairavasundaram, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, G. R. Goodson, and B. Schroeder (2008-11) An analysis of data corruption in the storage stack. Trans. Storage 4 (3), pp. 8:1–8:28. External Links: ISSN 1553-3077, Link, Document Cited by: §5.
  • [15] L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler (2007) An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’07, New York, NY, USA, pp. 289–300. External Links: ISBN 978-1-59593-639-4, Link, Document Cited by: §5.
  • [16] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier (2004) Using magpie for request extraction and workload modelling. In OSDI’04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, Cited by: §5.
  • [17] Basic Performance Measurements of the Intel Optane DC Persistent Memory Module. Note: https://arxiv.org/abs/1903.05714 Cited by: §1.
  • [18] H. P. Belgal, N. Righos, I. Kalastirsky, J. J. Peterson, R. Shiner, and N. Mielke (2002) A new reliability model for post-cycling charge retention of flash memories. In Proceedings of the 40th Annual Reliability Physics Symposium, pp. 7–20. Cited by: §1.
  • [19] F. Bellard (2005) QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track, Vol. 41, pp. 46. Cited by: §3.1.
  • [20] S. Bhatia, A. Kumar, M. E. Fiuczynski, and L. Peterson (2008) Lightweight, high-resolution monitoring for troubleshooting production systems. In OSDI’08: Proceedings of the 8th USENIX conference on Operating systems design and implementation, Cited by: §3.4.2, §3.4.
  • [21] M. Bjorling, J. Axboe, D. Nellans, and P. Bonnet (2013) Linux block io: introducing multi-queue ssd access on multi-core systems. In SYSTOR ’13: Proceedings of the 6th International Systems and Storage Conference, Cited by: §1.
  • [22] J. Bornholt, A. Kaufmann, J. Li, A. Krishnamurthy, E. Torlak, and X. Wang (2016) Specifying and checking file system crash-consistency models. In ASPLOS ’16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: §3.4.2.
  • [23] A. Brand, K. Wu, S. Pan, and D. Chin (1993) Novel read disturb failure mechanism induced by FLASH cycling. In Proceedings of the 31st Annual Reliability Physics Symposium, pp. 127–132. Cited by: §1.
  • [24] Bug 201639 - ext4 and btrfs filesystem corruption. Note: https://bugzilla.kernel.org/show_bug.cgi?id=201639 Cited by: §6.
  • [25] Bug 206397 XFS: Assertion failed . Note: https://bugzilla.kernel.org/show_bug.cgi?id=206397 Cited by: §3.4.2.
  • [26] P. A. Buhr, M. Karsten, and J. Shih (1996) KDB: a multi-threaded debugger for multi-threaded applications. In SPDT ’96: Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, Cited by: §2.2.
  • [27] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai (2012) Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’12, San Jose, CA, USA, pp. 521–526. External Links: ISBN 978-3-9810801-8-6, Link Cited by: §1.
  • [28] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, and K. Mai (2014) Neighbor-cell assisted error correction for MLC NAND flash memories. In ACM SIGMETRICS Performance Evaluation Review, Vol. 42, pp. 491–504. Cited by: §5.
  • [29] B. Caldwell (2015) Improving block-level efficiency with scsi-mq. arXiv preprint arXiv:1504.07481. Cited by: §1.
  • [30] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson (1994-06) RAID: high-performance, reliable secondary storage. ACM Comput. Surv. 26 (2), pp. 145–185. External Links: ISSN 0360-0300, Link, Document Cited by: §5.
  • [31] Computeexpresslink(CXL). Note: https://www.computeexpresslink.org/ Cited by: §1.
  • [32] Correlating debugger. Note: https://patents.google.com/patent/US20050183066A1/en Cited by: §2.2.
  • [33] DAX: Page cache bypass for filesystems on memory storage. Note: https://lwn.net/Articles/618064/ Cited by: §1.
  • [34] Dtrace. Note: http://dtrace.org/blogs/ Cited by: §2.2.
  • [35] U. Erlingsson, M. Peinado, S. Peter, and M. Budiu (2012) Fay: extensible distributed tracing from kernels to clusters. In ACM Transactions on Computer Systems (TOCS), Cited by: §2.2.
  • [36] (April 29, 2013) Failure on FreeBSD/SSD: Seeing data corruption with zfs trim functionality . Note: https://lists.freebsd.org/pipermail/freebsd-fs/2013-April/017145.html Cited by: §1.
  • [37] Firefox 3 uses fsync excessively. Note: https://bugzilla.mozilla.org/show_bug.cgi?id=421482 Cited by: §6.
  • [38] Flash array. Note: https://patents.google.com/patent/US4101260A/en Cited by: §1.
  • [39] Ftrace. Note: https://elinux.org/Ftrace Cited by: §3.3.
  • [40] R. Gabrys, E. Yaakobi, L. M. Grupp, S. Swanson, and L. Dolecek (2012)

    Tackling intracell variability in TLC flash through tensor product codes

    .
    In ISIT’12, pp. 1000–1004. Cited by: §1, §5.
  • [41] GDB: The GNU Project Debugger. Note: https://www.gnu.org/software/gdb/ Cited by: §2.2.
  • [42] V. Gogte, W. Wang, S. Diestelhorst, A. Kolli, P. M. Chen, S. Narayanasamy, and T. F. Wenisch (2019) Software wear management for persistent memories. In 17th USENIX Conference on File and Storage Technologies (FAST 19), Cited by: §1.
  • [43] L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. H. Siegel, and J. K. Wolf (2009) Characterizing flash memory: anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, New York, NY, USA, pp. 24–33. External Links: ISBN 978-1-60558-798-1, Link, Document Cited by: §1, §5.
  • [44] H. S. Gunawi, A. Rajimwale, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau (2008) SQCK: a declarative file system checker. In OSDI’08: Proceedings of the 8th USENIX conference on Operating systems design and implementation, Cited by: §3.4.2.
  • [45] How to Read a SCSI Bus Trace. Note: https://www.drdobbs.com/how-to-read-a-scsi-bus-trace/199101012 Cited by: §2.2, §3.2.
  • [46] HP Warns That Some SSD Drives Will Fail at 32,768 Hours of Use. Note: https://www.bleepingcomputer.com/news/hardware/hp-warns-that-some-ssd-drives-will-fail-at-32-768-hours-of-use/ Cited by: §1.
  • [47] io_uring IO interface. Note: https://lwn.net/Articles/776428/ Cited by: §3.2.
  • [48] S. Kandula, R. Mahajan, P. D. Verkaik, S. Agarwal, J. D. Padhye, and P. Bahl (2009) Detailed diagnosis in enterprise networks. In ACM SIGCOMM Computer Communication Review, Cited by: §5.
  • [49] Kgdb. Note: https://elinux.org/Kgdb Cited by: §2.2.
  • [50] S. Kim, M. Xu, S. Kashyap, J. Yoon, W. Xu, and T. Kim (2019) Finding semantic bugs in file systems with an extensible fuzzing framework. In SOSP ’19: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 147–161. Cited by: §4.
  • [51] A. Krioukov, L. N. Bairavasundaram, G. R. Goodson, K. Srinivasan, R. Thelen, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau (2008) Parity lost and parity regained.. In FAST, Vol. 8, pp. 1–15. Cited by: §5.
  • [52] H. Kurata, K. Otsuga, A. Kotabe, S. Kajiyama, T. Osabe, Y. Sasago, S. Narumi, K. Tokami, S. Kamohara, and O. Tsuchiya (2006) The impact of random telegraph signals on the scaling of multilevel flash memories. In VLSI Circuits, 2006. Digest of Technical Papers. 2006 Symposium on, pp. 112–113. Cited by: §1.
  • [53] H. Kurata, K. Otsuga, A. Kotabe, S. Kajiyama, T. Osabe, Y. Sasago, S. Narumi, K. Tokami, S. Kamohara, and O. Tsuchiya (2006) The impact of random telegraph signals on the scaling of multilevel flash memories. In VLSI Circuits, 2006. Digest of Technical Papers. 2006 Symposium on, pp. 112–113. Cited by: §5.
  • [54] C. Lee, D. Sim, J. Hwang, and S. Cho (2015) F2FS: a new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15), pp. 273–286. External Links: ISBN 978-1-931971-201 Cited by: §1.
  • [55] H. Li, M. Hao, M. H. Tong, S. Sundararaman, M. Bjørling, and H. S. Gunawi (2018) The case of femu: cheap, accurate, scalable and extensible flash emulator. In 16th USENIX Conference on File and Storage Technologies (FAST 18), pp. 83–90. Cited by: §3.1, §6.
  • [56] J. Li, K. Zhao, X. Zhang, J. Ma, M. Zhao, and T. Zhang (2015) How much can data compressibility help to improve nand flash memory lifetime?. In FAST, pp. 227–240. Cited by: §5.
  • [57] (November 15, 2009) Libata: add trim support. Note: https://lwn.net/Articles/362108/ Cited by: §1.
  • [58] L. Lu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and S. Lu (2013) A study of linux file system evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST’13, pp. 31–44. External Links: Link Cited by: §4.
  • [59] L. Lu, Y. Zhang, T. Do, S. Al-Kiswany, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau (2014) Physical disentanglement in a container-based file system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, pp. 81–96. External Links: ISBN 978-1-931971-16-4, Link Cited by: §1.
  • [60] S. Maneas, K. Mahdaviani, T. Emami, and B. Schroeder (2020) A study of ssd reliability in large scale enterprise storage deployments. In 18th USENIX Conference on File and Storage Technologies (FAST 20, Cited by: §5.
  • [61] J. Meza, Q. Wu, S. Kumar, and O. Mutlu (2015) A large-scale study of flash memory failures in the field. In ACM SIGMETRICS Performance Evaluation Review, Cited by: §1.
  • [62] C. Min, S. Kashyap, B. Lee, C. Song, and T. Kim (2015) Cross-checking semantic correctness: the case of finding file system bugs. In Proceedings of the 25th Symposium on Operating Systems Principles, pp. 361–377. Cited by: §2.1.
  • [63] G. Misherghi and Z. Su (2006) HDD: hierarchical delta debugging. In Proceedings of the 28th international conference on Software engineering, pp. 142–151. Cited by: §3.4.2, §5.
  • [64] J. Mohan, A. Martinez, S. Ponnapalli, P. Raju, and V. Chidambaram (2018) Finding crash-consistency bugs with bounded black-box crash testing. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 33–50. Cited by: §2.1, §2.1, §4.
  • [65] (2016) NVM Express. Note: https://nvmexpress.org/ Cited by: §1.
  • [66] T. Ong, A. Frazio, N. Mielke, S. Pan, N. Righos, G. Atwood, and S. Lai (1993) Erratic Erase In ETOX/sup TM/ Flash Memory Array. In Symposium on VLSI Technology, VLSI’93. Cited by: §1.
  • [67] Oracle VM VirtualBox. Note: https://www.virtualbox.org/ Cited by: §3.1.
  • [68] X. Ouyang, D. Nellans, R. Wipfel, D. Flynn, and D. K. Panda (2011) Beyond block i/o: rethinking traditional storage primitives. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture, Cited by: §1.
  • [69] Parallels Workstation. Note: https://www.parallels.com/products/ Cited by: §3.1.
  • [70] V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau (2005-10) IRON File Systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05), Brighton, United Kingdom, pp. 206–220. Cited by: §1.
  • [71] Ptrace, utrace, uprobes: lightweight, dynamic tracing of user apps. Note: https://landley.net/kdocs/ols/2007/ols2007v1-pages-215-224.pdf Cited by: §3.3.
  • [72] (July 19, 2015) Raid0: data corruption when using trim. Note: https://www.spinics.net/lists/raid/msg49440.html Cited by: §1.
  • [73] Ringing in a new asynchronous I/O API. Note: https://lwn.net/Articles/776703/ Cited by: §3.2.
  • [74] A. Riska and E. Riedel (2009) Evaluation of disk-level workloads at different time-scales. In 2009 IEEE International Symposium on Workload Characterization (IISWC), Cited by: §2.2.
  • [75] R. R. Sambasivan, A. X. Zheng, M. D. Rosa, and E. Krevat (2011) Diagnosing performance changes by comparing request flows. In NSDI ’11: 8th USENIX Symposium on Networked Systems Design and Implementation, Cited by: §3.4, §5, §5, §6.
  • [76] B. Schroeder and G. A. Gibson (2007) Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), Cited by: §5.
  • [77] B. Schroeder, R. Lagisetty, and A. Merchant (2016-02) Flash reliability in production: the expected and the unexpected. In 14th USENIX Conference on File and Storage Technologies (FAST 16), Santa Clara, CA, pp. 67–80. External Links: ISBN 978-1-931971-28-7, Link Cited by: §5.
  • [78] SCSI bus analyzer. Note: https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/diagnosticsubsystem/header_54.html Cited by: §2.2.
  • [79] (March 20, 2017) scsi-mq. Note: https://lwn.net/Articles/602159/ Cited by: §1.
  • [80] J. Seo, W. Kim, W. Baek, B. Nam, and S. H. Noh (2017) Failure-atomic slotted paging for persistent memory. ACM SIGPLAN Notices 52 (4), pp. 91–104. Cited by: §1.
  • [81] S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou (2004) Flashback: a lightweight extension for rollback and deterministic replay for software debugging. In USENIX 2004 Annual Technical Conference, Cited by: §5.
  • [82] R. Stallman, R. Pesch, S. Shebs, et al. (2002) Debugging with gdb. Free Software Foundation 51, pp. 02110–1301. Cited by: §1.
  • [83] K. Suh, B. Suh, Y. Lim, J. Kim, Y. Choi, Y. Koh, S. Lee, S. Kwon, B. S. Choi, J. Yum, J. Choi, J. Kim, and H. Lim (1995) A 3.3V 32Mb NAND flash memory with incremental step pulse programming scheme. In IEEE Journal of Solid-State Circuits, JSSC’95. Cited by: §1.
  • [84] J. Tan, X. Pan, S. P. Kavulya, R. Gandhi, and P. Narasimhan (2009) Mochi: visual log-analysis based tools for debugging hadoop. In HotCloud’09: Proceedings of the 2009 conference on Hot topics in cloud computing, Cited by: §3.4, §5.
  • [85] H. Tseng, L. M. Grupp, and S. Swanson (2011) Understanding the impact of power loss on flash memory. In Proceedings of the 48th Design Automation Conference (DAC’11), Cited by: §5.
  • [86] J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou (2007) Triage: diagnosing production run failures at the user’s site. In ACM SIGOPS Operating Systems Review, Cited by: §5.
  • [87] Using NTP to Control and Synchronize System Clocks. Note: http://www-it.desy.de/common/documentation/cd-docs/sun/blueprints/0701/NTP.pdf Cited by: §3.4.1.
  • [88] B. Van Assche (2015) Increasing scsi lld driver performance by using the scsi multiqueue approach. Cited by: §1.
  • [89] VMware. Note: https://www.vmware.com/ Cited by: §3.1.
  • [90] M. Weiser (1982) Programmers use slices when debugging. In Communications of the ACM, Cited by: §5.
  • [91] B. Welch and G. Noer (2013) Optimizing a hybrid ssd/hdd hpc storage system based on file size distributions. In 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), Cited by: §1.
  • [92] (June 15, 2015) When solid state drives are not that solid. Note: https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/ Cited by: §1, §1, §1, §6, §6.
  • [93] XRay: A Function Call Tracing System. Note: https://research.google/pubs/pub45287/ Cited by: §2.2.
  • [94] J. Xu and S. Swanson (2016) NOVA: a log-structured file system for hybrid volatile/non-volatile main memories. In 14th USENIX Conference on File and Storage Technologies (FAST 16), Cited by: §1.
  • [95] J. Yang, C. Sar, and D. Engler (2006) Explode: a lightweight, general system for finding serious storage system errors. In Proceedings of the 7th symposium on Operating systems design and implementation, pp. 131–146. Cited by: §1, §2.1, §2.1.
  • [96] J. Yoo, Y. Won, J. Hwang, S. Kang, J. Choil, S. Yoon, and J. Cha (2013) Vssim: virtual machine based ssd simulator. In 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–14. Cited by: §3.1, §6.
  • [97] A. Zeller (2002) Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering, SIGSOFT ’02/FSE-10, New York, NY, USA, pp. 1–10. External Links: ISBN 1-58113-514-9, Link Cited by: §3.4.2, §5.
  • [98] X. Zhang, R. Gupta, and Y. Zhang (2003) Precise dynamic slicing algorithms. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering, Cited by: §5.
  • [99] M. Zheng, J. Tucek, D. Huang, F. Qin, M. Lillibridge, E. S. Yang, B. W. Zhao, and S. Singh (2014) Torturing Databases for Fun and Profit. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, pp. 449–464. External Links: ISBN 978-1-931971-16-4, Link Cited by: §2.1.
  • [100] M. Zheng, J. Tucek, F. Qin, M. Lillibridge, B. W. Zhao, and E. S. Yang (2017) Reliability Analysis of SSDs under Power Fault. In ACM Transactions on Computer Systems (TOCS), External Links: Link Cited by: Table 1, §1, §4, §6.
  • [101] M. Zheng, J. Tucek, F. Qin, and M. Lillibridge (2013) Understanding the robustness of SSDs under power fault. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13), Cited by: §1, §1.