TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction

09/14/2017
by   Miryeong Kwon, et al.
Penn State University
Berkeley Lab
0

Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target storage hardware. In this work, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by keeping most their execution contexts and user scenarios while adjusting them with new system information. Specifically, our TraceTracker's software evaluation model can infer CPU burst times and user idle periods from old storage traces, whereas its hardware evaluation method remasters the storage traces by interoperating the inferred time information, and updates all inter-arrival times by making them aware of the target storage system. We apply the proposed co-evaluation model to 577 traces, which were collected by servers from different institutions and locations a decade ago, and revive the traces on a high-performance flash-based storage array. The evaluation results reveal that the accuracy of the execution contexts reconstructed by TraceTracker is on average 99 total idle periods, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

11/01/2020

Heuristic-based Mining of Service Behavioral Models from Interaction Traces

Software behavioral models have proven useful for emulating and testing ...
04/02/2021

Daisen: A Framework for Visualizing Detailed GPU Execution

Graphics Processing Units (GPUs) have been widely used to accelerate art...
12/06/2013

Towards the Framework of the File Systems Performance Evaluation Techniques and the Taxonomy of Replay Traces

This is the era of High Performance Computing (HPC). There is a great de...
10/17/2021

A Learning-based Approach Towards Automated Tuning of SSD Configurations

Thanks to the mature manufacturing techniques, solid-state drives (SSDs)...
06/30/2021

Design and Evaluation of Scalable Representations of Communication in Gantt Charts for Large-scale Execution Traces

Gantt charts are frequently used to explore execution traces of large-sc...
04/15/2021

Towards a Fast and Accurate Model of Intercontact Times for Epidemic Routing

We present an accurate user-encounter trace generator based on analytica...
07/16/2021

Estimation from Partially Sampled Distributed Traces

Sampling is often a necessary evil to reduce the processing and storage ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Tracing block accesses is a long-established method to extract and tabulate various system parameters. A set of collected I/O instructions, referred to as a block trace, can provide valuable insights into design tradeoffs and can be used for the implementations of various software subsystems and hardware components in storage stacks. Therefore, many proposals utilize a wide spectrum of block traces for system characterizations, model verifications, and design analyses [10, 15, 18]. Nevertheless, it is non-trivial, and ever-challenging to appropriately record block accesses on various large servers. Thus, open-license block traces, collected on different institutions and server locations, are extensively used in the computer and system communities [27, 9, 16, 11].

While these traces include detailed block access information, they can also lead to wrong results and conclusions for some simulation-based analyses and design studies. Specifically, time information (i.e., inter-arrival time) on traces is intrinsically connected to the performance characteristics of the target storage. Since modern storage systems are undergoing significant technology shifts, different performance exhibited by new hardware can result in different I/O timing and user application behaviors. Furthermore, open-license block traces are collected on old systems that employ many hard disk drives (HDDs) designed a decade ago, which in turn can make system analysis and evaluation based on such block traces significantly different from the actual results that reflect the real characteristics of modern systems.

Even though the limited timing information is a matter for system research, it is extremely challenging to collect comprehensive information on a variety of servers and large-scale computing systems by incorporating many important (but unpredictable) user scenarios. For example, Microsoft’s exchange server workloads [9], which are one of most popular block traces in system community, recorded detailed I/O patterns across multiple production clusters, which were generated by 5,000 users. Even if one tries to retrace the workloads by constructing such servers with modern storage like solid state drives (SSDs), it is difficult to capture all system delays, idle operations and non-deterministic timing behaviors generated by thousands of users. To address these challenges, some replay methods statically accelerate the old traces to study peak performance [25, 8, 30]. However, these overly-simplified “Acceleration” methods are too imprecise to remaster the time information of the workloads. There also exist dynamic approaches that revise the block traces by issuing actual I/Os to a real system [14, 32, 4]. These “Revision” methods can make the inter-arrival times of workloads more realistic, but they can also lose other important runtime contexts such as user idle periods and system delays.

Figure 1: Cumulative distribution function (CDF) for inter-arrival times observed by different methods and systems.

To be precise, we evaluate the inter-arrival times generated by an acceleration method (Acceleration [8]) and a revision method (Revision [4]), and the results are shown in Figure 1 in the form of a cumulative distribution function (CDF). In this evaluation, we generated 70 million instructions whose I/O patterns are same as a Microsoft’s network storage file server [9], and are issued to an HDD-based system node (OLD) and a SSD-based system node (NEW), respectively. Specifically, 14 million instructions are issued in an asynchronous fashion, and we injected user idle operations that account for 20% of the total instructions to make I/O access more realistic. The same patterns are collected from both OLD and NEW for a fair comparison. The traces on OLD are used for Acceleration while Revision is implemented by reconstructing the workloads by replaying them on the SSD-based storage node. As shown in the figure, the first half of the distribution curve of Acceleration exhibits shorter inter-arrival times than that of the actual target system (NEW) by 88% on average, while losing 98% of user idle times, compared to the target system. Even though the timing trend of Revision appears similar to that of NEW, it still exhibits longer inter-arrival times at the first half of CDF curve than the ones of NEW by 16%, on average. More importantly, Revision fails to capture 18% of user idle operations and 69% of total idle periods, observed in the real system, NEW.

In this paper, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by keeping most of their execution contexts and user behaviors while adjusting them to the new system information. Specifically, our proposed TraceTracker’s software evaluation model can infer CPU burst times and user idle periods from old-fashioned block traces, whereas its hardware evaluation method remasters the block traces by interoperating the inferred time information as well as renews all inter-arrival times by being aware of the target storage system. The proposed software and hardware co-evaluation methods can be implemented by using publicly-available benchmark tools such as FIO [2].

The main contributions can be summarized as follows:

Reviving the timing information for diverse workloads.

There are several workloads that provide no specific information or descriptions of the underlying storage trace collection environment. In this work, we analyze a diverse set of large-scale workloads and provide an inference model that estimates the relative time costs of an I/O request service. This inference model evaluates the realistic idle time that can capture the system and user behaviors from the traditional block traces by dividing the arrival time into a channel delay, a device time and an idle time. It decomposes I/O subsystem latency by analyzing the probability density functions and cumulative distribution functions of the inter-arrival times as well as being aware of the given request sizes and operation types.

Inference automation and hardware/software co-evaluation. Analyzing extensive out-of-date block traces is non-trivial, and reconstructing the traces is not a one-shot process as the target system will keep shifting its underlying storage technology. In this work, we reify the proposed inference model by automating our graph classification method and steepness analysis, each of which is used to examine massive trace data and speculate the underlying I/O system latency. With the timing information deduced by the proposed inference automation, TraceTracker simulates the old system behaviors and emulates I/O services on a real target system. TraceTracker also performs post-processing to revive asynchronous/synchronous information on new emulated traces. To verify the proposed trace reconstruction method, we introduce several verification metrics such as user idle detection and length. Even in the case of no runtime information being available on the trace collections, our TraceTracker can detect 99% of system delays and idle periods appropriately and secure the corresponding idle periods by 96% of a real execution, on average.

Massive trace reconstruction and analysis. In this work, we reconstructed 577 tracesAll the traces collected for this paper are available for download from http://trace.camelab.org. that cover a diverse set of I/O workloads of large-scale computing systems, such as web services, data mining and network file system severs, and performed a comprehensive analysis of the reconstructed block traces. While previous work [8] claimed that 50% of write requests have time intervals that are 2x longer than the effective device operation latency, even after accelerating the block traces (of the same workloads that we tested) by 100x, we observed that the number of time intervals that have idle periods is less than 39% of the total number of I/O requests. Note that the majority of idle periods in all the block traces are found in 1 millisecond, which is also 10% shorter than the one reported by the prior study [8].

Ii Background

(a) Storage stack.
(b) Block request timing example.
Figure 2: Storage-level I/O information.

In this section, we first explain the storage-level I/O information from the perspective of a storage stack and block request timing sequence. We then introduce the existing trace revision methods and discuss their limitations.

Ii-a Storage-Level I/O Information

Storage stack. Figure 1(a) illustrates a typical storage stack from an application to the underlying storage. Once the application makes an I/O request, it is required to switch from the user mode to the kernel mode and jump into an entry-point of a virtual file system (VFS). VFS then copies the corresponding target data from a user buffer to a kernel buffer (referred to as page cache [6]) and forwards the request to the underlying file system. During this time, the mode switch consumes CPU cycles for handling system calls and storing task states in addition to copying the buffers. The file system then looks up the physical locations (indicated by request) and submits this information to the block layer. Finally, the block layer partitions the translated information, including logical block address and request size (in terms of the number of sectors) into multiple packets (or transactions). Note that, before submitting the actual information to the underlying storage, the multiple layers in the storage stack consume CPU cycles for mode switches, data copies and address translations. Note also all open-license block traces are also typically collected underneath the block layer. In cases where there are no system delays or application idleness, the user/kernel specific CPU bursts can overlap with storage bursts, which make the computational cycles that upper software modules consume hide behind the inter-arrival times of multiple I/O requests at the block traces.

Block request timing. Figure 1(b) shows the timing diagram of block requests, which can be captured from underneath the block layer. There are three requests, denoted by , and . In this example, the request is issued asynchronously, whereas all other requests are issued synchronously. Since the asynchronous block request does not need to wait for the response from the underlying device, there is only a delay caused by the storage interface (i.e., channel) data movement and corresponding data packet. This channel delay is referred to as . In this work, the I/O subsystem latency, called , consists of and the actual device time taken by the storage to service the request, denoted by . The request is ready to be submitted to the storage at ➀, and therefore, it is issued at ➁. Even though is finished at ➂, the user/kernel consume some computation cycles, and the request is not available. This in turn leads to by ➃. Once the request is prepared by the upper layers, it can be served with its and . Note that, in addition to this kind of system delay, represents the time when a user or application does nothing. In this example of block request timings, the inter-arrival time, called , is defined by the time period between ➁ and ➃.

Ii-B Trace Revision

Even though existing block traces can cover different kinds of system configurations and various user scenarios, most publicly-available conventional block traces [1, 27, 9, 16, 11] were collected on HDD-based storage systems around a decade ago. Since then, however, storage systems have been dramatically changed, as modern servers start to adopt flash-based storage to boost performance and most server workloads significantly changed. Since the block traces are intensively studied and used for demonstrating the effectiveness and performance impacts of many system research proposals [31, 8, 17, 23], these traces need to be mapped to new block traces by considering the new storage system characteristics.

Several approaches exist to reconstruct traditional block traces [25, 8, 30, 14, 32, 4]. First, the acceleration methods (Acceleration) [25, 8, 30] can artificially shorten inter-arrival times to compensate for the low throughput exhibited by HDD-based storage. However, since this method can only resize the inter-arrival times without considering the block request timings, it can remove critical information, such as and from the traces. For example, if the average of a workload is 50 ms and the acceleration factor is 100, the reconstructed trace exhibits 500 us for its average . This removes the most and , and can even make unrealistic as there is no contexts for target device, system, and user behaviors. Instead of simply accelerating inter-arrival times, there is a (Revision) to revise target workloads by replaying the corresponding block traces on a real system [14, 32, 4]. While this would have more realistic and , it cannot appropriately accommodate , which varies across all I/O instructions in the trace.

(a) Acceleration.
(b) Revision.
Figure 3: Differences of inter-arrival times observed by reconstructed traces and real system traces.

To be precise, we also compare observed in the SSD-based system node (NEW) with the ones generated by Revision and Acceleration, respectively. The evaluation environment and scenario are the same as the test conditions described in Section I, and Acceleration leverages the acceleration degree that [8] uses. We examine different values by executing five open-license block traces (MSNFS, webusers, Exchange, homes, wdev), which are widely used in the storage community [24], and the results are shown in Figure 3. One can observe from Figure 2(a) that, 98.6% of reconstructed by Acceleration are shorter than actual observed in NEW. In contrast, as shown in Figure 2(b), of Revision is on average 17.8% accurate (i.e., ‘equal’ in the figure). However, most of them (77.8% of total , on average) are shorter than the actual , which means it loses important system delays and user idle periods, . Note that it also exhibits a , that is on average 4.3% longer than the actual . This is because replaying traces drops the mode contexts (i.e., asynchronous/synchronous), which fails to capture the block request timing described by the request (as shown in Figure 1(b)). Since it is difficult to capture all system delays, idle operations and non-deterministic user behaviors (there is no block trace that offers all such information to the best of our knowledge), block trace reconstruction with limited information is non-trivial and challenging work.

Iii Timing Inference for I/O Subsystems

Figure 4: High-level view of TraceTracker.
(a) Global maxima.
(b) Chunky middle.
(c) Multi maxima.
Figure 5: Types of CDF distribution.

Overview of TraceTacker. For individual I/O instructions, it is non-trivial to extract the idle time (i.e., ) from old block traces since is affected by multiple unknown system parameters and indeterministic user behaviors at the time of trace collection. Even though the old block traces have no runtime information, including the user behaviors, can be inferred if we can estimate I/O subsystem latency (i.e., ), which is composed of the channel delay (i.e., ) and storage device time (i.e., ). Generally speaking, can be simply obtained by subtracting from . Estimating would be relatively easy if most inter-arrival times (i.e., ) are similar to each other, which can make the graph, that represents the CDF of (i.e., ), steeper. As shown in Figure 4(a), the graph rapidly rises in the middle, which exhibits a single maximum on its derivative. Because almost the entire range of is in the middle of domain and not affected by its tail, at the global maximum of slope of CDF () can be considered as . However, there are many block traces whose exhibits a much smoother slope on (e.g., chunky middle) and/or multiple maxima on its derivative. This is because of each instruction is affected by a different runtime contexts on the target system, which often makes them vary significantly. Considering Figure 4(c) as an example, this workload exhibits at least two maxima on , which can render such simple differential analysis difficult to predict of the corresponding trace, appropriately.

Figure 4 summarizes the operation of our proposed TraceTracker. In this work, as shown in the left software simulation of Figure 4

, we classify all the I/O instructions traced by a workload into multiple groups based on the request size and operation type. For the multiple groups, we create multiple CDFs for

, and estimate the relative time costs of and in block request timings by taking into account the different request sizes and types. This relative time cost estimation, in turn, enables us to individually calculate for all numbers of I/O instructions, thereby extracting from the target traditional block. Once we secure that varies based on user and system timing behaviors, can be re-evaluated by taking into account the target storage system. Specifically, we emulate the new system by regenerating the each request and issuing it on new storage with estimated . After the target trace emulation, we perform a simple post processing on the trace, which overrides the I/O timing behaviors for asynchronous mode operations by considering the old block trace and regenerated new trace. Further, while the inference logic of TraceTracker extracts the timing behaviors affected by non-deterministic user behaviors and unknown system parameters from the old block trace, the hardware emulation, and post processing parts mimic the system delay and user idle periods on a real system (target) to generate the new block trace.

Figure 6: Finding the coefficients of ( or ).

Inference model. If there is no user idle period or system delay caused by the host-side software modules, can be similar to, or even the same as . In other words, if there is greater than , can be simply inferred by subtracting from . Even though the specific information captured by is also often not recorded and offered by the old block traces, in contrast to , it can be speculated by analyzing the distribution of . As described earlier, there is only one CDF of if all I/O instructions in the target workload exhibit a uniformed request size. This in turn allows us to simply speculate by referring to at the global maxima of . In cases where the target workload exhibits a wide spectrum of request sizes and types, we speculate by inferring and , separately. Since mainly depends on the underlying storage performance, we assume that follows a linear model for the sequential accesses; is inferred by if the type of the request is a read; otherwise, it is speculated by . and are coefficient values, which will be explained shortly, and denotes the size of a request. On the other hand, on random accesses can be slightly longer than that of sequential accesses as it has a moving delay time, referred to as . typically captures the seek time and rotational latency of the underlying disk [21].

To model , we replay ten FIU workloads [27, 11] on an enterprise disk [29] and measure by calculating the difference between the and , each of which is observed on the real disk by executing random I/O accesses and generated by our linear models with sequential I/O accesses, respectively. We consider the difference between and as , and the results (for each workload) are plotted in Figure 6(a). As shown in the figure, each CDF exhibits a similar magnitude of gradient change with transition of . Motivated by this, we use at the maximum of as the representative of the difference between and , which is referred to as . Consequently, in this work, for random reads and writes can be expressed by and , respectively. Using this inference model, we speculate , which in turn allow us to infer and . The specific estimation methods for each relative costs in block request timings are described below.

Decomposition of I/O subsystem latency. For each workload, the coefficients of in our inference model, and , can be estimated by using the following disintegration analysis. First, we group all I/O instructions of the workload to reconstruct into three different categories based on i) sequentiality (e.g., sequential vs. random), ii) operation type (e.g., read vs. write) and iii) request size (in terms of sectors). We then create multiple graphs of for each request size observed in each read or write with the sequential access pattern. The proposed inference model then examines the global maxima of for each CDF. Thus, there can be maxima of , where is the number of different I/O request sizes observed in a target workload. It then chooses the two steepest graphs of CDF, which have the two highest magnitudes of changes among the maximas. Let us denote each of the steepest functions as and , where is greater than . As shown in Figure 6, we can drive which is CDF difference of and for reads and writes separately, and calculate the maximum at for each.

We can then obtain the representative inter-arrival time at the maximum of for reads and writes, which are referred to as and , respectively. Let us denote the two request sizes, which are used for creating and , as and , respectively. We can estimate the and by calculating and , respectively. Let us denote the inter-arrival times at the maximum of for reads and writes as and , respectively. and are the best values that explain of the target workload since they exclude the most of the timing effects caused by system delays and user idle periods. Next, we can obtain and by calculating and , where and are the channel delays for the reads and writes, respectively.

(a) CDF distribution of .
(b) Average period of .
Figure 7: The time components of (FIU).

Figure 6(b) shows the actual of the FIU workloads that we observed on the disk for each access pattern. One can observe from this figure that, while the difference of between reads and writes exist to some extent (e.g., ikki and maxmax), that of between random and sequential access patterns is not significant (less than 8% and 6%, respectively). Noting that the difference of is less than by many orders of magnitude; we believe that estimating the channel delay based the operation type of I/O requests is reasonable.

Lastly, to estimate the relative time cost of , we also need to find that has the highest magnitude of gradient change with transition of among multiple CDFs in the group of random accesses, and estimate the inter-arrival time, , at the maximum of . Then, can be simply calculated by subtracting (or ) from the estimated .

Iv Implementation for Inference Automation

Analyzing multiple CDF graphs is important to reconstruct old traces. While categorizing requests based on their types and sizes can be easily automated, the autonomous analysis of CDFs is non-trivial due to their discreteness. In this section, we will detail the implementation of our proposed inference model and explain how to emulate traces with inferred system delays and idle periods (i.e., ) to reconstruct the traces.

Figure 8: Checking steepness of CDF distribution.

Graph classification. Since each block trace can exhibit multiple CDFs that are examined by the proposed inference model, it is time consuming to detect the two steepest graphs of , namely, and , among them. When examining the graph, it would be possible to have a higher degree of the polynomial equation to represent CDF in mathematical expression, which also renders the process of finding and for the read and write instruction set of each trace difficult.

1:/** Step 1: Calculate PDF of inter-arrival times ()
2:for each  in  do
3:        := num() / num(request)
4:end for
5:/** Step 2: Least Square Regression
6:slope := std() / std()
7:intercept := mean() - slope * mean()
8:f(x) := slope * x + intercept
9:

/** Step 3: Find outliers

10:margin := var() / 2
11:for each  in  do
12:       distance := - f()
13:       if distance >margin then
14:             outliers.append()
15:       end if
16:end for
17:/** Step 4: Calculate CDF steepness
18: := max(outliers)
19:steepness := distance(f(),PDF())
20:
Algorithm 1 CDF steepness examination.

One simple but effective method to check the steepness of each graph is to analyze probability density distribution (PDF), instead of examining the derived function of for a target trace. As shown in Figure 8, the CDF’s highest magnitude of gradient change with a transition of can be obtained by identifying the utmost outlier on the corresponding PDF. Algorithm 1 outlines how to examine the steepness of the curve of the target CDF through the corresponding PDF. It first calculates the PDF of (cf. lines 1 3). After that, the algorithm finds the best-fitting straight line through a set of

by using linear least squares regression analysis (cf. lines 4

6). In this algorithm, if there is , which is far from the best fitting straight line by more than a margin, we refer to

as an outlier. Note that, as the margin increases, the number of outliers decreases. As the final goal of this PDF analysis is to find the utmost outlier, we setup the margin at half the variance. This PDF analysis algorithm visits all

and collects the outliers for all categorized I/O instruction sets described in Section III (e.g., read/write and request sizes). Among the outliers, it first looks for the with the maximum value (denoted by ) and returns the distance, which is the difference between the f(x) value of the straight line at the utmost outlier and (cf. line 14). Then, it compares the distances observed by each and generates two graphs that have the top two highest values (cf. line 15).

Steepness analysis. It is a challenge to find the highest gradient change with a transition of by analyzing a group of I/O requests with their CDF. Since CDF of is a kind of non-differentiable function due to its discontinuity, the two I/O instruction groups selected by aforementioned graph classification algorithm must convert the discrete results into continuous results. While one can perform a curve fitting on for the two groups to achieve a differentiable function, there is no perfect function that represents all variances observed in

. To address challenge, we interpolate

with piecewise nonlinear curve fitting; two interpolation methods are widely used: i) a special type of piecewise polynomial (called spline) interpolation and ii) a piecewise cubic hermite interpolating polynomial (called pchip) interpolation. As shown in Figure 9, spline evaluates the coefficient for each interval of data and has two continuous derivatives, whereas pchip has just one derivative, which preserves shape more smoothly than spline. Among for all old block traces we tested, pchip exhibits the desired appearance of smoothness without oscillation and under/overfitting issues that spline has. Once we interpolate with pchip, we can differentiate the results of interpolation and find the maximum of the differential, which is the highest magnitude of gradient change with a transition of . Note that, the analysis of described earlier can be processed by the same curve fitting and differential calculation methods applied to .

Figure 9: Different types of interpolations that we tested.

Hardware emulation and post-processing. Once the relative time costs are estimated, we can derive the ’ equation, which infers the different device times under the execution of sequential reads/writes and random reads/writes. In cases where there exist numbers of I/O instruction traced in the target workload, we can denote the idle time, inter-arrival time and device latency of the instruction (where ) as , , and , respectively. We then visit each I/O instruction of an old trace and perform the following trace reconstruction procedure. First, we check the operation type and request size of the old trace’s instruction and estimate using the model (cf. Section III). We also calculate by checking the difference between time stamps of the and instructions, which are given by the old block trace. Thus, exists if is greater than (e.g., ). We then delay using sleep() and issue the I/O instruction (composed of the same information of the old block trace) to the underlying brand-new device. We iterate this process for all I/O instructions. During this phase, we collect the new block trace using blktrace, which is a standard block trace tool in Linux [3]. While this hardware emulation mimics the user behaviors, including system delays and idle periods, and incorporates actual channel delays and device times on the real target system, it is not feasible to inject synchronous/asynchronous mode information to each I/O request. Thus, we check the old trace and record all the indices of the instruction whose is shorter than . We then examine all the instructions of the new trace (but yet intermittent). In this post-processing, we subtract the new device time (measured by blktrace) from the corresponding inter-arrival time and update the next instruction based on the results, if the index of the instruction we are examining is in the range of instruction indices extracted by the old block trace. Note that if workloads provide the information, we can skip the inference phase, and immediately perform the hardware emulation and post processing after finding the short .

V Experimental Results

Workload sets Microsoft Production Server (MSPS) FIU SRCMap
Published year 2007 2008
Workloads 24HR 24HRS BS CFS DADS DAP DDR MSNFS ikki madmax online topgun webmail casa webresearch webusers
# of block traces 18 18 96 36 48 48 24 36 20 20 20 20 20 20 28 28
Avg data size (KB) 8.27 28.79 20.73 9.71 28.66 74.42 24.78 10.71 4.64 4.11 4.00 3.87 4.00 4.04 4.00 4.20
Total size (GB) 21.2 178.6 331.2 43.6 44.6 84 44 317.9 25.4 3.8 22.8 9.4 31.2 80.4 13.7 33.6
Workload sets FIU IODedup MSR Cambridge (MSRC)
Published year 2009 2008
Workloads mail+online homes mds prn proj prxy rsrch src1 src2 stg web wdev usr hm ts
# of block traces 21 21 2 2 5 2 3 3 3 2 4 4 3 1 1
Avg data size (KB) 4.0 5.23 33.0 15.4 29.6 8.6 8.4 35.7 40.9 26.2 7 34 38.65 15.16 9.0
Total size (GB) 57.1 84.6 208.4 568.8 4780.1 4353 27.63 6516.5 230.6 226.4 625.4 23.7 5506.1 9.24 16.2
Table I: Important characteristics of the publicly-available conventional block traces that we reconstructed.

In this evaluation, we focus on answering the following questions: i) How accurate is our inference model? ii) How realistic can our hardware/software make compared to conventional approaches? and iii) What are the system implications based on the revisions of ?

Evaluation node. For the target system where we reconstruct block traces, we build up a storage node that employs an all-flash array by grouping four NVM Express SSDs [7]. The storage capacity of each SSD is 400GB, and a single device consists of 18 channels, 36 dies, and 72 planes. Our storage node can exhibit different levels of parallelism, ranging from an array to channel, channel to die, which in turn can offer read and write bandwidths as much as 9GB/s and 4GB/s, respectively. Our all-flash array is connected to the node’s north-bridge via four PCIe 3.0 slots, each containing four lanes [20] to the storage node.

Target block traces. We reconstruct three workload categories: i) Florida International University (FIU) [27, 11], ii) Microsoft Production Server (MSPS) [9], and iii) Microsoft Research Cambridge (MSRC) [16]. Together, FIU, MSPS, and MSRC contain a total of 577 block traces, which are used for a wide spectrum of simulation-based studies [31, 8, 17, 23]. FIU workloads offer university-scale production server characteristics, which consist of two different types of sub-workloads: SRCMap and IODedup. While SRCMap workloads are collected for an application that optimizes system energy by virtualizing storage, department-level virtual machines for web services and mail, file and version control servers are collected by IODedup. On the other hand, MSPS provides eight different kinds of production server scenarios, and MSRC provides thirteen kinds of data center server scenarios. In MSRC, all workloads contain specific device-level information such as the type of RAID, while the same information for most workloads in MSPS is unknown. In addition, MSPS and MSRC workloads are collected by using an event-based kernel-level tracing facility [19] which can capture detailed information such as issue and completion time stamps; These timestamps are captured when requests are issued from a device driver to the target disk and when the disk completes the I/O operations, respectively. Note that even though all the traces related to the three categories of workloads discussed above include various system configurations and have a wide range of user scenarios, they are all collected around 20072009 on disk-based systems. The important characteristics of traces, including the size, and the number of traces per workload, are listed in Table I.

Reconstruction techniques. We evaluate five different block reconstruction methods:

  • [leftmargin=8pt, itemsep=-1ex,topsep=-1ex,partopsep=0ex,parsep=1ex]

  • Acceleration: Reconstruction by shortening [8].

  • Revision: Replaying block traces on all-flash array [4].

  • Fixed-th: An advanced revision method by inferring with a fixed threshold.

  • Dynamic: Reconstructions using our inference model, but with no post-processing.

  • TraceTracker: Hardware/software co-evaluation for trace reconstruction.

We leverage the value (i.e., 100) that a simulation-based SSD work uses [8] for its acceleration. On the other hand, Fixed-th considers the worst-case device latency of old storage with a fixed threshold value and uses it for inferring . To select a reasonable threshold, we performed a different set of evaluations on a HDD-based node with various thresholds, ranging from 10 ms to 100 ms, and selected 10 ms as Fixed-th’s optimal threshold. In contrast, Dynamic injects different per I/O instruction by speculating it over our inference model, but without the post-processing component of TraceTracker.

(a) known traces.
(b) unknown traces.
Figure 10: Verification results, Len(TP).
(a) known traces.
(b) unknown traces.
Figure 11: Verification results, Len(FP).

V-a Verification

Metrics. The results of this verification evaluation can be either positive or negative, each of which may be true or false. If the inference model speculates that there is , it can be classified as positive, and otherwise, the result is negative. Being negative or positive can be tested per I/O instruction. On the other hand, if the existence of is same in both target and reconstructed block traces, one can call this as true. Otherwise, it is false. Therefore, the results of the inference model test can be represented by four different statistics: true positive (TP), false positive (FP), false negative (FN), and true negative (TN). For verification, we will use four functions as follows: i) number of number of , ii) number of total number of I/O instructions, iii) , and iv) , where and are the idle times that were injected into the target block traces and speculated by our inference model, respectively. The first two functions capture the ratios of the number of TP/FP and the number of corresponding I/O instructions, whereas and indicate how much our inference model speculates accurate based on the result of a prediction hit or miss, respectively. Note that is the ratio of the speculated idle periods and actual idle periods, whereas is the actual period that the inference model mispredicts.

Results. Since the block traces have no information on , we inject in random places with various idle periods, ranging from 100 us to 100 ms. In this evaluation, injected accounts for 10% of the total I/O instructions of the target block traces. We compare the injected with the predicted by our inference model. We select two different groups of traces. One includes the traces that contain no timing information (e.g., FIU), and the other has I/O submission and completion time information (e.g., MSPS), which can be considered as . In this evaluation, we denote the former and the latter as known traces and unknown traces, respectively.

Figure 10 shows the results of observed by two trace groups that TraceTracker reconstructed. If the injected is longer than 1 ms, TraceTracker shows 90.5% and 97.3% accuracy of TP for known traces and unknown traces, respectively. If the injected is close to 100 us, the accuracy of TP declines compared to other cases by 46.5% and 73.9% for known traces and unknown traces, respectively. This is because the injected is in a range of the latency that new storage (in our case, Intel NVMe 750) exhibits. While this blurring boundary issue could make it difficult for our inference model to distinguish between device latency and idle time, most of the actual microsecond-scale system delays and idle periods are revived by our inference model. In addition, we observed that the results of is in the range of 82.2% 99.7% across all the block traces that TraceTracker built. On the other hand, is, on average, 6% and 26% while is on average 7 us and 6.4 ms for known traces and unknown traces, respectively. However, the distribution of observed by the reconstructed traces tells a different story. As shown in Figure 11, more than 98% of for known traces and unknown traces are shorter than 1 ms and 6 ms, respectively. Considering the high accuracy of TP and low impact of FP, we can conclude that TraceTracker

is within the confidence interval to reconstruct

.

(a) Unaware of .
(b) Aware of .
Figure 12: CDF distribution of (MSNFS).
Figure 13: differences among the different kinds of trace reconstruction techniques and TraceTracker method.

Comparisons. In this section, we analyze the accuracy of TraceTracker compared to other reconstruction methods by inspecting the details of the . To this end, we compare TraceTracker’s CDF of with two different groups of methods, each being unaware of and aware of ; the results are shown in Figures 11(a) and 11(b), respectively. In these figures, Target shows, the CDF of brought by the original block traces collected on HDD-based nodes. One can observe from Figure 11(a) that, Acceleration just shifts the CDF of Target from the right to the left as much as the acceleration factor indicates (e.g., 100x), which eliminates all the useful information to simulate target systems. On the other hand, Revision reflects the characteristics of the underlying new storage. However, compared to TraceTracker, it removes and by around 70% and 30%, respectively. As shown in Figure 11(b), while Fixed-th and Dynamic behave more realistically than Acceleration, unfortunately, Fixed-th loses 65% of the and Dynamic exhibits 30% longer than TraceTracker as it also loses asynchronous/synchronous mode information and is unable to capture appropriately.

Figure 13 plots the average difference between TraceTracker and other trace reconstruction methods in terms of for all the workloads we tested. One can observe from the figure that Acceleration and Revision methods that possess no information for to reconstruct traces differ by 7.08 and 7.15 seconds from TraceTracker, respectively. Considering the worst-case latency of the underlying SSD accesses (around 2 ms), losing such idle times that include system delays and user behaviors can have a great impact on diverse simulation-based studies. While Fixed-th and Dynamic have less differences compared to Acceleration or Revision, the difference between TraceTracker and their is as high as 1.3 ms and 0.035 ms, respectively. This means that, even though Fixed-th and Dynamic can capture the underlying storage characteristics, the actual time behaviors, including and , are omitted. As a result, they can exhibit different system behaviors with inaccurate values.

V-B System implications

Figure 14: differences between the target block traces and TraceTracker traces.
(a) CFS (MSPS).
(b) ikki (FIU).
Figure 15: Distribution differences between the target block traces and TraceTracker traces.

Overall analysis of inter-arrival times. The top and bottom of Figure 14 plot the average and maximum differences between the target block traces and traces reconstructed by TraceTracker. As shown in the figure, of the TraceTracker traces is 0.677 ms shorter, on average, than that of the target block traces. The implies that system analysis and evaluation studies that use the of target block traces should consider the TraceTracker traces instead since the time budget to perform foreground/background tasks can be tightened when the storage system is changed. For example, the ts workload (MSRC) has an average of 3 ms shorter in TraceTracker traces than in the target block traces. In addition, the median values of are 2 ms and 0.02 ms for target block traces and TraceTracker, respectively. Note that, the average differs among the 31 workloads because of the impact of the specific workload characteristics such as request size and type.

To analyze differences between the two traces in detail, we plot the CDF distribution of , as shown in Figure 15, only for the CFS (MSPS) and ikki (FIU), which have the maximum differences among the same workload categories (MSPS, FIU). As shown in the figures, the distribution of the TraceTracker traces leans towards the short time period and the average differences are 1 ms and 0.823 ms, respectively. For instance, 50 % of in the target block traces are less than 17 ms while that of the TraceTracker is 0.601 ms in Figure 14(a). In addition, as shown in Figure 14(b), 1 % of in the target block traces are less than 0.228 ms, while 90 % of the are less than the same value in the TraceTracker traces.

Figure 16: Average time period of .
Figure 17: Breakdown of .

Details of idle times. can be a representative workload characteristic, and the estimated was injected when the traces are reconstructed on the target storage system. Since the periods should be same in reconstructed traces, the that we estimated can be immediately used for other conventional block traces. Figure 16 shows the period estimated by TraceTracker. As shown in the figure, the average of MSPS is 0.27 s, and that of FIU is 2.80 s remove madmax workload has 20.5 s of longest among the FIU workloads. MSRC has an average value of 2.25 s, except for rsrch and wdev which have 69.2 s and 403.1 s of , respectively.

To check the detailed patterns of the 31 workloads, we analyze the breakdown of total duration by grouping these into , (0 10 ms), (10 100 ms), and (longer than 100 ms). The top and bottom parts of Figure 17 focus on the frequency and period, respectively. The frequency refers to the total number of requests per group while the period means the total time duration of each group. As shown in the figure, the MSPS workloads have larger breakdown in terms of frequency, compared to other workloads; the average breakdown is 70%, 31%, and 26% for MSPS, FIU, and MSRC, respectively. In contrast to the frequency, the average breakdown of period per workload categories is 87%, 99.8%, and 99.2% for MSPS, FIU, and MSRC, respectively. In other words, although the FIU and the MSRC workloads have low frequency, most of the is at around 90%. In addition, as shown in the figure, most of the is longer than 100 ms in the FIU and MSRC workloads. Similar to the average period shown in Figure 16, the breakdown pattern of MSPS workloads varies compared to other workloads. In the MSPS, the average frequency is 30%, 47.7%, 15%, and 6.7% for each group, while the average period breakdown is 12.6%, 18.3%, 26%, and 42.7%, respectively. Since the MSPS workloads have short , it is harder for them to utilize inter-arrival times, compared to other workloads.

Vi Related Works

There exist many prior studies that proposed to modify the conventional block traces to adjust to new storage systems [8, 25, 30, 4, 5, 14, 32]. For example, [8, 25, 30] tried to simply accelerate the inter-arrival times with a fixed scaling factor. On the other hand, [4, 5, 14, 32] replayed I/O requests on the real storage system by injecting an extra delay or zero (no-idle) between two consecutive requests. In contrast, [14] was aware of the behaviors of parallel applications, which are widely used in scientific or business environments, and reflected these onto target block traces by injecting different idle times per I/O instruction. As there are multiple nodes that execute parallel applications, this work calculates the duration of an extra delay by taking into account the computing time and synchronization time which is required for each node to ensure data synchronization; the input data of one node is another node’s output. While all the above methods did not consider user behaviors and I/O execution mode, TraceTracker can reconstruct the old block traces irrespective of application types and can classify the inter-arrival times into I/O subsystem latency and extra delay (idle times) by including both system and user behaviors. In TraceTracker, the idle times are decided by modeling the performance of the target trace’s storage system.

Early studies on storage performance modeling [12, 13, 22, 26, 28] try to capture the performance of new storage systems by identifying the target workload’s characteristics. For example, [28] used Classification and Regression Trees (CART), which is learning-based black box modeling technique. However, CART does not understand the input features and generates a multidimensional function of the model. Thus, for the storage performance, [28]

utilized the request information (e.g., inter-arrival times, logical block number, request type, and data size) as features of the CART algorithm. Unfortunately, the main problem of machine-learning based modeling is that it is hard to explain how the model can be achieved. While

[28] creates performance model without understanding the inter-arrival times, our TraceTracker analytically models storage performance by decomposing the inter-arrival times and detecting the short inter-arrival times for asynchronous I/O execution.

Vii Acknowledgement

This research is mainly supported by NRF 2016R1C1B2015312. This work is also supported in part by DOE DE-AC02-05CH 11231, IITP-2017-2017-0-01015, NRF-2015M3C4A7065645, and MemRay grant (2015-11-1731). Kandemir is supported in part by NSF grants 1439021, 1439057, 1409095, 1626251, 1629915, 1629129 and 1526750, and a grant from Intel. Myoungsoo Jung is the corresponding author.

Viii Conclusion

TraceTracker is a new approach to reconstruct existing block traces to new traces which is being aware of the target storage system only with inter-arrival time information of target workloads. To maintain the important workload’s characteristics such as system and user behaviors in the new traces, TraceTracker estimates the idle times by automatically inferencing the performance of storage system from target block traces. We can detect 99% of system delays and idle periods appropriately and secure the corresponding idle periods by 96% of a real execution, on average.

References

  • [1] O. Application, “I/o. umass trace repository,” Technical report, Http://traces.cs.umass.edu, 2007.
  • [2] J. Axboe, “Flexible i/o tester,” Freshmeat project website, 2011.
  • [3] A. D. Brunelle, “Block i/o layer tracing: blktrace,” 2006.
  • [4] S. Chen, A. Ailamaki, M. Athanassoulis, P. B. Gibbons, R. Johnson, I. Pandis, and R. Stoica, “Tpc-e vs. tpc-c: Characterizing the new tpc-e benchmark via an i/o comparison study,” ACM SIGMOD Record, 2011.
  • [5] G. R. Ganger and Y. N. Patt, “Using system-level models to evaluate i/o subsystem designs,” IEEE TC, 1998.
  • [6] K. Harty and D. R. Cheriton, “Application-controlled physical memory using external page-cache management,” ACM ASPLOS, 1992.
  • [7] Intel, “Intel ssd 750 series,” URL:http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html, 2015.
  • [8] J. Jeong, S. S. Hahn, S. Lee, and J. Kim, “Lifetime improvement of nand flash-based storage systems using dynamic program and erase scaling.” in FAST, 2014.
  • [9] S. Kavalanekar, B. Worthington, Q. Zhang, and V. Sharda, “Characterization of storage workload traces from production windows servers,” in IISWC, 2008.
  • [10] Y. Kim, R. Gunasekaran, G. M. Shipman, D. A. Dillow, Z. Zhang, and B. W. Settlemyer, “Workload characterization of a leadership class storage cluster,” in PDSW, 2010.
  • [11] R. Koller and R. Rangaswami, “I/o deduplication: Utilizing content similarity to improve i/o performance,” TOS, 2010.
  • [12] A. Merchant and P. S. Yu, “Analytic modeling of clustered raid with mapping based on nearly random permutation,” TC, 1996.
  • [13] M. Mesnier, M. Wachs, B. Salmon, and G. R. Ganger, “Relative fitness models for storage,” SIGMETRICS, 2006.
  • [14] M. P. Mesnier, M. Wachs, R. R. Simbasivan, J. Lopez, J. Hendricks, G. R. Ganger, and D. R. O’hallaron, “//trace: parallel trace replay with approximate causal events,” in FAST, 2007.
  • [15] V. Mohan, T. Siddiqua, S. Gurumurthi, and M. R. Stan, “How i learned to stop worrying and love flash endurance.” in HotStorage, 2010.
  • [16] D. Narayanan, A. Donnelly, and A. Rowstron, “Write off-loading: Practical power management for enterprise storage,” TOS, 2008.
  • [17] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. Rowstron, “Migrating server storage to ssds: analysis of tradeoffs,” in EuroSys, 2009.
  • [18] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. I. Rowstron, “Everest: Scaling down peak loads through i/o off-loading.” in OSDI, 2008.
  • [19] I. Park and M. K. Raghuraman, “Server diagnosis using request tracking,” in DSN, 2003.
  • [20] PCI-SIG, “Pci express base 3.0 specification,” Addison-Wesley Publishing Company, 2010.
  • [21] C. Ruemmler and J. Wilkes, “An introduction to disk drive modeling,” Computer, 1994.
  • [22] E. Shriver, A. Merchant, and J. Wilkes, “An analytic behavior model for disk drives with readahead caches and request reordering,” in SIGMETRICS, 1998.
  • [23] G. Soundararajan, V. Prabhakaran, M. Balakrishnan, and T. Wobber, “Extending ssd lifetimes with disk-based write caches.” in FAST, 2010.
  • [24] Storage Networking Industry Association, “Snia trace repository.”
  • [25] B. Trushkowsky, P. Bodík, A. Fox, M. J. Franklin, M. I. Jordan, and D. A. Patterson, “The scads director: Scaling a distributed storage system under stringent performance requirements.” in FAST, 2011.
  • [26] M. Uysal, G. A. Alvarez, and A. Merchant, “A modular, analytical throughput model for modern disk arrays,” in MOSCOTS, 2001.
  • [27] A. Verma, R. Koller, L. Useche, and R. Rangaswami, “Srcmap: Energy proportional storage using dynamic consolidation.” in FAST, 2010.
  • [28] M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Faloutsos, and G. R. Ganger, “Storage device performance prediction with cart models,” in MASCOTS, 2004.
  • [29] WD, “Western digital (wd) blue,” URL:https://www.wdc.com/wd-blue-pc-desktop-hard-drive.html, 2011.
  • [30] C. Weddle, M. Oldham, J. Qian, A.-I. A. Wang, P. Reiher, and G. Kuenning, “Paraid: A gear-shifting power-aware raid,” TOS, 2007.
  • [31] Y. Zhang, G. Soundararajan, M. W. Storer, L. N. Bairavasundaram, S. Subbiah, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Warming up storage-level caches with bonfire,” in FAST, 2013.
  • [32] N. Zhu, J. Chen, T.-C. Chiueh, and D. Ellard, “Tbbt: scalable and accurate trace replay for file server evaluation,” in SIGMETRICS, 2005.