Human errors have significant impact on the availability of Information systems [1, 2, 3] where some field studies have reported that 19% of system failures are caused by human errors [4, 3]. In large data-centers with Exa-Byte
(EB) storage capacity (by employing more than one million disk drives), one should expect at least a disk failure per hour. Despite using mechanisms such as automatic fail-over in modern data-centers, in many cases the role of human agents is inevitable. Meantime, the probability of human error, even by using precautionary mechanisms such as checklists and employing high-educated and high-trained human resources, is between 0.1 and 0.001[5, 6, 7, 8]. Such statistics translate that an exascale data-center will face multiple human errors a day.
Disk drives are of most vulnerable components in a Data Storage System (DSS). Disk failures and Latent Sector Errors (LSEs)  are of main sources of data loss in a disk subsystem. Several studies have tried to investigate the effect of these two incidences on a single disk and disk array reliability [9, 10, 11, 12]. In particular, the failure root cause breakdown in previous studies [4, 3] shows that human error is of great importance.
In this paper, we propose an availability model for the disk subsystem of a Backed-up data storage system111An storage system that keeps an updated backup of data, for example on a tape. In such system, we assume that data loss can be recovered using the backup and has just an unavailability consequence. by considering the effect of disk failures and human errors. While the incorrect repair service can have many different roots and happen in many different conditions, in this work we just consider the incorrect disk replacement service and call it Wrong Disk Replacement
. In our analysis, both disk subsystems with and without automatic disk fail-over are considered. The proposed analytical technique is based on Markov models and hence requires the assumption of exponential distributions fortime-to-failure and time-to-restore
. Furthermore, to cope with other probability distribution functions such asWeibull, that describes the disk failure behavior in a more realistic manner , we have developed a model based on Monte-Carlo (MC) simulations. This model has also been used as a reference to validate the proposed Markov model when using exponential distributions.
By incorporating the impact of human errors on the availability of disk subsystem, several important observations are obtained. First, it is shown that overlooking the impact of incorrect repair service will result in a considerable underestimation (up to 263X) of the system downtime. Second, it is observed that in the presence of human errors, conventional assumptions on the availability ranking of different Redundant Array of Independent Disks (RAID) configurations can be contradicted. Third, it is demonstrated that automatic disk fail-over can significantly improve the overall system availability when on-line rebuild is provided by using spare disks.
The remainder of this paper is organized as follows. Section II represents a background on human errors. Section III elaborates the Monte-Carlo simulation-based model. Section IV presents the proposed Markov models, considering the impact of human errors. Section V provides simulation results and the corresponding findings. Lastly, Section VI concludes the paper.
Ii-a Human Error in Safety-Critical Applications
To better understand and quantify human errors in a non-benign system, the Human Reliability Assessment (HRA)  techniques have been developed where its major focus is the quantification of Human Error Probability () which is simply defined by the fraction of error cases observed, over the opportunities for human errors . By collecting values obtained by NASA, EUROCONTROL, and NUREG, we found that human error has usually a probability in the range of to depending on the application and situation. This probability mainly varies from up to in enterprise and safety-critical applications [6, 7, 8, 5].
Ii-B Human Errors in Data Storage Systems
While human errors in data storage systems can happen in very different situations, in this work we focus on one of the most prevalent samples, the wrong disk replacement. In a RAID array, given RAID5, with no disk spare, the disk fail-over process can start after replacing the failed disk with a brand-new disk. Consider the case that the operator replaces the brand-new disk with one of the operating disks, rather than the failed one. In this case, two disks are inaccessible (the failed disk and the operating, wrongly removed disk), making the entire data unavailable. However, detecting the human error and undoing the wrong disk replacement makes the array available at no data loss cost.
In the next section, we describe a simulation-based reference model to evaluate the availability of data storage systems considering disk failures and the effect of human errors happening in disk replacement process.
Iii Availability of a Backed-up Disk Subsystem by Monte-Carlo Simulation
In the MC model, the failure and repair events are generated by assuming the desired distributions such as Weibull and exponential. Fig. 1 illustrates an example of the MC simulation for a RAID5 array. In case two consecutive disk failures happen in the same array and the second failure is before the recovery of the first failure, a data loss event happens (as shown in Fig. 1). As we assume that the data storage system is backed-up, we consider the data loss event as data unavailability, which duration is the data loss recovery (tape recovery in our example) time. In the case of single disk failure, the failed disk is replaced by a human agent. However, the occurrence of a human error in the disk replacement process makes another working disk unavailable, resulting in the unavailability of the entire data array (in the case of RAID5), which duration varies by the human error recovery time. The overall availability of the disk subsystem is calculated by dividing the total disk subsystem uptime by the overall simulation time. The error of MC simulations is inversely proportional to the root square of the number of iterations and the t-student coefficient for a target confidence level .
Iv Availability of a Backed-up Disk Subsystem Using Markov Model
In this section, we propose a Markov Model for a backed-up disk subsystem availability that corroborates the Monte Carlo reference model by assuming an exponential distribution for both failure and repair rates. Finally, we extend the Markov model for a disk subsystem with automatic fail-over.
Iv-a Markov Model of RAID5 in Presence of Human Errors
Fig. 2 shows the proposed Markov model for the availability of a backed-up disk subsystem by considering the effect of disk failures and human errors. In this model, disk failure rate, disk repair rate, double disk failure recovery rate from primary backup, and Human Error Probability are shown by , , , and , respectively. Upon the occurrence of the first disk failure, the system state will move from the operational () to the exposed state (). While being in the exposed state, a second disk failure will lead to a Data Loss (DL) event while a human error during disk replacement will lead to a Data Unavailability (DU) event. If the human agent successfully replaces the failed disk, the array returns to the state.
When the array is in the state, the incidence of human errors during the fail-over process makes the array to stay in the state. Otherwise, if no human error happens in the fail-over process, the array state transits to the state. In the state, if the wrongly replaced disk is crashed, the array switches to . Finally, when the array is in the state, it can be recovered by the rate of .
Iv-B Markov Model of RAID5 With Automatic Fail-over
Here, we study the effect of automatic disk fail-over when on-line rebuild process is being performed using hot-spare disks. In the conventional disk replacement policy, a failed disk may be replaced by a new disk before the completion of the on-line rebuild process. In the automatic fail-over policy as opposed to the conventional disk replacement policy, the replacement process should be started after the completion of on-line rebuild process. In automatic fail-over policy, it is assumed that a hot spare disk is available within the array while the system is in the operational state.
Fig. 3 shows the Markov model of a RAID5 array employing the automatic fail-over policy. The system is in the state when all disks work properly and a spare disk is present. In the case of a disk failure, the array state switches to . In the state, the system goes to either the state by another disk failure or the state if the failed disk is rebuilt into the available spare disk. In the state, all disks of the array work properly but no spare is present. Automatic fail-over paradigm forbids the operator to replace the affected disk before the completion of on-line rebuild process. Hence, disk replacement can be performed at the states other than the state and there is no possibility of human error in the state. When the system is in the state, a disk failure switches the system state to . If the failed disk is successfully replaced by the new disk, the array returns to the state. Otherwise, if a human error happens in the disk replacement process, the array switches to .
In the state, the array has a failed disk and no spare. In this state, successful disk fail-over process changes the array state to . Upon the successful replacement of failed disk in the state, the array switches to . However, by happening a human error in the process of disk fail-over or in the process of failed disk replacement, the array switches to . Upon a disk failure when the array is in the state, the array switches to .
In the state, one of the operational disks is wrongly replaced by the new disk due to human error. To remove this error, the wrongly removed disk should be placed back and in return, the failed disk should be removed. If this process happens successfully, the array goes back to . Otherwise, if another human error happens in the process of recovering the human error, the array switches to . In the state, if the wrongly removed disk crashes, the array switches to . Happening a disk failure when the array is in the state switches the array to .
In the state, the user data is totally lost due to a Double Disk Failure (DDF) and a hot spare disk is available. In this case, DDF could be recovered by the rate of . Similarly in the state, the user data is totally lost due to a DDF but no spare is available. Here, recovery from DDF changes the array state to . In the state, if one of the failed disks is successfully replaced by the new spare disk, the array switches to the state.
In the state, the array is totally unavailable due to a disk failure and a human error. In this state, the successful recovery of human error changes the array state to . However, if the wrongly removed disk crashes, the array switches to . In the state, performing the disk fail-over by using the disk array is not possible as the user data is unavailable due to the human error. In this case, performing the disk fail-over before recovering the human error is similar to the case of recovering DDF by the rate of . In the state, if the failed disk is successfully replaced by the new spare disk, the array switches to the state.
In the state, the array data is totally unavailable due to the occurrence of two human errors. In this case, the array can switch to if one of the human errors is successfully recovered. However, if one of the wrongly removed disks crashes, the array switches to .
is similar to except the point that the hot spare disk is available in this state. Similarly, the and states are similar to and , respectively, except the point that the hot spare disk is available in and .
Comparing the Markov model of a RAID5 array employing the automatic fail-over (Fig. 3) and a RAID5 array using the conventional disk replacement policy (Fig. 2) shows a longer path from to state when the automatic fail-over is performed in the system. Hence, it can be realized that the probability of being in the state significantly decreases by using the automatic fail-over policy. Detailed numerical results of this model and comparison with a RAID array employing the conventional replacement policy will be presented in Section V-D.
V Experimental Setup and Simulation Results
V-a Validation of Markov Model with Simulation-Based Model
Fig. 4 shows the comparison of the proposed MC simulation results (for iterations and 99% confidence level) and the availability values obtained by the Markov model. As shown in this figure, the availability values obtained by the Markov model are within the error interval of the results obtained by the MC simulations for both and .
V-B Availability Estimation in Presence of Human Error
Fig. 5 reports the availability results of a RAID5 array in the presence of human errors for different disk failure rates. The availability of the disk subsystem has been reported for the traditional availability model (assuming =) as well as two different human error probabilities ( and ). We consider typical values for the repair rate in our experiments. In particular, we consider 0.1 and 0.03 values for and , respectively. We also consider , , and . Considering a constant failure rate (for example, ), it is observed that the availability of a disk subsystem is inversely proportional with human error probability. As the results show, with the human error probability equal to 0.001, the availability of the disk subsystem drops between one to two orders of magnitude.
V-C Availability Comparison of RAID Configurations with Equivalent Usable Capacity
Fig. 6 compares the availability of three different RAID configurations including , , and with equivalent usable (logical) capacity, in the presence of human errors (=, =, and =), assuming exponential failure distribution (). Comparing the three RAID configurations by assuming no human errors () shows that results in a higher availability compared to and . However, by considering , the availability of all RAID configurations dramatically decrease, while our results show a more significant decrease in the configuration, making its availability slightly lower than both and configurations. This can be described by the higher Effective Replication Factor222The ratio of storage physical size to the logical (usable) size . (ERF) of () compared to () and (), which mandates employing higher number of disks for a specific usable capacity, increasing the chance of disk failure and consequently, human errors. By considering higher values (e.g., ), we observe more gap between RAID configurations where both and show lower availability compared to , that can again be described by the lower ERF of .
V-D Effect of Automatic Disk Fail-over Policy
In this section, we report the effect of the automatic fail-over policy when on-line rebuild process is being performed using hot-spare disks. Fig. 7 compares the availability of two RAID5 arrays, performing conventional and automatic fail-over in the presence of human errors. As the results show, using automatic fail-over policy can significantly moderate the effect of human errors. For example, assuming , automatic fail-over increases the system availability by two orders of magnitude as compared to the conventional disk replacement policy. The results reported in Fig. 7 also demonstrate that the delayed replacement policy shows higher availability improvement when has greater values.
Vi Conclusion and Future Works
In this paper, we investigated the effect of incorrect disk replacement service on the availability of a backed-up disk subsystem by using Monte Carlo simulations and Markov models. By taking the effect of incorrect disk replacement service into account, it is shown that a small percentage of human errors (e.g., ) can increase the system unavailability by more than one order of magnitude. Using the proposed models, it is also shown that in some cases the dependability ranking of RAID configurations is not as conventional. Additionally, it is shown that automatic fail-over can increase the system availability by two orders of magnitude. Such observations can be used by both designers and system administrators to enhance the overall system availability.
-  D. Oppenheimer, A. Ganapathi, and D. A. Patterson, “Why do internet services fail, and what can be done about it?” in USENIX symposium on internet technologies and systems, vol. 67. Seattle, WA, 2003.
-  A. Brown and D. A. Patterson, “To err is human,” in EASY, 2001.
-  D. Oppenheimer, “The importance of understanding distributed system configuration,” in Conference on Human Factors in Computer Systems workshop, 2003.
-  E. Haubert, “Threats of Human Error in a High-Performance Storage System: Problem Statement and Case Study,” Computing Research Repository, vol. abs/cs/041, 2004.
-  M. P. A. B. E. M. P. M. Faith Chandler, I. Addison Heard, “Nasa human error analysis,” Tech. Rep., 2010. [Online]. Available: www.hq.nasa.gov/office/codeq/rm/docs/hra.pdf
-  W. Gibson, B. Hickling, and B. Kirwan, “Feasibility study into the collection of human error probability data,” EEC Note, vol. 2, 2006.
-  U. N. R. Commission, Reactor Safety Study: An Assessment of Accident Risks in US Commercial Nuclear Power Plants, 1975, vol. 2.
-  A. D. Swain and H. E. Guttmann, “Handbook of human-reliability analysis with emphasis on nuclear power plant applications. final report,” Sandia National Labs., Albuquerque, NM (USA), Tech. Rep., 1983.
-  B. Schroeder, S. Damouras, and P. Gill, “Understanding latent sector errors and how to protect against them,” Transaction on Storage (TOS), vol. 6, pp. 9:1–9:23, September 2010.
-  J. Elerath and M. Pecht, “A highly accurate method for assessing reliability of redundant arrays of inexpensive disks (raid),” Transactions on Computers (TC), vol. 58, no. 3, pp. 289–299, March 2009.
-  K. M. Greenan, J. S. Plank, and J. J. Wylie, “Mean time to meaningless: Mttdl, markov models, and storage system reliability,” HotStorage, pp. 1–5, 2010.
-  B. Schroeder and G. A. Gibson, “Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?” FAST, pp. 1–16, 2007.
-  A. D. Swain, “Human reliability analysis: Need, status, trends and limitations,” Reliability Engineering & System Safety, vol. 29, no. 3, pp. 301–313, 1990.
-  K. L. Lange, R. J. Little, and J. M. Taylor, “Robust statistical modeling using the t distribution,” Journal of the American Statistical Association, vol. 84, no. 408, pp. 881–896, 1989.
-  S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu, S. Pan, S. Shankar, V. Sivakumar, L. Tang et al., “f4: Facebook’s warm blob storage system,” in OSDI, 2014, pp. 383–398.