Statistical Analysis of Modern Reliability Data

by   Yueyao Wang, et al.

Traditional reliability analysis has been using time to event data, degradation data, and recurrent event data, while the associated covariates tend to be simple and constant over time. Over the past years, we have witnessed the rapid development of sensor and wireless technology, which enables us to track how the product has been used and under which environmental conditions it has been used. Nowadays, we are able to collect richer information on covariates which provides opportunities for better reliability predictions. In this chapter, we first review recent development on statistical methods for reliability analysis. We then focus on introducing several specific methods that were developed for different types of reliability data with covariate information. Illustrations of those methods are also provided using examples from industry. Test planning is also an important part of reliability analysis. In addition to data analysis, we also provide a briefly review on recent developments of test planning and then focus on illustrating the sequential Bayesian design with an example of fatigue testing for polymer composites. The paper is concluded with some discussions and remarks.


page 1

page 2

page 3

page 4


Big Data and Reliability Applications: The Complexity Dimension

Big data features not only large volumes of data but also data with comp...

Statistical Perspectives on Reliability of Artificial Intelligence Systems

Artificial intelligence (AI) systems have become increasingly popular in...

Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

Artificial intelligence (AI) systems have become increasingly common and...

Next-Generation Information Technology Systems for Fast Detectors in Electron Microscop

The Gatan K2 IS direct electron detector (Gatan Inc., 2018), which was i...

A Degradation Performance Model With Mixed-type Covariates and Latent Heterogeneity

Successful modeling of degradation performance data is essential for acc...

Exploring System Resiliency and Supporting Design Methods

This paper provides a survey of the industry perspective on System Resil...

1 Introduction

1.1 Background

Traditional reliability data analysis mainly use time to event data, degradation data, and recurrent event data to make reliability predictions [1]

. The covariate information involved in the reliability analysis is usually time-invariant and the number of covariates is typically small. For time to event data, parametric models such as the Weibull and lognormal distributions are popular and accelerated failure time models are often used to incorporate covariate information on accelerating factors. For degradation data, the general path models and stochastic models are the common choices and the covariate information is often incorporated through regression type of models. The recurrent event data are often modeled by the event intensity models or mean cumulative functions with regression type of models that are often used to incorporate covariates.

With technological advances, new information on covariates become available. Products and systems can be equipped with sensors and smart chips to keep track of various information on the field usage of product units, number of transfers, and environmental conditions such as temperature and humidity. Such covariate information often change over time, so we refer to them as dynamic covariate information. Because the dynamic covariates often come in large volume and variety, it presents big data opportunities and challenges in the area of reliability analysis (e.g., [2] and [3]). Dynamic covariate data can be used for modeling and prediction of reliability because units under heavy usage often fail sooner than those lightly used. In recent years, more statistical methods for dynamic covariates have been being developed to make use of this new type of covariate data.

Another important area of reliability analysis is about test planning, which focuses on how to efficiently collect various types of data to make better prediction of reliability. For accelerated life tests (ALTs), it is especially challenging to timely collect sufficient failure data because the data collection is a time-consuming process and often requires using expensive equipment for testing units under elevated stress conditions. In some laboratories, there are typically one or two machines available for testing certain material. In this case, it is impractical to test multiple samples simultaneously and therefore limits the total obtainable sample size. Another challenge with traditional test planning is that it typically relies on a single set of best guess of the parameter values, which may lead to suboptimal designs when the specified parameter values are not accurate. Due to these challenges, sequential designs become popular where earlier test results can be utilized to determine the test conditions for later runs. In addition, Bayesian methods can be used to leverage prior information from the expert’s knowledge or related historical data to inform the test planning. The objective of this chapter is to review current development and then introduce the statistical methods for dynamic covariates and sequential Bayesian design (SBD) for ALT.

1.2 Related Literature

In lifetime data analysis, product usage information has been used to improve reliability model. Lawless et al. [4] consider warranty prediction problem using product usage information on return units. Constant usage information are used in [5] and [6]. Averaged product use-rate information are used in [7]. Nelson [8] and Voiculescu et al. [9] use cumulative exposure model in ALT and reliability analysis. Hong and Meeker [10] use cumulative exposure model to incorporate dynamic covariates and apply it to the Product D2 application.

In degradation data analysis stochastic process models are widely used. The Wiener process ([11, 12, 13]), Gamma process ([14]), and Inverse Gaussian process ([15, 16]) are among popular models in this class. The general path models are also widely used, which include [17, 18, 19, 20, 21]. For accelerated destructive degradation tests, the typical work includes [22, 23, 24]. Hong et al. [25] and Xu et al. [26] develop degradation model using the general path model framework to incorporate dynamic covariate information.

For recurrent events data, the nonhomogeneous Poisson process (NHPP) and the renewal process (RP) are widely used (e.g., [27, 28]). Kijima [29] introduce virtual age models which can model imperfect repairs. Pham and Wang [30] develop a quasi-renewal process, and Doyen and Gaudoin [31] propose models for imperfect repairs. The trend-renewal process (TRP) proposed in [32] can include the NHPP and RP as special cases, which has been used in [33, 34, 35] and other places. Xu et al. [36] develop a multi-level trend renewal process (MTRP) model for recurrent event with dynamic covariates.

For test planning, the optimum designs in traditional test planning framework are developed using non-Bayesian approaches (e.g., [37, 38]) and the true parameters are assumed to be known. Bayesian test planning for life data is developed in [39, 40, 41]. King et al. [42] develop optimum test plans for fatigue test of polymer composites. Lee et al. [43] develop SBD test planning for polymer composites and Lu et al. [44] extend it to test planning with dual objectives.

1.3 Overview

The rest of this chapter is organized as follows. Section 2 describes an application on time-to-event data with dynamic covariates. Section 3 illustrates the modeling of degradation with dynamic covariates. Section 4 describes the MTRP model for describing recurrent event data with dynamic covariates. Section 5 introduces SBD strategies for ALTs. Section 6 contains some concluding remarks.

2 Time to Event Data Analysis

In this section, we briefly introduce the application of using dynamic covariates for time to event prediction as described in Hong and Meeker [10].

2.1 Background and Data

A general method was developed by Hong and Meeker [10] to model failure-time data with dynamic covariates. The work was motivated by the Product D2 application, which is a machine used in office/residence. Product D2 is similar to high-end copy machine where the number of pages used could be recorded dynamically. For this product, the use-rate data (cycles/week) were collected weekly as a time series for those units connected to the network. This information could be downloaded automatically in addition to the failure-time data. In the Product D2 dataset, data were observed within a 70-week period and 69 out of 1800 units failed during the study period. Figure 1 illustrates the event plot of the failure-time data and the use-rate over time for a subset of the data.

(a) Failure-time Data (b) Use-rate processes
Figure 1: (a) The event plot for a subset of the Product D2 failure-time data and (b) the corresponding plot of the use-rate trajectories. Figure reproduced with permission.

2.2 Model for Time to Event and Parameter Estimation

Three sets of observable random variables: the failure time, censoring indicator and dynamic covariate over time are are considered, which are denoted by

. The observed data are described by . Here denotes the number of units in the dataset, is the failure time or time in service, and is the observed censoring indicator (i.e., it equals to 1 if unit fails and 0 otherwise). The is the observed dynamic covariate information of unit from the time to , where is the observed covariate value at time for unit . Particularly for Product D2, we use as the form of the covariate in the model, where is the baseline use-rate that is chosen to be a typical constant use rate.

The cumulative exposure model in [45] is used to model the failure-time data with dynamic covariate. The cumulative exposure is defined as,

where represents the influence of the covariate on the exposure. When the cumulative exposure of a unit reaches a random threshold at time , the unit fails. This establishes a relationship between and , that is,


Under the above model and the covariate history

, the cumulative distribution function (cdf) of the failure time


and probability density function (pdf) is

Here is the parameter in the baseline cdf of the cumulative exposure threshold and is the pdf of . In the Product D2 application, the baseline cumulative exposure distribution was modeled by the Weibull distribution, of which the cdf and pdf are

In the above expression, , where and are the location and scale parameters. Also, , and . Lognormal and other log-location-scale distributions can also be used if they are considered appropriate for certain applications.

2.3 Model for Covariates

To model the covariate process, we use the linear mixed effect model. In particular, is modeled as


In model (2), is the constant mean, and the term is used to model variation at individual level. Here and

is the vector of random effects of the initial covariate at time 0 and the changing rate for unit

. It is assumed that with the covariance matrix

and is the error term.

The parameter estimation is conducted in two parts since parameters in the failure-time model

and covariates process model are separate, using the maximum Likelihood (ML) method. The joint likelihood for and is


The first component of (3) is the likelihood function of the failure-time data, which is


The second component of (3) is the likelihood of covariate data, which is


In the above equation, is the pdf of a univariate normal and

is the pdf of a bivariate normal distribution.

2.4 Reliability Prediction

In order to predict future field failures, the distribution of the remaining life (DRL) is considered in the prediction procedure. The DRL describes the distribution of for unit given and . Particularly, the probability of unit failing within the next time units given it has survived by the time is


where denotes all parameters. Then can be further expressed as


where . Since model (2) is assumed for and , the multivariate normal distribution theory can be used to obtain the conditional distribution.

The Monte Carlo simulation is used to evaluate since an analytical expression for is unavailable. The following procedure is used to compute .

  1. Substitute with the ML estimates in the distribution of , and draw from the distribution.

  2. Let be the simulated covariate process in the time interval .

  3. Compute the DRL given and the ML estimates of by

  4. Repeat steps 1-3 times and obtain .

  5. The estimate is computed by

The confidence intervals (CIs) for

can be obtained through the following procedure:

  1. Draw and from and , respectively.

  2. Let and obtain following the above algorithm.

  3. Repeat steps 1-2 times to obtain .

  4. The CI is computed by . Here is the th ordered value of and is the function for rounding to the nearest integer.

Figure 2 shows the estimated DRL for two representative units. One unit has a higher use rate which increases quickly over time and the other has a lower use rate which increases slowly over time . The trends in the plot are consistent with our expectation that units with higher use rates tend to have higher failure risk.

Figure 2: The estimated DRLs and the 95% pointwise CIs for two representative at-risk units. Figure reproduced with permission.

To assess the prediction variability, one may also want to calculate the prediction interval (PI) of individual remaining life, denoted by . A PI of the remaining lifetime can be obtained by using the method introduced by [46] as in


Here is the quantile of the distribution, and represents the remaining life for unit . The can be obtained through a Monte Carlo simulation.

In real applications, obtaining the predicted cumulative number of failures is also important because this could help with the decisions for warranty cost control or the long-term production plan. Suppose is the total failure counts at time after DFD. The is the risk set at DFD in this expression and is the binary indicator for the occurrence of a failure at time with Let denote the cdf of , where is the count of units in the risk set. Then

can be computed in its explicit form using a discrete Fourier transformation

[47]. The PI for can be calculated similarly as the individual predictions.

Figure 3: The point predictions and the pointwise 90% PIs for the cumulative number of failures after the DFD out of the 1731 units at risk. Figure reproduced with permission.

For the Product D2 application, 1731 units are remained at risk after 69 out of 1800 installed units failed. The point predictions and the pointwise 90% PIs of the total number of failures after the DFD are shown in Figure 3.

3 Degradation Data Analysis

In this section, we briefly introduce how to leverage the dynamic covariates for modeling degradation data as described in Hong et al. [25].

3.1 Background and Data

Hong et al. [25] develop a general path model utilizing shape-restricted splines with random effects to model the degradation process with dynamic covariates. This paper considers an application of modeling the photodegradation process of organic coating materials due to exposure to the time-varying ultraviolet (UV) radiation and the outdoor environmental conditions. In this work, to study the service life of a coating material, scientists at NIST placed 36 specimens in outdoor setting with varied UV spectrum and intensity, temperature, and relative humidity (RH) recorded over an approximate 5-year period. The specimens started at different time points to allow different degradation paths to be observed. For each specimen, degradation measurements were taken periodically using Fourier transform infrared (FTIR) spectroscopy. Since a particular compound or structure is tied with a peak at a certain wavelength on the FTIR spectrum, the change in the height of the peak was used to measure the decrease in the concentration of the compound. One of the compounds of interest for the NIST data was C-O stretching of aryl ether, which was measured at the wavelength 1250 cm. Figure 4(a) shows the degradation paths of nine representative specimens with varied starting times in the study. We can observe very different trajectories with the degradation rate varies over time and among different specimens as well. Figures 4(b)-(d) show the dynamic covariate information on the daily UV dosage, RH, and temperature as well as the fitted smooth lines for showing the mean process of one specimen over the study period. The vertical lines are used to label time windows separated by every six months. We can observe both a seasonal pattern and a random oscillation of the daily records for each individual covariate. There are stronger seasonal patterns for the UV dosage and temperature than the RH. There also appears to be a larger variation of the daily observations in the summer than in the winter, which indicates a varied degree of variability of the covariates over different time periods.

3.2 Model for Degradation Paths and Parameter Estimation

The general additive model for degradation data with dynamic covariates is given in the form


where for is the degradation measurement at time for unit , denotes the measurement error, and is the vector containing the dynamic covariate information at the time . The actual degradation level at is modeled by as the sum of a fixed component and a random component. The fixed component is the population common degradation path, modeled in a cumulative damage form given by


This model incorporates the dynamic covariates through the covariate-effect functions for . Here, is the initial degradation, is the th covariate-effect of on the degradation process at time , and is the cumulative effect of up to time . The random component includes the random effect terms for modeling the unit-to-unit variation, which is specified in . Here, is the vector of random effects for the initial degradation and the growth rate over time, and it is assumed to follow a bivariate normal distribution with the covariance matrix

Also we use to denote all distinct parameters included in .

(a) Degradation paths (b) Daily UV dosage
(c) Daily temperature (d) Daily RH
Figure 4: Plots of (a) nine representative degradation paths and (b)-(d) dynamic covariate information on the daily UV dosage, temperature and relative humidity for a single sample. The black dots connected by green lines show the daily values. The vertical lines show the time windows by every 6 months from January 2002. The red smooth curves are the estimated mean process. Figure reproduced with permission.

The ML method is used for estimating the parameters. Since the degradation measurements and the dynamic covariates are observed at discrete time points, the discrete version of the degradation path model is used for computing the likelihood by replacing in (10) by


where is the th time point when the degradation and covariates are measured for unit and . Let denote all the model parameters. Then the likelihood is


where , and are the pdfs of a standard normal distribution and a bivariate distribution, respectively.

Considering there was not sufficient knowledge on what might be a sensible form for the covariate-effect functions, the paper chose to estimate the using a linear combination of spline bases. To leverage the physical understanding of the relationships between the degradation process and the covariates, the shape-restricted splines [48] were used to ensure monotonic decreasing bases (I-splines) for the UV dosage and temperature and concave bases (C-splines) for the RH. Let for denote the spline bases for the covariate , then the covariate-effect function is modeled as

where ’s are the spline coefficients. Define . Then the model in (9) with given in (11) can be written as a linear mixed effects model in the form of where

and the coefficient vector , where and denote the unconstrained and constrained parameters, respectively.

The following algorithm was proposed [25] to obtain the ML estimate that maximizes (12):

  1. Initiallize and by fitting a linear mixed-effects model with no constraints.

  2. Compute .

  3. The mixed primal-dual bases algorithm in [49] is used to estimate . That is to minimize subject to .

  4. Fit a linear mixed-effects model with to get the updated estimates of and .

  5. Repeat steps 2 to 4 until the estimated parameters converge.

With the shape-restricted splines, the ML estimates of some parameters might locate on the boundary of the parameter space. In this case, the bootstrap method is useful for assessing the variability and making inference about the parameters. An adjusted bootstrap procedure by [50] was applied to resample the residuals and the estimated random effects for constructing bootstrap resamples of the original data to avoid underestimating variability and producing too narrow CIs. Then the bias-corrected bootstrap CIs were constructed based on obtaining the ML estimates of model parameters using the above mentioned algorithm for a large number of bootstrap samples.

3.3 Model for Covariates

To predict the degradation process and reliability, it is necessary to model the dynamic covariate process through a parametric model. Hong et al. [25] propose the following model

where models the long-term trend of the covariate process for the th covariate, captures the seasonal pattern, and depicts the random error which is assumed to be a stationary process. For the NIST outdoor weathering data, there was no significant long-term trend observed, and hence for

. However, the seasonal pattern was quite prominent and there were seasonal effects observed for both the mean and variance of the process. So two sine functions were included in both the seasonal and error terms (except for RH which shows no seasonal effect assumed for the variation of the process from Figure

4) in the following form


To capture the autocorrelation within and among the covariate processes, a lag-2 VAR model [i.e. Var(2)] was used, where the error term was modeled by


In the above equation, and are regression coefficients matrices, and are multivariate normal random errors that do not change over time.

The parameters in models (13) and (14) are estimated in two steps. First, the ML estimates of the seasonal effects in the process mean and variance structures are obtained by ignoring the autocorrelation in the error terms. Then the VAR model is fitted to the residuals calculated from the first step using the multivariate least squares approach [51]. The bootstrap method is used for obtaining the CIs of the parameters in the dynamic covariate process.

3.4 Reliability Prediction

To predict the failure time and reliability, let denote the threshold for a soft failure. For any and , the degradation path is fixed and the failure time can be determined by


However, for a random unit, the covariate process and are random. Hence, the cdf of the failure time, , can be defined as


with denoting all the unknown parameters. There is usually no closed form expression of . Hence, the cdf at any estimated is estimated through Monte Carlo simulation outlined in the following steps [25].

  1. One need to simulate the covariate process based on the estimated parameter .

  2. Then one can simulate the random effects from with the estimated parameter .

  3. Compute based on the simulated covariate process and random effects.

  4. For the degradation path in step 3, determine the failure-time by Eqn. (15).

  5. Repeat steps 1 to 4 for times to obtain the simulated failure-times . Then is estimated by the empirical cdf, .

By using the bootstrap approach, the point estimates and the CIs of can be calculated using the sample mean and quantiles of the bootstrap version of the estimates of based on a large number of bootstrap estimates . By using , Monte Carlo simulations, and bootstrap samples, Figure 5 shows the predicted and its 95% pointwise CIs for the NIST coating degradation data. We can see that for a population of units with random starting time between 161 and 190 days, a majority of the population will fail between 50 to 150 days in service.

Figure 5: The estimated cdf and corresponding 95% pointwise CIs for a population of units with random starting time between 161 and 190 days. Figure reproduced with permission.

4 Recurrent Event Data Analysis

In this section, we briefly introduce the multi-level trend renewal process (MTRP) model and its application on the Vehicle B data as described in Xu et al. [36].

4.1 Background and Data

Xu et al. [36] consider the modeling and analysis of the Vehicle B data, which consist of recurrent event data from a batch of industrial systems. Vehicle B is a two-level repairable system. During its life span, Vehicle B may experience event at subsystem level (e.g., engine failures) and/or event at component level (e.g., oil pump failures). In the field data, we have units from a -month observation period. There are 219 component events and 44 subsystem events observed during the study period. Figure 6(a) shows the event plot for ten randomly selected units. We also have the cumulative usage information available for each unit, which is dynamic covariate. The cumulative usage information is shown in Figure 6(b). The goal is to make a prediction for the cumulative number of component event occurrences at a future time.

(a) Recurrent event processes for (b) Cumulative usage processes for
a subset of system units a subset of system units
Figure 6: Plots of (a) the event processes and (b) the cumulative usage processes for ten randomly selected units in the Vehicle B fleet. Figure reproduced with permission.

We need some notation to introduce the MTRP model. Suppose there are units under observation from time 0 to . Let be the time-dependent covariate at time for system . Let be the number of subsystem events and and be the number of component events up to time . The total number of replacement events is . The subsystem event time is sorted as . The component event time is sorted as . Let be the replacement event times, regardless of the types.

4.2 The MTRP Model and Parameter Estimation

For a two-level repairable system, Xu et al. [36] propose the following MTRP model to describe events occurred at component level. In particular, the intensity function is


Here denotes the historical information. In this multi-level model framework, the effect of subsystem events on the component event process is modeled by , which takes the form


Here, denotes the unknown parameters. The cumulative event intensity functions can be obtained as , and . The baseline function models the intensity of the component process when there is no event adjustment, and the function is used to model the adjustment for effect of events from the subsystem. The renewal distribution function is used to describe the distribution of gap times under the transformed scale. The model in (18) can be extended to incorporate dynamic covariates and random effects.

To model the dynamic covariates, the intensity function can be extended as


where denotes intensity trend function under the baseline and is the regression coefficient. In the Vehicle B application, we use . To incorporate random effects, the intensity function can be further extended as


Here ’s are independent and identically distributed with . The MTRP with random effects is referred to as , in which the HMTRP stands for heterogenous MTRP.

To estimate the model parameters, one need to construct the likelihood function. The component events data can be denoted as with be the event time and be the component-event indicator. The event history is denoted as . The likelihood function is


Xu et al.  [36]

use Bayesian methods with diffuse priors to estimate the model parameters. The Metropolis-within-Gibbs algorithm is used to obtained the posterior distributions and then the inference can be carried out using the Markov Chain Monte Carlo (MCMC) samples from the posterior distributions.

4.3 Prediction for Component Events

To make predictions for component events, let denote the vector of all the parameters and is the covariate information between the time and . The prediction for the counts of component events at time is


Here denotes the prediction for unit . Because there is no closed form expression for (22), the Monte Carlo simulation is used.

By fitting the MTRP model to the Vehicle B data using Bayesian estimation, one needs to specify the prior distributions for the unknown parameters. The Weibull distribution was used as renewal functions for and . To check the performance of prediction, first the last 15 months of the Vehicle B data were held back and only the first 95 months data were used to estimate the MTRP model and then generate predictions for the last 15 months. Figure 7(a) shows the prediction of the cumulative component events for the last 15 months based on the earlier data. One can see that the actual observed cumulative numbers of component events are closely located around the predicted values and also well bounded within the pointwise prediction intervals. Figure 7(b) shows the predicted future events given all the observed data for the next 30 months, which indicates that the total number of component events are expected to range between 62 and 90 with a 95% confidence level.

(a) Back test based on an early subset of the data (b) Prediction of future events
Figure 7: Plots of the predicted cumulative number of component events for Vehicle b for (a) the last 15 months based on the earlier 95 months data and (b) the future 30 months based on all observed data. Figure reproduced with permission.

5 Sequential Test Planning of Accelerated Life Tests

In this section, we briefly introduce the sequential Bayesian design (SBD) for fatigue test experiments described in Lee et al. [43].

5.1 Background and Historical Data

A sequential Bayesian test planning strategy for the accelerated life tests was proposed by Lee et al. [43]. The fatigue test for glass fiber, a composite material is considered to illustrate the sequential design strategy. In the test, a tensile/compressive stress (positive/negative value) is applied to the test unit and the material life is observed under that stress. In this work, 14 observations of E-glass are made including 11 failed and 3 right-censored units. Historical data of the observations are show in Figure 8. Several other important factors in the test are set as follow. Let denote the stress ratio, where is the minimum stress and is the maximum stress. The range of can reveal different test type and it is set at for a tension-tension loading test in this application. The ultimate stress , where the material breaks at the first cycle is set to be 1339.67 MPa. The frequency of the cyclic stress testing () is set at 2 Hz, and the angle () between the testing direction and material is set at 0.

(a) Stress-life relationship (b) Log of cycles vs. Stress leves
Figure 8: The plots show the historical data from a fatigue testing of the glass fiber with the fitted stress-life relationship. Figure reproduced with permission.

5.2 Lifetime Model

Consider using a log-location-scale distribution to model the cycles-to-failure, . The cdf and pdf are given as

where and are the standard cdf and pdf, respectively. The lognormal and Weibull distributions are the common choices. In the ALT modeling, we assume a constant scale parameter and the location parameter is , where is the stress level and is the unknown parameter. The following nonlinear model for composite materials proposed in [52] is used to describe in the form of


In the above model, and are effects from environment and material, and . The function if and if , and . Then denotes the unknown parameter in the ALT modeling.

The lower quantile of the cycles-to-failure distribution is of interest as it contains material life information. The log of the th quantile is


where is the th quantile at the use condition and is the th quantile of the standard distribution. Our goal is to propose test planning under multiple use conditions to approximate the real scenarios. The use stress profile consists of a set of use levels, , with weights and .

Let denote the data for the th testing unit, where