In the past decade, prognostics has emerged as one of the key enablers for industrial systems to become more reliable, operationally available, and economically maintained . Prognostics technologies aim to monitor the performance of a system (or a component), assess the health status, and predict the remaining useful life (RUL). Based on the predicted future performance, informed asset management strategies can be better planned to reduce operational risks and costs. Prognostics has been used for various engineering systems, such as engines [2, 3, 4], batteries [5, 6], electronics [7, 8], and bearings [9, 10, 11, 12]. In this paper, we propose a new and flexible prognostics framework based on a higher order hidden semi-Markov model (HOHSMM) to assess the health state and estimate the RUL of a system using condition monitoring data. We are particularly motivated by applications where the health state cannot be directly observable and the transition dynamics of the hidden health state are complex (e.g., depending on distant past, non-geometric sojourn time in each state). For example, turbofan engines typically have complex failure mechanisms and unobservable health conditions, which can only be inferred from sensor measurements, and health state transitions are often history-dependent, violating the first order Markovian assumption. Novel techniques that can model and predict such complex transition behaviors are needed.
Prognostics approaches can generally be classified into two categories: model-based approach[12, 13, 14, 15, 11, 16, 17, 18], and data-driven approach [10, 19, 20, 21, 22, 23, 24, 25, 26, 27]. There are also hybrid models [28, 29, 5, 30] that attempt to combine the strengths of model-based and data-driven approaches by fusing the results from both approaches. The model-based approaches require a good understanding of system physics-of-failure mechanisms. Most of model-based approaches deal with crack, wearing, and corrosion phenomena. For example, Li et al.  propose to use the Paris–Erdogan model to predict a bearing’s crack propagation and estimate the crack size. Similarly, Paris–Erdogan equation is used to model fatigue crack growth and a stochastic filtering technique is applied for real-time RUL estimation . Daigle and Goebel  develop a model-based prognostics framework for a centrifugal pump that includes models of the most significant damage progression processes such as impeller wear and bearing wear. Damage processes are characterized by a set of parameterized functions (e.g., erosive wear equation, friction coefficient equation), which describe how damage variables evolve in time. More model-based prognostics methods can be found in [11, 16, 15, 17, 18]. Model-based approaches are built on the knowledge of the processes and failure mechanisms occurring in the system of concern, and therefore the approaches allow for identification of the nature and extent of the fault. However, the limitations of model-based approaches are (1) specific domain knowledge for developing physical models is required but may not always be available, and (2) it is usually a challenging task to create dynamic models representing multiple physical processes occurring in complex systems.
With the rapid development of sensor technologies, it has become much easier and less costly to obtain condition monitoring data, including operational and environmental loads as well as performance conditions of the monitored system (e.g., temperature, vibration, pressure, voltage, current) 
. Advancements in modern sensor instruments have greatly facilitated data-driven prognostics. The monitoring data provide useful information for building a behavior model to characterize the evolution of system performance. Data-driven approaches use several tools, most of which originate from machine learning and statistical domains
. Among different machine learning techniques, neural networks and neuro-fuzzy networks are the most commonly used ones. Huang et al.
propose a prognostics framework for ball bearings based on self-organizing map (SOM) and neural network. The degradation indicator is extracted using SOM and the residual life is predicted based on back propagation neural networks. Gebraeel and Lawley develop a degradation model based on dynamic wavelet neural network to compute and continuously update residual life distributions of partially degraded component using condition-based sensory signals. A neuro-fuzzy network is used to predict the future health state of a gear in . Furthermore, Wang 
develops an adaptive predictor based on neuro-fuzzy approach to forecast the behavior of dynamic systems, where the forecasting is performed by fuzzy logic and the fuzzy system parameters are trained by using neural networks. Various statistical tools have been used for prognostics, including time series analysis models, Kalman and particle filters, and Markov chains. Yan et al.
employ a logistic regression model to compute the failure probability given condition variables and then use an autoregressive moving average model to predict future performance and estimate the RUL. Swanson et al.
use Kalman filter to track the time evolution of a crack in a tensioned steel band. More general than Kalman filter, particle filter can be used to perform nonlinear projection of features, which is exploited for RUL estimation of a mechanical component subject to fatigue crack growth.
Most of the existing strategies estimate RUL by predicting the degradation level (e.g., crack size), which is either directly observable or can be quantified based on sensor signals. In practice, the damage level of many engineering systems (e.g., turbofan engine) cannot be easily quantified due to complex failure mechanisms. Hidden Markov models (HMMs) are commonly used to infer the hidden health state directly from the observed data (e.g., sensor measurements) and predict the RUL. An HMM is defined as a statistical model that is used to represent stochastic processes, where the states are not directly observed but can emit corresponding observations . Bunks et al.  illustrate the applications of HMMs by using the Westland helicopter gearbox data set and show that HMMs can provide a natural framework for both health diagnostics and prognostics. Tobon-Mejia et al. 
develop a mixture of Gaussians hidden Markov model (MoG-HMM) to predict RUL of bearings. They use wavelet packet decomposition technique to extract continuous features from the monitoring signals and then use the features as observations to train MoG-HMMs. The learned MoG-HMMs are then exploited to assess the current condition of a bearing and estimate its RUL. However, standard HMMs have two inherent limitations. One is the assumption of first order Markovian dynamics of the hidden state process. The other is that the state duration (i.e., sojourn time) implicitly follows a geometric distribution. The first order assumption can be restrictive as the health state of complex systems usually evolves depending on its more distant history, not just the current state. Moreover, the duration time in one state does not always follow a geometric distribution. To provide a more adequate representation of temporal structure, hidden semi-Markov model (HSMM) extends HMM by assuming that the state duration is generally distributed. Moghaddass and Zuo propose an HSMM-based prognostic framework to assess the health conditions and estimate the RULs for gradually degraded systems. They demonstrate the proposed model by a case study on NASA turbofan engines, where principle component analysis (PCA) is used to extract features from multiple sensor measurements. However, the HSMM in  still makes the first order Markovian dynamics assumption.
In this paper, we propose a new prognostics framework based on HOHSMMs for systems with unobservable health state and complex transition dynamics. In the HOHSMM-based framework, the important features extracted from the monitoring data are used as observations and the underlying health status of the concerned system is represented in the form of hidden states, which evolve depending not only on the current state but also on its more distant history. The sojourn time in each state is generally distributed and is assumed to follow an explicit distribution. We design an effective Gibbs sampling algorithm for model inference and conduct a simulation experiment to evaluate the performance of the proposed HOHSMM sampler. The learned HOHSMM is then exploited to assess the current health state of a functioning system in operation and predict its RUL. Decoding algorithm is developed for health state assessment using the learned model. The RUL is estimated using a simulation approach by generating paths from the current health state to the failure state. Furthermore, we demonstrate the practical utility of the proposed prognostics framework by conducting a case study on NASA turbofan engines. The main contributions of this paper are two-fold.
Develop a new and advanced HOHSMM-based prognostics framework to assess hidden health state and predict the RUL for complex systems. The proposed HOHSMM includes the HMM and HSMM as two special cases.
Design efficient algorithms for HOHSMM inference, hidden state decoding, and RUL prediction. A Gibbs sampling algorithm is developed for HOHSMM inference and the simulation experiment shows that the designed HOHSMM sampler is effective for learning model parameters from observations. Based on the learned model, a decoding algorithm is developed for hidden health state assessment and an RUL estimation algorithm is developed for prognostics. The case study on NASA turbofan engines shows that the HOHSMM-based prognostics framework provides satisfactory hidden health state assessment and RUL estimation for complex systems.
The remainder of this paper is organized as follows. Section 2 provides preliminaries on higher order hidden Markov model (HOHMM). In Section 3, we develop an HOHSMM and design an effective sampling algorithm for statistical inference. Section 4 presents the hidden state decoding procedure using the learned model. The RUL is predicted using a simulation approach in Section 5. We conduct a simulation experiment to evaluate the performance of the proposed HOHSMM sampler in Section 6. A case study on NASA turbofan engines is demonstrated in Section 7. Section 8 discusses the concluding remarks and future work.
2 Preliminaries on Higher Order Hidden Markov Model
An HOHMM consists of two processes: a hidden process , which evolves according to a higher order Markov chain with discrete state space, and a potentially multivariate observed process observed sequentially over a set of discrete time points . HOHMMs extend the idea of basic HMMs by allowing the hidden state sequence to depend on its more distant past. An HOHMM of maximal order makes the following set of conditional independence assumptions:
Note that an HOHMM is said to be of maximal order if the distribution of only depends on a subset of . If the distribution of actually varies with the values at all the previous time points, the HOHMM is considered to be of full order .
While the HOHMM relaxes the restrictive first order assumption of the basic HMM, it also brings significant dimensionality challenge. For known state space , the transition probabilities are now indexed by different possible values of the lags and involve a total number of parameters, which increases exponentially with the order . To address this issue, latent allocation variables for and are introduced to shrink the total number of parameters. The allocation variable , taking values from , is the respective latent class that a particular state of is allocated into. The total number of the latent classes () then determines the inclusion of the lag . If , it means that is not an important lag for . If for all , the HOHMM is of full order . Based on the allocation variable , the hidden states are conditionally independent as shown in Figure 1.
We denote the probability that the lag is allocated into latent class by , i.e., . Given the combination of allocated latent classes , the state transition probability is denoted by for ,
Then the transition probability can be structured through the following hierarchical formulation
The parameters and are all non-negative and satisfy the constraints:
, for each combination ;
, for each pair .
In such a factorization, the number of parameters is reduced to , which is much smaller than if .
Marginalizing out the latent class indicators , the transition probability has an equivalent form as
where for all . Thus, the -step transition probability can be obtained as
Efficient two-stage Gibbs sampling algorithms have been designed in  for HOHMM inference. First, a hierarchical Dirichlet prior is assigned on ,
The dimension of varies with . Independent priors on the ’s are assigned as
Finally, the following independent priors are assigned on ’s
where . The prior assigns increasing probabilities to smaller values of as the lag becomes more distant, reflecting the natural belief that increasing lags have diminishing influence on the distribution of . The generic form of the emission distribution is expressed as , where represents parameters indexed by the hidden states.
The joint distribution of, and admits the following factorization
where , representing the history state of . The conditional independence relationships encoded in the factorization are used in deriving MCMC algorithms to draw samples from the posteriors. Detailed sampling algorithms are referred to .
3 Higher Order Hidden Semi-Markov Model
In this paper, we extend an HOHMM to a higher order hidden semi-Markov model (HOHSMM), where the hidden state sequence is governed by a semi-Markov chain. The HOHSMM is more flexible since it incorporates additional temporal structure by allowing the state duration to be generally distributed, rather than implicitly following a geometric distribution as in an HOHMM.
3.1 Model development
We first give a brief description of the HSMM and then develop the HOHSMM. There exist several specific models of HSMM which make different assumptions regarding the dependence between state transition and duration, for example, residential time HMMs and explicit duration HMMs . A residential time HMM assumes that the current state and its duration time are determined by the previous state, and independent to the duration of the previous state. An explicit duration HMM assumes that a transition to the current state is independent to the duration of the previous state and the duration is only conditional on the current state. We consider the explicit duration setting in our HOHSMM. Both HSMMs and HOHSMMs with explicit duration exclude state self-transitions because the duration distribution can not fully capture a state’s possible duration time if self-transitions are allowed.
An explicit duration HMM assumes that the underlying stochastic process is governed by a semi-Markov chain . Each state has a variable duration that follows an explicit state-specific distribution and a number of corresponding observations are produced while in the state (illustrated in Figure 2).
The observation sequence is produced segmentally from the emission distribution indexed by the hidden super-state sequence , where is the number of segments. Observations are assumed to be collected discretely by a unit time, and therefore the number of observations produced in each super-state represents the state duration. For the segment, the state duration is denoted by and denotes the produced observations, where , . In the last segment, the observations may be truncated, and we have .
In the proposed HOHSMM with the explicit duration setting, the hidden super-state sequence is assumed to be governed by a higher order Markov chain and the state duration follows an explicit distribution (e.g., Poisson distribution), denoted bywith the parameters indexed by the specific hidden super-state . An explicit-duration HOHSMM of maximal order is constructed as follows
Figure 3 illustrates a second order HSMM. In this example, the distribution of the hidden super-state depends on its previous two states and , and the duration time in each super-state is generally distributed, following an explicit state-specific distribution.
The hierarchical Dirichlet prior assigned for transition distribution parameter in Equations (8) and (9) does not exclude self-transitions. In order to exclude self-transitions in the super-state sequence for an HOHSMM, a modified hierarchical Dirichlet prior is assigned as 
Equation (15) ensures that the self-transition probabilities are zeros. Note that is the latent class the hidden super-state (the immediate precedent super-state of ) is allocated into and is the state of . Therefore, to have a valid comparison between and and exclude self-transitions, each possible state of must be allocated to a distinct latent class. In other words, each state of has its own latent class. To do so, we let and , where , and . For the remaining lags, the independent priors on the allocation distribution are assigned as
The transition probability is modeled as
Similarly, by introducing latent allocation variables for and with , the hidden super-states are conditionally independent and the model can be represented through the following hierarchical formulation
3.2 Model inference
We use the MCMC sampling method for explicit-duration HOHSMM inference. The sampler is designed based on the two-stage Gibbs sampling algorithms for HOHMM . There are additional challenges due to explicit temporal structure, excluding self-transitions, and multiple observed trajectories in the training data.
The first challenge is brought by incorporating explicit temporal structure (i.e., duration distribution), which requires additional sampling to determine the number of segments (i.e., the number of hidden super-states) and the duration time in each state. Existing sampling inference methods for HSMMs often use a message-backwards and sample-forwards technique to address this problem 
. However, it is not applicable for an HOHSMM since the backwards messages are extremely difficult to define and compute when higher order transitions present. The reversible jump MCMC provides a statistical inference strategy for Bayesian model determination, where the dimensionality of the parameter vector is typically not fixed (e.g, the multiple change-point problem for Poisson processes)
. However, it cannot be used to sample change-points of a sequence in an HOHSMM since there is no appropriate mechanism to update the hidden super-states affected by the moves of change-points (e.g., birth of a change-point, death of a change-point). The second challenge is brought by excluding self-transitions. A Dirichlet distribution is assigned as the conjugate prior for transition probability parameters. However, the conjugacy does not exist after setting self-transition probabilities to zeros. A mechanism to recover the conjugacy for updating transition probability parameters is needed. In addition, in many real-world applications, several identical units are typically monitored at the same time to collect sensor data. How to leverage all information provided by multiple observed trajectories (i.e., observation sequences) instead of using just one sequence is the third challenge. We address these difficulties in the following two sections.
3.2.1 Update segmentation
We denote run-to-failure observation sequences by and the observation sequence by , where is the observed length and . These sequences are assumed to be independent. To address the first challenge, we introduce a jump size threshold () to identify change-points. For each observation sequence, if the Euclidean distance between a point and its immediate previous point is greater than , this point is identified as a change-point. The prior of
is assigned to be a uniform distribution with support, where and are the and percentile values obtained
from the distances between two adjacent observed data points in all observation sequences, respectively. We then update the segmentation of the observation sequences and initialize the hidden super-state sequences iteratively by sampling the jump size threshold .
In each iteration of the HOHSMM sampler, we propose a new threshold from . For each observation sequence, we mark change-points based on the computed distances (illustrated in Figure 4(a)) and the sequence is segmented accordingly. After the initial segmentation, we compute the center of the observed data points for each segment and label the segments by clustering the centers. The hidden super-states are initialized by using the clustering labels. To exclude self-transitions, if two adjacent segments have the same clustering label, we merge these two segments. For example, the first two segments in Figure 4(a) share the same label 1, and these two segments are merged into one. After the clustering and merging processes, we obtain the final segmentation and the initialized hidden super-states of an observation sequence for a given jump size threshold (illustrated in Figure 4(b)). Based on the segmentation results, we also obtain the number of segments and the state duration for each observation sequence, denoted as and , respectively.
The hidden super-state sequence , latent allocation variables , and other parameters , , , , , are updated using the two-stage Gibbs sampling algorithm for the HOHSMM. The first stage is to identify the important lags by sampling from the posterior. Given the determined , we collect the samples of other parameters in the second stage. The obtained samples will be used to compute the acceptance probability for updating jump size threshold . In general MCMC sampling, the acceptance probability can be computed as
Since the prior of is a uniform distribution and is also proposed from the uniform distribution, it is obvious that the prior ratio and the proposal ratio are equal to 1. The posterior mean of the likelihood can be approximated using the obtained samples , which is provided as
where is a set of posterior samples generated from their posterior distributions.
Given all the collected samples of , the most likely jump size threshold is determined by computing the average value of the samples after burn-in. We then use to update segmentation and repeat the two-stage Gibbs sampling process to obtain the final segmentation and samples for all parameters. Given an explicit distribution for each super-state’s duration, the MLEs for parameters can be easily obtained using the final segmentation result.
3.2.2 The two-stage Gibbs sampling algorithm for HOHSMMs
Given the segmentation, we modify the two-stage Gibbs sampling algorithm in  to determine the values of , , , , , , and in the HOHSMM. The first stage is to determine the values of , which is the important lag indicator. Given determined values of , we update other model parameters , , , , , latent allocation variables , and hidden super-state sequence in the second stage. The joint distribution of , and for the sequence can be presented as
where , and , , . To address the third challenge of multiple trajectories, we use the joint distribution of all observation sequences. Based on the assumption that all sequences are independent, the joint distribution can be obtained as follows
The conditional independence relationships encoded in the joint distribution are used in deriving the two-stage Gibbs sampling algorithm for HOHSMMs.
Specifically, in the first stage, we identify important lags and the corresponding number of latent classes by sampling . In this stage, we use an approximated model which forces hard allocation of ’s instead of soft allocation. Hard allocation means that, partition the state space into clusters for the lag, then each cluster corresponds to its own latent class. In other words, each state is allocated into one class with probability 1. For example, partition the states into clusters with and for the lag, hard allocation means that , and 3 will be allocated to the first latent class and , and 6 will be allocated to the second one with probability 1. In soft allocation, one state can be allocated into several possible classes with specific probabilities. The mixture probabilities in the approximated model are denoted by , indicating hard clustering while indicates soft allocation.
Based on the approximated model, samples of the parameters are drawn from their respective conditional posteriors following the pre-specified order. We first examine the posteriors of the transition distributions and . There exist computational machineries of sampling from the posteriors in hierarchical Dirichlet process (HDP) models . In the HOHMM, the Dirichlet distribution is the conjugate prior of transition probability parameter, so it is straightforward to update the parameters . However, in our HOHSMM, the method used to exclude self-transitions makes the model not fully conjugate. Specifically, let , which counts the number of transitions from the latent allocation classs to state among all observation sequences where and for . Because of no self-transitions, we have . We consider the posterior distribution of ,
Because of the extra terms from the likelihood by excluding self-transitions, we cannot reduce this expression to the Dirichlet form over the components of . Therefore, the model is not fully conjugate and new posteriors need to be derived. To recover conjugacy, we introduce auxiliary variables , where . Each is independently drawn from a geometric distribution with specific success parameter . We adjust the sampling algorithm by updating transition parameters from the posterior distribution
Then we compute from Equation (15) and update .
Since the observation sequences are independent, and are updated sequence by sequence. For each sequence, and are sampled by applying a Metropolis-Hastings step and using simulated annealing to facilitate the convergence. The full conditionals of will depend on the choice of the emission distribution. Finally, a stochastic search variable selection (SSVS) method  is used to sample from their posteriors and are updated by the latent allocation cluster mapping. In the first stage, important lags can be determined and the number of latent classes for each important lag can be derived based the samples of .
The second stage, given the important lag inclusion result, is to sample parameters , , , , , and iteratively. Given the segmentation, the elements of , and have either multinomial or Dirichlet full conditionals and can be straightforwardly updated. Sampling of , , and emission parameters is the same as described in first stage. Details of the HOHSMM inference method are summarized in Algorithm 1.
4 Health States Decoding
The ultimate purpose of a prognostics framework is to assess the current condition of a system (or component) and to make inferences regarding the remaining useful life (RUL). In this section, we first present how to use the HOHSMM-based prognostics framework to decode the hidden super-states. For an operating system with observation sequence , Equation (22) provides the joint distribution of , and