1 Motivation and Introduction
Hidden Markov model (HMM) is a widely used statistical learning tool for modeling uncertain dynamical systems [5]
, where the associated temporal data are used to infer a Markov chain with unobserved states. In this setting, the learning task is to infer the states and the corresponding parameters of the Markov chain. In addition to HMM, several other nonlinear techniques have been proposed for Markov modeling of timeseries data. Symbolic timeseries analysisbased Markov modeling is a recently proposed technique
[24] where the states of a Markov chain are represented as a collection of words (i.e., symbol blocks, also referred to as memory words) of different lengths, which can be identified from the timeseries data on a discrete space with finite cardinality [28, 24, 18, 7]. The symbols are created from the continuously varying timeseries data by projecting the data to a set with finite cardinality. A common ground among all these tools of Markov modeling as discrete sequences, is that the Markov chain is induced by probabilistic representation of a deterministic finite state auotmaton (DFSA), often called probabilistic finite state automata (PFSA) [31]. While the PFSAbased inference provides a consistent, deterministic graph structure for learning, the deterministic algebraic structure is generally not a very compact representation and may often lead to large number of states in the induced Markov model. To circumvent this problem attempts have been made to reduce the statespace by merging statistically similar states of the model [18]. The problem is, however, that as these models are constructed by partitioning of phase space of the dynamical system, merging states that are statistically similar leads to algebraic inconsistency. On the other hand, if the states are merged to preserve the algebraic consistency, it leads to statistical impurity in the final models (i.e., states which have different statistics could be merged together). Other approaches for state aggregation in Markov chains could be found in [9, 32, 35]. However, these papers do not consider inference of the Markov model from the data which may not be suitable for analysis of datadriven systems [8].The state space for Markov models, created by using symbolic analysis, increases exponentially with increase in memory or order of the symbolic sequence. Estimating the right memory is critical for temporal modeling of patterns observed in the sequential data. However, some of the states may be statistically similar and thus merging them can reduce the size of statespace. This paper presents reducedorder Markov modeling of timeseries data to capture temporal patterns, where we estimate the size of temporal memory of the symbolic data using the spectral properties of a PFSA whose states are words of length one [29, 13]. The constraint of deterministic algebraic structure is not imposed by the end objective, but due to the choice of the data representation model. Thus we propose to merge the states and remove the constraint of deterministic algebraic properties associated with PFSA, where the states of the Markov chain are now collection of words from its alphabet of length estimated in the last step. This state aggregation induces a nondeterminism in the finite state model. The parameters of the reducedorder Markov model are estimated by a Bayesian inference technique from the parameters associated with the higherorder Markov model. The final model for data representation is selected using informationtheoretic criteria, and thus, we get a unique stopping point to terminate the statemerging procedure. We also present a bound on the distortion of the predictive capability of the models up on reduction in the size of the statespace. The final model obtained is a generative model for the data; however, some predictive capability is lost as we remove the deterministic algebraic structure of a DFSA.
The proposed technique of state merging is inspired by timecritical applications where it is imperative to arrive at a reliable decision quickly as the dynamics of the process being monitored is really fast. In such applications, there are strict constraints on accuracy as well as the time needed to come to a decision. In this paper, we illustrate the concepts using two different datasets. We discuss in detail the example of combustion instability which is a highly nonlinear and complex phenomena and results in severe structural degradation in jet turbine engines. Some good surveys on the current understanding of the mechanisms for the combustion instability phenomena could be found in [21, 27, 6, 11, 17]. Active combustion instability control (ACIC) with fuel modulation has proven to be an effective approach for reducing pressure oscillations in combustors [4, 3]. Based on the work available in literature, one can conclude that the performance of ACIC is primarily limited by the large delay in the feedback loop and the limited actuator bandwidth [4, 3]. Early detection of combustion instability can potentially alleviate the problems with delay in the ACIC feedback loop and thus possibly improve the performance. Some recent work for detection and prediction of combustion instabilities could be found in [12, 33, 25, 20, 19]. While the results in these papers are encouraging, there is no interpretation of the expected changes in the datadriven model that could be observed during changes in the operating regime of the underlying process. In contrast to the work reported in literature, we have presented an overall idea of changes in the underlying stochastic model structure and parameters during the complex instability phenomenon.
Contributions. This paper presents a technique for Markov modeling of time series data using a PFSA with nondeterministic algebraic structure. Nondeterminism is induced by merging states of a PFSA with deterministic algebraic structure inferred from discrete sequential data, which in turn allows very compact representation of temporal data. In contrast to the approach in [18], we present a method to use informationtheoretic criteria to arrive at a consistent stopping criterion for model selection. The resulting reducedorder model has fewer parameters to estimate; this is turn leads to faster convergence rates and thus faster decisions during test (or operation). We also present a bound on the distortion in the predictive capability of the models due to statespace reduction using Hamming distance between the sequences generated by the original and final model. The algorithms presented in the paper are validated on two different datasets– pressure data obtained from a swirlstabilized combustor to monitor thermoacoustic instability and a public data set for bearing prognostics. We show changes in the complexity of the pressure data as the process moves from stable to unstable through the transient phase which is then used to arrive at a criterion that provides perfect class separability. Apart from the results on Markov modeling, the results on combustion instability could be of independent interest in combustion community.
2 Background and Mathematical Preliminaries
Symbolic analysis of timeseries data is a recent approach where continuous sensor data are converted to symbol sequences via partitioning of the continuous domain [14, 24]. The dynamics of the symbols sequences are then modeled as a probabilistic finite state automaton (PFSA), which is defined as follows:
Definition 2.1 (Pfsa)
A probabilistic finite state automaton (PFSA) is a tuple where

is a finite set of states of the automata;

is a finite alphabet set of symbols ;

is the state transition function;

is the emission matrix. The matrix is row stochastic such that
is the probability of generating symbol
from state .
Remark 2.1
The PFSA defined above has a deterministic algebraic structure which is governed by the transition function ; thus a symbol emission from a particular state will lead to a fixed state. However, the symbol emissions are probabilistic (represented by the emission matrix). On the other hand, the transition function for a nondeterministic finite state automaton is given by a map, where, denotes the power set of and includes all subsets of . The idea is also presented in Figure 1 where we show that the same symbol can lead to multiple states, however in a probabilistic fashion. This allows more flexibility in modeling at the expense of some predictive accuracy.
For symbolic analysis of timeseries data, a class of PFSAs called the Markov machine have been proposed [24] as a suboptimal but computationally efficient approach to encode the dynamics of symbol sequences as a finite state machine.
Definition 2.2
(Markov Machine [24, 18]) A Markov machine is a statistically stationary stochastic process (modeled by a PFSA in which each state is represented by a finite history of symbols), where the probability of occurrence of a new symbol depends only on the last symbols, i.e.,
where is called the depth of the Markov machine.
A Markov machine is thus a order Markov approximation of the discrete symbolic process. For most stable and controlled engineering systems that tend to forget their initial conditions, a finite length memory assumption is reasonable. The Markov machine is represented as a PFSA and states of this PFSA are words over alphabet of length (or less); the state transitions are described by a sliding block code of memory and anticipation length of one [15].
For systems with fading memory it is expected that the predictive influence of a symbol progressively diminishes. In this context, depth is defined as follows.
Definition 2.3 (Depth)
Let be the observed symbol sequence where each . Then, the depth of the process generating is defined as the length such that:
(1) 
An accurate estimation of depth for the symbolic dynamical process is required for the precise modeling of the underlying dynamics of the discrete sequence. Next we introduce an informationtheoretic metric which is used for merging the states of the Markov model later in next section.
Definition 2.4 (KullbackLeibler Divergence)
[10]
The KullbackLeibler (KL) divergence of a discrete probability distribution
from another distribution is defined as follows.It is noted that KL divergence is not a proper distance as it is not symmetric. However, to treat it as a distance it is generally converted into symmetric divergence as follows, . This is defined as the KL distance between the distributions and .
This distance is used to find out the structure in the set of the states of the PFSAbased Markov model whose states are words, over the alphabet of the PFSA, of length equal to the depth estimated for the discretized sequence.
3 Technical Approach
In this section, we present the details of the proposed approach for inferring a Markov model from the time series data. As discussed earlier, the first step is the discretization of the timeseries data to generate a discrete symbol sequence. While it is possible to optimize the symbolization of timeseries using some optimization criterion, we do not discuss such a technique here. The data is discretized using the unbiased principle of entropy maximization of the discrete sequence using maximum entropy partitioning (MEP) [23]. The proposed approach for Markov modeling then consists of the following four critical steps

Estimate the approximate size of temporal memory (or order) of the symbol sequence.

Cluster the states of the highorder Markov model.

Estimate the parameters of the reducedorder Markov model (i.e., the transition matrix).

Select the final model using information theoretic scores (described below, Section 3C).
Memory of the discrete sequence is estimated using a recently introduced method based on the spectral analysis of the Markov model with depth . induced by a PFSA [29, 13]. It is noted that these steps are followed during training to estimate the approximate model for data and during test, the parameters are estimated for the reducedorder model. The key ideas behind these steps are explained in the next section.
3a Estimation of ReducedOrder Markov Model
Depth of a symbol sequence has been redefined in [29] as the number of time steps after which probability of current symbol is independent of any past symbol i.e.:
(2) 
Note that dependence in the proposed definition (eq. 2) is evaluated on individual past symbols using as opposed to the assessing dependence on words of length using . It is shown that if the observed process is forward causal then observing any additional intermediate symbols cannot induce a dependence between and if it did not exist on individual level [29].
Let be the onestep transition probability matrix of the PFSA constructed from this symbol sequence i.e.
(3) 
Then using the distance of the transition matrix after steps from the stationary point, depth can be defined as a length such that
(4) 
where
is number of nonzero eigenvalues of
. Thus, the depth of the symbol sequence is estimated for a choice of by estimating the stochastic matrix for the onestep PFSA. Next, another pass of data is done to estimate the PFSA parameters whose states are words over of length , i.e., . It is noted that this step is critical for modeling accuracy.The states of the reducedorder Markov model are then estimated by partitioning the set of words over of length estimated in the last step. This is done by using an agglomerative hierarchical clustering approach. The advantage of using the hierarchical clustering approach is that it helps visualize the structure of the set of the original states using an appropriate metric. Agglomerative hierarchical clustering is a bottomup clustering approach [34] that generates a sparse network (e.g., a binary tree) of the state set (where ) by successive addition of edges between the elements of . Initially, each of the states is in its own cluster where , which is the set of all clusters for the hierarchical cluster tree. The distance between any two states in is measured using the KL distance between the symbol emission probabilities conditioned on them, i.e.,
(5) 
where the terms on the right have the following meaning.
In terms of the distance measured by eq. (3A), the pair of clusters that are nearest to each other are merged and this step is repeated till only one cluster is left. The tree structure displays the order of splits in the state set of the higherorder Markov model and is used to aggregate the states close to each other. For clarification of presentation, we show an example of a Markov chain with states and symbols on a simplex plane in Figure 2, where each red pentagon on the simplex represents one row of the symbol emission matrix. The hierarchical clustering is used to find the structure of the state set on the simplex place using the KL distance. The set of states clustered together could be obtained based on the number of final states required in the final Markov model.
The overall algorithm is presented as a pseudocode in Algorithm 1. This algorithms is used to find the parameters of the models during training. The parameters during test are estimated using the clustering map and is further discussed in next section. In the later sections we show how an information theoretic criterion could be used to select the appropriate model to terminate the state merging algorithm or select a final model from the set of reducedorder models. Through numerical experiments using two different datasets we also illustrate the main motivation of this work that although the right memory is required for accurate modeling of the symbolic process, the statespace not necessarily consist of all words corresponding to the estimated memory and we can achieve sufficientlyhigh predictive accuracy even with a smaller statespace. We are able to achieve this tradeoff between the model complexity and predictive modeling accuracy using the informationtheoretic criteria.
3B Parameter Estimation of the ReducedOrder Markov Model
The parameters of the Markov model obtained after clustering the states of the original PFSA with states is obtained using a Bayesian inference technique using the parameters estimated for the PFSA. In this proposed approach, the state transition matrix , the emission matrix
, and the state probability vector
of the original PFSA model are available, along with the deterministic assignment map of the state in (i.e., state set of original model) to one of the state in (i.e., state set of the reduced order model). Since the reduced order model can represented by the tuple , where is the state transition matrix, we employ a Bayesian inference technique to infer the individual values of transition probabilities for all .Let
be the random variable denoting the state of PFSA model at some time step
and denotes the symbol emitted from that state, this probabilistic emission process is governed by the emission matrix . The state of the reduced order model is obtained from a deterministic mapping of the state of the PFSA model, thus the state of this model is also a random variable, which is denoted by. The Bayesian network representing the dependencies between these variables is shown in the recursive as well as unrolled form in the Figure
3.The conditional density can be evaluated by checking if state belongs to the state cluster and assigning the value of 1 if true, else assign it the value of 0. Since we know that partitions the set , the conditional density is welldefined. Thus, it can be written as
(6) 
where is the indicator function with , if element belongs to the set , else it is . The derivation of the Markov model using , stationary probability vector , and assignment map is shown ahead.
We can obtain from Bayes’ rule as
(11) 
Following the steps to obtain (10), we also derive
(12) 
We can obtain from Bayes’ rule as
(13) 
Note that, for the distribution and , we use the stationary probability . Using the equations (10), (11), (12), and (13) together, one can easily obtain the desired state transition matrix of the reduced order model. Once the state cluster set and state transition matrix are available, the reduced order model is completely defined.
3C Model Selection using information theoretic criteria
In this section, we describe the model selection process during the underlying state merging process for model inference. We compute “penalized” likelihood estimates for different models. Then, the model with the lowest score is selection as the optimal model.
The (unpenalized) loglikelihood of a symbol sequence given a Markov model is computed as follows:
(14) 
where the effects of the initial state are ignored because they become negligible for long statistically stationary symbol sequences. It is noted that with a finite symbol sequence, the loglikelihood is always finite. Furthermore, with the Markov models considered in this paper, the sum is simplified to the following form.
(15) 
As discussed earlier, the states are merged using hierarchical clustering and thus, for every desired number of final states we get the deterministic map which determines how the original states are partitioned using the hierarchical clustering. This map is known for every terminal number of states and thus, we can find the loglikelihood of the symbol sequence using the following relationship.
(16) 
where, is the state of the reduced model and is the state of the original fullorder model.
In the next step of the model selection process, a “complexity penalty” is added to the loglikelihood estimates, thereby balancing goodness of fit against the complexity of the model (and hence trying to prevent overfitting). We apply two widelyused such model selection functions, namely the Akaike information criterion (AIC) [2] and the Bayesian information criterion (BIC) [26]:

, where is the number of free parameters and is the number of observations.

, where is the number of free parameters.
The number of free parameters to be estimated from the data is the parameters of the symbol emission parameters, i.e., . It is noted that this allows model selection for individual symbol sequences. The criterion here allows a terminal condition for state merging; however, different symbol sequences can have different models. The model with the minimum score is selected as the best model. Through the results presented in next sections we illustrate the fact that most of the temporal and predictive capabilities can be preserved for the models with a very small number of states when compared to the original model.
Remark 3.1
The final Markov model is a finite depth approximation of the original timeseries data. However, compared to the PFSAbased DMarkov machines in [24, 18], the current aggregated model has a nondeterministic algebraic structure, i.e., the same symbol emissions from a state can lead to different states. While this leads to some loss in predictive capability as compared to the models in [24, 18], this allows us to compress the size of the model as per the requirement at hand. This allows faster convergence rates for the symbol emission probabilities as we only require fewer parameters to estimate from data, which might lead to faster decisions during testing.
In the rest of the paper, we will present a Hamming distancebased bound for distortion in the predictive capabilities of reduced models and demonstrate the utility of these models in practical problems of fault/anomaly detection from timeseries data.
4 Analysis of the Proposed Algorithm
In this section, we will present a bound on the distortion of the model due to the reduction of statespace of the Markov model using Hamming distance between two symbol sequences. We first present the Pinsker’s inequality [10] which relates the information divergence with the variational distance between probability measures defined on arbitrary spaces. This is followed by another theorem which can be used to derive Hamming distance bounds using the informational divergence.
Theorem 4.1 (Pinsker’s inequality)
[10] Let and be two probability distributions on a measurable space . Then, the following is true
(17) 
where is the total variation distance.
Theorem 4.2
[16] Let be a countable set and let us denote by the sequence . Let be a Markov measure on , that is, . Then for any probability measure on , the following is true
(18) 
where, denotes the normed Hamming distance on
(19) 
where if and otherwise. The distance between and is
(20) 
where
is taken over all joint distributions with marginals
and and denotes the expectation operator.The above theorem provides us a way to bound Hamming distance between sequences generated by two different distributions. Thus, using the above theorem, we find a bound on the Hamming distance between the symbol sequences generated by the reducedorder Markov model and the original model by estimating the KL distance between the measure on symbol sequences induced by these models. An approximate estimate of the KL distance between the original and a reduced model could be expressed and estimated as shown in the following.
Let the original DMarkov model be denoted by and the reducedorder model by . The Markov measure on the probability space where the set consists of sequences of length from an alphabet could be estimated using the symbol emission probabilities. More explicitly, the Markov measure of a sequence on induced by is given by (where is the depth of the model). Then, the KL divergence between and is given by the following expression.
(21) 
Then, the above expression can be simplified as follows.
where, is the merged state and is the original state. Then the expression on the right could be further bounded using the Lipschitz constant for the logarithm function and under the assumption that and all .
(22)  
(23)  
(24) 
where, . In the above inequalities, equation (23) is obtained from equation (22) by using the observation that , where is the perturbation in the symbol emission probability from when it is clustered into a new state . Hence, the KL distance in equation (21) could be bounded by the following term.
(25) 
Thus, a uniform bound on the Hamming distance between the original and the final model could then be obtained as follows,
(26) 
The above inequality thus, allows us to compare models with different statespace based on the predictive accuracy of a reduced model when compared to the original model. As compared to the earlier information theoretic criteria, which were based on the efficiency of data compression by different models, the inequality in (26) allows to compare them based on their symbol emission statistics and thus, is computationally efficient. It is possible to find a rather tighter bound in an expected sense by using the stationary distribution of the two Markov chains to find an expected bound on Hamming distance. However, finding the same is left as an exercise for future work. Using the above bound for selection of models could be more efficient than the information theoretic metrics (as it can estimated by using the symbol emission probabilities instead of the penalized likelihoods); however, finding a penalized version of the bound for model selection is also left as a future exercise.
5 Description of Experimentation and Data Sets
In this section, we briefly describe the two different datasets which have been used in this paper to illustrate and validate the proposed concepts. Specifically, we will describe the experiments done at Penn State to investigate instability in leanpremixed combustion and another benchmark dataset for anomaly detection in bearings. An important point to be noted here is that the numerical experiments we present in the following sections is to justify the fact that the reducedorder models obtained by the proposed algorithms are able to achieve the tradeoff between predictive accuracy and model complexity. Further results for classification and anomaly detection are to illustrate that this proposed approach of model learning can still achieve good performance for machine learning objectives of class separability and anomaly detection.
5a Combustion
A swirlstabilized, leanpremixed, laboratoryscale combustor was used to perform the experimental study. Tests were conducted at a nominal combustor pressure of 1 atm over a range of operating conditions, as listed in Table I.
Parameters  Value 

Equivalence Ratio  0.525, 0.55, 0.60, 0.65 
Inlet Velocity  2550 m/s i m/s increments 
Combustor Length  2559 inch in 1 inch increments 
In each test, the combustion chamber dynamic pressure and the global OH and CH chemiluminescence intensity were measured to study the mechanisms of combustion instability. The measurements were made simultaneously at a sampling rate of 8192 Hz (per channel), and data were collected for 8 seconds, for a total of 65536 measurements (per channel). A total of samples of data were collected from all the tests where in every test the combustion process was driven from stable to unstable by changing either the equivalence ratio, . However, as the accurate model of the process is not available, an accurate label of transition of the process to unstable phase is not available. It is noted that the data consists the behavior of the process over a large number of operating condition and thus provides a rich set of data to test the efficacy of the algorithm in detecting classes irrespective of the underlying operating conditions.
5B Bearing Prognostic Data
This test data has been picked from NASA’s prognostics data repository [1, 30]. A detailed description of the experiments could be found in [22]. The bearing test rig hosts four test bearings on one shaft which is driven by an AC motor at a constant speed. A constant force is applied on each of the bearings and accelerometer data is collected at every bearing at a sampling rate of for about . The tests are carried for days until a significant amount of debris was found in the magnetic plug of the test bearing. A defect in at least one of the bearings is found at the end of every test. In this paper, we will use the data from a bearing which shows anomalous behavior in the later parts of test. In particular, out of the three data sets, we use set one where an inner race fault occurred on Bearing . In the analysis, we use data from Bearing .
6 Markov Modeling
In this section, we present results for modeling and analysis of the timeseries data which are presented in this paper.
6a Combustion
Timeseries data is first normalized by subtracting the mean and dividing by the standard deviation of its elements; this step corresponds to bias removal and variance normalization. Data from engineering systems is typically oversampled to ensure that the underlying dynamics can be captured (in the current experiments, it was
). Due to coarsegraining from the symbolization process, an oversampled timeseries may mask the true nature of the system dynamics in the symbolic domain (e.g., occurrence of self loops and irrelevant spurious transitions in the Markov chain). Timeseries is first downsampled to find the next crucial observation. The first minimum of autocorrelation function generated from the observed timeseries is obtained to find the uncorrelated samples in time. The data sets are then downsampled by this lag. The autocorrelation function for the timeseries data during unstable case is shown in Figure 4 where the data are downsampled by the lag marked in red rectangle in Figure 4. To avoid discarding significant amount of data due to downsampling, downsampled data using different initial conditions is concatenated. Further details of this preprocessing can be found in [29].The continuous timeseries data set is then partitioned using maximum entropy partitioning (MEP), where the information rich regions of the data set are partitioned finer and those with sparse information are partitioned coarser. In essence, each cell in the partitioned data set contains (approximately) equal number of data points under MEP. A ternary alphabet with has been used to symbolize the continuous combustion instability data. As discussed in section 5, we analyze data sets from different phases, as the process goes from stable through the transient to the unstable region (the ground truth is decided using the RMSvalues of pressure).
In Figure (a)a, we show the observed changes in the behavior of the data as the combustion operating condition changes from stable to unstable. A change in the empirical distribution of data from unimodal to bimodal is observed as the system moves from stable to unstable. We selected samples of pressure data from the stable and unstable phases each to analyze and compare. First, we compare the expected size of temporal memory during the two stages of operation. There are changes in the eigenvalue decomposition rate for the 1step stochastic matrix calculated from the data during the stable and unstable behavior, irrespective of the combustor length and inlet velocity. During stable conditions, the eigenvalues very quickly go to zero as compared to the unstable operating condition (see Figure (b)b). This suggests that the size of temporal memory of the discretized data increases as we move to the unstable operating condition. This indicates that under the stable operating condition, the discretized data behaves as symbolic noise as the predictive power of Markov models remain unaffected even if we increase the order of the Markov model. On the other hand, the predictive power of the Markov models can be increased by increasing the order of the Markov model during unstable operating condition, indicating more deterministic behavior. An is chosen to estimate the depth of the Markov models for both the stable and unstable phases. Correspondingly, the depth was calculated as and for the stable and unstable conditions (see Figure 5). The corresponding is used to construct the Markov models next. First a PFSA whose states are words over of length is created and the corresponding maximumlikely parameters ( and ) are estimated. Then, the hierarchical clustering algorithm using KL distance is used to cluster and aggregate the states. It is noted that we create individual models for every sample, i.e., every sample is partitioned individually so that the symbols will have different meaning (i.e., they represent different regions in the measurement space of the signals) for every sample. Consequently, each sample will have a different statespace when viewed in the continuous domain. Thus, we do not show the mean behavior of the samples during any operating regime as the statespace would be inconsistent (even though the cardinality could be the same).
In Figure 6, we show the hierarchical cluster tree which details the structure of the statespace for the PFSA with depth for a typical sample during stable and unstable behavior. The cluster tree also suggests the symbolic noise behavior of the data during the stable regime (the states are very close to each other based on the KL distance). However, clearly a coarse clustering of states in the model during the unstable behavior would lead to significant information loss (as the states are statistically different). However, to compare the two Markov models, we keep the cardinality of the final models the same. For example, the algorithm is terminated with states in the final Markov model during the stable as well as the unstable regime. and the final aggregated states are the three clusters depicted in the Figure 6. Once the final aggregated states are obtained, we estimate the parameters of the model using the Bayesian inference discussed in section 3B.
Next, we present some results for model selection using the informationtheoretic criteria discussed earlier in section 3C. BIC and AIC are used to select the model which achieves the minimum score. As seen in the Figures (a)a through (b)b, the model with states is selected for stable as well as for the unstable case (note that the original model for the stable class had states for depth and the unstable model had states for a depth of ). In contrast to crossvalidation, the two criteria provide an unsupervised way for model selection. Thus we see that we need much smaller statespace to preserve the temporal statistics of the data and AIC and BIC provide us with a technique to select the compact model.
In Figure 8, we show the Hamming distance between the sequences generated by the original model and the reduced models for a typical sample each from stable and unstable combustion. The boxplots are generated by simulating the original model and the reducedorder model to generate symbol sequences of length from different initial states (i.e., a total of strings are generated) and the Hamming distance between them is calculated. A bound on the Hamming distance between the sequences generated by the original model and final model is also calculated using the inequality (26). The results are shown in Figure 8. It is possible to use the proposed Hamming distance metric to select a final model; however, this measures the distance between the distributions induced by the Markov models, and model selection using it is left as a future work. It is noted that the bounds on Hamming distance can provide a computationally convenient way to select model scores as it can be found from the symbol emission probabilities of the model instead of explicitly looking at the predictive capability by looking at the likelihoods of the symbol sequences.
6B Bearing
The same procedure of downsampling and depth estimation is followed for analysis of bearing data as was described in the previous section for combustion. A ternary alphabet is again chosen to discretize the continuous data after downsampling and the maximum entropy partitioning is used to find the partitions. Using the spectral method, a depth of (i.e., a total of states) is estimated for an (we skip the plot of spectral decomposition plot for brevity). The BIC and AIC score for the different models is shown in Figure 9 and the model with five states is selected using the obtained scores (marked in black rectangle). In Figure 10, we show the Hamming distance between the sequences generated by the original model (with states) and the reduced models and the corresponding bounds obtained by inequality (26).
7 Classification and Anomaly Detection Results
In this section, we present some results for anomaly detection and classification using the pressure timeseries data to infer the underlying reducedorder Markov model. As we discussed earlier in section 5A, the exact transition point of the system from stable to unstable is unknown, we first present results on anomaly detection and clustering of the data into different clusters which can be then associated with the stable and unstable class. We will present two different metrics for anomaly detection that allows models of different statespace and structure to be compared. It is noted that the word metric is used here in a loose sense; it is meant to be a distance that could be used to compare two different Markov models.
7a Anomaly Detection
As individual timeseries have different statespace, we define some metrics to compare them. These metrics reflect changes in the information complexity of Markov models and reveal different behavior of combustion process based on the changes in the inferred data model. In particular, the following two metrics are defined.

Cluster Divergence: This measure is defined for individual Markov models based on the cluster structure of the statespace of the model. Physically, it represents the maximum statistical difference between the states of the Markov model measures using KL distance. It is calculated for a particular model as follows
(27) where is defined by equation (3A).

Discrepancy Statistics: We measure the discrepancy between the i.i.d. statistics and the Markov statistics for the discretized data. This could be also interpreted as the information gain for Markov models. This measure also represents the information complexity of the data. If the i.i.d. statistics and the Markov statistics are very close, then the data has no temporal statistics; however, an increase in this measure would indicate the information gain by creating a temporal Markov model for the data. This is measured by the following equation.
(28) where represents the symbol emission probability conditioned on a state of the Markov model and represents the marginal symbol emission probability. The term represents the symmetric KL distance between the two distributions.
In Figure 10, we present some results to show the behavior of with increasing pressure fluctuations. It is noted that every model has been created in an unsupervised fashion by first discretizing and then, estimating the memory of the discrete sequence. As seen in Figure (a)a, there are three distinct behaviors that can be associated with . With low pressure fluctuations, the metric is very close to , indicating that the states of the model are very similar statistically. This is seen until data number with corresponding psig, which leads to a gradual change to a point where the measure saturates with psig (when the process becomes unstable). Thus, with this gradual trend with increasing pressure fluctuations, we associate different behaviors with the process. However, as is seen in the Figure (a)a, the transition from stable to unstable behavior is not clearly defined and is very difficult to label during the experiments as the process is very fast. We show the pressure signals from the three different clusters in Figure (b)b where it could be seen that the sample number
could be seen to approach an approximate limit cyclic behavior (and thus, could be loosely classified as transient stage). An important point to note at this point is that this measure is independent of any operating conditions and only depends on stability (or instability) of the process. This metric is thus used for anomaly detection. In Figure
(c)c, we show the statistics of with four states. We see that there is some loss of information up on merging states in the unstable class; the stable cluster remains unchanged implying that the states are statistically similar and the model distortion up on merging of states is insignificant. Thus, can be reliably used to detect departure from stable behavior.The statistics for the discrepancy measure for the full state models is shown in Figure 11. The plot in Figure 11 also agrees qualitatively with the earlier results on . From these plots, we can infer that the Markov statistics for the stable cluster is very similar to the i.i.d. statistics and thus the data is very much independently distributed and conditioning on the inferred states of the Markov models doesn’t improve predictability (or information complexity) of the temporal model. Thus, these two measures help infer the changes in the behavior of the data during the combustion process and are useful for anomaly detection.
To see more explicitly the changes in the underlying models, the models during stable and unstable phases are visualized in the information space. To do this, we reduce the state space of the models to just states and estimate the corresponding emission parameters. As the models have three symbols, the emission matrix has rows and each row corresponds to the symbol emission probabilities conditioned on the two states. Each of these rows for cases from stable and cases from unstable are plotted on a single simplex plane which is shown in Figure 12. The Figure shows the clusters of stable and unstable cases in the information space and that the model with even states are clustered separately. This shows that there is a structured change in the temporal dynamics of the data at the two phases and that the inferred Markov models are able to capture this change. Furthermore, the distinctive features of the models are sufficiently retained even after significant reduction in the statespace of the models.
7B Classification
These models are then used to train classifiers using support vector machines (SVM) and decision trees (DT)
[5]. The rationale behind using multiple classifier is to show that the performance of the Markov models is independent of the classification technique (i.e., it works equally well with maximum margin classifiers or decision tree classifiers). The SVM classifier is trained using a radial basis function kernel while the decision tree is trained using the standard euclidean distance. The classifiers are trained with
data points from each class and are tested on the remaining data (around and for stable and unstable respectively). The tests are repeated for different train and test data sets from the total data. The results of classification accuracy are listed in Table II. The SVM classifier is able to achieve around error using models with states while the decision tree classifier is able to achieve around error using models withstates. This provides another way of selecting the final model for state merging in a supervised learning setting. It is noted that the original models contain
states for stable and for unstable class.Number of States  Classifier  Classification Error () 

SVM  
DT  
SVM  
DT  
SVM  
DT  
SVM  
DT  
SVM  
DT  
SVM  
DT  
SVM  
DT  
SVM  
DT 
8 Summary, Conclusions and Future Work
In recent times the idea of representation learning has become very popular in the machine learning literature as it allows decoupling of data for model learning from the endobjectives like classification or clustering. In this paper, we presented a technique for Markov modeling of timeseries data using concepts of symbolic dynamics which allows inference of model structure as well as parameters for compact data representation. In the proposed technique we first estimate the memory size of the discretized timeseries data. The size of memory is estimated using spectral decomposition properties of the onestep Markov model created from the symbol sequence. Then, a second pass of data is made to infer the model with the right memory and the corresponding symbol emission matrix is estimated. Then the equivalence class of states based on KL distance between the states are estimated using hierarchical clustering of the corresponding states of the Markov model. The proposed concepts were validated using two different datasets– combustion instability and bearing. Modeling of combustion instability still remains a puzzle in the combustion community. The Markov modeling technique was used to analyze the problem of combustion instability. The proposed ideas were tested on experimental data from a swirlstabilized combustor used to study unstable thermoacoustic phenomenon during combustion process. The proposed approach allows us to infer the complexity of the timeseries data based on the inferred Markov model. Two different metrics were proposed for anomaly detection and classification of the stable and unstable classes. The results presented in this paper are encouraging as the inferred models are able to identify the stable and unstable phases independent of any other operating condition.
Simultaneous optimization of discretization and memory estimation for model inference is a topic of future research. While the results obtained with Markov modeling for the combustion instability problem are inspiring, further investigation with transient data is required for better characterization of the process. More thorough comparison of the proposed models with HMM models of similar statespace size is also an important topic of future work.
Acknowledgments
The authors would like to thank Professor Domenic Santavicca and Mr. Jihang Li of Center for Propulsion, Penn State for kindly providing the experimental data for combustion used in this work.
References
 [1] Prognostic data repository: Bearing data set nsf i/ucrc center for intelligent maintenance systems, 2010. [Online]. Available: http://ti.arc.nasa.gov/tech/dash/pcoe/prognosticdatarepository/
 [2] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716–723, Dec 1974.
 [3] A. Banaszuk, P. G. Mehta, and G. Hagen, “The role of control in design: From fixing problems to the design of dynamics,” Control Engineering Practice, vol. 15, no. 10, pp. 1292–1305, 2007.
 [4] A. Banaszuk, P. G. Mehta, C. A. Jacobson, and A. I. Khibnik, “Limits of achievable performance of controlled combustion processes,” Control Systems Technology, IEEE Transactions on, vol. 14, no. 5, pp. 881–895, 2006.
 [5] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.
 [6] S. Candel, D. Durox, T. Schuller, J.F. Bourgouin, and J. P. Moeck, “Dynamics of swirling flames,” Annual review of fluid mechanics, vol. 46, pp. 147–173, 2014.
 [7] I. Chattopadhyay and H. Lipson, “Abductive learning of quantized stochastic processes with probabilistic finite automata,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 371, no. 1984, p. 20110543, 2013.
 [8] F. Darema, “Dynamic data driven applications systems: New capabilities for application simulations and measurements,” in 5th International Conference on Computational Science  ICCS 2005, Atlanta, GA; United States, 2005.
 [9] B. C. Geiger, T. Petrov, G. Kubin, and H. Koeppl, “Optimal kullback–leibler aggregation via information bottleneck,” Automatic Control, IEEE Transactions on, vol. 60, no. 4, pp. 1010–1022, 2015.
 [10] R. M. Gray, Entropy and information. Springer, 1990.
 [11] Y. Huang and V. Yang, “Dynamics and stability of leanpremixed swirlstabilized combustion,” Progress in Energy and Combustion Science, vol. 35, no. 4, pp. 293–364, 2009.

[12]
D. K. Jha, A. Srivastav, and A. Ray, “Temporal learning in video data using deep learning and Gaussian processes,” in
Workshop on Machine Learning for Prognostics and Health Managament at 2016 KDD, San Francisco, CA, 2016.  [13] D. K. Jha, A. Srivastav, K. Mukherjee, and A. Ray, “Depth estimation in Markov models of timeseries data via spectral analysis,” in American Control Conference (ACC), 2015. IEEE, 2015, pp. 5812–5817.
 [14] J. Lin, E. Keogh, L. Wei, and S. Lonardi, “Experiencing SAX: a novel symbolic representation of time series,” Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 107–144, October 2007.
 [15] D. Lind and B. Marcus, An introduction to symbolic dynamics and coding. Cambridge University Press, 1995.
 [16] K. Marton, “Bounding distance by informational divergence: a method to prove measure concentration,” The Annals of Probability, vol. 24, no. 2, pp. 857–866, 1996.
 [17] J. P. Moeck, J.F. Bourgouin, D. Durox, T. Schuller, and S. Candel, “Nonlinear interaction between a precessing vortex core and acoustic oscillations in a turbulent swirling flame,” Combustion and Flame, vol. 159, no. 8, pp. 2650–2668, 2012.
 [18] K. Mukherjee and A. Ray, “State splitting and merging in probabilistic finite state automata for signal representation and analysis,” Signal Processing, vol. 104, pp. 105–119, 2014.
 [19] M. Murugesan and R. Sujith, “Combustion noise is scalefree: transition from scalefree to order at the onset of thermoacoustic instability,” Journal of Fluid Mechanics, vol. 772, pp. 225–245, 2015.
 [20] V. Nair, G. Thampi, and R. Sujith, “Intermittency route to thermoacoustic instability in turbulent combustors,” Journal of Fluid Mechanics, vol. 756, pp. 470–487, 2014.
 [21] J. O’Connor, V. Acharya, and T. Lieuwen, “Transverse combustion instabilities: Acoustic, fluid mechanic, and flame processes,” Progress in Energy and Combustion Science, vol. 49, pp. 1–39, 2015.
 [22] H. Qiu, J. Lee, J. Lin, and G. Yu, “Wavelet filterbased weak signature detection method and its application on rolling element bearing prognostics,” Journal of sound and vibration, vol. 289, no. 4, pp. 1066–1090, 2006.
 [23] V. Rajagopalan and A. Ray, “Symbolic time series analysis via waveletbased partitioning,” Signal Processing, vol. 86, no. 11, pp. 3309–3320, 2006.
 [24] A. Ray, “Symbolic dynamic analysis of complex systems for anomaly detection,” Signal Processing, vol. 84, no. 7, pp. 1115–1130, July 2004.
 [25] S. Sarkar, S. R. Chakravarthy, V. Ramanan, and A. Ray, “Dynamic datadriven prediction of instability in a swirlstabilized combustor,” International Journal of Spray and Combustion Dynamics, p. 1756827716642091, 2016.
 [26] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, 03 1978.
 [27] Sé, b. Ducruix, T. Schuller, D. Durox, Sé, and b. Candel, “Combustion dynamics and instabilities: Elementary coupling and driving mechanisms,” Journal of Propulsion and Power, vol. 19, no. 5, pp. 722–734, 2003.

[28]
C. R. Shalizi and K. L. Shalizi, “Blind construction of optimal nonlinear
recursive predictors for discrete sequences,” in
Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence
, ser. UAI ’04, 2004, pp. 504–511.  [29] A. Srivastav, “Estimating the size of temporal memory for symbolic analysis of timeseries data,” American Control Conference, Portland, OR, USA, pp. 1126–1131, June 2014.
 [30] D. A. TobonMejia, K. Medjaher, N. Zerhouni, and G. Tripot, “A datadriven failure prognostics method based on mixture of gaussians hidden Markov models,” IEEE Transactions on reliability, vol. 61, no. 2, pp. 491–503, 2012.
 [31] E. Vidal, F. Thollard, C. De La Higuera, F. Casacuberta, and R. C. Carrasco, “Probabilistic finitestate machinespart i,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 7, pp. 1013–1025, 2005.
 [32] M. Vidyasagar, “A metric between probability distributions on finite sets of different cardinalities and applications to order reduction,” Automatic Control, IEEE Transactions on, vol. 57, no. 10, pp. 2464–2477, 2012.
 [33] N. Virani, D. K. Jha, and A. Ray, “Sequential hypothesis tests using Markov models of time series data,” in Workshop on Machine Learning for Prognostics and Health Managament at 2016 KDD, San Francisco, CA, 2016.
 [34] R. Xu and D. Wunsch, “Survey of clustering algorithms,” Neural Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645–678, 2005.
 [35] Y. Xu, S. M. Salapaka, and C. L. Beck, “Aggregation of graph models and Markov chains by deterministic annealing,” Automatic Control, IEEE Transactions on, vol. 59, no. 10, pp. 2807–2812, 2014.
Comments
There are no comments yet.