Data-Driven Learning of the Number of States in Multi-State Autoregressive Models

06/06/2015 ∙ by Jie Ding, et al. ∙ 0

In this work, we consider the class of multi-state autoregressive processes that can be used to model non-stationary time-series of interest. In order to capture different autoregressive (AR) states underlying an observed time series, it is crucial to select the appropriate number of states. We propose a new model selection technique based on the Gap statistics, which uses a null reference distribution on the stable AR filters to check whether adding a new AR state significantly improves the performance of the model. To that end, we define a new distance measure between AR filters based on mean squared prediction error (MSPE), and propose an efficient method to generate random stable filters that are uniformly distributed in the coefficient space. Numerical results are provided to evaluate the performance of the proposed approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modeling and forecasting time series is of fundamental importance in various applications. There may be occasional changes of behavior in a time series. Some examples are the changes in the stock market due to the financial crisis, or the variations of an EEG signal caused by the mode change in the brain. In the econometrics literature, this kind of time series is referred to as regime-switching model [1, 2]. In regime switching models, the time series is assumed to have states, and if belongs to state (

), the probability density function (pdf) of

conditioning on its past is in the form of . The autoregressive (AR) model, one of the commonly used techniques to model stationary time series [2], is usually used to model each state. The autoregression of state is given by where

are independent and identically distributed (i.i.d.) noises with zero mean and variance

. Here , ,

is a real-valued vector of length

that characterizes state . A more detailed survey on this model can be found in [3]. We refer to this model as a multi-state AR model and to as the AR filter or AR coefficients of state . The above model with was first analyzed by Lindgren [4] and Baum et al. [5]. The model with general is widely studied in the speech recognition literature [6]

. The multi-state AR model is a general statistical model that can be used to fit data in many real world applications. It was shown that the model is capable of representing non-linear and non-stationary time series with multimodal conditional distributions and with heteroscedasticity


. There are two basic underlying assumptions in this model: 1. Autoregression assumption, which is reasonable if the observations are obtained sequentially in time; 2. Multi-state assumption, which is reasonable if the stochastic process exhibits different behaviors in different time epochs. For example, stock prices may have dramatic while not permanent changes in the case of business cycles or financial crises, and those dynamics can be described by stochastic transitions among different states.

Despite the wide applications of the multi-state AR model, there are few results on how to estimate the number of states

in a time series. Obviously, different values of produce a nested family of models and models with larger ’s fit the observed data better. The drawback of using complex models with a large

is the over-fitting problem which decreases the predictive power of the model. Hence, a proper model selection procedure that identifies the appropriate number of states is vital. It is tempting to test the null hypothesis that there are

states against the alternative of . Unfortunately, the likelihood ratio test of this hypothesis fails to satisfy the usual regularity conditions since some parameters of the model are unidentified under the null hypothesis. An alternative is to apply Akaike information criterion (AIC) [8] or Bayesian information criterion (BIC) [9] to introduce a penalty on the complexity of the model in the model selection procedure. However, in general AIC and BIC are shown to be inaccurate in estimating the number of states [10].

In this paper, we propose a model selection criterion inspired by the work of Tibshirani et al. [11] who studied the clustering of i.i.d. points under Euclidean distance. The idea is to identify by comparing the goodness of fit for the observed data with its expected value under a null reference distribution. To that end, we first draw a reference curve which plots the “goodness of fit” versus based on the most non-informative distributed data, and describes how much adding new AR states improves the goodness of fit. We then draw a similar curve based on the observed data. In this work we choose the “goodness of fit” measure to be the mean squared prediction error (MSPE). Finally, the point at which the gap between the two curves is maximized is chosen as the estimated .

Besides the simplicity and effectiveness, another benefit of the proposed model selection criterion is that it is adaptive to the underlying characteristics of AR processes. The criterion for the processes of little dependency, i.e., the roots of whose characteristic polynomial are small, is different from the criterion for those of large dependency. In this sense, it takes into account the characteristics behind the observed data in an unsupervised manner, even though no domain knowledge or prior information is given.

The remainder of the paper is outlined below. In Section II, we propose the Gap statistics for estimating the number of AR states in a time series. Section III

formulates a specific class of the multi-state AR model, where the transitions between the states are assumed to be a first order Markov process. We emphasize that this parametric model is considered primarily for simplicity and the proposed Gap statistics can be applied to general multi-state AR processes. A new initialization approach is also proposed that can effectively reduce the impact of a bad initialization on the performance of the expectation-maximization (EM) algorithm. Section 

IV presents some numerical results to evaluate the performance of the proposed approach. Experiments show that the accuracy of the proposed approach in estimating the number of AR states surpasses those of AIC and BIC.

Ii Gap Statistics

This section describes our proposed criterion for selecting the number of states in a multi AR process, inspired by [11]. We draw a reference curve, which is the expected value of MSPE under a null reference distribution versus , and use its difference with the MSPE of the observed data to identify the number of states, . We show that computing each point of the reference curve turns out to be a clustering problem in the space of AR coefficients of a fixed size, where the distance measure for clustering is derived from the increase in MSPE when a wrong model is specified. We derive the distance measure in closed form, introduce an approach to generate stable AR filters that are uniformly distributed, and apply the -medoids algorithm to approximate the optimal solution for the clustering problem. We first outline our proposed model selection criterion in Subsection II-A, and then elaborate on the distance measure in Subsections II-B and the generation of random AR filters in Subsections II-C.

Ii-a The Model Selection Criterion

We use superscript to represent the data at time step , and

to denote the normal distribution with mean

and variance . Symbols in bold face represent vectors or matrices. We start from a simple scenario where the data is generated using a single stable AR filter : , where , , , and are i.i.d. . Suppose we are at time step and we want to predict the value at time . If is used for prediction, the MSPE is . But if another AR filter is used for prediction instead of , i.e., , the MSPE becomes . The difference of the two MSPE is defined by


It is easy to observe that is always nonnegative, which means that using the mismatch filter for prediction increases MSPE. We refer to as the mismatch distance between two filters and , though it is not a metric. When the data generated from has zero mean, i.e., , we let also represents of length (with constant term omitted) with a slight abuse of notation, and we use in the same manner.

As has been mentioned in Section I, our model selection criterion is based on a reference curve that describes how much adding a new state increases the goodness of fit in the most non-informative or the “worst” case. To that end, we consider an -state zero mean AR process where at each time step , nature chooses random mismatch filters (with zero constants) for prediction. In such a worst scenario, the filters that minimize the average mismatch distances to the random filters are naturally believed to be the true data generating filters, and that minimal value, which is the average MSPE, is plotted as the reference curve. This leads to the following clustering problem in the space of stable AR filters , where

Clustering of Stable Filters: For a fixed , let , , , be a set of uniformly generated stable filters of a given length . We cluster into disjoint clusters , and define the within cluster sum of distances to be


where is defined in (1) and will be further simplified in (4), (5) and (6). By computing for , we obtain the reference curve. The optimization problem (2) can be solved by the -medoids algorithm [12].

The model selection criterion is outlined in Table 1. We note that the bound for the roots is determined by the estimated filters, and thus the reference is data-dependent. Intuitively, if the process has less dependency, or in other words a point has less influence on its future points, the roots of the characteristic polynomials of each AR process are closer to zero and the MSPE curve will have smaller values. Thus, the filters from which the reference curve is calculated should also be drawn from a smaller bounded space.

1:, (which is assumed to contain the “correct” number of states)
2:The number of AR states .
3:for   do
4:     Fit a multi-state AR model to the data (for instance using the EM algorithm described in Algorithm 4 )
5:     Compute the MSPE based on the estimated model.
6:end for
7:Plot , referred to as the “observed MSPE curve”
8:Compute the largest absolute value of the roots of each estimated AR filter for the case , denoted by . Let
9:for  (number of iterations)  do
10:     Run Algorithm 3 (to be introduced in Subsection II-C) to generate (e.g. ) independent and uniformly distributed stable filters from .
11:     for   do
12:          Run Algorithm 2 to approximate the optimum of (2), and obtain .
13:     end for
14:end for
15:Let Plot as the reference curve (see Fig. 2 for an example).
16:Choose to be the smallest that satisfies if there exists any; otherwise .
Algorithm 1 Model Selection Based on Gap Statistics
1:A set of stable filters , , , the number of desired clusters , a number (used for the stopping criterion).
2:The minimum within-cluster sum of distances (WCSD) and that approximate the centers.
3:Generate a matrix whose elements are pairwise distances between filters: .
4:Initialize clusters characterized by centers and associated sets of indices () that form a partition of .
5:Compute . Let (for initialization purpose).
6:while   do
7:     , .
8:     for   do
9:          Suppose that and let .
10:          while   do
11:               Consider the candidates for the new centers, , where and .
12:               For each , let if .
13:               Compute the WCSD given the new clusters: .
14:               if  then
15:                    , , , .
16:               else
17:                    .
18:               end if
19:          end while
20:     end for
21:end while
Algorithm 2 Clustering Stable AR filters via “-medoids” Algorithm

Ii-B Distance Measure for Autoregressive Processes

In this subsection, we provide the explicit formula for the distance in Equation (1). Assume that the data is generated by a stable filter of length . Let be the characteristic polynomial of , and let denote the roots of , i.e., , where lie inside the unit circle (). Similarly define for . The value in (1) can be computed using the power spectral density and Cauchy’s integral theorem as:




for , where denotes the complex conjugate of . For the degenerate cases when or , reduces to or .

Remark 1.

For now we assume that at each state has zero mean by default, unless explicitly pointed out. We use in Identity (4) instead of in Identity (3) to compute the reference curve. The derived reference curve can be applied to the general case. The reason is that it is more difficult to detect two AR states with the same mean than those that have different means. Therefore, the reference curves for the zero mean case (the “worst” case) can be used in general.

The distance measure defined in Equation (4) is proportional to . We consider which results in a constant in the computation of in (2). Since it is the same for different ’s, we set without loss of generality.

The distance between two AR filters can be explicitly expressed in terms of the coefficients. This is computationally desirable if the filters are random samples generated in the coefficient domain, as will be discussed in Subsection II-C.


Notations: Consider two polynomials of nonnegative powers and respectively of degrees and . Let respectively denote the reciprocal polynomial of , and the multiplication of and , i.e., , . Let be the resultant of and . Define and , where are the roots of .

Lemma 1.

The values of and can be computed as polynomials of the coefficients of and .

The proof follows from the fact that the resultant of and is given by the determinant of their associated Sylvester matrix [13], and that for any , can be computed as polynomials in the coefficients of via Newton’s identities. We further provide the following result.

Lemma 2.

Let , . The value of in Equation (4) (with ) can be computed in terms of the coefficients of and as in Equation (5) (on the top of the next page), where , , and the function is defined as

for , for , and for , where denotes the determinant of a square matrix.

Another simple way to compute the distance measure is given by the following lemma.

Lemma 3.

Let be the true filter of an autoregression with zero mean. The variance , the correlations , and the covariance matrix of the autoregression are respectively defined to be Define , for and , if and otherwise . Then and can be computed by

where is determined by . The value of in terms of and can be computed by


Ii-C Generating Uniformly Distributed Filters with Bounded Roots

As mentioned before, Gap statistics requires a reference curve that is calculated by clustering the filters randomly chosen from a reference distribution. In some scenarios we need to generate sample filters from , where is calculated from the observed data. Inspired by the work of Beadle and Djurić [14], we provide the following result on how to generate a random point in with uniform distribution

Lemma 4.

Generation of an independent uniform sample of can be achieved by the following procedure:
1. Draw uniformly on the interval ;
2. For , suppose that we have obtained that is uniformly distributed in . Draw independently from a pdf proportional to the following function on the interval




We prove by induction. The pdf of is proportional to one. For , suppose that the pdf of is proportional one inside and zero elsewhere. Suppose that and are determined by (8). The Levinson-Durbin recursion in (8) automatically enforces the stability constraint that falls inside . The pdf of can be computed as

where is the Jacobian from to () taking to be given. Therefore, if is proportional to the value given by (7), the joint pdf of is proportional to one in and zero elsewhere. ∎

Remark 2.

The technique presented in Lemma 4 can be equivalently formulated in a simple way summarized in the following lemma. The procedure is also described in Algorithm 3.

Lemma 5.

A sample of that is uniformly distributed in can be generated by the recursion , where and are independently generated.

3:for   do
4:     Draw

independently from the beta distribution

5:     Let and
6:end for
Algorithm 3 Generating a uniform sample within

Fig. 1 illustrates the filters randomly generated from with . The centers of a two-clustering obtained using Algorithm 2 are also shown in this figure. These centers are calculated based on the average of 20 random instances, each with 1000 samples. Fig. 2 shows the reference curves for and .

Fig. 1: 10000 independent and uniformly distributed filters of and the centers of two clusters, with .
Fig. 2: The reference curves for , , which are obtained based on (see Algorithm 1).

Iii Model

A popular way to describe the switching behavior between different states is to assume that the transition between the states follows a first-order Markov process. In this section, we adopt this assumption to formulate a parametric multi-state AR model for illustration purpose, even though the model selection criterion proposed in Section II is applicable to other multi-state AR models.

Iii-a Notations and Formulations

Let denote the set of data points that are generated from state . Suppose that are fixed and known. Let and be a sequence of missing (unobserved) indicators, where is a matrix, is a vector, and

Clearly, . We note that is a binary vector of length containing a unique “”; with a slight abuse of notation is the location of that “”. We assume that

is a Markov chain with transition probability matrix

, where , and is drawn from , where denotes the family of multinomial distributions. In other words, the assumed data generating process (given a fixed ) is:


Let be the set of unknown parameters to be estimated, where is of length (including the constant term). Though computing the maximum-likelihood estimation (MLE) of the above probabilistic model (10) is not tractable, it can be approximated by a local maximum via the EM algorithm [15]. The EM algorithm produces a sequence of estimates by the recursive application of E-step and M-step to the complete log-likelihood until a predefined convergence criterion is achieved. The complete log-likelihood can be written as


For brevity, we provide the EM formulas below without derivation. In the E-step, we obtain a function of unknown parameters by taking the expectation of (11) with respect to the missing data and given the most updated parameters,




can be computed recursively. We note that the parameters involved in the right-hand side of (13) take values from the last update. In the M-step, we use the coordinate ascent algorithm to obtain the following local maximum. The “old” superscriptions are omitted for brevity.


Iii-B Initialization of EM

The convergence speed of the EM algorithm strongly depends on the initialization and an improper initialization can cause it to converge to a local maximum which is far away from the global optimum. A routine technique is to use multiple random initializations and choose the output with the largest likelihood [16], but this can be significantly time consuming. Here, we use a new initialization technique to get a fast and reliable convergence for the EM algorithm. This technique is based on the fact that for time series obtained in most practical areas, the self-transition probability of the states is usually close to one, i.e., . By adopting this assumption, we propose the initialization method in Algorithm 4, which is shown empirically to produce more reliable and efficient EM results. We note that the “split” style rule that appears in line 5 of Algorithm 4 is used elsewhere (e.g. s[17]).

2:The initial parameters .
3:for  do
4:     for   do
5:          Estimate the AR filter and the noise variance from the sequence via the least squares method.
6:     end for
7:     Cluster into cluster using -means algorithm and obtain the centers with the corresponding noise variances . Pick up such () that maximize the sum of Euclidean distances to .
8:     if  then
9:          Let .
10:     else