# Data-Driven Learning of the Number of States in Multi-State Autoregressive Models

In this work, we consider the class of multi-state autoregressive processes that can be used to model non-stationary time-series of interest. In order to capture different autoregressive (AR) states underlying an observed time series, it is crucial to select the appropriate number of states. We propose a new model selection technique based on the Gap statistics, which uses a null reference distribution on the stable AR filters to check whether adding a new AR state significantly improves the performance of the model. To that end, we define a new distance measure between AR filters based on mean squared prediction error (MSPE), and propose an efficient method to generate random stable filters that are uniformly distributed in the coefficient space. Numerical results are provided to evaluate the performance of the proposed approach.

There are no comments yet.

## Authors

• 39 publications
• 4 publications
• 51 publications
• ### Learning the Number of Autoregressive Mixtures in Time Series Using the Gap Statistics

Using a proper model to characterize a time series is crucial in making ...
09/11/2015 ∙ by Jie Ding, et al. ∙ 0

• ### Generalization error bounds for stationary autoregressive models

We derive generalization error bounds for stationary univariate autoregr...
03/04/2011 ∙ by Daniel J. McDonald, et al. ∙ 0

• ### Model-based bias correction for short AR(1) and AR(2) processes

The class of autoregressive (AR) processes is extensively used to model ...
10/12/2020 ∙ by Sigrunn H Sørbye, et al. ∙ 0

• ### TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

The autoregressive (AR) models, such as attention-based encoder-decoder ...
04/04/2021 ∙ by Zhengkun Tian, et al. ∙ 0

• ### A tutorial on reproducing a predefined autocovariance function through AR models: Application to stationary homogeneous isotropic turbulence

Sequential methods for synthetic realisation of random processes have a ...
05/24/2021 ∙ by Cristobal Gallego-Castillo, et al. ∙ 0

• ### Sampling Requirements for Stable Autoregressive Estimation

We consider the problem of estimating the parameters of a linear univari...
05/04/2016 ∙ by Abbas Kazemipour, et al. ∙ 0

• ### Changepoint detection in random coefficient autoregressive models

We propose a family of CUSUM-based statistics to detect the presence of ...
04/27/2021 ∙ by Lajos Horváth, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Modeling and forecasting time series is of fundamental importance in various applications. There may be occasional changes of behavior in a time series. Some examples are the changes in the stock market due to the financial crisis, or the variations of an EEG signal caused by the mode change in the brain. In the econometrics literature, this kind of time series is referred to as regime-switching model [1, 2]. In regime switching models, the time series is assumed to have states, and if belongs to state (

), the probability density function (pdf) of

conditioning on its past is in the form of . The autoregressive (AR) model, one of the commonly used techniques to model stationary time series [2], is usually used to model each state. The autoregression of state is given by where

are independent and identically distributed (i.i.d.) noises with zero mean and variance

. Here , ,

is a real-valued vector of length

that characterizes state . A more detailed survey on this model can be found in [3]. We refer to this model as a multi-state AR model and to as the AR filter or AR coefficients of state . The above model with was first analyzed by Lindgren [4] and Baum et al. [5]. The model with general is widely studied in the speech recognition literature [6]

. The multi-state AR model is a general statistical model that can be used to fit data in many real world applications. It was shown that the model is capable of representing non-linear and non-stationary time series with multimodal conditional distributions and with heteroscedasticity

[7]

. There are two basic underlying assumptions in this model: 1. Autoregression assumption, which is reasonable if the observations are obtained sequentially in time; 2. Multi-state assumption, which is reasonable if the stochastic process exhibits different behaviors in different time epochs. For example, stock prices may have dramatic while not permanent changes in the case of business cycles or financial crises, and those dynamics can be described by stochastic transitions among different states.

Despite the wide applications of the multi-state AR model, there are few results on how to estimate the number of states

in a time series. Obviously, different values of produce a nested family of models and models with larger ’s fit the observed data better. The drawback of using complex models with a large

is the over-fitting problem which decreases the predictive power of the model. Hence, a proper model selection procedure that identifies the appropriate number of states is vital. It is tempting to test the null hypothesis that there are

states against the alternative of . Unfortunately, the likelihood ratio test of this hypothesis fails to satisfy the usual regularity conditions since some parameters of the model are unidentified under the null hypothesis. An alternative is to apply Akaike information criterion (AIC) [8] or Bayesian information criterion (BIC) [9] to introduce a penalty on the complexity of the model in the model selection procedure. However, in general AIC and BIC are shown to be inaccurate in estimating the number of states [10].

In this paper, we propose a model selection criterion inspired by the work of Tibshirani et al. [11] who studied the clustering of i.i.d. points under Euclidean distance. The idea is to identify by comparing the goodness of fit for the observed data with its expected value under a null reference distribution. To that end, we first draw a reference curve which plots the “goodness of fit” versus based on the most non-informative distributed data, and describes how much adding new AR states improves the goodness of fit. We then draw a similar curve based on the observed data. In this work we choose the “goodness of fit” measure to be the mean squared prediction error (MSPE). Finally, the point at which the gap between the two curves is maximized is chosen as the estimated .

Besides the simplicity and effectiveness, another benefit of the proposed model selection criterion is that it is adaptive to the underlying characteristics of AR processes. The criterion for the processes of little dependency, i.e., the roots of whose characteristic polynomial are small, is different from the criterion for those of large dependency. In this sense, it takes into account the characteristics behind the observed data in an unsupervised manner, even though no domain knowledge or prior information is given.

The remainder of the paper is outlined below. In Section II, we propose the Gap statistics for estimating the number of AR states in a time series. Section III

formulates a specific class of the multi-state AR model, where the transitions between the states are assumed to be a first order Markov process. We emphasize that this parametric model is considered primarily for simplicity and the proposed Gap statistics can be applied to general multi-state AR processes. A new initialization approach is also proposed that can effectively reduce the impact of a bad initialization on the performance of the expectation-maximization (EM) algorithm. Section

IV presents some numerical results to evaluate the performance of the proposed approach. Experiments show that the accuracy of the proposed approach in estimating the number of AR states surpasses those of AIC and BIC.

## Ii Gap Statistics

This section describes our proposed criterion for selecting the number of states in a multi AR process, inspired by [11]. We draw a reference curve, which is the expected value of MSPE under a null reference distribution versus , and use its difference with the MSPE of the observed data to identify the number of states, . We show that computing each point of the reference curve turns out to be a clustering problem in the space of AR coefficients of a fixed size, where the distance measure for clustering is derived from the increase in MSPE when a wrong model is specified. We derive the distance measure in closed form, introduce an approach to generate stable AR filters that are uniformly distributed, and apply the -medoids algorithm to approximate the optimal solution for the clustering problem. We first outline our proposed model selection criterion in Subsection II-A, and then elaborate on the distance measure in Subsections II-B and the generation of random AR filters in Subsections II-C.

### Ii-a The Model Selection Criterion

We use superscript to represent the data at time step , and

to denote the normal distribution with mean

and variance . Symbols in bold face represent vectors or matrices. We start from a simple scenario where the data is generated using a single stable AR filter : , where , , , and are i.i.d. . Suppose we are at time step and we want to predict the value at time . If is used for prediction, the MSPE is . But if another AR filter is used for prediction instead of , i.e., , the MSPE becomes . The difference of the two MSPE is defined by

 D(ψA,ψB) =E{[x(n)+ψTBx(n)]2}−σ2A =E{[(ψA−ψB)Tx(n)]2}. (1)

It is easy to observe that is always nonnegative, which means that using the mismatch filter for prediction increases MSPE. We refer to as the mismatch distance between two filters and , though it is not a metric. When the data generated from has zero mean, i.e., , we let also represents of length (with constant term omitted) with a slight abuse of notation, and we use in the same manner.

As has been mentioned in Section I, our model selection criterion is based on a reference curve that describes how much adding a new state increases the goodness of fit in the most non-informative or the “worst” case. To that end, we consider an -state zero mean AR process where at each time step , nature chooses random mismatch filters (with zero constants) for prediction. In such a worst scenario, the filters that minimize the average mismatch distances to the random filters are naturally believed to be the true data generating filters, and that minimal value, which is the average MSPE, is plotted as the reference curve. This leads to the following clustering problem in the space of stable AR filters , where

 RL(r)= {[λ1,…,λL]T∣zL+L∑ℓ=1λℓzL−ℓ=L∏ℓ=1(z−aℓ), λℓ∈R,|aℓ|

Clustering of Stable Filters: For a fixed , let , , , be a set of uniformly generated stable filters of a given length . We cluster into disjoint clusters , and define the within cluster sum of distances to be

 (2)

where is defined in (1) and will be further simplified in (4), (5) and (6). By computing for , we obtain the reference curve. The optimization problem (2) can be solved by the -medoids algorithm [12].

The model selection criterion is outlined in Table 1. We note that the bound for the roots is determined by the estimated filters, and thus the reference is data-dependent. Intuitively, if the process has less dependency, or in other words a point has less influence on its future points, the roots of the characteristic polynomials of each AR process are closer to zero and the MSPE curve will have smaller values. Thus, the filters from which the reference curve is calculated should also be drawn from a smaller bounded space.

### Ii-B Distance Measure for Autoregressive Processes

In this subsection, we provide the explicit formula for the distance in Equation (1). Assume that the data is generated by a stable filter of length . Let be the characteristic polynomial of , and let denote the roots of , i.e., , where lie inside the unit circle (). Similarly define for . The value in (1) can be computed using the power spectral density and Cauchy’s integral theorem as:

 D(ψA,ψB) =D0(ψA,ψB)+⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1+L∑ℓ=1ψBℓ1+L∑ℓ=1ψAℓψA0+ψB0⎞⎟ ⎟ ⎟ ⎟ ⎟⎠2 (3)

where

 σ2A2π∫π−π∣∣ΨA(ejω)−ΨB(ejω)∣∣2|1+ΨA(ejω)|2dω =σ2AL∑k=1L∏ℓ=1(ak−bℓ)akL∏ℓ=1ℓ≠k(ak−aℓ)⎛⎜ ⎜ ⎜ ⎜ ⎜⎝L∏ℓ=1(1−akb∗ℓ)L∏ℓ=1(1−aka∗ℓ)−1⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, (4)

for , where denotes the complex conjugate of . For the degenerate cases when or , reduces to or .

###### Remark 1.

For now we assume that at each state has zero mean by default, unless explicitly pointed out. We use in Identity (4) instead of in Identity (3) to compute the reference curve. The derived reference curve can be applied to the general case. The reason is that it is more difficult to detect two AR states with the same mean than those that have different means. Therefore, the reference curves for the zero mean case (the “worst” case) can be used in general.

The distance measure defined in Equation (4) is proportional to . We consider which results in a constant in the computation of in (2). Since it is the same for different ’s, we set without loss of generality.

The distance between two AR filters can be explicitly expressed in terms of the coefficients. This is computationally desirable if the filters are random samples generated in the coefficient domain, as will be discussed in Subsection II-C.

Notations: Consider two polynomials of nonnegative powers and respectively of degrees and . Let respectively denote the reciprocal polynomial of , and the multiplication of and , i.e., , . Let be the resultant of and . Define and , where are the roots of .

###### Lemma 1.

The values of and can be computed as polynomials of the coefficients of and .

The proof follows from the fact that the resultant of and is given by the determinant of their associated Sylvester matrix [13], and that for any , can be computed as polynomials in the coefficients of via Newton’s identities. We further provide the following result.

###### Lemma 2.

Let , . The value of in Equation (4) (with ) can be computed in terms of the coefficients of and as in Equation (5) (on the top of the next page), where , , and the function is defined as

 1h!det⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝s1−t10⋯0s2−t2s1−t2⋱⋮⋮⋮⋱⋱0sh−1−th−2⋮⋱⋱h−1sh−thsh−1−th−1⋯s2−t2s1−t⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

for , for , and for , where denotes the determinant of a square matrix.

Another simple way to compute the distance measure is given by the following lemma.

###### Lemma 3.

Let be the true filter of an autoregression with zero mean. The variance , the correlations , and the covariance matrix of the autoregression are respectively defined to be Define , for and , if and otherwise . Then and can be computed by

 ρ=−Φ−1ψA,γ0=(1+ρT% ψA)−1,

where is determined by . The value of in terms of and can be computed by

 D0(ψA,ψB)=(ψA−ψB)TΓ(ψA−ψB). (6)

### Ii-C Generating Uniformly Distributed Filters with Bounded Roots

As mentioned before, Gap statistics requires a reference curve that is calculated by clustering the filters randomly chosen from a reference distribution. In some scenarios we need to generate sample filters from , where is calculated from the observed data. Inspired by the work of Beadle and Djurić [14], we provide the following result on how to generate a random point in with uniform distribution

###### Lemma 4.

Generation of an independent uniform sample of can be achieved by the following procedure:
1. Draw uniformly on the interval ;
2. For , suppose that we have obtained that is uniformly distributed in . Draw independently from a pdf proportional to the following function on the interval

 (1+λk,krk)⌊k2⌋(1−λk,krk)⌊k−12⌋, (7)

where

 λi,k=λi,k−1+λk,kλk−i,k−1r2k−2i(i=1,…,k−1). (8)
###### Proof.

We prove by induction. The pdf of is proportional to one. For , suppose that the pdf of is proportional one inside and zero elsewhere. Suppose that and are determined by (8). The Levinson-Durbin recursion in (8) automatically enforces the stability constraint that falls inside . The pdf of can be computed as

 p(λ1,k,…,λk,k)=p(λk,k)p(λ1,k,…,λk−1,k∣λk,k) =p(λk,k)p(λ1,k−1,…,λk−1,k−1)|Jk|−1 ∝p(λk,k)(1+λk,k/rk)−⌊k/2⌋(1−λk,k/rk)−⌊(k−1)/2⌋,

where is the Jacobian from to () taking to be given. Therefore, if is proportional to the value given by (7), the joint pdf of is proportional to one in and zero elsewhere. ∎

###### Remark 2.

The technique presented in Lemma 4 can be equivalently formulated in a simple way summarized in the following lemma. The procedure is also described in Algorithm 3.

###### Lemma 5.

A sample of that is uniformly distributed in can be generated by the recursion , where and are independently generated.

Fig. 1 illustrates the filters randomly generated from with . The centers of a two-clustering obtained using Algorithm 2 are also shown in this figure. These centers are calculated based on the average of 20 random instances, each with 1000 samples. Fig. 2 shows the reference curves for and .

## Iii Model

A popular way to describe the switching behavior between different states is to assume that the transition between the states follows a first-order Markov process. In this section, we adopt this assumption to formulate a parametric multi-state AR model for illustration purpose, even though the model selection criterion proposed in Section II is applicable to other multi-state AR models.

### Iii-a Notations and Formulations

Let denote the set of data points that are generated from state . Suppose that are fixed and known. Let and be a sequence of missing (unobserved) indicators, where is a matrix, is a vector, and

 z(n)mm′ ={1 if % x(n−1)∈Sm and x(n)∈Sm′,0 otherwise, y(n)m ={1 if % x(n)∈Sm,0 otherwise.

Clearly, . We note that is a binary vector of length containing a unique “”; with a slight abuse of notation is the location of that “”. We assume that

is a Markov chain with transition probability matrix

, where , and is drawn from , where denotes the family of multinomial distributions. In other words, the assumed data generating process (given a fixed ) is:

 y(n) ∼{M(α1,…,αM)% if n=1,M(Ty(n−1)1,…,Ty(n−1)M)otherwise, (9) X(n) ∼N(−γTy(n)x(n),σ2y(n)),n=2,…,N. (10)

Let be the set of unknown parameters to be estimated, where is of length (including the constant term). Though computing the maximum-likelihood estimation (MLE) of the above probabilistic model (10) is not tractable, it can be approximated by a local maximum via the EM algorithm [15]. The EM algorithm produces a sequence of estimates by the recursive application of E-step and M-step to the complete log-likelihood until a predefined convergence criterion is achieved. The complete log-likelihood can be written as

 N∑n=1logp(x(n)∣x(n)) =N∑n=1N∑m,m′=1z(n)mm′(log(Tmm′√2πσm′) +(x(n)−γTm′x(n))22σ2m′⎞⎟⎠. (11)

For brevity, we provide the EM formulas below without derivation. In the E-step, we obtain a function of unknown parameters by taking the expectation of (11) with respect to the missing data and given the most updated parameters,

 Q(Θ∣X,Θold) =N∑n=1N∑m,m′=1w(n)mm′(log(Tmm′√2πσm′) +(x(n)−γTm′x(n))22σ2m′⎞⎟⎠, (12)

where

 w(n)mm′ =E(z(n)mm′∣Θold)=P(y(n−1)=m,y(n)=m′∣X) (13)

can be computed recursively. We note that the parameters involved in the right-hand side of (13) take values from the last update. In the M-step, we use the coordinate ascent algorithm to obtain the following local maximum. The “old” superscriptions are omitted for brevity.

 γm =−(N∑n=1M∑m′=1w(n)m′mx(n)(x(n))T)−1 (N∑n=1M∑m′=1w(n)m′mx(n)x(n)), (14) σ2m =N∑n=1M∑m′=1w(n)m′m(x(n)+γTmx(n))2N∑n=1M∑m′=1w(n)m′m, (15) Tmm′ =N∑n=1w(n)mm′M∑m′=1N∑n=1w(n)mm′. (16)

### Iii-B Initialization of EM

The convergence speed of the EM algorithm strongly depends on the initialization and an improper initialization can cause it to converge to a local maximum which is far away from the global optimum. A routine technique is to use multiple random initializations and choose the output with the largest likelihood [16], but this can be significantly time consuming. Here, we use a new initialization technique to get a fast and reliable convergence for the EM algorithm. This technique is based on the fact that for time series obtained in most practical areas, the self-transition probability of the states is usually close to one, i.e., . By adopting this assumption, we propose the initialization method in Algorithm 4, which is shown empirically to produce more reliable and efficient EM results. We note that the “split” style rule that appears in line 5 of Algorithm 4 is used elsewhere (e.g. s[17]).