# Learning the Number of Autoregressive Mixtures in Time Series Using the Gap Statistics

Using a proper model to characterize a time series is crucial in making accurate predictions. In this work we use time-varying autoregressive process (TVAR) to describe non-stationary time series and model it as a mixture of multiple stable autoregressive (AR) processes. We introduce a new model selection technique based on Gap statistics to learn the appropriate number of AR filters needed to model a time series. We define a new distance measure between stable AR filters and draw a reference curve that is used to measure how much adding a new AR filter improves the performance of the model, and then choose the number of AR filters that has the maximum gap with the reference curve. To that end, we propose a new method in order to generate uniform random stable AR filters in root domain. Numerical results are provided demonstrating the performance of the proposed approach.

## Authors

• 39 publications
• 4 publications
• 51 publications
• ### Data-Driven Learning of the Number of States in Multi-State Autoregressive Models

In this work, we consider the class of multi-state autoregressive proces...
06/06/2015 ∙ by Jie Ding, et al. ∙ 0

• ### On the the linear processes of a stationary time series AR(2)

Our aim in this work is to give explicit formula of the linear processes...
03/31/2020 ∙ by Mouloud Goubi, et al. ∙ 0

• ### Sampling Requirements for Stable Autoregressive Estimation

We consider the problem of estimating the parameters of a linear univari...
05/04/2016 ∙ by Abbas Kazemipour, et al. ∙ 0

• ### Moments of Maximum: Segment of AR(1)

Let X_t denote a stationary first-order autoregressive process. Consider...
08/12/2019 ∙ by Steven Finch, et al. ∙ 0

• ### A tutorial on reproducing a predefined autocovariance function through AR models: Application to stationary homogeneous isotropic turbulence

Sequential methods for synthetic realisation of random processes have a ...
05/24/2021 ∙ by Cristobal Gallego-Castillo, et al. ∙ 0

• ### General proof of a limit related to AR(k) model of Statistics

Computing moments of various parameter estimators related to an autoregr...
08/01/2019 ∙ by Jan Vrbik, et al. ∙ 0

• ### ARMDN: Associative and Recurrent Mixture Density Networks for eRetail Demand Forecasting

Accurate demand forecasts can help on-line retail organizations better p...
03/10/2018 ∙ by Srayanta Mukherjee, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Modeling time-series has been of a great interest for a long time. A good model not only describes the observed data well, but also avoids over-fitting, which can reduce the predictive power of the model. Autoregressive (AR) models are one of the most commonly used techniques to model stationary time-series [1]. A time-varying autoregressive (TVAR) model is a generalized form of an AR model that is used to describe the non-stationarity of time-series [2]. An example of TVAR models is the regime-switching model [1]

, which assumes that a non-stationary stochastic process is composed of different epochs/regimes each of which is a stationary process, and that the regimes switch according to a Markov process. Another example is the model proposed in

[3]

, which uses a mixture of Gaussian AR models to describe time-series and uses an expectation-maximization (EM) algorithm to determine the parameters of the model. Wong and Li

[3] used Akaike information criterion (AIC) [4] and Bayes information criterion (BIC) [5]

to introduce a penalty on the complexity of the model and estimate the number of AR filters. In general AIC and BIC are shown to be suboptimal for determining the number of modes

[6].

Tibshirani et al. [7] introduced an intuitive technique for determining the appropriate number of clusters using Gap statistics. The general idea of the Gap statistics is to identify the number of clusters in a data set by comparing the goodness of fit for the observed data with its expected value under a reference distribution. In this work we extend the Gap statistics to time-series in order to identify the number of AR filters needed to describe a time-series. We use a reference curve to measure how much adding a new AR filter improves the model under reference distribution, and then choose the number of filters that has the maximum gap with that reference curve.

In order to derive a reference curve, it is important to first clarify the meaning of the “goodness of fit”. In [7], the “goodness of fit” is measured by the sum of squared Euclidean distances of each data point from the center of the cluster it belongs to. But a different measure needs to be used in time-series to evaluate the performance of each model. In this work, our goal is to select models that have higher predictive powers, and thus we define the “goodness of fit” measure to be the mean squared prediction error (MSPE). We use MSPE to define a new distance measure between two stable filters in accordance with our need for a reference curve in Gap statistics. Our proposed distance measure differs from the previous distances (e.g., cepstral distance [8]

, discrete Fourier transform (DFT)

[9][10], and discrete wavelet transform (DWT) [11]) in that it naturally arises from the MSPE of the one-step prediction.

Computing each point on the reference curve using the new distance measure turns out to be a clustering problem in the space of stable AR filters with a fixed size, which is solved using the Monte Carlo approach. To that end, we introduce an approach to generate uniform random stable filters with equal sizes, and apply the -medoids algorithm to approximate the optimal solution for the clustering problem. Numerical simulations show that the accuracy of the proposed Gap statistics in estimating the number of AR filters surpasses that of AIC and BIC.

The remainder of the paper is organized as follows. Section II describes the model considered in this paper. In Section III we propose the Gap statistics to estimate the number of AR filters in a time-series. Section IV presents some numerical results to evaluate the performance of the proposed approach. We make our conclusions in Section V.

## Ii Model Assumption and Its Estimation

In order to model a given time series , we assume that each data point depends only on previous points and that is known in this paper. We use a time-varying autoregressive (TVAR) model to describe the value at time step as follows

 xn=L∑ℓ=1ϕnℓxn−ℓ+εn, (1)

where ’s are real numbers, and are independent Gaussian noise. Assume that the first points of a sample set,

, are known. The vector form of this equation can be written as

 xn=ϕTnxn+εn, (2)

where , and , . In real scenarios, is a time-varying vector, and modeling the variations of can be complicated. For simplicity, we assume that , for , where is the number of AR filters used to describe , and that each is a filter with size . We refer to as mode and call this model multi-mode AR model.

Clearly, can be seen as a parameter for a nested family of models, and larger will fit the observed data better. But as mentioned before, the predictive power of the model drops if is too large. Hence, a model selection procedure that identifies the appropriate number of modes is important. To that end, we evaluate MSPE for a range of values of which is assumed to contain the number of modes. We first estimate the parameters of each of the candidate models, and then select the number of AR modes according to the Gap statistics developed in Section III. For simplicity, we further assume the model with the following log-likelihood:

 logp(X|Θ) =N∑n=1logp(xn∣xn) =N∑n=1log(M∑m=1αmN(xn∣γTmxn,σ2)) (3)

where for any fixed , and

denotes the density of Gaussian distribution of mean

and variance

evaluated at , i.e., .

Let be the set of unknown parameters to be estimated. Next, we briefly describe how to approximate the maximum-likelihood estimation (MLE) of . Though computing the MLE in (3) is not tractable, it can be approximated by a local maximum via the EM algorithm [12]. Let be the membership labels with , where

 znm={1 if ϕn=γm0 otherwise.

The joint probability of

and can be written as a product of conditional probabilities as

 p(X,Z∣Θ)=N∏n=1M∏m=1(αmp(xn∣xn))znm.

Thus the complete log-likelihood is

 logp(X,Z∣Θ)=N∑n=1M∑m=1znmlog(αmp(xn∣xn)). (4)

The EM algorithm produces a sequence of estimates by recursive application of E-step and M-step until some convergence criterion is achieved. For brevity, we provide the EM formulas below without detailed derivations. We note that the M-step uses a coordinate ascent algorithm to find a local maxima.
E-Step: We take the expectation of (4) with respect to the missing data given the recent estimated unknown parameters, and obtain the following function (also referred to as the “Q function”)

 Q(Θ∣X,Z,Θold)=N∑n=1M∑m=1wnmlog(αmp(xn∣xn)),

where

 wnm=αmN(xn∣γTmxn,σ2)M∑m′=1αm′N(xn∣γTm′xn,σ2).

We note that the parameters involved in the right-hand side of the equation above take values from the last update. The “old” superscriptions are omitted for brevity. In other words, the E-step replaces the “missing data” in Equation (4) by its expected values for .
M-Step: Letting the derivatives of the Q function be zero leads to a coupled non-linear system that has no closed-form solution. Thus, our best hope is an approximation to the solution; for this we use the coordinate ascent algorithm to obtain a local maximum. For each , we apply the Lagrange method with constraint to obtain

 αm =N∑n=1wnmM∑m=1N∑n=1wnm=N∑n=1wnmN,m=1,⋯,M.

Taking the partial derivative of the Q function with respect to and then we obtain the following local maximum

 γm =(N∑n=1wnmxnxTn)−1(N∑n=1wnmxnxn),m=1,⋯,M, σ2 =M∑m=1N∑n=1wnm(xn−γTmxn)2N.

## Iii Gap Statistics to Determine the Number of Modes

In this work we use Gap statistics [7] to estimate the number of AR filters in a time series. In this technique, a data set is clustered into disjoint clusters by minimizing the following within-cluster sum of distances (WCSD)

 WM≜minμ1,⋯,μM⎧⎨⎩M∑m=1∑y∈Cmd(μm,y)⎫⎬⎭, (5)

where is the distance of from , the center of the th cluster. Each cluster is defined based on as

 Cm={y:y∈B,d(μm,y)≤d(μm′,y)  ∀m′≠m}. (6)

After computing for where is assumed to be the largest possible number of clusters, the graph of is standardized by comparing it to its expectation under a non-informative reference distribution. This can be chosen to be a uniform distribution. The point that has the largest difference with the reference curve is selected as the estimated number of clusters.

Let be the squared Euclidean distance (the most commonly used distance measure for clustering purposes). Then with this distance, becomes within-cluster sum of squares. However, for clustering AR filters, Euclidean distances have been shown to be ineffective [8]. Hence, we introduce a new distance measure in Sec. III-A that is well-suited for AR clustering.

### Iii-a Distance Measure for Autoregressive Processes

In order to find a reference curve for Gap statistics, we derive the distance between two filters based on MSPE. We assume that the data is generated by a stable filter with size , i.e.,

 xn=ψTAxn+εn,εn ∼N(0,σ2). (7)

For now we assume that has zero mean. Let be the characteristic polynomial of , and denote the roots of , i.e.,

 1−ΨA(z)=L∏ℓ=1(1−aℓz−1). (8)

If is stable, the roots lie inside the unit circle (). When we use at time step to predict the value at time , i.e., , the MSPE is

 E{(xn−^xn)2}=σ2. (9)

Suppose that we use a filter other than to predict the value at time . The mis-specified filter is denoted by . Then the MSPE becomes

 E{(xn−^xn)2}=E{((ψTA−ψTB)xn)2}+σ2. (10)

Motivated by Equations (9) and (10), we define the distance between filters and by

 D(ψA,ψB)=E{((ψTA−ψTB)xn)2}, (11)

which can be calculated using the power spectral density of as

 D(ψA,ψB)=σ22π∫π−π∣∣ΨA(e−jω)−ΨB(e−jω)∣∣2|1−ΨA(e−jω)|2dω. (12)

Using (), we get

 D (ψA,ψB) =σ22πj∮C(ΨA(z)−ΨB(z))(ΨA(z−1)−ΨB(z−1))(1−ΨA(z))(1−ΨA(z−1))dzz. (13)

Using Cauchy’s integral theorem, Equation (III-A) can be written in terms of the roots and as

 D(ψA,ψB)=σ2L∑k=1L∏ℓ=1(ak−bℓ)akL∏ℓ=1ℓ≠k(ak−aℓ)⎛⎜ ⎜ ⎜ ⎜ ⎜⎝L∏ℓ=1(1−akb∗ℓ)L∏ℓ=1(1−aka∗ℓ)−1⎞⎟ ⎟ ⎟ ⎟ ⎟⎠. (14)

for , where is the conjugate of a complex number . For the degenerate cases when or , reduces to or . We note that the distance (14) is not symmetric, i.e., . The distance measure defined in (14) is proportional to , which results in a constant in the computation of for the reference curve. Since it is the same for different , without loss of generality we can set .

### Iii-B Generation of Uniformly Distributed Random Filters

As mentioned before, Gap statistics require a reference curve that is generated by clustering sampled data from a reference distribution, which is usually chosen to be uniform. Therefore, uniform random generation of stable filters (uniform in coefficient domain) is needed. Since the roots are used in Equation (14), we use an approach to generate samples of roots that correspond to uniform samples of coefficients. To simplify the notations, we let denote the polynomial .

A polynomial is stable (also referred to as Schur stable) if all of its roots lie inside the unit circle. For a polynomial of order , we use and respectively to denote its number of pairs of complex roots and number of real roots. Let

 RL={(λ1,⋯,λL)∣zL+L∑ℓ=1λℓzL−ℓ is stable}⊂RL

be the coefficient space of all stable polynomials of degree , and let correspond to the polynomials that have pairs of complex roots. We call a configuration of with parameters . Clearly, are bounded subspaces of and .

In this section we propose an approach to generate uniformly distributed polynomials using roots. We first present the following theorem, which helps us to find the relation between the distribution of coefficients of a polynomial and its roots.

###### Theorem 1.

The determinant of the Jacobian matrix of the coefficients of a polynomial with respect to its roots is the Vandermonde polynomial of the roots, i.e.,

 det(∂[λ1,⋯,λL]∂[a1,⋯,aL])=∏1≤u

Furthermore, the volume of is

 Vol (R(c)L)=∫(x1,y1)∈C1⋯∫(xc,yc)∈C1∫a2c+1∈S1⋯∫aL∈S1 ∏1≤u

where .

###### Proof.

Let be the value of the left-hand side of Equation (15), which is a polynomial of . For any positive integers and , it is clear that

 ∂λk∂au=∑1≤ℓ1<⋯<ℓk−1≤L,ℓ1,⋯,ℓk−1≠uaℓ1⋯aℓk−1.

Thus changes sign under any transposition of the and by properties of the determinant, i.e. is an alternating polynomial of . It implies that is divisible by the Vandermonde polynomial [13]. Furthermore, both and are homogeneous polynomials of degree , and the coefficients of the term are both . Therefore, we obtain .

In order to compute the volume of , consider the space . Each point ,, ,, in corresponds to a set of roots , , , of a stable polynomial in , and thus there is a mapping from to (due to the permutation of roots). Therefore,

 Vol (R(c)L)=∫(x1,y1)∈C1⋯∫aL∈S112cc!(L−2c)! ∣∣ ∣∣det(∂[λ1,λ2,⋯,λ2c+1,⋯,λL]∂[x1,y1⋯,a2c+1,⋯,aL])∣∣ ∣∣dx1dy1⋯daL. (17)

Since implies that

 det(∂[a1,a2,⋯,a2c+1,⋯,aL]∂[x1,y1⋯,a2c+1,⋯,aL])=(−2j)c,

we obtain Equation (15) by combining Equations (1)–(III-B). ∎

Following Theorem 1, it is not difficult to obtain the following result. We omit the proof for brevity.

###### Corollary 1.

Generating a sample uniformly from can be done via the following three-step procedure:

1. Randomly draw

 c∼Multinomial⎛⎜ ⎜⎝Vol(R(0)L)Vol(RL),⋯,Vol(R(⌊L2⌋)L)Vol(RL)⎞⎟ ⎟⎠;
2. Generate a random sample ,,, , , from according to the (unnormalized) density

 ∏1≤u
3. Obtain by computing

 λk=(−1)k∑1≤ℓ1<⋯<ℓk≤Laℓ1⋯aℓk,k=1,⋯,L.

In practical implementations, the parameters of multinomial distribution in the first step require only one-time computation, and the second step can be realized by a sequence of one-dimensional reject samplings.

### Iii-C Calculating the Reference Curve

Based on the discussions above, the following procedure describes how to derive the reference curve and Gap statistics.

1. Generate uniform random stable filters , , , with a given size , using the technique introduced in Section III-B.

2. Suppose that is the candidate set of numbers of modes. For each , cluster into disjoint clusters by minimizing:

 WM≜minγ1,⋯,γM⎧⎨⎩1FM∑m=1∑ψ∈CmD(γm,ψ)⎫⎬⎭+1, (19)

where has been defined in Section III-A and the clusters are similarly defined as in (6).111 To make it consistent with the MSPE in (10), we put an additional “1” on the right hand side of definition (19) compared with (5). The “1” was used to represent the noise variance , since becomes a linear term in and thus can be negligible. For this step, we first generate the matrix whose elements are pairwise distances between sampled filters, and then run the -medoids algorithm [14] to approximate the optimum of (19).

3. Plot the reference curve, which is against for . We note that the reference curve is model independent.

4. Plot the empirical curve given the MSPE for , using the observed data, postulated model, and the model fitting approach. For example, the postulated model in this paper is the mixture of AR introduced in Section II, and the model fitting approach is the EM algorithm.

5. Finally, the number of AR mixtures that corresponds to the largest gap between the two curves is selected.

Fig. 1 illustrates the sampled coefficients of stable filters of size randomly generated using the technique described in Section III-B. The centers of the clusters (which are approximated as some of the generated filter samples) obtained using -medoids algorithm are also shown in this figure for . These centers are calculated based on the average of 20 random instances, each with samples. Fig. 2 shows the reference curves for and 4. Similar to Fig. 1, the reference curves are plotted based on random instances, each with samples.

## Iv Numerical Results

We have generated zero-mean time-series with size using the following three different models to evaluate the performance of the proposed model selection technique.

Scenario 1: 4 AR filters with lengths . The AR filter at each time-instance is randomly drawn from a multinomial distribution with parameters (0.25, 0.25, 0.25, 0.25).

Scenario 2: 2 AR filters with lengths

. The AR filter at each time-instance is randomly drawn from a Bernoulli distribution with parameters (0.4, 0.6).

Scenario 3: 7 AR filters with lengths . The time-series is divided into 7 equal parts and only one AR filter is used to draw the data-points in each part.

For each scenario, time-series of length are generated using uniformly distributed random AR filters. For each time series, EM is run times with different random initializations to increase the chance that the final estimates are close to the global optimum. Table I shows the estimated number of AR filters using AIC, BIC, and Gap statistics. The true number of filters for each scenario is highlighted. As it can be observed, Gap statistics outperforms AIC and BIC in all of the three scenarios. In the third scenario, AIC and BIC are not able to estimate the number of AR filters correctly, and severely underestimate the number of modes. While the Gap statistics finds the correct number of AR filters of the time.

For small and large , the uniformly generated AR filters are more likely to be close to one another, and thus identifying them as two separate modes becomes more challenging. Nevertheless, Gap statistics still outperforms AIC and BIC for this scenario. By increasing , the volume of the space of stable filters explodes and the average distance (defined in (12)) between two randomly chosen filters becomes larger, which makes the mode separation much easier for large ’s. In that case, the gain of our approach is more pronounced.

## V Conclusions

In this work, we introduced a new model selection technique based on Gap statistics in order to estimate the number of stable AR mixtures for modeling a given time series. The Gap statistics was extended to stable filters using a new distance measure between stable AR filters. This distance measure in turn was derived based on mean squared prediction error (MSPE). We also proposed a method to generate uniform random stable AR filters in the root domain, in order to compute the reference curve. This may be of some independent interest on its own right. Simulation results were provided demonstrating the performance of our proposed approach.

## Acknowledgment

This research was funded by the Defense Advanced Research Projects Agency (DARPA) under grant number W911NF-14-1-0508.

## References

• [1] J. Hamilton, Time Series Analysis.   Princeton University Press, 1994.
• [2] Y. Abramovich, N. Spencer, and M. Turley, “Time-varying autoregressive (TVAR) models for multiple radar observations,” IEEE Trans. Signal Process, vol. 55, no. 4, pp. 1298–1311, Apr 2007.
• [3]

C. S. Wong and W. K. Li, “On a mixture autoregressive model,”

J. Roy. Statist. Soc. Ser. B, vol. 62, no. 1, pp. 95–115, Sep 2000.
• [4] H. Akaike, “Information theory and an extension of the maximum likelihood principle,” 2nd Int. Sym. Info. Theory, pp. 267–281, Sep 1973.
• [5] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, Mar 1978.
• [6] G. Celeux and G. Soromenho, “An entropy criterion for assessing the number of clusters in a mixture model,” J. Classification, vol. 13, no. 2, pp. 195–212, 1996.
• [7] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” J. Roy. Statist. Soc. Ser. B, vol. 63, no. 2, pp. 411–423, 2001.
• [8] K. Kalpakis, D. Gada, and V. Puttagunta, “Distance measures for effective clustering of ARIMA time-series,” Proc. IEEE Int. Conf. Data Mining (ICDM), pp. 273–280, 2001.
• [9] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast subsequence matching in time-series databases,” Proc. of Special Interest Group on Management of Data (SIGMOD), vol. 23, no. 2, pp. 419–429, May 1994.
• [10] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani, “Mining the stock market: Which measure is best?” Proc. Conf. Knowl. Discovery Data Mining (KDD), pp. 487–496, 2000.
• [11] Z. Struzik and A. Siebes, “Measuring time series similarity through large singular features revealed with wavelet transformation,” Proc. 10th Int. Workshop Database and Expert Systems Applications, pp. 162–166, 1999.
• [12] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. Ser. B, pp. 1–38, 1977.
• [13] K. Alladi, Surveys in Number Theory.   Springer, 2008.
• [14] L. Kaufman and P. Rousseeuw, Clustering by means of medoids.   North-Holland, 1987.