I Introduction
Modeling timeseries has been of a great interest for a long time. A good model not only describes the observed data well, but also avoids overfitting, which can reduce the predictive power of the model. Autoregressive (AR) models are one of the most commonly used techniques to model stationary timeseries [1]. A timevarying autoregressive (TVAR) model is a generalized form of an AR model that is used to describe the nonstationarity of timeseries [2]. An example of TVAR models is the regimeswitching model [1]
, which assumes that a nonstationary stochastic process is composed of different epochs/regimes each of which is a stationary process, and that the regimes switch according to a Markov process. Another example is the model proposed in
[3], which uses a mixture of Gaussian AR models to describe timeseries and uses an expectationmaximization (EM) algorithm to determine the parameters of the model. Wong and Li
[3] used Akaike information criterion (AIC) [4] and Bayes information criterion (BIC) [5]to introduce a penalty on the complexity of the model and estimate the number of AR filters. In general AIC and BIC are shown to be suboptimal for determining the number of modes
[6].Tibshirani et al. [7] introduced an intuitive technique for determining the appropriate number of clusters using Gap statistics. The general idea of the Gap statistics is to identify the number of clusters in a data set by comparing the goodness of fit for the observed data with its expected value under a reference distribution. In this work we extend the Gap statistics to timeseries in order to identify the number of AR filters needed to describe a timeseries. We use a reference curve to measure how much adding a new AR filter improves the model under reference distribution, and then choose the number of filters that has the maximum gap with that reference curve.
In order to derive a reference curve, it is important to first clarify the meaning of the “goodness of fit”. In [7], the “goodness of fit” is measured by the sum of squared Euclidean distances of each data point from the center of the cluster it belongs to. But a different measure needs to be used in timeseries to evaluate the performance of each model. In this work, our goal is to select models that have higher predictive powers, and thus we define the “goodness of fit” measure to be the mean squared prediction error (MSPE). We use MSPE to define a new distance measure between two stable filters in accordance with our need for a reference curve in Gap statistics. Our proposed distance measure differs from the previous distances (e.g., cepstral distance [8]
, discrete Fourier transform (DFT)
[9], principal component analysis (PCA)
[10], and discrete wavelet transform (DWT) [11]) in that it naturally arises from the MSPE of the onestep prediction.Computing each point on the reference curve using the new distance measure turns out to be a clustering problem in the space of stable AR filters with a fixed size, which is solved using the Monte Carlo approach. To that end, we introduce an approach to generate uniform random stable filters with equal sizes, and apply the medoids algorithm to approximate the optimal solution for the clustering problem. Numerical simulations show that the accuracy of the proposed Gap statistics in estimating the number of AR filters surpasses that of AIC and BIC.
The remainder of the paper is organized as follows. Section II describes the model considered in this paper. In Section III we propose the Gap statistics to estimate the number of AR filters in a timeseries. Section IV presents some numerical results to evaluate the performance of the proposed approach. We make our conclusions in Section V.
Ii Model Assumption and Its Estimation
In order to model a given time series , we assume that each data point depends only on previous points and that is known in this paper. We use a timevarying autoregressive (TVAR) model to describe the value at time step as follows
(1) 
where ’s are real numbers, and are independent Gaussian noise. Assume that the first points of a sample set,
, are known. The vector form of this equation can be written as
(2) 
where , and , . In real scenarios, is a timevarying vector, and modeling the variations of can be complicated. For simplicity, we assume that , for , where is the number of AR filters used to describe , and that each is a filter with size . We refer to as mode and call this model multimode AR model.
Clearly, can be seen as a parameter for a nested family of models, and larger will fit the observed data better. But as mentioned before, the predictive power of the model drops if is too large. Hence, a model selection procedure that identifies the appropriate number of modes is important. To that end, we evaluate MSPE for a range of values of which is assumed to contain the number of modes. We first estimate the parameters of each of the candidate models, and then select the number of AR modes according to the Gap statistics developed in Section III. For simplicity, we further assume the model with the following loglikelihood:
(3) 
where for any fixed , and
denotes the density of Gaussian distribution of mean
and variance
evaluated at , i.e., .Let be the set of unknown parameters to be estimated. Next, we briefly describe how to approximate the maximumlikelihood estimation (MLE) of . Though computing the MLE in (3) is not tractable, it can be approximated by a local maximum via the EM algorithm [12]. Let be the membership labels with , where
The joint probability of
and can be written as a product of conditional probabilities asThus the complete loglikelihood is
(4) 
The EM algorithm produces a sequence of estimates by recursive application of Estep and Mstep until some convergence criterion is achieved.
For brevity, we provide the EM formulas below without detailed derivations. We note that the Mstep uses a coordinate ascent algorithm to find a local maxima.
EStep: We take the expectation of (4) with respect to the missing data given the recent estimated unknown parameters, and obtain the following function (also referred to as the “Q function”)
where
We note that the parameters involved in the righthand side of the equation above take values from the last update.
The “old” superscriptions are omitted for brevity.
In other words, the Estep replaces the “missing data” in Equation (4) by its expected values
for .
MStep: Letting the derivatives of the Q function be zero leads to a coupled nonlinear system
that has no closedform solution.
Thus, our best hope is an approximation to the solution; for this we use the coordinate ascent algorithm to obtain a local maximum.
For each , we apply the Lagrange method with constraint to obtain
Taking the partial derivative of the Q function with respect to and then we obtain the following local maximum
Iii Gap Statistics to Determine the Number of Modes
In this work we use Gap statistics [7] to estimate the number of AR filters in a time series. In this technique, a data set is clustered into disjoint clusters by minimizing the following withincluster sum of distances (WCSD)
(5) 
where is the distance of from , the center of the th cluster. Each cluster is defined based on as
(6) 
After computing for where is assumed to be the largest possible number of clusters, the graph of is standardized by comparing it to its expectation under a noninformative reference distribution. This can be chosen to be a uniform distribution. The point that has the largest difference with the reference curve is selected as the estimated number of clusters.
Let be the squared Euclidean distance (the most commonly used distance measure for clustering purposes). Then with this distance, becomes withincluster sum of squares. However, for clustering AR filters, Euclidean distances have been shown to be ineffective [8]. Hence, we introduce a new distance measure in Sec. IIIA that is wellsuited for AR clustering.
Iiia Distance Measure for Autoregressive Processes
In order to find a reference curve for Gap statistics, we derive the distance between two filters based on MSPE. We assume that the data is generated by a stable filter with size , i.e.,
(7) 
For now we assume that has zero mean. Let be the characteristic polynomial of , and denote the roots of , i.e.,
(8) 
If is stable, the roots lie inside the unit circle (). When we use at time step to predict the value at time , i.e., , the MSPE is
(9) 
Suppose that we use a filter other than to predict the value at time . The misspecified filter is denoted by . Then the MSPE becomes
(10) 
Motivated by Equations (9) and (10), we define the distance between filters and by
(11) 
which can be calculated using the power spectral density of as
(12) 
Using (), we get
(13) 
Using Cauchy’s integral theorem, Equation (IIIA) can be written in terms of the roots and as
(14) 
for , where is the conjugate of a complex number . For the degenerate cases when or , reduces to or . We note that the distance (14) is not symmetric, i.e., . The distance measure defined in (14) is proportional to , which results in a constant in the computation of for the reference curve. Since it is the same for different , without loss of generality we can set .
IiiB Generation of Uniformly Distributed Random Filters
As mentioned before, Gap statistics require a reference curve that is generated by clustering sampled data from a reference distribution, which is usually chosen to be uniform. Therefore, uniform random generation of stable filters (uniform in coefficient domain) is needed. Since the roots are used in Equation (14), we use an approach to generate samples of roots that correspond to uniform samples of coefficients. To simplify the notations, we let denote the polynomial .
A polynomial is stable (also referred to as Schur stable) if all of its roots lie inside the unit circle. For a polynomial of order , we use and respectively to denote its number of pairs of complex roots and number of real roots. Let
be the coefficient space of all stable polynomials of degree , and let correspond to the polynomials that have pairs of complex roots. We call a configuration of with parameters . Clearly, are bounded subspaces of and .
In this section we propose an approach to generate uniformly distributed polynomials using roots. We first present the following theorem, which helps us to find the relation between the distribution of coefficients of a polynomial and its roots.
Theorem 1.
The determinant of the Jacobian matrix of the coefficients of a polynomial with respect to its roots is the Vandermonde polynomial of the roots, i.e.,
(15) 
Furthermore, the volume of is
Vol  
(16) 
where .
Proof.
Let be the value of the lefthand side of Equation (15), which is a polynomial of . For any positive integers and , it is clear that
Thus changes sign under any transposition of the and by properties of the determinant, i.e. is an alternating polynomial of . It implies that is divisible by the Vandermonde polynomial [13]. Furthermore, both and are homogeneous polynomials of degree , and the coefficients of the term are both . Therefore, we obtain .
In order to compute the volume of , consider the space . Each point ,, ,, in corresponds to a set of roots , , , of a stable polynomial in , and thus there is a mapping from to (due to the permutation of roots). Therefore,
Vol  
(17) 
Since implies that
we obtain Equation (15) by combining Equations (1)–(IIIB). ∎
Following Theorem 1, it is not difficult to obtain the following result. We omit the proof for brevity.
Corollary 1.
Generating a sample uniformly from can be done via the following threestep procedure:

Randomly draw

Generate a random sample ,,, , , from according to the (unnormalized) density
(18) 
Obtain by computing
In practical implementations, the parameters of multinomial distribution in the first step require only onetime computation, and the second step can be realized by a sequence of onedimensional reject samplings.
IiiC Calculating the Reference Curve
Based on the discussions above, the following procedure describes how to derive the reference curve and Gap statistics.

Generate uniform random stable filters , , , with a given size , using the technique introduced in Section IIIB.

Suppose that is the candidate set of numbers of modes. For each , cluster into disjoint clusters by minimizing:
(19) where has been defined in Section IIIA and the clusters are similarly defined as in (6).^{1}^{1}1 To make it consistent with the MSPE in (10), we put an additional “1” on the right hand side of definition (19) compared with (5). The “1” was used to represent the noise variance , since becomes a linear term in and thus can be negligible. For this step, we first generate the matrix whose elements are pairwise distances between sampled filters, and then run the medoids algorithm [14] to approximate the optimum of (19).

Plot the reference curve, which is against for . We note that the reference curve is model independent.

Plot the empirical curve given the MSPE for , using the observed data, postulated model, and the model fitting approach. For example, the postulated model in this paper is the mixture of AR introduced in Section II, and the model fitting approach is the EM algorithm.

Finally, the number of AR mixtures that corresponds to the largest gap between the two curves is selected.
Fig. 1 illustrates the sampled coefficients of stable filters of size randomly generated using the technique described in Section IIIB. The centers of the clusters (which are approximated as some of the generated filter samples) obtained using medoids algorithm are also shown in this figure for . These centers are calculated based on the average of 20 random instances, each with samples. Fig. 2 shows the reference curves for and 4. Similar to Fig. 1, the reference curves are plotted based on random instances, each with samples.
Iv Numerical Results
We have generated zeromean timeseries with size using the following three different models to evaluate the performance of the proposed model selection technique.
Scenario 1: 4 AR filters with lengths . The AR filter at each timeinstance is randomly drawn from a multinomial distribution with parameters (0.25, 0.25, 0.25, 0.25).
Scenario 2: 2 AR filters with lengths
. The AR filter at each timeinstance is randomly drawn from a Bernoulli distribution with parameters (0.4, 0.6).
Scenario 3: 7 AR filters with lengths . The timeseries is divided into 7 equal parts and only one AR filter is used to draw the datapoints in each part.
For each scenario, timeseries of length are generated using uniformly distributed random AR filters. For each time series, EM is run times with different random initializations to increase the chance that the final estimates are close to the global optimum. Table I shows the estimated number of AR filters using AIC, BIC, and Gap statistics. The true number of filters for each scenario is highlighted. As it can be observed, Gap statistics outperforms AIC and BIC in all of the three scenarios. In the third scenario, AIC and BIC are not able to estimate the number of AR filters correctly, and severely underestimate the number of modes. While the Gap statistics finds the correct number of AR filters of the time.
For small and large , the uniformly generated AR filters are more likely to be close to one another, and thus identifying them as two separate modes becomes more challenging. Nevertheless, Gap statistics still outperforms AIC and BIC for this scenario. By increasing , the volume of the space of stable filters explodes and the average distance (defined in (12)) between two randomly chosen filters becomes larger, which makes the mode separation much easier for large ’s. In that case, the gain of our approach is more pronounced.
V Conclusions
In this work, we introduced a new model selection technique based on Gap statistics in order to estimate the number of stable AR mixtures for modeling a given time series. The Gap statistics was extended to stable filters using a new distance measure between stable AR filters. This distance measure in turn was derived based on mean squared prediction error (MSPE). We also proposed a method to generate uniform random stable AR filters in the root domain, in order to compute the reference curve. This may be of some independent interest on its own right. Simulation results were provided demonstrating the performance of our proposed approach.
Acknowledgment
This research was funded by the Defense Advanced Research Projects Agency (DARPA) under grant number W911NF1410508.
References
 [1] J. Hamilton, Time Series Analysis. Princeton University Press, 1994.
 [2] Y. Abramovich, N. Spencer, and M. Turley, “Timevarying autoregressive (TVAR) models for multiple radar observations,” IEEE Trans. Signal Process, vol. 55, no. 4, pp. 1298–1311, Apr 2007.

[3]
C. S. Wong and W. K. Li, “On a mixture autoregressive model,”
J. Roy. Statist. Soc. Ser. B, vol. 62, no. 1, pp. 95–115, Sep 2000.  [4] H. Akaike, “Information theory and an extension of the maximum likelihood principle,” 2nd Int. Sym. Info. Theory, pp. 267–281, Sep 1973.
 [5] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, Mar 1978.
 [6] G. Celeux and G. Soromenho, “An entropy criterion for assessing the number of clusters in a mixture model,” J. Classification, vol. 13, no. 2, pp. 195–212, 1996.
 [7] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” J. Roy. Statist. Soc. Ser. B, vol. 63, no. 2, pp. 411–423, 2001.
 [8] K. Kalpakis, D. Gada, and V. Puttagunta, “Distance measures for effective clustering of ARIMA timeseries,” Proc. IEEE Int. Conf. Data Mining (ICDM), pp. 273–280, 2001.
 [9] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast subsequence matching in timeseries databases,” Proc. of Special Interest Group on Management of Data (SIGMOD), vol. 23, no. 2, pp. 419–429, May 1994.
 [10] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani, “Mining the stock market: Which measure is best?” Proc. Conf. Knowl. Discovery Data Mining (KDD), pp. 487–496, 2000.
 [11] Z. Struzik and A. Siebes, “Measuring time series similarity through large singular features revealed with wavelet transformation,” Proc. 10th Int. Workshop Database and Expert Systems Applications, pp. 162–166, 1999.
 [12] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. Ser. B, pp. 1–38, 1977.
 [13] K. Alladi, Surveys in Number Theory. Springer, 2008.
 [14] L. Kaufman and P. Rousseeuw, Clustering by means of medoids. NorthHolland, 1987.
Comments
There are no comments yet.