On Selecting Stable Predictors in Time Series Models

05/18/2019 ∙ by Avleen S. Bijral, et al. ∙ 0

We extend the feature selection methodology to dependent data and propose a novel time series predictor selection scheme that accommodates statistical dependence in a more typical i.i.d sub-sampling based framework. Furthermore, the machinery of mixing stationary processes allows us to quantify the improvements of our approach over any base predictor selection method (such as lasso) even in a finite sample setting. Using the lasso as a base procedure we demonstrate the applicability of our methods to simulated and several real time series datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variable or Predictor selection is an important problem in machine learning and statistics and has received a lot of attention in the literature. These methods are valuable in various applications in biology and finance for both the very large

(samples) or the high dimensional () setting. In many cases of interest the objective of variable selection is to uncover an underlying causal mechanism and hence the method in question must exhibit robustness to noise in the data.

Most predictor selection methods discussed in the literature hold only for i.i.d data. The topic is relatively unexplored in the case of time series problems where variables correspond to the choice of exogenous predictors or even the endogenous lags to include in the regression model111We use predictors to mean either a (possibly lagged) exogenous predictor or a endogenous lag.. As time series data becomes pervasive with the advent of sensor enabled devices and networks of these devices (for e.g. Yu et al. (2017)), such predictor selection is poised to become widely applicable. Some recent applications already employ external time series of web searches to predict disease prevalence such as Influenza (GFT of Ginsberg et al. (2009), ARGO of Yang et al. (2015)) and twitter based sentiment time series for stock prediction Si et al. (2013). These approaches attempt to mitigate the problem of searching through millions of candidate exogenous time series that may be predictive of the underlying metric. In such applications selecting the correct time series is particularly crucial as with incorrect predictors, disease or other forecasts can deviate significantly from reality (see Lazer et al. (2014) for a discussion). Another important application lies in the domain of causal inference (Brodersen et al. (2015)

), wherein the impact of an intervention on a metric is evaluated using a Bayesian structural time series model. This is achieved by forecasting, using metric history and exogenous predictors (pseudo-controls), beyond the time of intervention. Finally, comparing the difference between the forecasts and what really happened, the event’s impact is estimated. It is evident that the quality of the forecast is crucial in this task and that the choice of the pseudo-controls or the correct lags is very important and so we would like to have sufficient confidence in this choice.

1.1 Contribution

In this paper we propose and analyze novel and efficient stable procedures for predictor selection in time series inspired by the framework developed in Shah and Samworth (2013). To that end we first describe block sampling for time series and then propose stability measures called block pair average (BPA) and simultaneous block selection (SBS) with corresponding error control bounds. We empirically validate our procedures on several real time series and show both qualitative and quantitative improvements over competing methods. To the best of our knowledge these are the first predictor selection methods with finite sample guarantees that have applications to interpretable forecasting, classification and other time series domains.

1.2 Related Work

The foundation of our work is based on extensions to sub-sampling based methods for i.i.d data. In this setting Meinshausen and Bühlmann (2010) proposed Stability Selection (SS) as a repeated sub-sampling based methodology to improve the performance of any variable selection technique and also provided bounds on the number of selected false positives. This method is an improvement over standard variable selection techniques as it is usually non-trivial to provide error control bounds for methods run on the complete data. More recently Shah and Samworth (2013)

proposed Complementary Pairs Stability Selection (CPSS) procedures that provide significant improvements over SS. Their performance bounds do not explicitly depend on signal and noise variables (that are usually unknown) but instead depends on the number of low selection probability variables that are included and on the number of high selection probability variables that are excluded by the CPSS procedures. This approach doesn’t require any dependence on restrictive and unverifiable assumptions such as exchangeability which is required for the analysis in

Meinshausen and Bühlmann (2010). The central idea of stability selection using both SS and CPSS procedures is repeated execution of a base procedure (e.g. lasso - Tibshirani (1996) ) on subsamples of data points to identify variables that show up often in the selected set. In time series applications, the error control yielded by these stability procedures does not hold as the sub-sampling does not account for the underlying dependence.

Most existing predictor selection methods in time series are largely based on heuristics,

Ng et al. (2013) or simply use plain lasso Yang et al. (2015); Buncic and Tischhauser (2017)

on the entire data and it is non-trivial to provide guarantees for such methods. For the specific case of vector auto regression (VAR) models,

Song and Bickel (2011) propose a grouped penalty based approach that provably identifies relevant lags and predictors in the asymptotic (number of time series) and (number of lags) regime. Our method is of a fundamentally distinct flavor in that we provide quantifiable improvement over any base predictor selection method, including the method in Song and Bickel (2011), even in the finite data (sample or dimension) setting. Moreover our approach also works for the more general VAR-X (VAR with exogenous) model and in general is independent of the base predictor selection mechanism.

2 Preliminaries

In this section we establish notation and preliminary details for the models we work with. Let and denote strictly stationary sequences. Before we present an example of variable selection procedure in the time series domain first consider the general VAR-X model Ltkepohl (2007)

(1)

where each and . Model (1

) captures the effect of endogenous and exogenous variables at different lags on the response variable and has been a staple in econometrics with more recent applications in genetics

Larvie et al. (2016) and renewable energy forecasting Cavalcante et al. (2016).

This model can be -regularized and thus employed as a variable selection procedure, for example to determine stable cross sectional relations in or to identify predictive exogenous series in .

(2)

where is a convex loss and each and are stacked such that

(3)

Since and is a stationary mixing sequence it follows that is also stationary (See Bradley (2005)). Unless stated we are not assuming Model 2 as the underlying data generating process (DGP). The model and its sparse estimation only serves as a means for predictor selection as described in the later sections.

3 Stable Predictor Selection

Consider the stationary sequence in . In the case of Model 2, . Let us define a variable selection procedure , as an estimator of the set of signal variables , that takes as input a dependent random sample of length 222Corresponds to the proportion of the entire time series length () used in sampling. Specified later. and takes values in all subsets of . Define the selection probability of a variable index as

(4)

The improved stability framework as presented in Shah and Samworth (2013) executes the base selection procedure on i.i.d random samples of size to then combine the variable selection estimates. In our setting because the stationary sequence is dependent standard stability sub-sampling doesn’t apply, so instead using the independent block technique from Yu (1994) we create “almost" independent blocks and then transfer the stability error control to i.i.d blocks that have the same distribution as the original blocks.

Divide the sequence into blocks of length and assume that without any loss of generality. Let and

be the sets that denote the indices in the odd and even blocks respectively such that

(5)
Figure 1: We construct a blockwise independent sequence (right) from a dependent sequence (left) such that the points in the block are dependent but the blocks are independent. The odd blocks in both the sequences have the same distribution.

The intuition behind this approach is that for an appropriately mixing sequence the odd blocks are roughly independent, provided is large enough (See Figure 1). And as we show later in the analysis, if we create “independent" blocks with each new block having the same distribution as its dependent counterpart we can use this duality to work with the original dependent time series.

The block construction described leads to measures of stability for predictors that we define next.

3.1 Definitions

3.1.1 Block Pair Average (BPA)

Consider a randomly selected (without replacement) subset of the sequence of odd blocks .

Definition 1.

Block Pair Average: Let be the set of randomly selected pairs of sequence of blocks such that . For some the block average selection estimator is then defined as where

(6)

Algorithm 1 shows the use of the BPA measure in the context of lasso base predictor.

3.1.2 Simultaneous Block Selection (SBS)

Definition 2.

Simultaneous Block Selection: Let be the set of randomly selected pairs of sequence of blocks such that . For some the simultaneous selection estimator is then defined as where

(7)

Note that measures (6) and (7) are dependent data analogues of the i.i.d measures presented in Shah and Samworth (2013)

and serve the same purpose of reducing variance (via averaging) as estimators of

. For these we have .

Finally to analyze these measures we need the following definitions

Definition 3.

For some , define be the set of low selection probability features under and denote the set of features that have high selection probability and .

The quantities and denote the expected number of low (noise) and high (signal) probability predictors that are included and excluded by our procedure. We analyze these measures against a base predictor to quantify the improvements that our procedures yield.

3.2 Assumptions

We will refer to the following assumptions when necessary

  1. Assumption 1: Probability of selection under is better than random. Thus where is the number of signal variables.

  2. Assumption 2: The process is algebraically -mixing (See Mohri and Rostamizadeh (2009)) implying there exists some constant such that for .

  3. Assumption 3: The process is geometrically -mixing (See Bradley (2005)) implying there exists some constant such that .

  4. Assumption 4: The distribution of the sequence is unimodal for each . This assumption is discussed in Section of Shah and Samworth (2013) and reflects the empirical observation that for many different DGPs the distribution of is consistently unimodal and leads to a sharper version of Markov inequality employed in the analysis.

There are several equivalent definitions of -mixing in the literature but we refer to Definition- in Yu (1994).

4 Block Pair Averaging - Analysis

For the BPA measure we have the following

Theorem 1.

If and , then under Assumptions 1, 2 and 4 and when

(8)

where

(9)

then we have for the block average procedure (6) and with

  1. Remark 1- is the constant in the improved Markov inequality that was developed for a finite grid () in Shah and Samworth (2013).

  2. Remark 2 When and then we get , showing that the BPA measure selects only at most of the low selection probability predictors selected by the base procedure.

  3. Remark 3- For geometric mixing sequences (Assumption 3) we require for the Theorem 1 to hold.

  4. Remark 4- The improved empirical performance obtained using the -concavity assumption also carries over to our case but for simplicity and generality we retain only the unimodal assumption. See Shah and Samworth (2013) for more details.

  5. Remark 5 We can quite easily extend the BPA measure to the case of -BPA wherein we sample blocks without replacement to get more conservative measures. This becomes obvious in the steps leading upto (15) in the proof of Theorem 1. However, for simplicity we retain only the discussion for . The i.i.d analogue of this approach seems non-trivial.

  6. Remark 6 The SBS measure is used in the proof of Theorem 1 and may also be used in practice besides BPA and its analysis is almost exactly similar.

1:  Input: , , , , , , a sequence of regularizers.
2:  Initialize: ,
3:  q-Estimate: Solve and set to be the smallest that returns active entries of .
4:  for  to  do
5:     Sample: Sequence of blocks from the set of odd blocks () without replacement and set .
6:     Set: For , .
7:     BPA: , .
8:  end for
9:  Output:
Algorithm 1 BPA Stable Selection

5 Empirical Results

To convince the reader of the utility of our methods we evaluate them on simulated and real time series data using lasso as a base predictor. For the real time series we evaluate forecast errors on a hold-out test period using the selected predictors from Algorithm 1 and compare them to other predictor selection methods such as standard lasso, elastic net Zou and Hastie (2005), adaptive lasso Zou (2006) and AIC based maximum lag selection for both and (See Model 1). For these methods we use the standard method of cross-validation. Note that the standard method of CV works for auto-regressive models as long as the errors are assumed to be uncorrelated. See Bergmeir et al. (2018) for a discussion.

The post selection training is restricted to least squares regression. The train-test split is and the predictions are made in a rolling fashion with each test data point added to the training set and re-trained to forecast the next step. On the simulated data we test the robustness of our method by adding varying degree of noise to the simulated data and reporting true positive and false positive rates (TPR/FPR). We de-trend the real data in case of any obvious violations of stationarity.

Before we get into the details of the empirical results we first discuss the choice of parameters and how to use Theorem 1 for the case of lasso as a base selection method.

5.1 Theory in Practice

Since is not known in practice we approximate it by . Setting is equivalent to choosing the regularization parameter , we can vary it until lasso selects predictors. Once is set we use to denote the irrelevant variables as those having less than average selection probability.

The appropriate threshold can then be determined based on Theorem 1 by solving for

where is pre-specified. In general we can specify any two of , and to get the stable predictor set. We will present results with different values of for some threshold . is set to and was set to some multiple of seasonality in the data for all experiments. This reflects the empirical assumption that across seasons the data points have a weak dependence. In the absence of seasonality we didn’t see a significant difference against other choices of such as or . See Algorithm 1 for a complete recipe of our method using the lasso base predictor. In the algorithm and correspond to the data points from the corresponding block sequence (see also the proof of Theorem 1). The regularizer sequence is the commonly used in the glmnet package Friedman et al. (2010). Note that in the proofs we require of the data to be used for the comparison against the base procedure, however in Algorithm 1 and empirically we use all the data for determining the -estimated active set.

Figure 2: a) The TPR/FPR reveals that Algorithm 1 is much less sensitive to added noise and hence more robust. The competing procedures are either too conservative (adaptive lasso) or too accepting. b) BPA ranked predictors. Dew point is the top predictor and there is a causal mechanismm for this. Defined as the temperature at which dew droplets start to form and so not surprisingly this impacts the PM2.5 measurement significantly.
Method BeijingPM2.5 ShanghaiPM2.5 IndoorTemp AirQuality
-BPA
-Lasso
-BPA
-Lasso
Lasso
ElasticNet
AdapLasso
AIC
Table 1: RMSE/MAPE on the test period compared to competing methods. Here and .

5.2 Simulated Data

We simulate a synthetic time series with from the following auto-regressive with exogenous variables (AR(3)-X) model where the exogenous time series are simulated from a simple AR() model with Gaussian innovations.

(10)

We also simulate time series (AR()) for the dataset along with i.i.d noise time series to create a more realistic setting of several candidates with varying possibility of being included in the stable set. Our objective is to recover a stable set (lags and exogenous) using Algorithm 1 for a AR(3)-X model with

exogenous time series. The parameters are normally distributed,

for all .

We vary the amount of noise added to the predictors and the response and plot the TPR/FPR with

-standard deviation error bars. To estimate the stable predictors using the BPA measure we set

and . Figure 2a) shows that the stable set is more robust to growing noise levels compared to the competing approaches. The TPR for lasso and Elastic net is higher since it selects many more predictors and consequently also has a much higher FPR. In contrast adaptive lasso is too conservative and Algorithm 1 has an FPR comparable to it but much better TPR. Note that for only less than half of the original predictors in Model (10) show up in the selected set, since a pure least squares estimation of the model reveals that only a fraction of the predictors are statistically significant.

6 Real Data

6.1 Chinese Cities PM2.5

The Chinese cities PM2.5 dataset, PM2 is an hourly sampled measurement of particulate matter concentration alongside several other exogenous meteorological time series. Our goal is to forecast the PM2.5 concentration of Beijing and Shanghai using exogenous and endogenous signals. The data specific parameters settings are , (thrice the daily seasonality), and (max endogenous and exogenous lags to select from).

6.2 Indoor Temperature

The indoor temperature dataset SML is a one minute sampled ( minute smoothed) set of measurements by a monitoring system that captures various attributes such as CO2 concentration, humidity, lighting, outdoor conditions etc. The idea behind this dataset is to aid in accurate forecasting of the indoor temperature to better regulate the energy consumption of the HVAC system. For this analysis we only consider the impact of the exogenous signal on the temperature so that and . Other parameters are , .

6.3 Air Quality - Temperature Prediction

The air quality dataset air is an hourly sampled set of measurements by chemical sensors that capture various attributes such as CO, NO2 and O3 concentration. While the original purpose of this dataset is to estimate sensor estimation quality, we repurpose it to model temperature as a function of these concentrations. For this analysis we set and . Other parameters are , .

6.4 Results and Discussion

Table 1 reveals that the BPA measure performs better on all but one datasets. For the BeijingPM2.5 dataset it appears that all variables contribute to the improved RMSE. Note that the choice of does make a difference to the results, as too small a value would exclude signal variables (appears to be the case for BeijingPM2.5) and a large would add noisy variables to the selected set. Some prior knowledge of the domain can be useful here, but in its absence a largish (for example ) seems to be a safe choice (due to the guaranteed error control for the BPA measure) as the results indicate that for most datasets RMSE only degrades a little. From a causality perspective Figure 2b) reveals that dew point (temperature at which dew droplets form) and precipitation are important variables. This is corroborated by meteorological studies such as Liang et al. (2015).

7 Conclusion

Filling the gap in the feature selection literature for dependent data, we proposed novel stable predictor selection techniques in the time series setting. For robustness scores that can be used with any selection method, we provided theoretical results guaranteeing error control. The stable predictors selected by our method were shown to have superior predictive performance on several real datasets. In the future we intend to explore such a scheme and measures for nonstationary and heteroscedastic time series.

8 Appendix

We gather all the proofs in this section

8.1 Proof of Theorem 1

Proof.

Denote the sequence of random variables that correspond to the

and indices as

(11)

Consider a sequence of i.i.d blocks where such that the sequence is independent of and each block has the same distribution as a block from the original sequence ().

Let be a bounded measurable function on the set of selected blocks and be the corresponding i.i.d sequence of blocks from then

(12)

where the expectation is w.r.t to the distribution of the original and constructed sequences. For a measurable and bounded function such that we have from Lemma 4.1 of Yu (1994)

(13)

Next we have

(14)

Using (13) for , using the unimodal Markov inequality for the simultaneous selector on the constructed independent blocks , from Theorem 3 of (Shah and Samworth, 2013) and finally using the fact that the sequence is stationary (all blocks have the same distribution)

(15)

where

(16)

Using the assumption (Assumption 1) it is easy to see that when

(17)

This implies from (15) that when (17) holds

(18)

Following the proof of Theorem in Shah and Samworth (2013) and using (18) we have

(19)

For the second part replace by as an estimator of , the set of noise variables. Let and correspond to the noise variable estimators. Let

(20)

Just like in the proof of the first part when

(21)
(22)

References

  • [1] Chinese pm2.5. https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities.
  • [2] Smart house data. https://archive.ics.uci.edu/ml/datasets/SML2010.
  • [3] Air quality. http://archive.ics.uci.edu/ml/datasets/Air+quality.
  • Bergmeir et al. [2018] Christoph Bergmeir, Rob J Hyndman, and Bonsoo Koo. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120:70–83, 2018.
  • Bradley [2005] Richard C. Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probab. Surveys, 2:107–144, 2005. doi: 10.1214/154957805100000104. URL http://dx.doi.org/10.1214/154957805100000104.
  • Brodersen et al. [2015] Kay H Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L Scott, et al. Inferring causal impact using bayesian structural time-series models. The Annals of Applied Statistics, 9(1):247–274, 2015.
  • Buncic and Tischhauser [2017] Daniel Buncic and Martin Tischhauser. Macroeconomic factors and equity premium predictability. International Review of Economics & Finance, 2017.
  • Cavalcante et al. [2016] Laura Cavalcante, Ricardo J Bessa, Marisa Reis, and Jethro Dowell. Sparse structures for very short-term wind power forecasting. working paper, 2016.
  • Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.
  • Ginsberg et al. [2009] Jeremy Ginsberg, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014, 2009.
  • Larvie et al. [2016] Joy Edward Larvie, Mohammad Gorji Sefidmazgi, Abdollah Homaifar, Scott H Harrison, Ali Karimoddini, and Anthony Guiseppi-Elie. Stable gene regulatory network modeling from steady-state data. Bioengineering, 3(2):12, 2016.
  • Lazer et al. [2014] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The parable of google flu: traps in big data analysis. Science, 343(6176):1203–1205, 2014.
  • Liang et al. [2015] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2182):20150257, 2015.
  • Ltkepohl [2007] Helmut Ltkepohl. New Introduction to Multiple Time Series Analysis. Springer Publishing Company, Incorporated, 2007. ISBN 3540262393, 9783540262398.
  • Meinshausen and Bühlmann [2010] Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
  • Mohri and Rostamizadeh [2009] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary -mixing and -mixing processes. Journal of Machine Learning Research, 4:1–26, 2009.
  • Ng et al. [2013] Serena Ng et al. Variable selection in predictive regressions. Handbook of economic forecasting, 2(Part B):752–789, 2013.
  • Shah and Samworth [2013] Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):55–80, 2013.
  • Si et al. [2013] Jianfeng Si, Arjun Mukherjee, Bing Liu, Qing Li, Huayi Li, and Xiaotie Deng. Exploiting topic based twitter sentiment for stock prediction. ACL (2), 2013:24–29, 2013.
  • Song and Bickel [2011] Song Song and Peter J Bickel. Large vector auto regressions. arXiv preprint arXiv:1106.3915, 2011.
  • Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • Yang et al. [2015] Shihao Yang, Mauricio Santillana, and Samuel C Kou. Accurate estimation of influenza epidemics using google search data via argo. Proceedings of the National Academy of Sciences, 112(47):14473–14478, 2015.
  • Yu [1994] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. Ann. Probab., 22(1):94–116, 01 1994. doi: 10.1214/aop/1176988849. URL http://dx.doi.org/10.1214/aop/1176988849.
  • Yu et al. [2017] Ruoxi Yu, Tony WC Mak, Ruikai Zhang, Sunny H Wong, Yali Zheng, James YW Lau, and Carmen CY Poon. Smart healthcare: Cloud-enabled body sensor networks. In Wearable and Implantable Body Sensor Networks (BSN), 2017 IEEE 14th International Conference on, pages 99–102. IEEE, 2017.
  • Zou [2006] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
  • Zou and Hastie [2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.