High dimensionality is an increasingly common characteristic of data being collected in fields as diverse as genetics, neuroscience, astronomy, finance, and macroeconomics. In these fields, we frequently encounter situations in which the number of candidate predictors () is much larger than the number of samples (), and statistical inference is made possible by relying on the assumption of sparsity. The sparsity assumption, which states that only a small number of covariates contributes to the response, has led to a wealth of theoretical results and methods available for identifying important predictors in this high dimensional setting. These methods broadly fall into two classes: screening methods and penalized likelihood methods, and we focus on the screening approach in this work. For the case where is much larger than , screening is more computationally feasible as a first stage method, which can be followed by a second stage method, such as penalized likelihood approaches, on the reduced subset of predictors selected at the screening stage.
Fan and Lv (2008) proposed Sure Independence Screening (SIS) for the linear model, and it is based on ranking the magnitudes of the marginal Pearson correlations of the covariates with the response. A large amount of work has been done since then to generalize the procedure to various other types of models including: generalized linear models (Fan and Song, 2010), nonparametric additive models (Fan et al., 2011), Cox proportional hazards model (Fan et al., 2010)
, linear quantile models(Ma et al., 2017), and varying coefficient models (Fan et al., 2014). Model-free screening methods, which do not assume any particular model a priori, have also been developed. Some examples include: a distance correlation based method in Li et al. (2012), the fused Kolmogorov filter in Mai et al. (2015), a conditional distance correlation method in Liu and Wang (2017), a method based on maximum correlation in Huang and Zhu (2016), a martingale difference based approach in Shao and Zhang (2014), and a smoothing bandwidth based method in Feng et al. (2017). For a partial survey of screening methods, one can consult Liu et al. (2015). The main theoretical result of these methods is the so called “sure screening property”, which states that under appropriate conditions we can reduce the dimension of the feature space from size to a far smaller size
, while retaining all the relevant predictors with probability approaching 1.
Although there has been a large amount of interest in developing screening methods, it is surprising to see that almost all of the works operate under the assumption of independent observations. This is even more surprising given the ubiquity of time dependent data in many scientific disciplines. Data in fields such as climate science, neuroscience, political science, economics, and finance are frequently observed over time and/or space thereby exhibiting serial dependence. A specific example is in forecasting low frequency macroeconomic indicators such as GDP or inflation rate, where we can have a large number of macroeconomic and financial time series and their lags as possible covariates. Another example is identification of brain connectivity networks, where we have data from thousands of voxels collected over a relatively small number of time periods (Valdés-Sosa et al., 2005). These examples, amongst others, highlight the importance of developing screening methods for time dependent data.
In creating a screening method for time series data, we aim to account for some of the unique features of time series data such as:
A prior belief that a certain number of lags of the response variable are to be in the model.
An ordered structure of the covariates, in which lower order lags of covariates are thought to be more informative than higher order lags.
The frequent occurrence of multivariate response models such linear or nonlinear VAR models.
Additionally, we aim to have a model free screening approach which can handle continuous, discrete or grouped time series. Using a model free approach makes our methods robust to model misspecification at the screening stage, and gives us full flexibility when considering a second stage procedure. The few works which have relaxed the assumption of independent observations include Cheng et al. (2014), and Xu et al. (2014) which dealt with longitudinal data. However, the dependence structure of longitudinal data is too restrictive to cover the type of dependence present in most time series. To the best of our knowledge there have been only two works, Chen et al. (2017) and Yousuf (2018), dealing with the issue in a general stationary time series setting. The former work extended the nonparametric independence screening approach used for independent observations to the time series setting. However, the method does not utilize the serial dependence in the data, or account for the unique properties of time series data we outlined. The latter work (Yousuf, 2018) extended the theory of SIS to heavy tailed and/or dependent data as well as proposing a GLS based screening method to correct for serial correlation. However, this work is limited to the linear model and the other unique qualities of time series data outlined above are ignored. Additionally both of these works are only applicable to models with a univariate response.
In order to account for the unique characteristics of time series data mentioned above, and correct some of the limitations in previous works, we will introduce several distance correlation based screening procedures. Distance correlation (DC) was introduced by Székely et al. (2007)
, for measuring dependence and testing independence between two random vectors. The consistency, and weak convergence of sample distance correlation has been established for stationary time series inZhou (2012) and Davis et al. (2016). DC has a number of useful properties such as:
The distance correlation of two random vectors equals to zero if and only if these two random vectors are independent.
Ability to handle discrete time series, as well as grouped predictors.
An easy to compute partial distance correlation has also been developed, allowing us to control for the effects of a multivariate random vector (Székely and Rizzo, 2014).
The first property allows us to develop a model free screening approach, which is robust to model misspecification. The second property is useful when dealing with linear or nonlinear VAR models for discrete or continuous data. The third property will allow us to account for the first two unique features of time series data mentioned previously.
Compared to the previous works on screening using distance correlation based methods (Li et al., 2012; Liu and Wang, 2017), our work differs in a number of ways. First, our work deals with the time series setting, where both the covariates and response are stationary time series, and can be heavy tailed. Second, our screening procedures are developed specifically in order to account for certain unique features in time series data mentioned previously. Lastly, we choose to rely on partial DC, instead of conditional DC, when controlling for confounding variables. Partial DC is a DC based procedure which can be easily computed using pairwise distance correlations, whereas the computation of conditional DC is more involved and involves the choice of a bandwidth parameter, which can be difficult to choose.
Broadly speaking, we will be dealing with two types of models: univariate response models, some examples of which include linear or nonlinear autoregressive models with exogenous predictors (NARX), and multivariate response models such as linear or nonlinear VAR models. In both settings, we rely on partial distance correlation to build our screening procedures. Partial distance correlation produces a rich family of screening methods by taking different choices for the conditioning vector. In many applications, it is usually the case that researchers have prior knowledge that a certain subset of predictors is relevant to the response. Utilizing this prior knowledge usually enhances the screening procedure, as shown in the case of generalized linear models in Barut et al. (2016). Therefore our procedure can be viewed as a model free adaption of this principle to the time series setting. We discuss approaches for choosing the conditioning vector of each predictor, and we usually assume at least a few lags of the response variable are part of the conditioning vector of each predictor. We also discuss ways in which we can leverage the ordered structure of our lagged covariates to add additional variables to our conditioning vectors.
To motivate the multivariate response setting, consider a linear VAR(1) model: , where is a
-variate random vector. The number of parameters to estimate in this model is, which can quickly become computationally burdensome even for screening procedures. In many cases however, there exists a certain group structure amongst the predictors, which is known to researchers in advance, along with a sparse conditional dependency structure between these groups (Basu et al., 2015). For example, in macroeconomics or finance, different sectors of the economy can be grouped into separate clusters. Using this group structure, we can apply the partial distance correlation to screen relationships at the group level, thereby quickly reducing the number of variables for a second stage procedure.
The rest of the paper is organized as follows. Section 2 reviews the functional dependence measure and mixing coefficients, as well as comparisons between the two frameworks. We also discuss the assumptions placed on structure of the covariate and response processes. Section 3 introduces our screening procedures with their sure screening properties for models with a univariate response. Section 4 presents screening algorithms for multivariate response models. Section 5 covers simulation results, and a real data application is presented in Section 6. The concluding remarks are in Section 7. Lastly, the proofs for all theorems, along with additional simulations and data analysis results for sections 5 and 6 are placed in the supplementary material.
2 Dependence Measures
In order to establish asymptotic properties, we rely on two widely used dependence measures, the functional dependence measure and -mixing coefficients. We first start with an overview of the functional dependence measure framework, before proceeding to -mixing processes. For univariate processes, , we assume is a causal, strictly stationary, ergodic process with the following form:
where is a real valued measurable function, and are iid random variables. And for multivariate processes, such as the covariate process , we assume the following representation:
Where , are iid random vectors, , , and .
Processes having these representations are sometimes known as Bernoulli shift processes (Wu, 2009)
, and include a wide range of stochastic processes such as linear processes with their nonlinear transforms, Volterra processes, Markov chain models, nonlinear autoregressive models such as threshold auto-regressive (TAR), bilinear, GARCH models, among others(Wu, 2011, 2005). These representations allow us to quantify dependence using a functional dependence measure introduced in Wu (2005). The functional dependence measure for a univariate process and multivariate processes is defined respectively as:
where with being iid. And for the multivariate case, with being iid. Since we are replacing by , we can think of this as measuring the dependency of on , since we are keeping all other inputs the same. We assume the cumulative functional dependence measures are finite:
This short range dependence condition implies, by the proof of theorem 1 in Wu and Pourahmadi (2009), the auto-covariances are absolutely summable.
There are many equivalent definitions given for -mixing, and we use the one provided by Doukhan (1994):
Given a stationary multivariate process, , for each positive integer , the coefficient of absolute regularity or -mixing coefficient, , is:
Where is the total variation norm, and
is the joint distribution of the blocks. A stochastic process is said to be -mixing if
We note that compared to functional dependence measures, -mixing coefficients can be defined for any stochastic processes, and are not limited to Bernoulli shift processes. On other hand, functional dependence measures are easier to interpret and compute since they are related to the data generating mechanism of the underlying process. In many cases using the functional dependence measure also requires less stringent assumptions (see Wu and Wu (2016), Yousuf (2018) for details). Although there is no direct relationship between these two dependence frameworks, fortunately there are a large number of commonly used time series processes which are -mixing and satisfy (4). For example, under appropriate conditions, linear processes, ARMA, GARCH, ARMA-ARCH, threshold autoregressive, Markov chain models, amongst others, can be shown to be -mixing (see Pham and Tran (1985), Carrasco and Chen (2002), An and Huang (1996), Lu (1998) for details).
3 Partial Distance Correlation Screening for Univariate Response Models
We start with a brief overview of the distance covariance, distance correlation, and partial distance correlation measures.
Given this choice of weight function, by Székely et al. (2007), we have a simpler formula for the distance covariance. Let be iid, each with joint distribution , and let:
Then, by remark 3 in Székely et al. (2007), We can now estimate this quantity using moment based methods. Suppose we observe , the sample estimates for the distance covariance and distance correlation are:
Partial distance correlation (PDC) was introduced in Székely and Rizzo (2014), as a means of measuring nonlinear dependence between two random vectors and while controlling for the effects of a third random vector . We refer to the vector as the conditioning vector. Additionally, Székely and Rizzo (2014) showed that the PDC can be evaluated using pairwise distance correlations. Specifically, the PDC between and , controlling for , is defined as:
if , otherwise . For more details and an interpretation of PDC, one can consult Székely and Rizzo (2014).
3.2 Screening Algorithm I: PDC-SIS
We first review some basic ingredients of screening procedures. Let be our response time series, and let denote the predictor series at time . Given that lags of these predictor series are possible covariates, we let denote the length vector of covariates, where . Now we denote our set of active covariates as:
is the conditional cumulative distribution function of. The value represents the maximum lag order we are considering for our response and predictor series. This value can be decided beforehand by the user, or can be selected using a data driven method. Variable selection methods aim to recover exactly, which can be a very difficult goal both computationally and theoretically, especially when . In contrast, variable screening methods have a less ambitious goal, and aim to find a set such that as . Ideally we would also hope that , thereby significantly reducing the dimension of the feature space for a second stage method.
When developing screening algorithms for time series data, we would like to account for some of its unique properties as mentioned in the introduction. For models with a response, these would be:
A prior belief that a certain number of lags of the response variable are to be in the model.
An ordered structuring of the covariates, in which lower order lags of covariates are thought to be more informative than higher order lags.
The first property can be easily accounted for using partial distance correlation, while there are many different ways to account for the second property. In this section we present two partial distance correlation based screening algorithms, which attempt to account for the ordered structure of our covariates. In our first algorithm, PDC-SIS, we define the conditioning vector for the lag of predictor series as:
where . Since we are assuming a priori that a certain number of lags of are to be included in the model, is part of the conditioning vector for all possible covariates. Our conditioning vector also includes all lower order lags for each lagged covariate we are considering. By including the lower order lags in the conditioning vector, our method tries to shrink towards sub-models with lower order lags. To illustrate this, consider the case where is strongly dependent on even while controlling for the effects of . Under this scenario, if has strong serial dependence, higher order lags of can be mistakenly selected by our screening procedure even if they are not in our active set of covariates.
For convenience, let denote our set of conditioning vectors; where is the conditioning vector for covariate . Our screened sub-model is:
To establish sure screening properties, we introduce the following conditions.
Assume for and .
Assume the process is -mixing, with mixing rate , for some .
Condition 3.1 is a standard population level assumption which allows covariates in the active set to be detected by our screening procedure. Condition 1.2 is similar to the one used in Yousuf (2018) and Wu and Wu (2016), and assumes both the response and covariate processes are causal Bernoulli shift processes. Additionally it presents the dependence and moment conditions on these processes, where higher values of indicate weaker temporal dependence. Examples of response processes which satisfy condition 1.2 include stationary, causal, finite order ARMA, GARCH, ARMA-GARCH, bilinear, and threshold autoregressive processes, all of which have exponentially decaying functional dependence measures (see Wu (2011) for details). For the covariate process, assume is a vector linear process: . where are coefficient matrices and are iid random vectors with . For simplicity, assume are identically distributed, then , where is the column of . If for , then . Other examples include stable VAR processes, and multivariate ARCH processes which have exponentially decaying cumulative functional dependence measures (Wu and Wu, 2016; Yousuf, 2018). Condition 1.3 strengthens the moment requirements of condition 1.2, and requires that all moments of the covariate and response processes are finite. To illustrate the role of the constants and , consider the example where is a linear process: with iid and , then . If we assume is sub-Gaussian, then , since . Similarly, if is sub-exponential, we have .
To understand the inclusion of condition 1.4, consider the -statistic:
which aims to estimate . When are iid, the
-statistic is an unbiased estimator of, however for the -statistic is no longer unbiased if is serially dependent. Since our sample distance correlation estimate can be written as a sum of -statistics (Li et al., 2012), condition 1.4 is needed to control the rate at which the above bias vanishes as . Conditions 1.2 and 1.4 are frequently used when dealing with time series data (Wu and Pourahmadi, 2009; Xiao and Wu, 2012; Davis et al., 2016).
Throughout this paper, let , and , if , otherwise . Let if , otherwise , and let , if , otherwise . Additionally, let , and . Given condition 1.3, it follows that . Let , be the maximum dimension of the conditional vectors. We define , . Lastly, for ease of presentation, let , , where , . In addition, let
For simplicity and convenience of presentation, we assume , and one can consult the proof for the general case. The following theorem presents the sure screening properties of PDC-SIS for both the heavy tailed and light tailed settings.
From the above theorem, we observe that the range of depends on the temporal dependence in both the covariate and the response processes, the strength of the signal (), and the moment conditions. We also have two cases for finite polynomial moments, one for and one for . This is due to our proof technique which relies on both Nagaev and Rosenthal type inequalities. For the case of low moments, we obtain a better bound using a Rosenthal type inequality combined with the Markov inequality, whereas for higher moments Nagaev type inequalities lead to a better bound; more details can be found in the proof which is provided in the supplementary file.
For example, if we assume only finite polynomial moments with and , then . If we assume and , . The constants and , which are related to the cumulative functional dependence measures, represent the effect of temporal dependence on our bounds when . However, when using Nagaev type inequalities, there is an additional effect in the case of stronger dependence in the response or covariate process (i.e. ). For instance, if and , the range for is reduced by a factor of in the case of stronger dependence. For the case of exponentially decaying tails however, there is no level shift in the decay rate of our bounds due to the dependence of the response or covariate processes. We observe that if the response and covariates are sub-Gaussian, , and if they are sub-exponential, .
By choosing an empty conditional set for all the variables, our procedure reduces to the distance correlation screening (DC-SIS) introduced in Li et al. (2012) for the iid setting. Assuming sub-Gaussian response and covariates, Li et al. (2012) obtained for DC-SIS, which matches our rate. In the iid setting with finite polynomial moments, we can use the truncation method in their proof and combined with the Markov inequality to obtain . Our results, which rely on a different proof strategy than the truncation method, provide a better bound even in this setting.
3.3 Screening Algorithm II: PDC-SIS+
As we have seen, the time ordering of the covariates allows us some additional flexibility in selecting the conditioning vector compared to iid setting. Our previous algorithm attempted to utilize the time series structure of our data by conditioning on previous lags of the covariate. However, rather than simply conditioning only on the previous lags of a covariate, we can condition on additional information available from previous lags of other covariates as well. One way to attempt this, and to potentially improve our algorithm, is to identify strong conditional signals at each lag level and add them to the conditioning vector for all higher order lag levels. By utilizing this conditioning scheme we can pick up on hidden significant variables in more distant lags, and also shrink toward models with lower order lags by controlling for false positives resulting from high autocorrelation, and cross-correlation.
We now give a formal description of PDC-SIS+. The conditioning vector for the first lag level of predictor series is: which coincides with the conditioning vector for the first lag level of PDC-SIS. Using the representation , we denote the strong conditional signal set for the first lag level as:
We then use this information to form our next conditioning vector:
where is a sub-vector of which is formed by extracting the indices contained in . We note that any duplicates which result from overlap between and are deleted. For convenience, we define as our vector of estimated conditional sets. We then use to compute the strong conditional signal set for the lag level:
Repeating this procedure we obtain:
We can also vary the threshold for each lag level; for simplicity we leave it the same for each of our levels here. Our sub-model obtained from this procedure is:
The asymptotic properties of this procedure are similar to PDC-SIS, and we present them in the supplementary material. To show the asymptotic properties associated with this algorithm, we denote
as the population level counterpart to . In addition, let and
represent the population level strong conditional signal set and the population level set of conditioning vectors, respectively. One of the difficulties in proving uniform convergence of our estimated partial distance correlations in this algorithm is the presence of an estimated conditioning set . This issue becomes compounded as we estimate the conditioning vector for higher lag levels, since these rely on estimates of the conditioning vectors for lower ones. To overcome this, we first denote the collection of strong signals from lag 1 to as: . We will assume the following condition:
For any , assume
, where .
Condition 3.5 assumes the variables in the strong conditional signal set, , are easily identifiable from the rest of the covariates. This separation in the signal strength will allow us to ensure with high probability that our estimated conditional sets match their population level counterparts. The assumption , is introduced to ensure . Although the hope is that , this is not required to prove sure screening properties of our algorithm. Additionally, as seen in Barut et al. (2016) for the case of generalized linear models, conditioning on irrelevant variables could also enhance the power of a screening procedure. We will discuss how to choose the threshold for in section 3.4. In practice we would prefer not to condition on too many variables, therefore the threshold for adding a variable to would be high.
The sure screening properties for PDC-SIS+ are similar to PDC-SIS, but for the sake of completeness, we state the theorem in full.
Now, we have presented two classes of PDC screening methods. In the first class of methods, the conditional set of each covariate is known as a priori, while in the second class the conditional set is estimated from the data. We can easily modify our algorithms for both procedures depending on the situation; for example we can screen groups of lags at a time for certain covariates in PDC-SIS. Additionally, for either procedure we can condition on a small number of lags of , and leave the higher order lags of as possible covariates in our screening procedure.
3.4 Threshold Selection: PDC-SIS+
In order to implement PDC-SIS+, we need to select a threshold parameter . For simplicity we will only use a single threshold for all lag levels, and it is selected as follows: we first generate independent AR(1) variables, where , , and are independent of our response. We set , and estimate:
We then set the 99th percentile of as our estimated threshold parameter . In order to avoid conditioning on too many variables, following Weng et al. (2017), we set an upper bound of variables which can be added to our conditioning vector at each lag level. Given that and are independent, we have that is a statistical estimate of zero. This procedure is similar to the random decoupling approach used in Weng et al. (2017) and Barut et al. (2016) for the iid setting. We note that we could also use a more computationally intensive alternative approach which involves forming a pseudo-sample , where is formed using a moving block bootstrap (Kunsch, 1989) or a stationary bootstrap (Politis and Romano, 1994). We then use the procedure outlined in Barut et al. (2016), using the pseudo-sample , to select our threshold. We choose to go with the first approach due to computational efficiency.
4 Screening for Multivariate Time Series Models
Multivariate time series models, such as linear VAR models, are commonly used in fields such as macroeconomics (Lütkepohl, 2005), finance, and more recently neuroscience (Valdés-Sosa et al., 2005), and genomics. VAR models provide a convenient framework for forecasting, investigating Granger causality, and modeling the temporal and cross-sectional dependence for large numbers of series. Since the number of parameters grows quadratically with the number of component series, VAR models have traditionally been restricted to situations where the number of component series is small. One way to overcome this limitation is by assuming a sparse structure in our VAR process, and using penalized regression methods such as the Lasso and adaptive Lasso (Zou, 2006) to estimate the model. Examples of works which pursue this direction include Basu and Michailidis (2015), Basu et al. (2015), Kock and Callot (2015), and Nicholson et al. (2016). However, due to the quadratically increasing nature of the parameter space, penalized regression methods can quickly become computationally burdensome when we have a large panel of component series. For example, in a VAR() process: , where , , the number of parameters to estimate is . Additionally, these methods are restricted to linear VAR models, whereas there is considerable evidence of non-linear effects such as the existence of thresholds, smooth transitions, regime switching, and varying coefficients in fields such as macroeconomics and finance (Kilian and Lütkepohl, 2017).
Screening approaches can be used in this setting, and one option would be to screen separately for each of the series. This can be computationally prohibitive since it requires estimating correlations. However, if we assume a group structure in the component series and a sparse conditional dependency structure between these groups, we can quickly reduce the feature space by screening at the group level using distance correlation based methods. To be more precise, let be a non-linear VAR() process:
For simplicity, we let all groups be of size , let denote the total number of groups for a given lag level, and denote our groups . To get a sense of the computational benefits of screening on the group level, assume for example, , and we have 25 groups all of size . For this linear VAR model, when , we note it takes about 350 times longer to compute all pairwise distance correlations vs. computing all group pairwise distance correlations. After the group screening, examples of second stage procedures include: screening at the individual series level using partial distance correlations, or using a group lasso type procedure (Yuan and Lin, 2006) which can handle sparsity between groups and within groups for a linear VAR model (Basu et al., 2015).
We now present the details of our group PDC-SIS procedure. We decide to condition on only one lag of the grouped response in our procedure, however this number can also be selected using a data driven procedure. Let
, refer to the set of possible group connections for