# Detecting Nonlinear Causality in Multivariate Time Series with Sparse Additive Models

We propose a nonparametric method for detecting nonlinear causal relationship within a set of multidimensional discrete time series, by using sparse additive models (SpAMs). We show that, when the input to the SpAM is a β-mixing time series, the model can be fitted by first approximating each unknown function with a linear combination of a set of B-spline bases, and then solving a group-lasso-type optimization problem with nonconvex regularization. Theoretically, we characterize the oracle statistical properties of the proposed sparse estimator in function estimation and model selection. Numerically, we propose an efficient pathwise iterative shrinkage thresholding algorithm (PISTA), which tames the nonconvexity and guarantees linear convergence towards the desired sparse estimator with high probability.

## Authors

• 3 publications
• 11 publications
• 63 publications
• 45 publications
• ### An Interpretable and Sparse Neural Network Model for Nonlinear Granger Causality Discovery

While most classical approaches to Granger causality detection repose up...
11/22/2017 ∙ by Alex Tank, et al. ∙ 0

• ### Machine learning time series regressions with an application to nowcasting

This paper introduces structured machine learning regressions for high-d...
05/28/2020 ∙ by Andrii Babii, et al. ∙ 37

• ### On Optimal and Feasible Regularization in Linear Models in Time Series

We discuss predictive linear modeling in the presence of: (i) stochastic...
07/22/2020 ∙ by Erez Buchweitz, et al. ∙ 0

• ### Sparse Partially Linear Additive Models

The generalized partially linear additive model (GPLAM) is a flexible an...
07/17/2014 ∙ by Yin Lou, et al. ∙ 0

• ### Graphical LASSO Based Model Selection for Time Series

We propose a novel graphical model selection (GMS) scheme for high-dimen...
10/05/2014 ∙ by Alexander Jung, et al. ∙ 0

• ### Interpretable Vector AutoRegressions with Exogenous Time Series

The Vector AutoRegressive (VAR) model is fundamental to the study of mul...
11/09/2017 ∙ by Ines Wilms, et al. ∙ 0

• ### Iteratively reweighted adaptive lasso for conditional heteroscedastic time series with applications to AR-ARCH type processes

Shrinkage algorithms are of great importance in almost every area of sta...
02/23/2015 ∙ by Florian Ziel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Detecting causal relationships within a set of coupled stochastic processes has been an interesting and important problem that rises broadly across the fields of modern science and engineering, including financial industries (Atanasov and Black, 2016), computer science (Rodriguez et al., 2011), epidemiology (Robins et al., 2000), neural science (Quinn et al., 2011), and climatology (Runge et al., 2015). A typical problem involving causal discovery often involves a vast network containing many agents, each generating a discrete time series. As the backbone that captures the evolutionary dynamics, the underlying causal relationships, which are often depicted by a causal graph (Quinn et al., 2015), can provide guidance for estimating the values of the time series in near future.

In this paper, we focus on learning Granger causality from a set of discrete time series when (i) the underlying causal graph is sparse, and (ii) when the underlying causal relationships are potentially nonlinear. We accomplish this goal by fitting the observed time series into a nonparametric sparse additive model (SpAM) of the form

 Xi[t+1]=p∑j=1fij(Xj[t])+ϵi[t+1], (1.1)

where is the -th dimension of the -th observation, and is the additive noise. The sparsity refers to the fact that for each only a few ’s are nonzero. We define .

### 1.1 Motivations

We motivate this work from several aspects, including why we focus on the notion of Granger causality, nonlinear causal relationships, and why we model this causal relationship using SpAMs.

Why Granger causality? Granger causality (Granger, 1969) is the most widely used notion of causality for time series observations. It captures the information flow between different dimensions (Kamitake et al., 1984; Rissanen and Wax, 1987). In (1.1), the existence of Granger causality can be directly interpreted in the sense of model selection: assuming that the density of on the support of is positive, then the conditional directed information from to given all other processes nonzero if and only if .

Why nonlinear causality with SpAMs? In many applications, nonlinear causal relationships rise naturally: in financial markets, it is well known that nonlinear causal relationship exists between the return and volume (Hiemstra and Jones, 1994). In climatology, the causal relationships between adjacent as well as far apart geological regions have been shown to follow a nonlinear relationship (Runge et al., 2017). In those applications, merely considering models that only encode linear causality may result in estimation errors caused by model mismatches. More importantly, the limitation of linear models restricts our understanding of nonlinear dynamics within the evolution of the time series dataset. Therefore, studies on revealing nonlinear causal relationship behind a time series dataset is of great importance.

Most of existing works consider linear causal relationships within a generative model, such as the vector autoregressive models (VARs):

, while the nonlinear causal relationships are often described by models with parametric forms (Fan and Yao, 2008). The fitting of nonparametric models, such as the nonparametric regression model

, suffer from the curse of dimensionality

(Fan and Yao, 2008). To the best of our knowledge, inferring nonlinear causality under nonparametric statistical models remain largely an open problem.

Among nonparametric statistical models, SpAMs is a less general but easier to fit model compared to the nonparametric regression model Ravikumar et al. (2009). As a high dimensional extension to the nonparametric additive models (Hastie and Tibshirani, 1990), SpAMs serve as an immediate generalization of the VARs that has direct implications on Granger causality. As such, we consider fitting SpAMs using time series data in this work.

### 1.2 Related Works

There exist several lines of research that are closely related to our work. In this section, we give a brief introduction to those, including the inference of Granger causality in large networks of stochastic processes, and the works related to fitting SpAMs.

Inferring Granger causality. In a large network, two typical ways are used for detecting Granger causality: (i) Granger causality is measured by information theoretic quantities, such as directed (or conditional directed) information (Quinn et al., 2015; Rissanen and Wax, 1987; Runge et al., 2017). While estimators with performance guarantee exist for the simple case (Jiao et al., 2013), the intrinsic difficulty that lies beneath the computation of conditional directed information hinders the design of efficient algorithms that scale effectively in large networks. (ii) The other way to detect Granger causality is under the assumption that the observations are generated from a statistical model. Several widely used models include the VAR (Han and Liu, 2013; Etesami and Kiyavash, 2014; Materassi and Innocenti, 2010), time series models with independent noise (TiMINo, Peters et al. (2013)), and time series linear non-Gaussian acyclic models (TS-LiNGAM, Hyvärinen et al. (2008)).

Fitting SpAMs. SpAMs are direct extensions of nonparametric additive models (Hastie and Tibshirani, 1990) to the high dimensional regime. Common ways to fit SpAMs include: (i) component selection and smoothing operator (COSSO, Lin et al. (2006)), which solves a group Lasso formulation in an reproducing kernel Hilbert space (RKHS), and (ii) methods based on basis expansions, which expand each function using a growing number of basis functions to reduce the problem into a group Lasso formulation Huang et al. (2010). An estimator based on kernel expansion is proposed in Zhou and Raskutti (2018). Compared to Zhou and Raskutti (2018), our work directly generalizes SpAMs and group Lasso with nonconvex regularization frameworks to fit time series data. The basis expansion method is superior compared to kernel expansion in the sense that sparse multiple kernel learning typically requires solving an optimization problem with parameters, with being the dimension and

being the sample size. This is a significant bottleneck when analyzing large-scale, and high dimensional data. As we will explain later, our proposed method requires solving an optimization with

parameters, with .

### 1.3 Our Contribution

Our contribution in this paper is three-fold. Firstly, we generalize the fitting technique for SpAMs to the realm of time series data. By simple basis expansion, we construct a finite dimensional parametric approximation for each , and then estimate the parameters by solving a finite dimensional group-lasso-type optimization problem using nonconvex group regularization (Fan and Li, 2001; Yuan and Lin, 2006; Zhang, 2010). For the estimator we proposed, we provide statistical guarantees in terms of the oracle property: (I) For each , we correctly recover the set of ’s that have causal effect on it; (II) The statistical rate of convergence in terms of function estimation is as small as directly performing low-dimensional estimation on the true support. Secondly, from a computational aspect, we propose an efficient pathwise iterative shrinkage thresholding algorithm (PISTA), which guarantees a linear convergence to the desired estimator with high probability. Lastly, we contribute to the field of causal inference by introducing a model for estimating nonlinear causality in high dimensional regime with both theoretical guarantees and fast numerical convergence. To the best of our knowledge, such models have only been evaluated empirically by Peters et al. (2013) on low dimensional dataset.

Notations: Given a vector , we denote the number of nonzero entries in as , and as the subvector of with the -th entry removed. Let be an index set. We use to denote the complementary index set to . Given a matrix , we use and

to denote the largest and smallest eigenvalues of

. Finally, to distinguish between SpAMs with input, we refer to the model in (1.1) as SpAM with time series input (TS-SpAM).

## 2 Fitting SpAMs for Time Series

It is obvious that for a -dimensional , fitting unknown functions ’s can be performed in parallel by systems where each system fits (1.1) for a particular . Therefore, for notational simplicity, we fix the subscript and denote by in the rest of the paper. We also denote . Consider model (1.1) :

[General setup for TS-SpAM]
(A1) The cardinality of is uniformly upper bounded by .

(A2) For each , is bounded on and belongs to the Sobolev space .

(A3) The noise is bounded, and across time, with and for all .

(A4) Let , where and is a nonnegative integer such that . The -th derivative exists and satisfies a Lipschitz condition of order for some constant :

 |f(l)j(s)−f(l)j(t)|≤C|s−t|α,  ∀s,t∈[a,b],

where is the interval in which takes value for any . The boundedness of follows from (A1) to (A3).

(A5) For , for some constant .

(A6) For each , has a distribution , whose density is positive over its support, and is bounded between and for all and .

(A7) The function and the distribution of satisfies .

(A8) For each , we assume that is a -mixing time series with being a constant irrelevant of . Furthermore, we assume that where denotes the standard Euclidean norm.

Most of the above assumptions are inherited from SpAM. The last assumption (A8) is common among works in time series. For simplicity, we restrict our attention to noise distributions with bounded support. The result can be easily extended to more general family of noise distributions such as those with subgaussian tails. The boundedness assumption on both s and the noise allows us to work with a set of bounded discrete processes , which greatly simplifies our analysis.

We fit the TS-SpAM by solving a regularized optimization problem of the following form:

 min\bbf∈\calFp1nn∑t=1(Y[t]−p∑j=1fj(Xj[t]))2+\Rcλ(\bbf), (2.1)

where consists of all the functions satisfying the constraints specified in Assumption 2

, and we denote the first term as the loss function

. Below, we introduce the detailed procedure for approximating each with splines, reducing the problem into a group-lasso-type optimization. The number of spline basis used is selected such that the approximation error balances the estimation error introduced when solving the group Lasso problem.

### 2.1 Reduction to Group Lasso

Given samples, we construct a finite dimensional set of splines to approximate . Recall that by assumption. Given samples, partition the interval into sub-intervals, with for . The end points of the partition are , and are evenly located such that . Let be the set of polynomial splines that are times continuously differentiable on , and are degree polynomials over each sub-interval . The constant is defined in Assumption (A4).

Under such settings, there exists a set of B-spline basis functions of where . Such a set of bases can be constructed by recursively taking the divided difference from the Haar basis.111See Chapter 5’s appendix of Friedman et al. (2001) for a detailed description. As additive models are prone to identifiability issues, , s are only identifiable up to constant differences, we center the splines to enforce the constraint that :

 ψjk(x):=ϕk(x)−1nn∑t=1ϕk(Xj[t]).

Following the notations of Huang et al. (2010), we denote the space generated by the centered spline basis by .

At this point, we can approximate in the model (1.1) by the functions in :

 fj(x)≈q∑k=1βjkψjk(x), ∀j∈\calP. (2.2)

Upon this approximation, the functional optimization problem (2.1) immediately turns into the following parametric estimation problem:

 ˆ\bbeta∈\argmin\bbeta∈\realspq 12nn∑t=1(Y[t]−p∑j=1q∑k=1βjkψjk(Xj[t]))2+\Rcλ(\bbeta), (2.3)

where is a -dimensional vector that concatenates column vectors, , each of dimension . More compactly,

 ˆ\bbeta=\argmin\bbeta∈\realspq12n∥y−Z\bbeta∥22+\Rcλ(\bbeta), (2.4)

where and . Each component is an matrix, and is defined as with . More specifically, we have , where

 Zj=⎡⎢ ⎢ ⎢⎣ψj1(Xj[1])⋯ψjq(Xj[1])⋮⋱⋮ψj1(Xj[n])⋯ψjq(Xj[n])⎤⎥ ⎥ ⎥⎦.

In (2.3), each group of coefficients determines a function . When , the recovered should be the zero vector. The imposed is responsible for inducing group sparsity of the additive components.

Finally, we characterize the approximation error of (2.2), which parallels the result of approximation error when using the samples in Huang et al. (2010).[Approximation error with B-splines] Under Assumption 2, there exists such that

 ∥fj−˜fj∥2=\calOP(q−d+q1/2n−1/2).

Given such a result, a natural way to choose is to balance the above two terms. Choosing gives us .

### 2.2 Nonconvex Penalties

We apply decomposable nonconvex penalty to (2.4). That is,

 \calRλ(\bbeta)=p∑j=1rλ(\bbetaj).

In particular, we focus on the nonconvex group MCP regularizer (Huang et al., 2012) of the form

 rλ(\bbetaj) =λ⎛⎝\norm\bbetaj2−\norm\bbetaj222λγ⎞⎠⋅1(\norm\bbetaj2<λγ)+λ2γ2⋅1(\norm\bbetaj2≥λγ), (2.5)

where is the tuning parameter that determines the threshold on upon which . Note that with being

 wλ(\bbetaj) =λ2γ−2λ∥\bbetaj∥22⋅1(∥\bbetaj∥2≥λγ)−∥\bbetaj∥222γ⋅1(∥\bbetaj∥2<λγ). (2.6)

Hence, the overall regularization term takes the form

 \Rcλ(\bbeta)=p∑j=1[λ∥\bbetaj∥2+wλ(∥\bbetaj∥2)].

### 2.3 Connection Between Time Series Observations and Samples

In this section, we make the observation that one can plug in the time series data into TS-SpAM as if they were . The key to making such a connection resides in proving a restricted eigenvalue property for the block . This result also fuels our subsequent analysis of the oracle properties of our estimator with nonconvex regularization.

For strong mixing time series, the following Bernstein’s inequality guarantees that the empirical average concentrates tightly around its expectation.

[Bernstein’s inequality under strong mixing conditions (Merlevède et al., 2009)] Let

be a sequence of centered random variables bounded by a constant

and satisfying the strong mixing condition with parameter . Then, there exists a dependent constant , such that for all and ,

 \Prob\lbrac∣∣ ∣∣n∑i=1Xi∣∣ ∣∣≥x\rbrac≤2exp\lbrac−C˜μx2nM2+Mx(logn)(loglogn)\rbrac.

The above lemma is similar to the Bernstein’s inequality for random variables. It turns out that this lemma will help us prove the restricted eigenvalue condition for the design matrix , which is the key to the theoretical results under the scenario. We state this restricted eigenvalue condition for time series in the following lemma:

Let where and . For each , let . Then with high probability, we have

 Cminq−1≤mini∈[p]ρmin(Ui)≤maxi∈[p]ρmax(Ui)≤Cmaxq−1,

where and are constants dependent only on defined in Lemma 2.3.

We briefly sketch the proof of Lemma 2.3 below. A detailed proof can be found in the Appendix.

###### Proof Sketch.

We relate the eigenvalues of to the maximum and minimum values of , where lies on the surface of a -dimensional sphere .

Upper bound. The upper bound for can be obtained in three steps.

(i) We first derive an exponential tail bound for for fixed . This can be obtained by first noticing that

 v⊤Z⊤iZiv=n∑t=1(q∑j=1ϕj(Xi[t])vj)2,

and applying Lemma 2.3. The Bernstein inequality requires that be strong mixing. This is true since is strong mixing.

(ii) Upper bounding by . This holds true by the property of the B-splines. Together with the first step, one has

 \Prob\lbrac|Sn(v)|n≥δ\rbrac ≤2exp\lbrac−C˜μnδ2M2+Mδlognloglogn\rbrac.

where is the simplified notation of .

(iii) We apply the standard -net argument. Choose a proper value of , and let . When with , the probability that the supremum of over converges to 0. Hence with high probability, .

Lower bound. To obtain the lower bound, one needs to obtain a lower bound for . This can be done following the argument in Lemma 6.1 in Zhou et al. (1998), where

 \Expect[∥Ziv∥2]=n−1∑t=1∫bas2t(x)dQi,t(x)≥nqmincq−1∥v∥22,

where is some constant. ∎

Since the cardinality of is bounded by , the following result holds true by triangle inequality and Lemma 3 of Stone (1985):

[Restricted eigenvalues] Let , where , and satisfying . Define . Then, with high probability, we have

 C′minq−1 ≤minA⊂{1,…,p},|A|≤\calPρmin(UA)andC′maxq−1≥maxA⊂{1,…,p},|A|≤\calPρmax(UA),

where and are constants dependent only on .

## 3 Main Results: Oracle Properties

We now introduce the oracle properties of the group lasso problem: when using a nonconvex regularization, the convergence rate of the estimation of will be as good as the convergence rate when the set is given by an oracle. As a matter of fact, the result presented below not only applies to the analysis of TS-SpAM, but also serves as a topic general enough to be of its own independent interest.

We first define the oracle solution of the group Lasso problem: [Oracle solution] For any given parent set , let be the concatenation of s with . Then, the oracle solution is defined in the way such that for and

 ˆ\bbetao\calP ∈\argmin\bbeta\calP∈\reals|\calP|q12n∥∥ ∥∥y−∑j∈\calPZj\bbetaj∥∥ ∥∥22. (3.1)

It is equivalent to saying that an oracle provides information on the support of , and therefore the regression is limited to estimating for , which is a low dimensional problem since is fixed and does not grow with or .

For regularizers of the form defined in (2.5), such as group SCAD and group MCP (Huang et al., 2012), the following regularity conditions on the concave function , defined in (2.6), guarantee consistent estimation given observations: We require to satisfy the following regularities:

• is symmetric, i.e., .

• and pass through the origin, i.e. .

• There is some , such that , we have if , and otherwise.

• is monotonically decreasing and Lipschitz continuous, i.e., for , there are positive real numbers and , such that .

• has bounded difference with respect to , i.e., for all .

Under the regularity assumptions on the penalty function and the general assumptions we made on TS-SpAM, the following oracle property holds for time series input.

Given , suppose that there exists a universal constant such that

 minj∈S∥\bbeta∗j∥2≥Cσ(√qn+√logpn). (3.2)

If we set the regularization coefficient , then for large enough , we have

 \Pb(ˆ\bbeta=ˆ\bbetao)≥1−p−1,

while

 \normˆfj−fj22=\calOP(n−2d/(2d+1)).

Theorem 3 provides statistical guarantee of the probability for successfully performing support recovery as well as the error of function estimation. This result provides theoretical guarantee for inferring Granger causality in nonlinear autoregressive models.

## 4 Computational Algorithm

Nonconvex optimization problems are computationally intractable in general, as has been extensively studied by the worst-case analysis in classical optimization theory. However, the nonconvex optimization in our problem is not constructed adversarially, and thus one can naturally expect a better result for any computational algorithms when comparing to the worst-case performance. In this section, we develop an efficient pathwise iterative shrinkage thresholding algorithm (PISTA), which guarantees linear convergence to the desired sparse local solution with high probability. PISTA contains two loops: (i) the inner loop, which is an iterative shrinkage thresholding algorithm; (ii) the outer loop, which is the pathwise optimization scheme.

For notational convenience, we define the augmented loss function as , where is the summation of s defined in (2.6). With this notation, we rewrite the objective function as

 Fλ(\bbeta)=˜Lλ(\bbeta)+λ\norm\bbeta1,2.

### 4.1 Iterative Shrinkage Thresholding Algorithm

We first derive the iterative shrinkage thresholding algorithm (ISTA), the inner loop of PISTA. Suppose that we have the solution at the -th iteration. We take the following quadratic approximation of at :

 Hλ,Lt+1(\bbeta,\bbeta(t)) =˜L(\bbeta(t))+(\bbeta−\bbeta(t))⊤∇˜L(\bbeta(t))+ηt+12\norm\bbeta−\bbeta(t)22+λ\norm\bbeta1,2,

where is the step size parameter for the -th iteration. The ISTA algorithm then takes

 \bbeta(t+1) =\argmin\bbeta Hλ,ηt+1(\bbeta,\bbeta(t))=\argmin\bbetaηt+12\norm\bbeta−˜\bbeta(t+1)22+λ\norm\bbeta1,2, (4.1)

where

 ˜\bbeta(t+1)=\bbeta(t)−1ηt+1∇˜L(\bbeta(t)).

By group soft thresholding, (4.1) admits a closed form solution , where

 \bbeta(t+1)j=˜\bbeta(t+1)j⋅max⎛⎜ ⎜⎝1−ληt+1\norm˜\bbeta(t+1)j2, 0⎞⎟ ⎟⎠.

Once again, for notational simplicity, we write . The step size in (4.1) can be obtained by performing a backtracking line search. Particularly, we start with a small enough , and then choose the minimum nonnegative integer such that satisfies for . We terminate ISTA when the following approximate KKT condition holds:

 κλ(\bbeta(t+1)) =minξ∈∂\norm\bbeta(t+1)1,2\norm∇˜Lλ(\bbeta(t+1))+λξ∞,2≤ελ,

where is the target precision.

### 4.2 Pathwise Optimization Scheme

We next derive the pathwise optimization scheme, the outer loop of PISTA. The pathwise optimization scheme is very natural for sparse optimization problems, since we always need to tune the regularization parameter over a refined grid to achieve a good empirical performance. The pathwise optimization scheme takes advantage of the tuning procedure, and does not require any additional computational effort. Particularly, the pathwise optimization scheme considers a decreasing sequence of regularization parameters , where for . We can verify that when we choose , is a local solution, i.e., and . Then for , we adopt the ISTA algorithm with as the initial value, and solve the nonconvex optimization problems with respect to . We summarize the PISTA algorithm in Algorithm 1.

We establish the global convergence of PISTA in the next theorem. Under the same assumptions as Theorem 3, and given , for large enough , , the following results hold with high probability:

(1) Throughout all iterations, any iterate in our algorithm is sufficiently sparse, .

(2) PISTA converges to a unique sparse local solution for each satisfying

 \norm¯\bbeta{M}0≤C\calP/2andκλM(¯\bbeta{M})=0.

(3) For each , we have .

(4) To compute the entire solution path, the number of ISTA iterations is at most .

Theorem 4.2 is a result that extends the Theorem 3 in Wang et al. (2014) to the group Lasso setting. In terms of the performance, Theorem 4.2 guarantees that PISTA attains global linear convergence to the sparse unique oracle solution for nonconvex structured causal inference problem with high probability. Note that in Theorem 4.2 is exactly in Theorem 3, since they correspond to the same value of .

## 5 Numerical Results

In this section, we verify the proposed algorithm on both synthetic and real data. For the synthetic data set, we verify the support recovery performance on a nonlinear autoregressive model with ’s designed as polynomial splines. For the real data set, we infer the causal relationship from daily stock market price data of 452 companies listed at NYSE, and compare prediction error of PISTA and Lasso (realized using glmnet (Friedman et al., 2010)). In particular, we demonstrate, from the real data, the superiority of PISTA over Lasso when the input data contains nonlinear causal relationship.

### 5.1 Synthetic Data

Consider the model (1.1), where , and . 222We set the number of spline bases for each dimension to be 3, so that the design matrix has less number of samples than the number of predictors. We focus on estimating ’s for , and we assume 10 out of the remaining 299 ’s are non-zero, which we choose uniformly at random. For each , we set it to be of the form , where is a standardized vector that guarantees the mixing of . The noise . Finally, for , we set in the same fashion as the non-zero ’s, and set for .

We randomly generate 100 training sets, and test PISTA’s performance for different ’s. 333Here refers to the maximum specified as the input to PISTA. We set the number of splines to be 3 and adopt the group MCP penalty with . 444As shown in Appendix 7.12, increasing does not significantly affect PISTA’s performance in this case.

The computed precision and recall statistics are shown in Figure

1. As increases, the precision increases at the expense of lowering the recall. The balance of the two that maximizes the F1 score is achieved at roughly for PISTA, with an F1 score of . The selected generalizes from the training set to the test set so that optimal can be selected via cross validation. In Appendix 7.12, we perform a side by side comparison between the precision and recall performance of PISTA and that of group Lasso. The result shows that the PISTA with group MCP penalty achieves much better performance in support recovery compared to group Lasso.

### 5.2 Real Data: Inferring Causality From Stock Market Data

We demonstrate and interpret the result of TS-SpAM when applied to a stock market dataset. The dataset we use is collected from Zhao et al. (2012), which includes 452 company’s stock prices over a period of 1258 days between Jan 1st, 2003 to Jan 1st, 2008. In our experiment, we consider those stocks that belong to the “information technology” sector, and we consider the first 100 samples to interpret the causal relationship.

We preprocess the data by considering the log return. The log return, , of a stock with a price history is defined as

 ri[t]=log(1+Xi[t]−Xi[t−1]Xi[t−1]). (5.1)

For this dataset, the log return of each stock can be viewed as a stationary time series.

We then input the processed stock prices to PISTA with the number of B-spline basis set to be 3. We also input the processed stock prices to glmnet (Friedman et al., 2010), which we use as our benchmark. For both algorithms, we set the regularization parameter sequence to have a wide range such that the resulting estimated support could be sparse. For each stock, we acquire three companies that have the most significant impact on it. The complete estimated relationship can be found in Table 3, and we manually choose 24 dimensions to visualize in the following graphs.

The selected companies’ labels are given in Table 1. The majority of the companies we selected are related to the manufacturing of semiconductors or storages. These include NVDA, MU, AMD, INTC, QCOM, BRCM, SNDK, WDC, etc. A few companies are related to the manufacturing of computers and computer related softwares, including DELL, MSFT, HPQ, etc. The results of PISTA and glmnet have a lot in common. For example, SNDK, NVDA, QCOM, WDC are widely connected in the results of both algorithms. However, the result of PISTA also captures a few relationships that are not present in the glmnet result. For example, the result of PISTA shows that the price of MU causally affects the prices of NVDA, AMD, both of which are popular stocks in the semiconductor industry. Glmnet on the other hand, only captures the impact on INTC. 555In our result, we keep the to be the same for glmnet and PISTA. Besides INTC, glmnet also recovers Molex and Paychex as the stocks significantly impacted by MU. While Molex is involved in manufacturing of electronics and fiber, Paychex is not directly related to semiconductor industry.

Another interesting interpretation that can be seen from Figure 2 is that the glmnet does not recover any causal influence on the price of A among the companies in Table 1, while PISTA interprets a causal relationship between A and HPQ. A closer look into the company history reveals that Agilent is a spinoff from the Hewlett-Packard in 1999 (agi, 2018), and the prices of the two companies share similar trend even until today.

## 6 Conclusions

In this paper, we studied the sparse additive models for discrete time series (TS-SpAMs) in high dimensions. We proposed an estimator that fits this model by exploiting the basis expansion and nonconvex regularization techniques. We provided both theoretical guarantees and a computationally efficient algorithm, PISTA, that tames the nonconvexity while guaranteeing a linear convergence towards the desired sparse solution with high probability. Our result has direct implications on detecting nonlinear causality in large time series datasets, and we compared the result of recovering causal relationships between PISTA and glmnet on a stock market dataset.

## References

• agi (2018) (2018). Agilent company website. [online; accessed 2018-03-10].
• Atanasov and Black (2016) Atanasov, V. and Black, B. (2016). Shock-based causal inference in corporate finance and accounting research .
• Etesami and Kiyavash (2014) Etesami, J. and Kiyavash, N. (2014). Directed information graphs: A generalization of linear dynamical graphs. In American Control Conference (ACC), 2014. IEEE.
• Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360.
• Fan and Yao (2008) Fan, J. and Yao, Q. (2008). Nonlinear time series: nonparametric and parametric methods. Springer Science & Business Media.
• Friedman et al. (2001) Friedman, J., Hastie, T. and Tibshirani, R. (2001). The elements of statistical learning, vol. 1. Springer series in statistics New York.
• Friedman et al. (2010) Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 1.
• Granger (1969) Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society 424–438.
• Han and Liu (2013) Han, F. and Liu, H. (2013). Transition matrix estimation in high dimensional time series. In

Proceedings of the 30th International Conference on Machine Learning (ICML-13)

.
• Hastie and Tibshirani (1990) Hastie, T. and Tibshirani, R. (1990). Generalized additive models. Wiley Online Library.
• Hiemstra and Jones (1994) Hiemstra, C. and Jones, J. D. (1994). Testing for linear and nonlinear granger causality in the stock price-volume relation. The Journal of Finance 49 1639–1664.
• Huang et al. (2012) Huang, J., Breheny, P. and Ma, S. (2012). A selective review of group selection in high-dimensional models. Statistical science: a review journal of the Institute of Mathematical Statistics 27.
• Huang et al. (2010) Huang, J., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric additive models. Annals of Statistics 38 2282–2313.
• Hyvärinen et al. (2008) Hyvärinen, A., Shimizu, S. and Hoyer, P. O. (2008). Causal modelling combining instantaneous and lagged effects: an identifiable model based on non-gaussianity. In Proceedings of the 25th international conference on Machine learning. ACM.
• Jiao et al. (2013) Jiao, J., Permuter, H. H., Zhao, L., Kim, Y.-H. and Weissman, T. (2013). Universal estimation of directed information. Information Theory, IEEE Transactions on 59 6220–6242.
• Kamitake et al. (1984) Kamitake, T., Harashima, H. and Miyakawa, H. (1984). A time-series analysis method based on the directed transinformation. Electronics and Communications in Japan (Part I: Communications) 67 1–9.
• Lin et al. (2006) Lin, Y., Zhang, H. H. et al. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics 34 2272–2297.
• Materassi and Innocenti (2010) Materassi, D. and Innocenti, G. (2010). Topological identification in networks of dynamical systems. Automatic Control, IEEE Transactions on 55 1860–1871.
• Merlevède et al. (2009) Merlevède, F., Peligrad, M., Rio, E. et al. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume. Institute of Mathematical Statistics, 273–292.
• Peters et al. (2013) Peters, J., Janzing, D. and Schölkopf, B. (2013). Causal inference on time series using restricted structural equation models. In Advances in Neural Information Processing Systems.
• Quinn et al. (2011) Quinn, C. J., Coleman, T. P., Kiyavash, N. and Hatsopoulos, N. G. (2011). Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. Journal of computational neuroscience 30 17–44.
• Quinn et al. (2015) Quinn, C. J., Kiyavash, N. and Coleman, T. P. (2015). Directed information graphs. Information Theory, IEEE Transactions on 61 6887–6909.
• Ravikumar et al. (2009) Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 1009–1030.
• Rissanen and Wax (1987) Rissanen, J. and Wax, M. (1987). Measures of mutual and causal dependence between two time series (corresp.). IEEE Transactions on information theory 33 598–601.
• Robins et al. (2000) Robins, J. M., Hernan, M. A. and Brumback, B. (2000). Marginal structural models and causal inference in epidemiology.
• Rodriguez et al. (2011) Rodriguez, M. G., Balduzzi, D. and Schölkopf, B. (2011). Uncovering the temporal dynamics of diffusion networks. arXiv preprint arXiv:1105.0697 .
• Rudelson and Vershynin (2013) Rudelson, M. and Vershynin, R. (2013). Hanson-Wright inequality and sub-gaussian concentration. ArXiv e-prints .
• Runge et al. (2015) Runge, J., Petoukhov, V., Donges, J. F., Hlinka, J., Jajcay, N., Vejmelka, M., Hartman, D., Marwan, N., Paluš, M. and Kurths, J. (2015). Identifying causal gateways and mediators in complex spatio-temporal systems. Nature communications 6 8502.
• Runge et al. (2017) Runge, J., Sejdinovic, D. and Flaxman, S. (2017). Detecting causal associations in large nonlinear time series datasets. arXiv preprint arXiv:1702.07007 .
• Stone (1985) Stone, C. J. (1985). Additive regression and other nonparametric models. The annals of Statistics 689–705.
• van Handel (2014) van Handel, R. (2014). Probability in high dimension. Tech. rep., DTIC Document.
• Wang et al. (2014) Wang, Z., Liu, H. and Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Annals of statistics 42 2164.
• Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 49–67.
• Zhang (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics 894–942.
• Zhao et al. (2012) Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in r. Journal of Machine Learning Research 13 1059–1062.
• Zhou and Raskutti (2018) Zhou, H. and Raskutti, G. (2018). Non-parametric sparse additive auto-regressive network models. arXiv preprint arXiv:1801.07644 .
• Zhou et al. (1998) Zhou, S., Shen, X., Wolfe, D. et al. (1998). Local asymptotics for regression splines and confidence regions. The annals of statistics 26 1760–1782.

## 7 Appendix: Proofs of the Major Theoretical Results

For convenience of proving the main theorem, the following technical lemmas are needed.

### 7.1 Some Basic Definitions and Lemmas

For a loss function that is twice differentiable, and a given , define

 ρ+(s)=sup∥z∥0,2≤sz⊤∇2\Lc(β)z∥z∥22,  ρ−(s)=inf∥z∥0,2≤sz⊤∇2\Lc(β)z∥z∥22. (7.1)

Under this definition, and are the largest and smallest -block-sparse eigenvalues of , respectively. For all , we say that the function is -strongly convex (-SC) if there is a positive real number , such that

 h(β′)−h(β)−(β′−β)⊤∇h(β)≥a2∥β′−β∥22.

We say that is -smooth (-S) if there is a positive real number , such that

 h(β′)−h(β)−(β′−β)⊤∇h(β)≤b2∥β′−β∥22.

When the above equations only hold in a specific subset of , we say that is -restricted strongly convex (-RSC) or -restricted smooth (-RS).

Suppose that . Then, for any such that , we have

1. is -RSC and -RS.

 ρ−(s)2∥β′−β∥22≤ L(β′)−L(β)−(β′−β)⊤∇L(β)≤ρ+(s)2∥β′−β∥22.
2. is -RSC and -RS.

 ˜ρ−(s)2∥β′−β∥22≤˜Lλ(β′)−˜Lλ(β)−(β′−β)⊤∇˜Lλ(β)≤ρ+(s)2∥β′−β∥22

with .

3. For all , is -RSC.