Traditional methods for forecasting stationary multivariate time series from their own past are derived from the classical linear ARMA modelling. In these, the prediction of the next point in the future of the series is constructed as a linear function of the past observations. The use of linear functions as the predictors is in part based on the Wold representation theorem (e.g. 
) and in part, probably more importantly, on the fact that the linear predictor is the best predictor (in the mean-square-error sense) in case the time series is Gaussian.
The Gaussian assumption is therefore often adopted in the analysis of time series to justify the simple linear modelling. However, it is indeed a simplifying assumption since for non-Gaussian series the best predictor may very well be a non-linear function of the past observations. A number of parametric non-linear models has been proposed in the literature, each adapted to capture specific sources of non-linearity (for example multiple forms of regime-switching models, e.g. ).
In this paper we adopt an approach that does not rely on such prior assumptions for the function form. We propose to learn the predictor as a general vector-valued function that takes as input the past observations of the multivariate series and outputs the forecast of the unknown next value (vector).
We have two principal requirements on the function . The first is the standard prediction accuracy requirement. That is, the function shall be such that we can expect its outputs to be close (in the squared error sense) to the true future observations of the process. The second requirement is that the function shall have a structure that will enable the analysis of the relationships amongst the subprocesses of the multivariate series. Namely, we wish to understand how parts of the series help in forecasting other parts of the multivariate series, a concept known in the time-series literature as graphical Granger modelling [11, 9].
To learn such a function we employ the framework of regularised learning of vector-valued functions in the reproducing kernel Hilbert space (RKHS) . Learning methods based on the RKHS theory have previously been considered for time series modelling (e.g. [10, 19, 15]). Though, as Pillonetto et al. note in their survey , their adoption for the dynamical system analysis is not a commonplace.
A critical step in kernel-based methods for learning vector-valued functions is the specification of the operator-valued kernel that exploits well the relationships between the inputs and the outputs. A convenient and well-studied class of operator-valued kernels (e.g. in [7, 8, 12]) are those decomposable into a product of a scalar kernel on the input space (input kernel) and a linear operator on the output space (output kernel).
The kernel uniquely determines the function space within which the function is learned. It thus has significant influence on both our objectives described above. Instead of having to choose the input and the output kernels a priori, we introduce a method for learning the input and output kernels from the data together with learning the vector-valued function .
Our method combines in a novel way the multiple-kernel learning (MKL) approach  with learning the output kernels within the space of positive semidefinite linear operators on the output space . MKL methods for operator-valued kernels have recently been developed in  and . The first learns a convex combination of a set of operator-valued kernels fixed in advance, the second combines a fixed set of input kernels with a single learned output kernel. To the best of our knowledge, ours is the first method in which the operator-valued kernel is learned by combining a set of input kernels with a set of multiple learned output kernels.
In accordance with our second objective stated above, we impose specific structural constraints on the function search space so that the learned function supports the graphical Granger analysis. We achieve this by working with matrix-valued kernels operating over input partitions restricted to single input scalar series (similar input partitioning has recently been used in ).
We impose diagonal structure on the output kernels to control the model complexity. Though this has a cost in the inability to model contemporaneous relationships, it addresses the strong over-parametrisation in a principled manner. It also greatly simplifies the final structure of the problem, which, in result, suitably decomposes into a set of smaller independent problems solvable in parallel.
We develop two forms of sparsity-promoting regularisation approaches for learning the output kernels. These are based on the and norms respectively and are motivated by the search for Granger-causality relationships. As to our knowledge, the latter has not been previously used in the context of MKL.
Finally, we confirm on experiments the benefits our methods can bring to forecasting non-Gaussian series in terms of improved predictive accuracy and the ability to recover hidden dynamic dependency structure within the time series systems. This makes them valid alternatives to the state-of-the-art graphical Granger techniques.
We use bold upper case and lower case letters for matrices and vectors respectively, and the plain letters with subscripts for their elements. For any matrix or vector the superscript denotes its transpose. Vectors are by convention column-wise so that is the -dimensional vector . are the sets of real scalars, -dimensional vectors, and dimensional matrices. is the set of non-negative matrices, the set of positive semi-definite matrices and the set of non-negative diagonal matrices. is the set of positive integers . For any vectors , are the standard inner product, and norms in the real Hilbert spaces. For any square matrix , denotes the trace. For any two matrices , is the Frobenius inner product and the Frobenius norm. and are the inner product and norm in the Hilbert space .
2 Problem Formulation
Given a realisation of a discrete stationary multivariate time series process , our goal is to learn a vector-valued function that takes as input the past observations of the process and predicts its future vector value (one step ahead). The function shall be such that
we can expect the prediction to be near (in the Euclidean distance sense) the unobserved future value
For notational simplicity, from now on we indicate the output of the function as and the input as (bearing in mind that is in fact the -th order Cartesian product of and that the inputs and outputs are the past and future observations of the same -dimensional series). We also align the time indexes so that our data sample consists of input-output data pairs .
Following the standard function learning theory, we will learn by minimising the regularised empirical squared-error risk (with a regularization parameter )
Here is the reproducing kernel Hilbert space (RKHS) of -valued functions endowed with the norm and the inner product . The RKHS is uniquely associated with a symmetric positive-semidefinite matrix-valued kernel with the reproducing property
where the coefficients are the solutions of the system of linear equations
where if and is zero otherwise.
2.1 Granger-causality Analysis
To study the dynamical relationships in time series processes, Granger  proposed a practical definition of causality based on the accuracy of least-squares predictor functions. In brief, for two time series processes and , is said to Granger-cause () if given all the other relevant information we can predict the future of better (in the mean-square-error sense) using the history of than without it.
Though the concept seems rather straightforward, there are (at least) three points worth considering. First, the notion is purely technical based on the predictive accuracy of functions with differing input sets; it does not seek to understand the underlying forces driving the relationships. Second, in practice the conditioning set of information needs to be reduced to all the available information instead of all the relevant information. Third, it only considers relationships between pairs of (sub-)processes and not the interactions amongst a set of series.
extended the concept to multivariate analysis through graphical models. The discussion in the paper focuses on the notion of Granger non-causality rather than causality and describes the specific Markov properties (conditional non-causality) encoded in the graphs of Granger-causal relationships. In this sense, the absence of a variable in a set of inputs is more informative of the Granger (non-)causality than its presence. In result, graphical Granger methods are typically based on (structured) sparse modelling.
3 Function Space and Kernel Specification
The function space within which is learned is fully determined by the reproducing kernel . Its specification is therefore critical for achieving the two objectives for the function defined in Sect. 2. We focus on the class of matrix-valued kernels decomposable into the product of input kernels, capturing the similarities in the inputs, and output kernels, encoding the relationships between the outputs.
To analyse the dynamical dependencies between the series, we need to be able to discern within the inputs of the learned function the individual scalar series. Therefore we partition the elements of the input vectors according to the source scalar time series. In result, instead of a single kernel operating over the full vectors, we work with multiple partition-kernels, each of them operating over a single input series. We further propose to learn the partition-kernels by combining the MKL techniques with output kernel learning within the cone of positive semi-definite matrices.
More formally, the kernel we propose to use is constructed as a sum of kernels , where is the number of the individual scalar-valued series in the multivariate process (dimensionality of the output space ). Each is a matrix-valued kernel that determines its own RKHS of vector-valued functions. The domains are sets of vectors constructed by selecting from the inputs only the coordinates that correspond to the past of a single scalar time series .
Further, instead of choosing the individual matrix-valued functions , we propose to learn them. We construct each again as a sum of kernels of possibly uneven number of summands of matrix-valued kernels . For this lowest level we focus on the family of decomposable kernels . Here, the input kernels capturing the similarity between the inputs are fixed in advance from a dictionary of valid scalar-valued kernels (e.g. Gaussian kernels with varying scales). The set of output kernels encoding the relations between the outputs is learned within the cone of symmetric positive semidefinite matrices .
3.1 Kernel Learning and Function Estimation
Learning all the output kernels as full PSD matrices implies learning more than parameters. To improve the generalization capability, we reduce the complexity of the problem drastically by restricting the search space for ’s to PSD diagonal matrices . This essentially corresponds to the assumption of no contemporaneous relationships between the series. We return to this point in Sect. 5.
As explained in Sect. 2.1, Granger (non-)causality learning typically searches for sparse models. We bring this into our methods by imposing a further sparsity inducing regularizer on the set of the output kernels . We motivate and elaborate suitable forms of in Sect. 3.2.
The joint learning of the kernels and the function can now be formulated as the problem of finding the minimising solution and ’s of the regularised functional
where is the regularised risk from (1). By calling on the properties of the RKHS, we reformulate this as a finite dimensional problem that can be addressed by conventional finite-dimensional optimisation approaches. We introduce the gram matrices such that for all , the output data matrix such that , and the coefficient matrix such that .
3.2 Sparse Regularization
The construction of the kernel and the function space described in Sect. 3 imposes on the function the necessary structure that allows the Granger-causality analysis (as per our 2nd objective set-out in Sect. 2). As explained in Sect. 2.1, the other ingredient we need to identify the Granger non-causalities is sparsity within the structure of the learned function.
In our methods, the sparsity is introduced by the regularizer . By construction of the function space, we can examine the elements of the output kernels (their diagonals) to make statements about the Granger non-causality. We say the -th scalar time series is non-causal for the series (given all the remaining series in the process) if for all .
Essentially, any of the numerous regularizers that exist for sparse or structured sparse learning  could be used as , possibly based on some prior knowledge about the underlying dependencies within the time-series process.
We elaborate here two cases that do not assume any special structure in the dependencies as the base scenarios. The first is the entry-wise norm across all the output kernels so that
The second is the grouped norm
After developing the learning strategy for these in Sect(s). 4.1 and 4.2, we provide some more intuition of their effects on the models and link to some other known graphical Granger techniques in Sect. 5.
4 Learning Strategy
First of all, we simplify the final formulation of the problem (7) in Sect. 3.1. Rather than working with a set of diagonal matrices , we merge the diagonals into a single matrix . We then re-formulate the problem with respect to this single matrix in place of the set and show how this reformulation can be suitably decomposed into smaller independent sub-problems.
We develop fit-to-purpose approaches for our two regularisers in Sect(s). 4.1 and 4.2. The first - based on the decomposition of the kernel matrices into the corresponding empirical features and on the variational formulation of norms  - shows the equivalence of the problem with group lasso [22, 23]. The second proposes a simple alternating minimisation algorithm to obtain the two sets of parameters.
We introduce the non-negative matrix such that
(each row in corresponds to the diagonal of one output kernel; if for all we have ). Using this change of variable, the optimisation problem (7) can be written equivalently as
and is the equivalent of so that
From (12)-(14) we observe that, with both of our regularizers, problem (11) is conveniently separable along into the sum of smaller independent problems, one per scalar output series. These can be efficiently solved in parallel, which makes our method scalable to very large multivariate systems. The final complexity depends on the choice of the regulariser and the appropriate algorithm. The overhead cost can be significantly reduced by precalculating the gram matrices in a single preprocessing step and sharing these in between the parallel tasks.
4.1 Learning with Norm
To unclutter notation we replace the bracketed double superscripts by a single superscript . We also drop the regularization parameter (fix it to ) as it is easy to show that any other value can be absorbed into the rescaling of and the and matrices. For each of the parallel tasks we indicate , and so that the individual problems are the minimisations with respect to and of
We decompose (for example by eigendecomposition) each of the gram matrices as , where is the matrix of the empirical features, and we introduce the variables and the set . Using these we rewrite111We extend the function to the point by taking the convention . equation (15)
We first find the closed form of the minimising solution for as for all . Plugging this back to (16) we obtain
Seen as a minimisation with respect to the set this is the classical group-lasso formulation with the empirical features as inputs. Accordingly, it can be solved by any standard method for group-lasso problems such as the proximal gradient descent method, e.g. , which we employ in our experiments. After solving for we can directly recover from the above minimising identity and then obtain the parameters from the set of linear equations
The algorithm outlined above takes advantage of the convex group-lasso reformulation (17) and has the standard convergence and complexity properties of proximal gradient descent. The empirical features can be pre-calculated and shared amongst the tasks to reduce the overhead cost.
4.2 Learning with Norm
For the regularization, we need to return to the double indexation to make clear how the groups are created. As above, for each of the parallel tasks we use the vectors and . However, for vector we will keep the notation for its elements. The individual problems are the minimisations with respect to and of
We propose to use the alternating minimisation with a proximal gradient step. At each iteration, we alternatively solve for and . For fixed we obtain from the set of linear equations (18). With fixed , problem (19) is a group lasso for with groups defined by the sub-index within the double indexation of the elements of . Here, the proximal gradient step takes place to move along the descend direction for . Though convex in and individually, the problem (19) is jointly non-convex and therefore can converge to local minima.
5 Interpretation and Crossovers
To help the understanding of the inner workings of our methods and especially the effects of the two regularizers, we discuss here the crossovers to other existing methods for MKL and Granger modelling.
The link to group-lasso demonstrated in Sect. 4.1 is not in itself too surprising. The formulation in (15) can be recognised as a sparse multiple kernel learning problem which has been previously shown to relate to group-lasso (eg. , ). We derive this link in Sect. 4.1 using the empirical feature representation to i) provide better intuition for the structure of the learned function , ii) develop an efficient algorithm for solving problem (15).
The re-formulation in terms of the empirical features creates an intuitive bridge to the classical linear models. Each can be seen as a matrix of features generated from a subset of the input coordinates relating to the past of a single scalar time series . The group-lasso regularizer in equation (17) has a sparsifying effect at the level of these subsets zeroing out (or not) the whole groups of parameters . In the context of linear methods, this approach is known as the grouped graphical Granger modelling .
Within the non-linear approaches to time series modelling, Sindhwani et al.  recently derived a similar formulation. There the authors followed a strategy of multiple kernel learning from a dictionary of input kernels combined with a single learned output kernel (as opposed to our multiple output kernels). They obtain their IKL model, which is in its final formulation equivalent to problem (15), by fixing the output kernel to identity.
Though we initially formulate our problem quite differently, the diagonal constraint we impose on the output kernels essentially prevents the modelling of any contemporaneous relationships between the series (as does the identity output kernel matrix in IKL). What remains in our methods are the diagonal elements, which are non-constant and sparse, and which can be interpreted as the weights of the input kernels in the standard MKL setting.
The more complex regularisation discussed in Sect. 4.2 is to the best of our knowledge novel in the context of multiple kernel learning. It has again a strong motivation and clear interpretation in terms of the graphical Granger modelling. The norm has a sparsifying effect not only at the level of the individual kernels but at the level of the groups of kernels operating over the same input partitions . In this respect our move from the to the norm has a parallel in the same move in linear graphical Granger techniques. The norm Lasso-Granger method  imposes the sparsity on the individual elements of the parameter matrices in a linear model, while the of the grouped-Lasso-Granger  works with groups of the corresponding parameters of a single input series across the multiple lags .
To document the performance of our method, we have conducted a set of experiments on real and synthetic datasets. In these we simulate real-life forecasting exercise by splitting the data into a training and a hold-out set which is unseen by the algorithm when learning the function and is only used for the final performance evaluation.
We compare our methods with the output kernel regularization (NVARL1) and and
(NVARL12) with simple baselines (which nevertheless are often hard to beat in practical time series forecasting) as well as with the state-of-the-art techniques for forecasting and Granger modelling. Namely, we compare with simple mean and univariate linear autoregressive models (LAR), multivariate linear vector autoregressive model withpenalty (LVARL2), the group-lasso Granger method  (LVARL1), and a sparse MKL without the input partitioning (NVAR). Of these, the last two are the most relevant competitors. LVARL1, similarly to our methods, aims at recovering the Granger structure but is strongly constrained to linear modelling only. NVAR has no capability to capture the Granger relationships but, due to the lack of structural constraints, it is the most flexible of all the models.
We evaluate our results with respect to the two objectives for the function defined in Sect. 2. We measure the accuracy of the one-step ahead forecasts by the mean square error (MSE) for the whole multivariate process averaged over 500 hold-out points. The structural objective allowing the analysis of dependencies between the sub-processes is wired into the method itself (see Sect(s). 3 and 2.1) and is therefore satisfied by construction. We produce adjacency matrices of the graphs of the learned dependencies, compare these with the ones produced by the linear Granger methods and comment on the observed results.
6.1 Technical Considerations
For each experiment we preprocessed the data by removing the training sample mean and rescaling with the training sample standard deviation. We fix the number of kernels for each input partition to six (for all ) and use the same kernel functions for all experiments: a linear, 2nd order and 3rd polynomial, and Gaussian kernels with width 0.5, 1 and 2. We normalise the kernels so that the training Gram matrices have trace equal to the size of the training sample.
We search for the hyper-parameter by a 5-fold cross-validation within a 15-long logarithmic grid , where is the training sample size and is the number of kernels or groups (depending on the method). In each grid search, we use the previous parameter values as warm starts. We do not perform an exhaustive search for the optimal lag for each of the scalar input series by some of the classical testing procedures (based on AIC, BIC etc.). We instead fix it to for all series in all experiments and rely on the regularization to control any excess complexity.
We implemented our own tools for all the tested methods based on variations of proximal gradient descent with ISTA line search . The full Matlab code is available at https://bitbucket.org/dmmlgeneva/nonlinear-granger
6.2 Synthetic Experiments
We have simulated data from a five dimensional non-Gaussian time-series process generated through a linear filter of a 5-dimensional i.i.d. exponential white noise
with identity covariance matrix (re-centered to zero and re-scaled to unit variance). The matrixin the filter is such that the process consists of two independent internally interrelated sub-processes, one composed of the first 3 scalar series, the other of the remaining two series. This structural information, though known to us, is unknown to the learning methods (not considered in the learning process).
We list in Table 1 the predictive performance of the tested methods in terms of the average hold-out MSE based on training samples of varying size. Our methods clearly outperform all the linear models. The functionally strongly constrained linear LVARL1 performs roughly on par with our methods for the small sample sizes. But for larger sample sizes, the higher flexibility of the function space in our methods yields significantly more accurate forecasts (as much as 10% MSE improvement).
In brackets is the average standard deviation (std) of the MSEs. Results for NVARL1 and NVARL12 in bold are significantly better than all the linear competitors, in italics
are significantly better than the non-linear NVAR (using one-sided paired-sample t-test at 10% significance level).
The structural constraints in our methods also help the performance when competing with the unstructured NVAR method, which has mostly less accurate forecasts. At the same time, as illustrated in Fig. 1, our methods are able to correctly recover the Granger-causality structure (splitting the process into the two independent subprocesses by the zero off-diagonal blocks), which NVAR by construction cannot.
6.3 Real Data Experiments
We use data on water physical discharge publicly available from the website of the Water Services of the US geological survey (http://www.usgs.gov/). Our dataset consists of 9 time series of daily rates of year-on-year growth at measurement sites along the streams of Connecticut and Columbia rivers.
The prediction accuracy of the tested methods is listed in Table 2. Our non-linear methods perform on par with the state-of-the-art linear models. This on one hand suggests that for the analysed dataset the linear modelling seems sufficient. On the other hand, it confirms that our methods, which in general have the ability to learn more complex relationships by living in a richer functional space, are well behaved and can capture simpler dependencies as well. The structure encoded into our methods, however, benefits the learning since the unstructured NVAR tends to perform less accurately.
In brackets is the average standard deviation (std) of the MSEs.
The learned dynamical dependence structure of the time series is depicted in Fig. 1. In the dataset (and the adjacency matrices), the first 4 series are the Connecticut measurement sites starting from the one highest up the stream and moving down to the mouth of the river. The next 5 our the Columbia measurement sites ordered in the same manner.
From inspecting the learned adjacency matrices, we observe that all the sparse methods recover similar Granger-causal structures. Since we do not know the ground truth in this case, we can only speculate about the accuracy of the structure recovery. Nevertheless, it seems plausible that there is little dynamical cross-dependency between the Connecticut and Columbia measurements as the learned graphs suggest (the two rivers are at the East and West extremes of the US).
We have developed a new method for forecasting and Granger-causality modelling in multivariate time series that does not rely on prior assumptions about the shape of the dynamical dependencies (other than being sparse). The method is based on learning a combination of multiple operator-valued kernels in which the multiple output kernels are learned as sparse diagonal matrices. We have documented on experiments that our method outperforms linear competitors in the presence of strong non-linearities and is able to correctly recover the Granger-causality structure which non-structured kernel methods cannot do.
This work was partially supported by the research projects HSTS (ISNET) and RAWFIE #645220 (H2020). We thank Francesco Dinuzzo for helping to form the initial ideas behind this work through fruitful discussions while visiting in IBM Research, Dublin.
-  Arnold, A., Liu, Y., Abe, N.: Temporal causal modeling with graphical granger methods. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’07 (2007)
Bach, F.: Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research (2008)
-  Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning (2012)
-  Bahadori, M., Liu, Y.: An Examination of Practical Granger Causality Inference. SIAM Conference on Data Mining (2013)
-  Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications (2009)
-  Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer Science+Business Media, LLC, 2nd edn. (2006)
-  Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal Multi-Task Kernels. Machine Larning Research (2008)
-  Dinuzzo, F., Ong, C.: Learning output kernels with block coordinate descent. In: International Conference on Machine Learning (ICML) (2011)
Eichler, M.: Graphical modelling of multivariate time series. Probability Theory and Related Fields (2012)
-  Franz, M.O., Schölkopf, B.: A unifying view of wiener and volterra theory and polynomial kernel regression. Neural computation (2006)
-  Granger, C.W.J.: Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica: Journal of the Econometric Society (1969)
-  Jawanpuria, P., Lapin, M., Hein, M., Schiele, B.: Efficient Output Kernel Learning for Multiple Tasks. In: NIPS (2015)
-  Kadri, H., Rakotomamonjy, A., Bach, F., Preux, P.: Multiple Operator-valued Kernel Learning. In: NIPS (2012)
-  Lanckriet, G.G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research (2004)
-  Lim, N., D’Alché-Buc, F., Auliac, C., Michailidis, G.: Operator-valued Kernel-based Vector Autoregressive Models for Network Inference. Machine Learning (2014)
-  Lozano, A.C., Abe, N., Liu, Y., Rosset, S.: Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics (Oxford, England) (2009)
-  Micchelli, C.A., Pontil, M.: On learning vector-valued functions. Neural computation (2005)
Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., Ljung, L.: Kernel methods in system identification, machine learning and function estimation: A survey. Automatica (2014)
-  Sindhwani, V., Minh, H.Q., Lozano, A.: Scalable Matrix-valued Kernel Learning for High-dimensional Nonlinear Multivariate Regression and Granger Causality. In: UAI (2013)
-  Turkman, K.F., Scotto, M.G., de Zea Bermudez, P.: Non-Linear Time Series. Springer (2014)
-  Xu, Z., Jin, R., Yang, H., King, I., Lyu, M.R.: Simple and efficient multiple kernel learning by group lasso. International Conference on Machine Learning (ICML) (2010)
-  Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2006)
-  Zhao, P., Rocha, G.: Grouped and hierarchical model selection through composite absolute penalties (2006)