1 Introduction
Traditional methods for forecasting stationary multivariate time series from their own past are derived from the classical linear ARMA modelling. In these, the prediction of the next point in the future of the series is constructed as a linear function of the past observations. The use of linear functions as the predictors is in part based on the Wold representation theorem (e.g. [6]
) and in part, probably more importantly, on the fact that the linear predictor is the best predictor (in the meansquareerror sense) in case the time series is Gaussian.
The Gaussian assumption is therefore often adopted in the analysis of time series to justify the simple linear modelling. However, it is indeed a simplifying assumption since for nonGaussian series the best predictor may very well be a nonlinear function of the past observations. A number of parametric nonlinear models has been proposed in the literature, each adapted to capture specific sources of nonlinearity (for example multiple forms of regimeswitching models, e.g. [20]).
In this paper we adopt an approach that does not rely on such prior assumptions for the function form. We propose to learn the predictor as a general vectorvalued function that takes as input the past observations of the multivariate series and outputs the forecast of the unknown next value (vector).
We have two principal requirements on the function . The first is the standard prediction accuracy requirement. That is, the function shall be such that we can expect its outputs to be close (in the squared error sense) to the true future observations of the process. The second requirement is that the function shall have a structure that will enable the analysis of the relationships amongst the subprocesses of the multivariate series. Namely, we wish to understand how parts of the series help in forecasting other parts of the multivariate series, a concept known in the timeseries literature as graphical Granger modelling [11, 9].
To learn such a function we employ the framework of regularised learning of vectorvalued functions in the reproducing kernel Hilbert space (RKHS) [17]. Learning methods based on the RKHS theory have previously been considered for time series modelling (e.g. [10, 19, 15]). Though, as Pillonetto et al. note in their survey [18], their adoption for the dynamical system analysis is not a commonplace.
A critical step in kernelbased methods for learning vectorvalued functions is the specification of the operatorvalued kernel that exploits well the relationships between the inputs and the outputs. A convenient and wellstudied class of operatorvalued kernels (e.g. in [7, 8, 12]) are those decomposable into a product of a scalar kernel on the input space (input kernel) and a linear operator on the output space (output kernel).
The kernel uniquely determines the function space within which the function is learned. It thus has significant influence on both our objectives described above. Instead of having to choose the input and the output kernels a priori, we introduce a method for learning the input and output kernels from the data together with learning the vectorvalued function .
Our method combines in a novel way the multiplekernel learning (MKL) approach [14] with learning the output kernels within the space of positive semidefinite linear operators on the output space [12]. MKL methods for operatorvalued kernels have recently been developed in [13] and [19]. The first learns a convex combination of a set of operatorvalued kernels fixed in advance, the second combines a fixed set of input kernels with a single learned output kernel. To the best of our knowledge, ours is the first method in which the operatorvalued kernel is learned by combining a set of input kernels with a set of multiple learned output kernels.
In accordance with our second objective stated above, we impose specific structural constraints on the function search space so that the learned function supports the graphical Granger analysis. We achieve this by working with matrixvalued kernels operating over input partitions restricted to single input scalar series (similar input partitioning has recently been used in [19]).
We impose diagonal structure on the output kernels to control the model complexity. Though this has a cost in the inability to model contemporaneous relationships, it addresses the strong overparametrisation in a principled manner. It also greatly simplifies the final structure of the problem, which, in result, suitably decomposes into a set of smaller independent problems solvable in parallel.
We develop two forms of sparsitypromoting regularisation approaches for learning the output kernels. These are based on the and norms respectively and are motivated by the search for Grangercausality relationships. As to our knowledge, the latter has not been previously used in the context of MKL.
Finally, we confirm on experiments the benefits our methods can bring to forecasting nonGaussian series in terms of improved predictive accuracy and the ability to recover hidden dynamic dependency structure within the time series systems. This makes them valid alternatives to the stateoftheart graphical Granger techniques.
Notation
We use bold upper case and lower case letters for matrices and vectors respectively, and the plain letters with subscripts for their elements. For any matrix or vector the superscript denotes its transpose. Vectors are by convention columnwise so that is the dimensional vector . are the sets of real scalars, dimensional vectors, and dimensional matrices. is the set of nonnegative matrices, the set of positive semidefinite matrices and the set of nonnegative diagonal matrices. is the set of positive integers . For any vectors , are the standard inner product, and norms in the real Hilbert spaces. For any square matrix , denotes the trace. For any two matrices , is the Frobenius inner product and the Frobenius norm. and are the inner product and norm in the Hilbert space .
2 Problem Formulation
Given a realisation of a discrete stationary multivariate time series process , our goal is to learn a vectorvalued function that takes as input the past observations of the process and predicts its future vector value (one step ahead). The function shall be such that

we can expect the prediction to be near (in the Euclidean distance sense) the unobserved future value
For notational simplicity, from now on we indicate the output of the function as and the input as (bearing in mind that is in fact the th order Cartesian product of and that the inputs and outputs are the past and future observations of the same dimensional series). We also align the time indexes so that our data sample consists of inputoutput data pairs .
Following the standard function learning theory, we will learn by minimising the regularised empirical squarederror risk (with a regularization parameter )
(1)  
Here is the reproducing kernel Hilbert space (RKHS) of valued functions endowed with the norm and the inner product . The RKHS is uniquely associated with a symmetric positivesemidefinite matrixvalued kernel with the reproducing property
where the map is the kernel section of centred at such that for all . From the classical result in [17], the unique solution of the variational problem (1) admits a finite dimensional representation
(2) 
where the coefficients are the solutions of the system of linear equations
(3) 
where if and is zero otherwise.
2.1 Grangercausality Analysis
To study the dynamical relationships in time series processes, Granger [11] proposed a practical definition of causality based on the accuracy of leastsquares predictor functions. In brief, for two time series processes and , is said to Grangercause () if given all the other relevant information we can predict the future of better (in the meansquareerror sense) using the history of than without it.
Though the concept seems rather straightforward, there are (at least) three points worth considering. First, the notion is purely technical based on the predictive accuracy of functions with differing input sets; it does not seek to understand the underlying forces driving the relationships. Second, in practice the conditioning set of information needs to be reduced to all the available information instead of all the relevant information. Third, it only considers relationships between pairs of (sub)processes and not the interactions amongst a set of series.
Eichler [9]
extended the concept to multivariate analysis through graphical models. The discussion in the paper focuses on the notion of Granger noncausality rather than causality and describes the specific Markov properties (conditional noncausality) encoded in the graphs of Grangercausal relationships. In this sense, the absence of a variable in a set of inputs is more informative of the Granger (non)causality than its presence. In result, graphical Granger methods are typically based on (structured) sparse modelling
[4].3 Function Space and Kernel Specification
The function space within which is learned is fully determined by the reproducing kernel . Its specification is therefore critical for achieving the two objectives for the function defined in Sect. 2. We focus on the class of matrixvalued kernels decomposable into the product of input kernels, capturing the similarities in the inputs, and output kernels, encoding the relationships between the outputs.
To analyse the dynamical dependencies between the series, we need to be able to discern within the inputs of the learned function the individual scalar series. Therefore we partition the elements of the input vectors according to the source scalar time series. In result, instead of a single kernel operating over the full vectors, we work with multiple partitionkernels, each of them operating over a single input series. We further propose to learn the partitionkernels by combining the MKL techniques with output kernel learning within the cone of positive semidefinite matrices.
More formally, the kernel we propose to use is constructed as a sum of kernels , where is the number of the individual scalarvalued series in the multivariate process (dimensionality of the output space ). Each is a matrixvalued kernel that determines its own RKHS of vectorvalued functions. The domains are sets of vectors constructed by selecting from the inputs only the coordinates that correspond to the past of a single scalar time series .
Further, instead of choosing the individual matrixvalued functions , we propose to learn them. We construct each again as a sum of kernels of possibly uneven number of summands of matrixvalued kernels . For this lowest level we focus on the family of decomposable kernels . Here, the input kernels capturing the similarity between the inputs are fixed in advance from a dictionary of valid scalarvalued kernels (e.g. Gaussian kernels with varying scales). The set of output kernels encoding the relations between the outputs is learned within the cone of symmetric positive semidefinite matrices .
(4) 
3.1 Kernel Learning and Function Estimation
Learning all the output kernels as full PSD matrices implies learning more than parameters. To improve the generalization capability, we reduce the complexity of the problem drastically by restricting the search space for ’s to PSD diagonal matrices . This essentially corresponds to the assumption of no contemporaneous relationships between the series. We return to this point in Sect. 5.
As explained in Sect. 2.1, Granger (non)causality learning typically searches for sparse models. We bring this into our methods by imposing a further sparsity inducing regularizer on the set of the output kernels . We motivate and elaborate suitable forms of in Sect. 3.2.
The joint learning of the kernels and the function can now be formulated as the problem of finding the minimising solution and ’s of the regularised functional
(5) 
where is the regularised risk from (1). By calling on the properties of the RKHS, we reformulate this as a finite dimensional problem that can be addressed by conventional finitedimensional optimisation approaches. We introduce the gram matrices such that for all , the output data matrix such that , and the coefficient matrix such that .
3.2 Sparse Regularization
The construction of the kernel and the function space described in Sect. 3 imposes on the function the necessary structure that allows the Grangercausality analysis (as per our 2nd objective setout in Sect. 2). As explained in Sect. 2.1, the other ingredient we need to identify the Granger noncausalities is sparsity within the structure of the learned function.
In our methods, the sparsity is introduced by the regularizer . By construction of the function space, we can examine the elements of the output kernels (their diagonals) to make statements about the Granger noncausality. We say the th scalar time series is noncausal for the series (given all the remaining series in the process) if for all .
Essentially, any of the numerous regularizers that exist for sparse or structured sparse learning [3] could be used as , possibly based on some prior knowledge about the underlying dependencies within the timeseries process.
We elaborate here two cases that do not assume any special structure in the dependencies as the base scenarios. The first is the entrywise norm across all the output kernels so that
(8) 
The second is the grouped norm
(9) 
After developing the learning strategy for these in Sect(s). 4.1 and 4.2, we provide some more intuition of their effects on the models and link to some other known graphical Granger techniques in Sect. 5.
4 Learning Strategy
First of all, we simplify the final formulation of the problem (7) in Sect. 3.1. Rather than working with a set of diagonal matrices , we merge the diagonals into a single matrix . We then reformulate the problem with respect to this single matrix in place of the set and show how this reformulation can be suitably decomposed into smaller independent subproblems.
We develop fittopurpose approaches for our two regularisers in Sect(s). 4.1 and 4.2. The first  based on the decomposition of the kernel matrices into the corresponding empirical features and on the variational formulation of norms [3]  shows the equivalence of the problem with group lasso [22, 23]. The second proposes a simple alternating minimisation algorithm to obtain the two sets of parameters.
We introduce the nonnegative matrix such that
(10) 
(each row in corresponds to the diagonal of one output kernel; if for all we have ). Using this change of variable, the optimisation problem (7) can be written equivalently as
(11)  
where
(12)  
and is the equivalent of so that
(13) 
and
(14) 
In equations (12) and (14) we somewhat abuse the notation by using to indicate the sum across the rows of the matrix .
From (12)(14) we observe that, with both of our regularizers, problem (11) is conveniently separable along into the sum of smaller independent problems, one per scalar output series. These can be efficiently solved in parallel, which makes our method scalable to very large multivariate systems. The final complexity depends on the choice of the regulariser and the appropriate algorithm. The overhead cost can be significantly reduced by precalculating the gram matrices in a single preprocessing step and sharing these in between the parallel tasks.
4.1 Learning with Norm
To unclutter notation we replace the bracketed double superscripts by a single superscript . We also drop the regularization parameter (fix it to ) as it is easy to show that any other value can be absorbed into the rescaling of and the and matrices. For each of the parallel tasks we indicate , and so that the individual problems are the minimisations with respect to and of
(15) 
We decompose (for example by eigendecomposition) each of the gram matrices as , where is the matrix of the empirical features, and we introduce the variables and the set . Using these we rewrite^{1}^{1}1We extend the function to the point by taking the convention . equation (15)
(16) 
We first find the closed form of the minimising solution for as for all . Plugging this back to (16) we obtain
(17) 
Seen as a minimisation with respect to the set this is the classical grouplasso formulation with the empirical features as inputs. Accordingly, it can be solved by any standard method for grouplasso problems such as the proximal gradient descent method, e.g. [3], which we employ in our experiments. After solving for we can directly recover from the above minimising identity and then obtain the parameters from the set of linear equations
(18) 
The algorithm outlined above takes advantage of the convex grouplasso reformulation (17) and has the standard convergence and complexity properties of proximal gradient descent. The empirical features can be precalculated and shared amongst the tasks to reduce the overhead cost.
4.2 Learning with Norm
For the regularization, we need to return to the double indexation to make clear how the groups are created. As above, for each of the parallel tasks we use the vectors and . However, for vector we will keep the notation for its elements. The individual problems are the minimisations with respect to and of
(19) 
We propose to use the alternating minimisation with a proximal gradient step. At each iteration, we alternatively solve for and . For fixed we obtain from the set of linear equations (18). With fixed , problem (19) is a group lasso for with groups defined by the subindex within the double indexation of the elements of . Here, the proximal gradient step takes place to move along the descend direction for . Though convex in and individually, the problem (19) is jointly nonconvex and therefore can converge to local minima.
5 Interpretation and Crossovers
To help the understanding of the inner workings of our methods and especially the effects of the two regularizers, we discuss here the crossovers to other existing methods for MKL and Granger modelling.
Norm
The link to grouplasso demonstrated in Sect. 4.1 is not in itself too surprising. The formulation in (15) can be recognised as a sparse multiple kernel learning problem which has been previously shown to relate to grouplasso (eg. [2], [21]). We derive this link in Sect. 4.1 using the empirical feature representation to i) provide better intuition for the structure of the learned function , ii) develop an efficient algorithm for solving problem (15).
The reformulation in terms of the empirical features creates an intuitive bridge to the classical linear models. Each can be seen as a matrix of features generated from a subset of the input coordinates relating to the past of a single scalar time series . The grouplasso regularizer in equation (17) has a sparsifying effect at the level of these subsets zeroing out (or not) the whole groups of parameters . In the context of linear methods, this approach is known as the grouped graphical Granger modelling [16].
Within the nonlinear approaches to time series modelling, Sindhwani et al. [19] recently derived a similar formulation. There the authors followed a strategy of multiple kernel learning from a dictionary of input kernels combined with a single learned output kernel (as opposed to our multiple output kernels). They obtain their IKL model, which is in its final formulation equivalent to problem (15), by fixing the output kernel to identity.
Though we initially formulate our problem quite differently, the diagonal constraint we impose on the output kernels essentially prevents the modelling of any contemporaneous relationships between the series (as does the identity output kernel matrix in IKL). What remains in our methods are the diagonal elements, which are nonconstant and sparse, and which can be interpreted as the weights of the input kernels in the standard MKL setting.
Norm
The more complex regularisation discussed in Sect. 4.2 is to the best of our knowledge novel in the context of multiple kernel learning. It has again a strong motivation and clear interpretation in terms of the graphical Granger modelling. The norm has a sparsifying effect not only at the level of the individual kernels but at the level of the groups of kernels operating over the same input partitions . In this respect our move from the to the norm has a parallel in the same move in linear graphical Granger techniques. The norm LassoGranger method [1] imposes the sparsity on the individual elements of the parameter matrices in a linear model, while the of the groupedLassoGranger [16] works with groups of the corresponding parameters of a single input series across the multiple lags .
6 Experiments
To document the performance of our method, we have conducted a set of experiments on real and synthetic datasets. In these we simulate reallife forecasting exercise by splitting the data into a training and a holdout set which is unseen by the algorithm when learning the function and is only used for the final performance evaluation.
We compare our methods with the output kernel regularization (NVARL1) and and
(NVARL12) with simple baselines (which nevertheless are often hard to beat in practical time series forecasting) as well as with the stateoftheart techniques for forecasting and Granger modelling. Namely, we compare with simple mean and univariate linear autoregressive models (LAR), multivariate linear vector autoregressive model with
penalty (LVARL2), the grouplasso Granger method [16] (LVARL1), and a sparse MKL without the input partitioning (NVAR). Of these, the last two are the most relevant competitors. LVARL1, similarly to our methods, aims at recovering the Granger structure but is strongly constrained to linear modelling only. NVAR has no capability to capture the Granger relationships but, due to the lack of structural constraints, it is the most flexible of all the models.We evaluate our results with respect to the two objectives for the function defined in Sect. 2. We measure the accuracy of the onestep ahead forecasts by the mean square error (MSE) for the whole multivariate process averaged over 500 holdout points. The structural objective allowing the analysis of dependencies between the subprocesses is wired into the method itself (see Sect(s). 3 and 2.1) and is therefore satisfied by construction. We produce adjacency matrices of the graphs of the learned dependencies, compare these with the ones produced by the linear Granger methods and comment on the observed results.
6.1 Technical Considerations
For each experiment we preprocessed the data by removing the training sample mean and rescaling with the training sample standard deviation. We fix the number of kernels for each input partition to six (
for all ) and use the same kernel functions for all experiments: a linear, 2nd order and 3rd polynomial, and Gaussian kernels with width 0.5, 1 and 2. We normalise the kernels so that the training Gram matrices have trace equal to the size of the training sample.We search for the hyperparameter by a 5fold crossvalidation within a 15long logarithmic grid , where is the training sample size and is the number of kernels or groups (depending on the method). In each grid search, we use the previous parameter values as warm starts. We do not perform an exhaustive search for the optimal lag for each of the scalar input series by some of the classical testing procedures (based on AIC, BIC etc.). We instead fix it to for all series in all experiments and rely on the regularization to control any excess complexity.
We implemented our own tools for all the tested methods based on variations of proximal gradient descent with ISTA line search [5]. The full Matlab code is available at https://bitbucket.org/dmmlgeneva/nonlineargranger
6.2 Synthetic Experiments
We have simulated data from a five dimensional nonGaussian timeseries process generated through a linear filter of a 5dimensional i.i.d. exponential white noise
with identity covariance matrix (recentered to zero and rescaled to unit variance). The matrix
in the filter is such that the process consists of two independent internally interrelated subprocesses, one composed of the first 3 scalar series, the other of the remaining two series. This structural information, though known to us, is unknown to the learning methods (not considered in the learning process).We list in Table 1 the predictive performance of the tested methods in terms of the average holdout MSE based on training samples of varying size. Our methods clearly outperform all the linear models. The functionally strongly constrained linear LVARL1 performs roughly on par with our methods for the small sample sizes. But for larger sample sizes, the higher flexibility of the function space in our methods yields significantly more accurate forecasts (as much as 10% MSE improvement).
Train size  300  700  1000  

Mean  0.925  (0.047)  0.923  (0.047)  0.923  (0.047) 
LAR  0.890  (0.045)  0.890  (0.044)  0.890  (0.044) 
LVAR  0.894  (0.045)  0.836  (0.041)  0.763  (0.035) 
LVARL1  0.787  (0.037)  0.737  (0.031)  0.722  (0.030) 
NVAR  0.835  (0.041)  0.735  (0.032)  0.719  (0.030) 
NVARL1  0.754  (0.034)  0.706  (0.030)  0.679  (0.028) 
NVARL12  0.808  (0.040)  0.710  (0.031)  0.684  (0.029) 
Train size  1500  2000  3000  
Mean  0.923  (0.047)  0.922  (0.047)  0.922  (0.047) 
LAR  0.888  (0.045)  0.889  (0.045)  0.888  (0.045) 
LVAR  0.751  (0.034)  0.741  (0.033)  0.687  (0.028) 
LVARL1  0.710  (0.029)  0.701  (0.028)  0.693  (0.028) 
NVAR  0.699  (0.028)  0.682  (0.027)  0.662  (0.026) 
NVARL1  0.654  (0.026)  0.640  (0.025)  0.626  (0.025) 
NVARL12  0.659  (0.027)  0.685  (0.028)  0.657  (0.027) 
In brackets is the average standard deviation (std) of the MSEs. Results for NVARL1 and NVARL12 in bold are significantly better than all the linear competitors, in italics
are significantly better than the nonlinear NVAR (using onesided pairedsample ttest at 10% significance level).
The structural constraints in our methods also help the performance when competing with the unstructured NVAR method, which has mostly less accurate forecasts. At the same time, as illustrated in Fig. 1, our methods are able to correctly recover the Grangercausality structure (splitting the process into the two independent subprocesses by the zero offdiagonal blocks), which NVAR by construction cannot.
6.3 Real Data Experiments
We use data on water physical discharge publicly available from the website of the Water Services of the US geological survey (http://www.usgs.gov/). Our dataset consists of 9 time series of daily rates of yearonyear growth at measurement sites along the streams of Connecticut and Columbia rivers.
The prediction accuracy of the tested methods is listed in Table 2. Our nonlinear methods perform on par with the stateoftheart linear models. This on one hand suggests that for the analysed dataset the linear modelling seems sufficient. On the other hand, it confirms that our methods, which in general have the ability to learn more complex relationships by living in a richer functional space, are well behaved and can capture simpler dependencies as well. The structure encoded into our methods, however, benefits the learning since the unstructured NVAR tends to perform less accurately.
Train size  300  700  1000  

Mean  0.780  (0.053)  0.795  (0.054)  0.483  (0.026) 
LAR  0.330  (0.023)  0.340  (0.024)  0.152  (0.013) 
LVARL2  0.302  (0.021)  0.311  (0.022)  0.140  (0.012) 
LVARL1  0.310  (0.022)  0.310  (0.023)  0.140  (0.012) 
NVAR  0.328  (0.023)  0.316  (0.023)  0.148  (0.012) 
NVARL1  0.308  (0.023)  0.317  (0.024)  0.140  (0.012) 
NVARL12  0.321  (0.023)  0.322  (0.024)  0.141  (0.012) 
Train size  1500  2000  3000  
Mean  0.504  (0.03)  0.464  (0.027)  0.475  (0.017) 
LAR  0.181  (0.015)  0.179  (0.013)  0.187  (0.008) 
LVARL2  0.167  (0.014)  0.164  (0.013)  0.170  (0.007) 
LVARL1  0.165  (0.014)  0.163  (0.013)  0.170  (0.008) 
NVAR  0.169  (0.014)  0.166  (0.012)  0.173  (0.007) 
NVARL1  0.164  (0.014)  0.161  (0.013)  0.167  (0.007) 
NVARL12  0.162  (0.014)  0.160  (0.012)  0.166  (0.007) 
In brackets is the average standard deviation (std) of the MSEs.
The learned dynamical dependence structure of the time series is depicted in Fig. 1. In the dataset (and the adjacency matrices), the first 4 series are the Connecticut measurement sites starting from the one highest up the stream and moving down to the mouth of the river. The next 5 our the Columbia measurement sites ordered in the same manner.
From inspecting the learned adjacency matrices, we observe that all the sparse methods recover similar Grangercausal structures. Since we do not know the ground truth in this case, we can only speculate about the accuracy of the structure recovery. Nevertheless, it seems plausible that there is little dynamical crossdependency between the Connecticut and Columbia measurements as the learned graphs suggest (the two rivers are at the East and West extremes of the US).
7 Conclusions
We have developed a new method for forecasting and Grangercausality modelling in multivariate time series that does not rely on prior assumptions about the shape of the dynamical dependencies (other than being sparse). The method is based on learning a combination of multiple operatorvalued kernels in which the multiple output kernels are learned as sparse diagonal matrices. We have documented on experiments that our method outperforms linear competitors in the presence of strong nonlinearities and is able to correctly recover the Grangercausality structure which nonstructured kernel methods cannot do.
Acknowledgements
This work was partially supported by the research projects HSTS (ISNET) and RAWFIE #645220 (H2020). We thank Francesco Dinuzzo for helping to form the initial ideas behind this work through fruitful discussions while visiting in IBM Research, Dublin.
References
 [1] Arnold, A., Liu, Y., Abe, N.: Temporal causal modeling with graphical granger methods. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’07 (2007)

[2]
Bach, F.: Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research (2008)
 [3] Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsityinducing penalties. Foundations and Trends in Machine Learning (2012)
 [4] Bahadori, M., Liu, Y.: An Examination of Practical Granger Causality Inference. SIAM Conference on Data Mining (2013)
 [5] Beck, A., Teboulle, M.: Gradientbased algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications (2009)
 [6] Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer Science+Business Media, LLC, 2nd edn. (2006)
 [7] Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal MultiTask Kernels. Machine Larning Research (2008)
 [8] Dinuzzo, F., Ong, C.: Learning output kernels with block coordinate descent. In: International Conference on Machine Learning (ICML) (2011)

[9]
Eichler, M.: Graphical modelling of multivariate time series. Probability Theory and Related Fields (2012)
 [10] Franz, M.O., Schölkopf, B.: A unifying view of wiener and volterra theory and polynomial kernel regression. Neural computation (2006)
 [11] Granger, C.W.J.: Investigating Causal Relations by Econometric Models and Crossspectral Methods. Econometrica: Journal of the Econometric Society (1969)
 [12] Jawanpuria, P., Lapin, M., Hein, M., Schiele, B.: Efficient Output Kernel Learning for Multiple Tasks. In: NIPS (2015)
 [13] Kadri, H., Rakotomamonjy, A., Bach, F., Preux, P.: Multiple Operatorvalued Kernel Learning. In: NIPS (2012)
 [14] Lanckriet, G.G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research (2004)
 [15] Lim, N., D’AlchéBuc, F., Auliac, C., Michailidis, G.: Operatorvalued Kernelbased Vector Autoregressive Models for Network Inference. Machine Learning (2014)
 [16] Lozano, A.C., Abe, N., Liu, Y., Rosset, S.: Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics (Oxford, England) (2009)
 [17] Micchelli, C.A., Pontil, M.: On learning vectorvalued functions. Neural computation (2005)

[18]
Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., Ljung, L.: Kernel methods in system identification, machine learning and function estimation: A survey. Automatica (2014)
 [19] Sindhwani, V., Minh, H.Q., Lozano, A.: Scalable Matrixvalued Kernel Learning for Highdimensional Nonlinear Multivariate Regression and Granger Causality. In: UAI (2013)
 [20] Turkman, K.F., Scotto, M.G., de Zea Bermudez, P.: NonLinear Time Series. Springer (2014)
 [21] Xu, Z., Jin, R., Yang, H., King, I., Lyu, M.R.: Simple and efficient multiple kernel learning by group lasso. International Conference on Machine Learning (ICML) (2010)
 [22] Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2006)
 [23] Zhao, P., Rocha, G.: Grouped and hierarchical model selection through composite absolute penalties (2006)
Comments
There are no comments yet.