Kernel methods (Hofmann et al., 2008) have proved useful to handle and analyze structured data. A non-exhaustive list of such data types includes images (Chapelle et al., 1999; Grauman and Darrell, 2005; Cuturi and Fukumizu, 2007; Harchaoui and Bach, 2007), graphs (Kashima et al., 2003; Mahe et al., 2005; Vishwanathan et al., 2008; Shervashidze and Borgwardt, 2009), texts (Joachims, 2002; Moschitti and Zanzotto, 2007) and strings on finite alphabets (Leslie et al., 2002; Cortes et al., 2004; Vert et al., 2004; Cuturi and Vert, 2005; Sonnenburg et al., 2007), which have all drawn much attention in recent years. Time series, although ubiquitous in science and engineering, have been comparatively the subject of less research in the kernel literature.
Numerous similarity measures and distances for time series have been proposed in the past decades (Schreiber and Schmitz, 1997). These similarities are not, however, always well suited to the kernel methods framework. First, most available similarity measures are not positive definite (Haasdonk and Bahlmann, 2004). Likewise, most distances are not negative definite. The positive definiteness of similarity measures (alternatively the negative definiteness of distances) is needed to use the convex optimization algorithms that underly most kernel machines. Positive definiteness is also the cornerstone of the reproducing kernel Hilbert space (RKHS) framework which supports these techniques (Berlinet and Thomas-Agnan, 2003)
. Second, most similarities measures are only defined for ‘standard’ multivariate time series, that is time series of finite dimensional vectors. Yet, some of the main application fields of kernel methods include bioinformatics, natural language processing and computer vision, where the analysis of time series of structured objects (images, texts, graphs) remains a very promising field of study. Ideally, a useful kernel for time series should be both positive definite and able to handle time series of structured data. An oft-quoted example(Bahlmann et al., 2002; Shimodaira et al., 2002) of a non-positive definite similarity for time series is the Dynamic Time Warping (DTW) score (Sakoei and Chiba, 1978), arguably the most popular similarity score for variable-length multivariate time series (Rabiner and Juang, 1993, §4.7). Hence the DTW can only be used in a kernel machine if it is altered through ad hoc modifications such as diagonal regularizations (Zhou et al., 2010). Some extensions of the DTW score have addressed this issue: Hayashi et al. (2005) propose to embed time series in Euclidean spaces such that the distance of such representations approximates the distance induced by the DTW score. Cuturi et al. (2007) consider the soft-max of the alignment scores of all possible alignments to compute a positive definite kernel for two time series. This kernel can also be used on two time series and of structured objects since the kernel between x and can be expressed as a function of the Gram matrix where is a given kernel on the structured objects of interest.
A few alternative kernels have been proposed for multivariate time series. Kumara et al. (2008)
consider a non-parametric approach to interpolate time series using splines, and define directly kernels on these interpolated representations. In a paragraph of their broad work on probability product kernels,Jebara et al. (2004, §4.5) briefly mention the idea of using the Bhattacharyya distance on suitable representations as normal densities of two time series using state-space models. Vishwanathan et al. (2007) as well as Borgwardt et al. (2006) use the family of Binet-Cauchy kernels (Vishwanathan and Smola, 2004), originally defined by the coefficients of the characteristic polynomial of kernel matrices such as the matrix described above (when ). Unlike other techniques listed above, these two proposals rely on a probabilistic modeling of the time series to define a kernel. Namely, in both the probability product and Binet-Cauchy approaches the kernel value is the result of a two step computation: each time series is first mapped onto a set of parameters that summarizes their dynamic behavior; the kernel is then defined as a kernel between these two sets of parameters.
The kernels we propose in this paper also rely on a probabilistic modeling of time series to define kernels but do away with the two step approach detailed above. This distinction is discussed later in the paper in Remark 2. Our contribution builds upon the the covariance kernel framework proposed by Seeger (2002), and whose approach can be traced back to the work of Haussler (1999) and Jaakkola et al. (1999), who advocate the use of probabilistic models to extract features from structured objects. Given a measurable space
and a model, that is a parameterized family of probability distributions onof the form , a kernel for two objects can be defined as
where is a prior on the parameter space. Our work can be broadly characterized as the implementation of this idea for time series data, using the VAR model to define the space of densities and a specific prior for
, the matrix-normal inverse-Wishart prior. This conjugate prior allows us to obtain a closed-form expression for the kernel which admits useful properties.
The rest of this paper is organized as follows. Section 2 starts with a brief review of the Bayesian approach to dynamic linear modeling for multivariate stochastic processes, which provides the main tool to define autoregressive kernels on multivariate time series. We follow by detailing a few of the appealing properties of autoregressive kernels for multivariate time series, namely their infinite divisibility and their ability to handle high-dimensional time series of short length. We show in Section 3 that autoregressive kernels can not only be used on multivariate time series but also on time series taking values in any set endowed with a kernel. The kernel is then computed as a function of the Gram matrices of subsets of shorter time series found in x and . This computation requires itself the computation of large Gram matrices in addition to a large number of operations that grows cubicly with the lengths of x and . We propose in Section 3.2 to circumvent this computational burden by using low-rank matrix factorization of these Gram matrices. We present in Section 4 different experimental results using toy and real-life datasets.
2 Autoregressive Kernels
We introduce in this section the crux of our contribution, that is a family of kernels that can handle variable-length multivariate time series. Sections 2.1, 2.2 and 2.3 detail the construction of such kernels, while Sections 2.4 and 2.5 highlight some of their properties.
2.1 Autoregressive Kernels as an Instance of Covariance Kernels
A vector autoregressive model of orderhenceforth abbreviated as VAR() is a family of densities for -valued stochastic processes parameterized by matrices of size and a positive definite matrix . Given a parameter the conditional probability density that an observed time series has been drawn from model given the first observations , assuming , is equal to
where for a vector and a positive definite matrix the Mahalanobis norm is equal to . We write for the length of the time series x, in the example above. We abbreviate the conditional probability as and take for granted that the first observations of the time series are not taken into account to compute the probability of x. We consider the set of parameters
where is the cone of positive definite matrices of size , to define a kernel that computes a weighted sum of the product of features of two time series x and as varies in ,
where the exponential weight factor is used to normalize the probabilities by the length of the considered time series x.
The feature map produces strictly positive features. Features will be naturally closer to 0 for longer time series. We follow in this work the common practice in the kernel methods literature, as well as the signal processing literature, to normalize to some extent these features so that their magnitude is independent of the size of the input object, namely the length of the input time series. We propose to do so by normalizing the probabilities by the lengths of the considered series. The weight introduced in Equation 1 is defined to this effect in section 2.3.
The formulation of Equation (1)
is somehow orthogonal to kernels that map structured objects, in this
case time series, to a single density in a given model and compare
directly these densities using a kernel between densities, such as
probability product kernels (Jebara et al., 2004), Binet-Cauchy
kernels (Vishwanathan et al., 2007), structural kernels (Hein and Bousquet, 2005)
or information diffusion kernels (Lebanon, 2006) . Indeed, such
kernels rely first on the estimation of a parameter
. Indeed, such kernels rely first on the estimation of a parameterin a parameterized model class to summarize the properties of an object x, and then compare two objects x and by using a proxy kernel on and . These approaches do require that the model class, the VAR model in the context of this paper, properly models the considered objects, namely that x is well summarized by . The kernels we propose do not suffer from this key restriction: the VAR model considered in this work is never used to infer likely parameters for a time series x but is used instead to generate an infinite family of features.
The kernel is mainly defined by the prior
on the parameter space. We present in the next section a possible choice for this prior, the matrix-normal inverse-Wishart prior, which has been extensively studied in the framework of Bayesian linear regression applied to dynamic linear models.
2.2 The Matrix-Normal Inverse-Wishart Prior in the Bayesian Linear Regression Framework
Consider the regression model between a vector of explanatory variables of , and an output variable , where is a coefficient matrix and is a centered Gaussian noise with covariance matrix . Accordingly, follows conditionally to a normal density
Given pairs of explanatory and response vectors weighted by nonnegative coefficients such that , the weighted likelihood of this sample of observations is defined by the expression
where the matrices and stand respectively for , and .
The matrix-normal inverse Wishart joint distributionWest and Harrison (1997, §16.4.3) is a natural choice to model the randomness for . The prior assumes that the matrix
is distributed following a centered matrix-normal density with left variance matrix parameterand right variance matrix parameter ,
where is a positive definite matrix. Using the following notations,
we can integrate out in to obtain
The matrix-normal inverse Wishart prior for also assumes that is distributed with inverse-Wishart density of inverse scale matrix. The posterior obtained by multiplying this prior by Equation (2) is itself proportional to an inverse-Wishart density with parameters which can be integrated to obtain the marginal weighted likelihood,
Using for the prior , for the matrix , and discarding all constants independent of and yields the expression
Note that the matrix in the denominator is known as the hat-matrix of the orthogonal least-squares regression of versus . The right term in the denominator can be interpreted as the determinant of the weighted cross-covariance matrix of with the residues
regularized by the identity matrix.
2.3 Bayesian Averaging over VAR Models
Given a VAR() model, we represent a time series as a sample of pairs of explanatory variables in
and response variables in, namely . Following a standard practice in the study of VAR models (Lütkepohl, 2005, §3), this set is better summarized by matrices
Analogously, we use the corresponding notations for a second time series of length . Using the notation
these samples are aggregated in the , and matrices
Note that by setting and , the integrand that appears in Equation (1) can be cast as the following probability,
Integrating out using the matrix-normal inverse Wishart prior for yields the following definition:
Given two time series and using the notations introduced in Equation (3), the autoregressive kernel of order and degrees of freedom is defined as
2.4 Variance and Gram Based Formulations
We show in this section that can be reformulated in terms of Gram matrices of subsequences of x and rather than variance-covariance matrices. For two square matrices and we write when the spectrums of and coincide, taking into account multiplicity, except for the value . Recall first the following trivial lemma.
For two matrices and in and respectively, , and as a consequence
Based on this lemma, it is possible to establish the following result.
Let then the autogressive kernel of order and degrees of freedom given in Equation (4) is equal to
Taking a closer look at the denominator, the matrix inversion lemma111 yields the equality
Factoring in these two results, we obtain Equation (5).
We call Equation (4) the Variance formulation and Equation (5) the Gram formulation of the autoregressive kernel as it only depends on the Gram matrices of and . Although both the Variance and Gram formulations of are equal, their computational cost is different as detailed in the remark below.
In a high -low setting, the computation of requires operations to compute the denominator’s matrices and to compute their inverse, which yields an overall cost of the order of . This may seem reasonable for applications where the cumulated time series lengths’ is much larger than the dimension of these time series, such as speech signals or EEG data. In a low -high setting, which frequently appears in bioinformatics or video processing applications, autoregressive kernels can be computed using Equation (5) in operations.
2.5 Infinite Divisibility of Autoregressive Kernels
We recall that a positive definite kernel function is infinitely divisible if for all , is also positive definite (Berg et al., 1984, §3.2.6). We prove in this section that under certain conditions on , the degrees-of-freedom parameter of the inverse Wishart law, the autoregressive kernel is infinitely divisible. This result builds upon (Cuturi et al., 2005, Proposition 3).
Proving the infinite divisibility of is useful for the following two reasons: First, following a well-known result Berg et al. (1984, §3.2.7), the infinite divisibility of implies the negative definiteness of . Using Berg et al. (1984, §3.3.2) for instance, there exists a mapping of onto a Hilbert space such that
and hence defines a Hilbertian metric for time series which can be used with distance-based tools such as nearest-neighbors.
Second, on a more practical note, the exponent in Equation (5) is numerically problematic when is large. In such a situation, the kernel matrices produced by would be diagonally dominant. This is analogous to selecting a bandwidth parameter
which is too small when using the Gaussian kernel on high-dimensional data. By proving the infinite divisibility of, the exponent can be removed and substituted by any arbitrary exponent.
To establish the infinite divisibility result, we need a few additional notation. Let be the set of positive measures on
with finite second-moment. This set is a semigroup (Berg et al., 1984) when endowed with the usual addition of measures.
For two measures and of , the kernel
is an infinitely divisible positive definite kernel.
The following identity is valid for any positive-definite matrix
Given a measure with finite second-moment on , we thus have
where stands for the Frobenius dot-product between matrices and is the standard multivariate normal density. In the formalism of (Berg et al., 1984) the integral of Equation (6) is an integral of bounded semicharacters on the semigroup equipped with autoinvolution. Each semicharacter is indexed by a vector as
To verify that is a semicharacter notice that , where is the zero measure, and for two measures of . Now, using the fact that the multivariate normal density is a stable distribution, one has that for any ,
where is the -th convolution of density , which proves the result.
For , equivalently for , is a negative definite kernel
is infinitely divisible as the product of two infinitely divisible kernels, and computed on two different representations of x and : first as empirical measures on with locations enumerated in the columns of and respectively, and as empirical measures on with locations enumerated in the columns of the stacked matrices and . The set of weights for both representations are the uniform weights and . The negative definiteness of follows from, and is equivalent to, the infinite divisibility of Berg et al. (1984, §3.2.7).
The infinite divisibility of the joint distribution matrix normal-inverse Wishart distribution would be a sufficient condition to obtain directly the infinite divisibility of using for instance Berg et al. (1984, §3.3.7). Unfortunately we have neither been able to prove this property nor found it in the literature. The inverse Wishart distribution alone is known to be not infinitely divisible in the general case (Lévy, 1948). We do not know either whether can be proved to be infinitely divisible when . The condition , and hence also plays an important role in Proposition 4.
In the next sections, we will usually refer to the (negative definite) autoregressive kernel
where the constant , rather than considering itself.
3 Extension to Time Series Valued in a Set Endowed with a Kernel
We show in Section that autoregressive kernels can be extended quite simply to time series valued in arbitrary spaces by considering Gram matrices. This extension is interesting but can be very computationally expensive. We propose a way to mitigate this computational cost by using low-rank matrix factorization techniques in Section 3.2
3.1 Autoregressive Kernels Defined Through Arbitrary Gram Matrices
Using again notation introduced in Equation (3), we write for the Gram matrix of all explanatory variables contained in the joint sample and for the Gram matrix of all outputs of the local regression formulations. As stated in Equation (7) above, and by extension can be defined as a function of the Gram matrices and . To alleviate notations, we introduce two functions and defined respectively on the cone of positive semidefinite matrices and on :
Using these notations, we have
which highlights the connection between and the Gram matrices and . In the context of kernel methods, the natural question brought forward by this reformulation is whether the linear dot-product matrices and in or can be replaced by arbitrary kernel matrices and between the vectors in and enumerated in and , and the resulting quantity still be a valid positive definite kernel between x and . More generally, suppose that x and are time series of structured objects, graphs for instance. In such a case, can Equation (7) be used to define a kernel between time series of graphs x and by using directly Gram matrices that measure the similarities between graphs observed in x and ? We prove here this is possible.
Let us redefine and introduce some notations to establish this result. Given a -uple of points taken in an arbitrary set , and a positive kernel on we write for the Gram matrix
For two lists u and , we write for the concatenation of u and . Recall that an empirical measure on a measurable set is a finite sum of weighted Dirac masses, , where the are the locations and the the weights of such masses.
For two empirical measures and defined on a set by locations and and weights and respectively, the function
is a negative definite kernel.
We follow the approach of the proof of Cuturi et al. (2005, Theorem 7). Consider measures and real weights such that . We prove that the quantity
is necessarily non-positive. Consider the finite set of all locations in enumerated in all measures . For each point in , we consider the function in the reproducing kernel Hilbert space of functions defined by . Let be the finite dimensional subspace of spanned by all images in of elements of by this mapping. For each empirical measures we consider its counterpart , the empirical measure in with the same weights and locations defined as the mapped locations of in . Since for two points in we have by the reproducing property that , we obtain that where is in this case cast as a positive definite kernel on the Euclidean space . Hence the left hand side of Equation (9) is nonnegative by negative definiteness of .
We now consider two time series x and taking values in an arbitrary space . For any sequence we write where for the sequence . To summarize the transitions enumerated in x and we consider the sequences of subsequences
Considering now a p.d. kernel on and on we can build Gram matrices,
Given two time series in , the autoregressive negative definite kernel of order , parameter and base kernels and defined as
is negative definite.
is negative definite as the sum of three negative definite kernels: is an additive function in and and is thus trivially negative definite. The term can be cast as times the negative definite kernel defined on measures of with kernel while the term is times the negative definite kernel defined on measures of with kernel .
3.2 Approximations Using Low-Rank Factorizations
We consider in this section matrix factorization techniques to approximate the kernel matrices and used to compute by low rank matrices. Theorem 5 provides a useful tool to control the tradeoff between the accuracy and the computational speed of this approximation.
3.2.1 Computing using low-rank matrices
Consider an matrix and an matrix such that approximates and approximates . Namely, such that the Frobenius norms of the differences
are small, where the Frobenius norm of a matrix is .
Computing requires an order of operations. Techniques to obtain such matrices and
range from standard truncated eigenvalue decompositions, such as the power method, to incomplete Cholesky decompositions(Fine and Scheinberg, 2002; Bach and Jordan, 2005) and Nystr m methods (Williams and Seeger, 2001; Drineas and Mahoney, 2005) which are arguably the most popular in the kernel methods literature. The analysis we propose below is valid for any factorization method.
Let , then defined in Equation (8) is a strictly concave function of which is strictly increasing in the sense that if and .
The gradient of is which is thus a positive definite matrix. As a consequence, is a strictly increasing function. The Jacobian of this gradient evaluated at is the linear map . For any matrix the Hessian of computed at is thus the quadratic form
Since for any two matrices , is negative for any positive definite matrix . Hence the Hessian of is minus a positive definite quadratic form on and thus is strictly concave.
We use a first order argument to bound the difference between the approximation and the true value of using terms in and :
Given two time series x,x’, for any low rank approximations and in such that and we have that
Proof. Immediate given that is concave and increasing .
3.2.2 Early stopping criterion
Incomplete Cholesky decomposition and Nystr m methods can build iteratively a series of matrices and such that and increase respectively towards and as goes to . The series and can be obtained without having to compute explicitly the whole of nor except for their diagonal.
The iterative computations of and can be halted whenever an upper bound for each of the norms and of the residues and goes below an approximation threshold.
Theorem 5 can be used to produce such a stopping criterion by a rule which combines an upper bound on and and the exact norm of the gradients of at and . This would require computing Frobenius norms of the matrices , . These matrices can be updated iteratively using rank-one updates. A simpler alternative which we consider is to bound uniformly between and using the inequality
which yields the following bound: