Kernels for sequentially ordered data

We present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or strings. Our approach is based on signature features which can be seen as an ordered variant of sample (cross-)moments; it allows to obtain a "sequentialized" version of any static kernel. The sequential kernels are efficiently computable for discrete sequences and are shown to approximate a continuous moment form in a sampling sense. A number of known kernels for sequences arise as "sequentializations" of suitable static kernels: string kernels may be obtained as a special case, and alignment kernels are closely related up to a modification that resolves their open non-definiteness issue. Our experiments indicate that our signature-based sequential kernel framework may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual pre-processing.

Authors

• 21 publications
• 7 publications
• Kernels for time series with irregularly-spaced multivariate observations

Time series are an interesting frontier for kernel-based methods, for th...
04/18/2020 ∙ by Ahmed Guecioueur, et al. ∙ 0

• KONG: Kernels for ordered-neighborhood graphs

We present novel graph kernels for graphs with node and edge labels that...
05/25/2018 ∙ by Moez Draief, et al. ∙ 2

• Autoregressive Kernels For Time Series

We propose in this work a new family of kernels for variable-length time...
01/04/2011 ∙ by Marco Cuturi, et al. ∙ 0

• Efficient Global String Kernel with Random Features: Beyond Counting Substructures

Analysis of large-scale sequential data has been one of the most crucial...
11/25/2019 ∙ by Lingfei Wu, et al. ∙ 0

• Faster Kernels for Graphs with Continuous Attributes via Hashing

While state-of-the-art kernels for graphs with discrete labels scale wel...
10/01/2016 ∙ by Christopher Morris, et al. ∙ 0

• Word2Vec is a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing

We show Correspondence Analysis (CA) is equivalent to defining Gini-inde...
05/17/2016 ∙ by Hirotaka Niitsuma, et al. ∙ 0

• Rational Kernels: A survey

Many kinds of data are naturally amenable to being treated as sequences....
10/20/2019 ∙ by Abhishek Ghose, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequentially ordered data are ubiquitous

in modern science, occurring as time series, location series, or, more generally, sequentially observed samples of numbers, vectors, and structured objects. They occur frequently in structured machine learning tasks, in supervised classification and regression as well as in forecasting, as well as in unsupervised learning.

Three stylized facts make learning with sequential data an ongoing challenge:

1. [label=()]

2. Sequential data is usually very diverse, with wildly different features being useful. In the state-of-the-art, this is usually addressed by manual extraction of hand-crafted features, the combination of which is often very specific to the application at hand and does not transfer easily.

3. Sequential data often occurs as sequences of structured objects, such as letters in text, images in video, graphs in network evolution, or heterogenous combination of all mentioned say in database or internet applications. This is usually solved by ad-hoc approaches adapted to the specific structure.

4. Sequential data is often large, with sequences easily obtaining the length of hundreds, thousands, millions. Especially when there is one or more sequences per data point, the data sets quickly become very huge, and with them computational time.

In this paper, we present a novel approach to learning with sequential data based on a joining of the theory of signatures/rough paths, and kernels/Gaussian processes, addressing the points above:

1. [label=()]

2. The signature of a path is a (large) collection of canonical features

that can be intuitively described an ordered version of sample moments. They completely describe a sequence of vectors (provably), and make sequences of different size and length comparable. The use of signature features is therefore a straightforward way of avoiding manual feature extraction (Section

3).

3. Combining signatures with the kernel trick, by considering the signature map as a feature map yields a kernel for sequences. It also allows learning with sequences of structured objects for which non-sequential kernels exist — consequently we call the process of obtaining a sequence kernel from a kernel for structured objects “kernel sequentialization” (Section 4).

4. The sequentialized kernel can be computed efficiently via dynamic programming ideas similar to those known for string kernels (Section 5). The kernel formalism makes the computations further amenable to low-rank type speed-ups in kernel and Gaussian process learning such as Nyström-type and Cholesky approximations or inducing point methods (Section 8).

To sum up, we provide a canonical construction to transform any kernel into a version for sequences , where we have denoted by the set of arbitrary length sequences in . We call the sequentialization of . This sequentialization is canonical in the sense that it converges to an inner product of ordered moments, the signature, when sequences in converge to functions in a meaningful way. We will see that existing kernels for sequences such as string or alignment kernels are closely related to this construction.

We explore the practical use of the sequential kernel in experiments which show that sequentialization of non-linear kernels may be beneficial, and that the sequential kernel we propose can beat the state-of-the-art in sequence classification while avoiding extensive pre-processing. Below we give an informal overview of the main ideas, and a summary of related prior art.

1.1 Signature features, and their universality for sequences

Signatures are universal features for sequences, characterizing sequential structure by quantifying dependencies in their change, similar to sample moments. We showcase how to obtain such signature features for the simple example of a two-dimensional, smooth series

 x:[0,1]→R2,t↦(a(t),b(t))⊤,

whose argument we interpret as “time”. As with sample moments or sample cumulants of different degree, there are signature features of different degree, first degree, second degree, and so on. The first degree part of the signature is the average change in the series, that is,

where we have written ,. The second degree part of the signature is the (non-centered) covariance of changes at two subsequent time points, that is, the expectation

 S2(x) :=12Et1

where the expectations in the first line are uniformly over time points in chronological order (that is,

is the order statistic of two points sampled from the uniform distribution on

). This is equivalent to integration over the so-called -order-simplex in the second line, up to a factor of corresponding to the uniform density (we put it in front of the expectation and not its inverse in front of the integral to obtain an exponential generating function later on).

Note that the second order signature is different from the second order moment matrix of the infinitesimal changes by the chronological order imposed in the expectation. Similarly, one defines the degree part of the signature as the

-th order moment tensor of the infinitesimal changes, where expectation is taken over chronologically ordered time points (which is a tensor of degree

). A basis-free definition over an arbitrary RKHS is given in Section 3.2. Note that the signature tensors are not symmetric, similarly to the second order matrix in which the number arising from the term is in general different from the number obtained from the term.

The signature features are in close mathematical analogy to moments and thus polynomials on the domain of multi-dimensional series. Namely, one can show:

• A (sufficiently regular) series is (almost) uniquely determined by their signature - this is not true for higher order moments or cumulants without the order structure (recall that these almost uniquely determine the distribution of values, without order structure).

• Any (sufficiently regular real-valued) function on series can be arbitrarily approximated by a function linear in signature features, that is for a non-linear functionals of our two-dimensional path ,

 f(x)≈α+∑βi1,…,iM∫(dxi1⋯(dxiM,

where the sum runs over and and we denote ,. Note that is the indeterminate here, that is the degree part of the signature and approximation is over a compact set of different paths . The exact statements are given in Section 3.

From a methodological viewpoint, these assertions mean that not only are signature features rich enough to capture all relevant features of the sequential data, but also that any practically relevant feature can be expressed linearly in the signature, addressing point 1 in the sense of a universal methodological approach.

Unfortunately, native signature features, in the form above are only practical in low dimension and low degree : already in the example above of a two-dimensional path , there are (scalar) signature features of degree , in general computation of a larger number of signature features is infeasible, point 3 Further, all data are discrete sequences not continuous, and possibly of objects which are not necessarily real vectors; point 2.

1.2 The sequential kernel and sequentialization

The two issues mentioned can be addressed by the kernel trick — more precisely, by the kernel trick applied twice: once, to cope with the combinatorial explosion of signature features, akin to the polynomial kernel which prevents computation of an exponential number of polynomial features; a second time, to allow treatment of sequences of arbitrary objects. This double kernelization addresses point 2, and also the combinatorial explosion of the feature space. An additional discretization-approximation, which we discuss in the next paragraph below, makes the so-defined kernel amenable to efficient computation.

We describe the two kernelization steps. The first kernelization step addresses the combinatorial explosion. It simply consists of taking the scalar product of signature features as kernel, and then observing that this scalar product of integrals is an integral of scalar products. More precisely, this kernel, called signature kernel can be defined as follows, continuing the example with two two-dimensional sequences , as above:

 K⊕(x,¯x):=⟨S(x),S(¯x)⟩=1+⟨S1(x),S1(¯x)⟩+⟨S2(x),S2(¯x)⟩+⋯.

The scalar product of -s (vectors in ) is the Euclidean scalar product in , the scalar product of -s (matrices in ) is the trace product in , and so on (with higher degree tensor trace products). The “1” is an “”-contribution (for mathematical reasons becoming apparent in paragraph 1.3 below).

For the first degree contribution to the signature kernel, one now notes that

 ⟨S1(x),S1(¯¯¯x)⟩ =Es,t[˙a(s)⋅˙¯¯¯a(t)+˙b(s)⋅˙¯¯¯a(t)] =Es,t[⟨˙x(s),˙¯¯¯x(t)⟩].

In analogy, one computes that the second degree contribution to the signature kernel evaluates to

 ⟨S2(x),S2(¯¯¯x)⟩=12!2Es1

Similarly, for a higher degree , one obtains a product of scalar products in the expectation.

The presentation is not only reminiscent of the polynomial kernel in how it copes with the combinatorial explosion, it also directly suggests the second kernelization to cope with sequences of arbitrary objects: since the sequential kernel is now entirely expressed in scalar products in , the scalar products in the expectation may be replaced by any kernel , of arbitrary objects, yielding a sequential kernel, now for sequences in , given as

 k⊕(x,¯x):=1+11!2Es,t[k(˙x(s),˙¯¯¯x(t))]+12!2Es1

(for expository convenience we assume here that differentials in are defined which in general is untrue, see Section 4.3 for the general statement). Note that (1.1) can be seen as a process that takes any kernel on , and makes it into a kernel on -sequences, therefore we term it “sequentialization” of the kernel . This addresses point 2, and can be found in more detail in Section 4.

1.3 Efficient computation and discretization

An efficient way of evaluating the sequential kernel is suggested by a second observation, closely related to (and generalizing) Horner’s method of evaluating polynomials. Note that the sequential kernel can be written as an iterated conditional expectation

 k⊕(x,¯x)=(1+Es1,t1[k(˙x(s1),˙¯¯¯x(t1))⋅(1+122Es1

The iterated expectation directly suggests a discretization by replacing expectations by sums, such as

 (1+1n2∑s1,t1[k(˙x(s1),˙¯¯¯x(t1))⋅(1+1(2n)2∑s1

where the sums range over a discrete set of points , . A reader familiar with string kernels will immediately notice the similarity: the sequential kernel can in fact be seen as infinitesimal limit of a string kernel, and the (vanilla) string kernel can be obtained as a special case (see Section 6). As a final subtlety, we note that the derivatives of will not be known in observations, therefore one needs to replace by a discrete difference approximation

 k(x(si+1),¯¯¯x(tj+1))+k(x(si),¯¯¯x(tj))−k(x(si),¯¯¯x(tj+1))−k(x(si+1),¯¯¯x(tj))

where resp.  denote adjacent support values of the discretization.

Our theoretical results in Section 5 show that the discretization, as described above, converges to the continuous kernel, with a convergence order linear in the sampling density. Moreover, similarly to the Horner scheme for polynomials (or fast string kernel techniques), the iterated sum-product can be efficiently evaluated by dynamical programming techniques on arrays of dimension three, as outlined in Section 8. The computational complexity is quadratic in the length of the sequences and linear in the degree of approximation, and can be further reduced to linear complexity in both with low-rank techniques.

This addresses the remaining point 3 and therefore yields an efficently computable, canonical and universal kernel for sequences of arbitrary objects.

1.4 Prior art

Prior art relevant to learning with sequential data may be found in three areas:

1. dynamic programming algorithms for sequence comparison in the engineering community,

2. kernel learning and Gaussian processes in the machine learning community

3. rough paths in the stochastic analysis community.

The dynamic programming literature (1) from the 70’s and 80’s has inspired some of the progress in kernels (2) for sequential data over the last decade, but to our knowledge so far no connections have been made between these two, and (3), even though (3) pre-dates kernel literature for sequences by more than a decade. Beyond the above, we are not aware of literature in statistics of time series that deals with sequence-valued data points in a way other than first identifying one-dimensional sequences with real vectors of same size, or even forgetting the sequence structure entirely and replacing the sequences with (order-agnostic) aggregates such as cumulants, quantiles or principal component scores (this is equally true for forecasting methods). Though, simple as such reduction to more classic situations may be, it constitutes an important baseline, since only in comparison one can infer that the ordering was informative or not.

Dynamic programming for sequence comparison.

The earliest occurrence in which the genuine order information in sequences is used for learning can probably be found in the work of Sakoe et al

[30, 29] which introduces the idea of using editing or distortion distances to compare sequences of different length, and to efficiently determine such distances via dynamic programming strategies. These distances are then employed for classification by maximum similarity/minimum distance principles. Through theoretical appeal and efficient computability, sequence comparison methods, later synonymously called dynamic time warping methods, have become one of the standard methods in comparing sequential data [16, 9].

Though it may need to be said that sequence comparison methods in their pure form — namely an efficiently computable distance between sequences — have remained somewhat restricted in that they can only be directly adapted only to relatively heuristic distance-based learning algorithms, by definition. This may be one of the reasons why sequence comparison/dynamic time warping methods have not given rise to a closed learning theory, and why in their original practical application, speech recognition and speech classification, they have later been superseded by Hidden Markov Models

[26]

and more recently by neural network/deep learning methodologies

[15] as gold standard.

A possible solution of the above-mentioned shortcomings has been demonstrated in kernel learning literature.

Kernels for sequences.

Kernel learning is a relatively new field, providing a general framework to make non-linear data of arbitrary kind amenable to classical and scalable linear algorithms such as regression or the support vector machine in a unified way, by using a non-linear scalar product: this strategy is called “kernelization”; see

[32] or [33]. Mathematically, there are close relations to Gaussian process theory [27] which is often considered as a complimentary viewpoint to kernels, and aspects of spatial geostatistics [4]

, particularly Kriging, an interpolation/prediction method from the 60’s

[23] which has been re-discovered 30 years later in the form of Gaussian process regression [35]. In all three incarnations, coping with a specific kind of data practically reduces to finding a suitable kernel (= covariance function), or a family of kernels for the type of data at hand — after which one can apply a ready arsenal of learning theory and non-linear methodology to such data. In this sense, providing suitable kernels for sequences has proved to be one of the main strategies in removing the shortcomings of the sequence comparison approach.

Kernels for strings, that is, sequences of symbols, were among the first to be considered [14]. Fast dynamic programming algorithms to compute string kernels were obtained a few years later [19, 17]. Almost in parallel and somewhat separately, kernels based on the above-mentioned dynamic time warping approach were developed, for sequences of arbitrary objects [1, 24]. A re-formulation/modification led to the so-called global alignment kernels [6], for which later fast dynamic programming algorithms were found [5] as well. An interesting subtle issue common to both strains was that initially, the dynamic programming algorithms found were quadratic in the length of the sequences, and only later linear complexity algorithms were devised: for string kernels, the transition was made in [17], while for the sequence matching strain, this became only possible after passing to the global alignment kernels [5].

Looking from a general perspective: while in hindsight all of the mentioned kernels can be viewed from Haussler’s original, visionary relation-convolution kernel framework, and all above-mentioned kernels for sequences, in some form, admit fast dynamic programming algorithms, existing literature provides no unifying view on kernels for sequences: the exact relation between string kernels and dynamic time warping/global alignment kernels, or to the classical theory of time series has remained unclear; further, the only known kernels for sequences of arbitrary objects, the dynamic time warping/global alignment kernels, suffer from the fact that they are not proper kernels, failing to be positive definite.

In this paper, we attempt to resolve these issues. More precisely, the string kernel will arise as a special case, and the global alignment kernel as a deficient version of our new signature kernel, built on the theory of signatures and rough paths from stochastic analysis.

Iterated integrals, signatures and rough paths.

Series of iterated integrals are a classic mathematical object that plays a fundamental role in many areas like control theory, combinatorics, homotopy theory, Feynman–Dyson–Schwinger theory in physics and more recently probability theory. We refer to

[20, Section “Historic papers”, p97] for a bibliography of influential articles. This series, or certain aspects of it, are treated under various names in the literature like “Magnus expansion”, “time ordered exponential”, or the one we chose which comes from Lyons’ rough path theory: “the signature of a path”. The reason we work in the rough path setting is that it provides a concise mathematical framework that clearly separates analytic and algebraic aspects, applies in infinite dimensional spaces like our RKHS and is robust under noise; we refer to [20, 22, 8] as introductions. The role of the signature as a “non-commutative exponential” plays a guiding principle for many recent developments in stochastic analysis, though it might be less known outside this community.

The major application of rough path theory was and still is to provide a robust understanding of differential equations that are perturbed by noise, going beyond classic Ito-calculus; Hairer’s work on regularity structures [12] which was recently awared a Fields medal can be seen as vast generalization of such ideas. The interpretation of the signature in a statistical context is more recent: work of Papavasiliou and Ladroue [25]

applies it to SDE parameter estimation; work of Gyurko, Hao, Lyons, Oberhauser

[11, 18, 21]

applies it to forecasting and classifcation of financial time series using linear and logistic regression; work of Diehl

[7] and Graham [10] uses signature features for handwritten digit recognition, see also [36] for more recent state of the art results.

The interpretation of the signature as an expectation already occurs as a technical Lemma 3.9 in [13]. A scalar product formula for the norm, somewhat reminiscent of that of the sequential kernel, can be found in the same Lemma. Similarly we would like to mention the Computational Rough Paths package [3], that contains C++ code to compute signature features directly for -valued paths. However, it does not contain specialized code to calculate inner products of signature features directly. The Horner-type algorithm we describe in Section 1.3 gives already significant speed improvements when it is applied to paths in finite dimensional, linear spaces (that is, the sequentialization of the Euclidean inner product ; see Remark 8.1).

Remark 1.1.

The above overview contains a substantial number of examples in which independent or parallel development of ideas related to sequential data has occurred, possibly due to researchers being unaware of similar ideas in communities socially far from their own. We are aware that this could therefore also be the case for this very paper, unintended. We are thus especially thankful for any pointers from readers of this pre-print version that help us give credit where credit is due.

Acknowledgement

HO is grateful for support by the Oxford-Man Institute of Quantitative finance.

2 Notation for ordered data

We introduce recurring notation.

Definition 2.1.

The set of natural numbers, including , will be denoted by .

Definition 2.2.

Let be a set and . We denote

1. the set of integers between and by ,

2. the set of ordered -tuples in as usual by ,

3. the set of such tuples, of arbitrary but finite length, by where by convention .

Moreover, we use the following index notation for ,

1. ,

2. the count of the most frequent item in ,

3. for ,

4. for ,

5. for ,

6. if consists of different elements in and denote the number of times they occur in ,

7. for .

In the case that , we can define subsets of that consist of increasing tuples. These tuples play an important role for calculations with the signature features.

Definition 2.3.

Let and . We denote the set of monotonously ordered -tuples in by

 ΔM(A) :={u∈AM:u[1]≤u[2]≤⋯≤u[M]}.

We denote the union of such tuples by where again by convention . We call the order simplex on , and the order simplex on . A monotonously ordered tuple is called strict if for all . The index notation of Definition 2.2 applies also to , understanding that if .

Remark 2.4.

Above definition of order simplex is slightly different from the usual one, in which one takes , the counting starts at , and one has and .

The following notation is less standard, but becomes very useful for the algorithms and recursions that we discuss.

Definition 2.5.

Let , . We use the following notation:

• if there is a such that ,

• if there is strictly ordered tuple such that ,

• if there is a such that and .

For we also the notation

1. if ,

2. if and is strict (that is for all ).

3 Signature features: ordered moments for sequential data

In this section, the signature features are introduced in a mathematically precise way, and the properties which make them canonical features for sequences are derived. We refer the interested reader to [20] for further properties of signatures and its use in stochastic analysis.

As outlined in introductory Section 1.1, the signature can be understood as an ordered and hence non-commutative variant of sample moments. If we model ordered data by a function , the signature features are obtained as iterated integrals of and the -times iterated integral can be interpreted as a -th ordered moment; examples of such features for are the (non-commutative) integral moments

 S1:[[0,1]→Rn]→Rn,x↦∫10(dx(t),orS2:[[0,1]→Rn]→Rn×n,x↦∫10∫t20(dx(t1)(dx(t2)⊤

(where the domain of will be restricted so the integrals are well-defined). The more general idea of signature features, made mathematically precise, is to consider the integral moments

 SM(x)=(∫ΔM(dxi1(t1)⋯(dxiM(tM))i1,…,iM∈{1,…,n}∈Rn×⋯×n (3.1)

where the integration is over the -simplex , i.e., all ordered sequences and the choice of index parametrises the features. The features are all element of a (graded) linear space and a kernel for sequential data may then be obtained from taking the scalar product over the signature features.

The section is devoted to a description and charactarization of these signature features and the kernel obtained from it. This is done in a basis-free form which allows treatment of ordered data , where is a Hilbert space (over ) not necessarily identical to . This will allow considering ordered versions of data, the non-ordered variant of which may be already encoded via its reproducing kernel Hilbert space .

For illustration, the reader is invited to keep the prototypical case in mind.

3.1 Paths of bounded variation

The main object modelling a (generative) datum will be a continuous sequence, a path where is a Hilbert space which is fixed in the following. For reasons of regularity (integrals need to converge), we will restrict ourselves to paths which are continuous and of bounded length111Considering bounded variation paths is slightly restrictive as it excludes samples from stochastic process models such as the Wiener process/Brownian motion. The theory may be extended to such sequences at the cost of an increase in technicality. For reading convenience, this will be done only at a later stage. (called “variation”, as defined below).

The variation of a path is defined as the supremum of variations of discrete sub-sequences taken from the path:

Definition 3.1.

For an ordered tuple we define:

1. [label=()]

2. as , ,

3. ,

4. .

We call the first difference sequence of , we call the mesh of , and the variation of .

Definition 3.2.

We define the variation of as

 V(x):=supt∈Δ([0,1])V(x(t)).

The mapping is said to be of bounded variation if .

A bounded variation path, as the name says, is a path with bounded variation.

Definition 3.3.

Let . We denote the set of -valued paths of bounded variation on by

 BV([a,b],H):={x∈C([a,b],H):V(x)<∞}.

When we often omit the qualifier and write .

Definition 3.4.

Let be a linear space and denote with the set of continuous linear maps from to . Given and , the Riemann–Stieltjes integral of over is defined as the element in given as

 ∫bay(dx:=limt∈Δ([a,b])mesh(t)→0ℓ(t)−1∑i=1y(t[i])(∇x(t[i])).

We also use the shorter notation when the integration domain is clear from the context.

As in the finite-dimensional case, above is indeed well-defined, that is the Riemann sums converge and the limit is independent of the sequence ; see [20, Theorem 1.16] for a proof of a more general result. Note that the integral itself is in general not a scalar, but an element of the Hilbert space .

3.2 Signature integrals in Hilbert spaces

With the Riemann–Stieltjes integral, we can define the signature integrals in a basis-free way. For this, note that the Riemann–Stieltjes integral of a bounded variation path is again a bounded variation path.

Definition 3.5.

Let and . We define:

1. [label=()]

2. ,

3. for integers .

We call the -th iterated integral of on . When we often omit the qualifier and write for .

In the prototypical case , the first iterated integral is a vector in , the second iterated integral is a matrix in , the third iterated integral is a tensor in and so on. For the case of arbitrary Hilbert spaces, we need to introduce some additional notation to write down the space where the iterated integrals live:

Definition 3.6.

We denote by the tensor (=outer product) product of with itself. Instead of ( times), we also write . By convention, .

The -th iterated integral of is an element of , a tensor of degree . The natural space of iterated integrals of all degrees together is the tensor power algebra the Hilbert space :

Definition 3.7.

The tensor power algebra over is defined as , addition , multiplication and an inner product are defined by canonical extension from , that is:

 (g0,g1,…,gM,…)+(h0,h1,…,hM,…):=(g0+h0g1+h1,…gM+hM,…). (g0,g1,…,gM,…)∗(h0,h1,…,hM,…):=(g0h0,g0⊗h1+g1⊗h0,…,M∑i=0gi⊗hM−i,…), ⟨(g0,g1,…,gM,…),(h0,h1,…,hM,…)⟩T(H):=⟨g0,h0⟩R+⟨g1h1⟩H+⋯+⟨gM,hM⟩H⊗M+….

The elements in the tensor power algebra with with finite norm are denoted by 222This is slightly non-standard notation: usually equals .. Further, we canonically identify with the sub-Hilbert space of that contains exactly elements of the type , which we call homogenous (of degree ). Under this identification, for , we will also write, for reading convenience,

where we may opt to omit zeros in the sum. We adopt an analogue use of sum and product signs .

Example 3.8.

We work out tensor algebra multiplication and scalar product in the case of the prototypical example . Consider homogenous elements of degree 2: it holds that , and the tensor product of two vectors is . The trace product on is induced by the Euclidean scalar product, since . Homogenous elements of degree 3 are similar: it holds that , and the tensor product of three vectors is a tensor of degree , where for all . The scalar product on is the tensor trace product, .
An element is of the form

 g=c+v+M+T+…,wherec∈R,v∈Rn,M∈Rn×n,T∈Rn×n×n,etc.

As an example of tensor algebra multiplication, it holds that

 g∗g=c2+2cv+cM+vv⊤+cT+v⊗M+M⊗v+….

Note the difference between and : it holds that while

One checks that is indeed an algebra:

Proposition 3.9.

is a an associative -algebra with

 (1,0,…,0,…) resp. (0,…,0,…)

as multiplicative neutral element resp. additive neutral element. In general, the tensor algebra multiplication is not commutative.

Proof.

Verifying the axioms of an associative -algebra is a series of non-trivial but elementary calculations. To see that is not commutative, consider the counter-example where are linearly independent. Then, (in the case of , this is ). ∎

Remark 3.10.

We further emphasize the following points, also to a reader who may already be familiar with the tensor product/outer product:

• The Cartesian product is different from the tensor product in the same way as is different from and from

• In general, is different from the formal tensor product/outer product of elements, since for , the formal tensor product is an element of , while the tensor power algebra product is an element of .

• Under the identification introduced above, there is one case where coincides with the tensor product - namely, when and are homogenous. Identifying and , it holds that may be identified with which is also homogenous. No equivalence of this kind holds when and are not homogenous.

3.3 The signature as a canoncial feature set

We are now ready to define the signature features.

Definition 3.11.

We call the mapping

the signature map of -valued paths and we refer to , as the signature features of . Similarly, we define (level--)truncated signature mapping as

 S≤M:BV([0,1],H)→T(H),x↦M∑m=0∫Δm([0,1])(dx⊗M.

Above is well-defined since the signature can be shown to have finite norm:

Let . Then:

1. [label=()]

2. ,

3. .

Proof.

(i) is classical in the literature on bounded variation paths, it is also proven in Lemma B.4 of the appendix. (ii) follows from (i) by observing that due to the triangle equality, then substituting (i) and the Taylor expansion of . ∎

There are several reasons why the signature features are (practically and theoretically) attractive, which we summarize before we state the results exactly:

1. the signature features are a mathematically faithful representation of the underlying sequential data : the map is essentially one-on-one.

2. the signature features are sequentially ordered analogues to polynomial features and moments. The tensor algebra has the natural grading with designating the “polynomial degree”. It is further canonically compatible with natural operations on .

3. linear combinations of signature features approximate continuous functions of sequential data arbitrarily well. This is in analogy with classic polynomial features and implies that signature features are as rich a class for learning purposes as one can hope.

Theorem 1 (Signature uniqueness).

Let . Then if and only and are equal up to tree-like equivalence333We call tree-like equivalent if is tree-like. A path is called tree-like if there exists a continuous map with and for all . .

Proof.

This is [20, Theorem 2.29 (ii)]. ∎

Remark 1.

Being not tree-like equivalent is a very weak requirement, e.g. if and have a strictly increasing coordinate they are not tree-like equivalent. All the data we will encounter in the experiments is not tree-like. Even if presented with a tree-like path, simply adding time as extra coordinate (that is, working with instead of ) guarantees the assumptions of above Theorem are met.

Remark 2.

Above Theorem extends to unbounded variation paths, cf. [2]

Secondly, the signature features are analogous to polynomial features: the tensor algebra has a natural grading with designating the “polynomial degree”.

Theorem 2 (Chen’s Theorem).

Let , then

 ∫ΔM(dx⊗M⊗∫ΔN(dx⊗N=∑σσ(∫ΔM+N(dx⊗(M+N)).

Here the sum is taken over all ordered shuffles

 σ∈OSM,N={σ:σ permutation of {1,…,M+N},σ(1)<⋯<σ(M),σ(M+1)<⋯<σ(M+N)}.

and acts on as .

Finally, a direct consequence of the above and again in analogy with classic polynomial features, linear combinations of signature features approximate continuous functions of sequential(!) data arbitrary well.

Theorem 3 (Linear approximations).

Let be a compact subset of of paths that are not tree-like equivalent. Let be continuous in variation norm. Then for any , there exists a such that

 supx∈P∣∣f(x)−⟨w,S(x)⟩T(H)∣∣<ϵ.
Proof.

The statement follows from the Stone–Weierstraß theorem if the set

 (3.2)

forms a point-separating algebra. However, this is a direct consequence of the above: by Chen’s theorem, Theorem 2, is an algebra, and by the signature uniqueness, Theorem 1, separates points. ∎

Above shows more than stated: for a fixed ONB of , there exists a finite subset of this ONB and can be found in the linear span of this finite set.

4 Kernelized signatures and sequentialized kernels

Our goal is to construct a kernel for sequential data of arbitrary type, to enable learning with such data. We proceed in two steps and first discuss the case when the sequential data are sequences in the Hilbert space (for example ). In this scenario, the properties of the signature, presented in Section 3, suggest as kernel the scalar product of the signature features. This yields the following kernels,

Definition 4.1.

Fix . We define the kernels

 K⊕: BV(H)×BV(H)→R,(x,y)↦⟨S(x),S(y)⟩T(H), K⊕≤M: BV(H)×BV(H)→R,(x,y)↦⟨S≤M(x),S≤M(y)⟩T(H).

We refer to as the signature kernel and to as the signature kernel truncated at level .

To make these kernels practically meaningful, we need to verify a number of points:

1. [label=()]

2. That they are well-defined, positive (semi-)definite kernels. Note that checking finiteness of the scalar product is not immediate (but follows from well-known estimates about the signature features).

3. That they are efficiently computable. A naive evaluation is infeasible, due to combinatorial explosion of the number of signature features. However, we show that and can be expressed entirely in integrals of inner products . The formula can be written as an efficient recursion, similar to the Horner scheme for efficient evaluation of polynomials.

4. That they are robust under discretization: the issue is that paths are never directly observed since all real observations are discrete sequences. The subsequent Section 5 introduces discretizations for signatures, and the two kernelization steps above, which are canonical and consistent to the above continuous steps, in a sampling sense of discrete sequences converging to bounded variation path .

5. That they are robust under noise: in most situations our measurements of the underlying path are perturbed by random perturbations and noise. We discuss the common situation of additive white noise/Brownian motion in Section

7.

We refer to the above procedure as “kernel trick one” and discuss it below in Section 4.1 and Section 4.2.

In a second step, to which we refer as “kernel trick two”, we show that the above is also meaningful for sequential data in an arbitrary set . This second step yields, for any primary kernel on (static objects) , a sequential kernel on (a sufficiently regular subset of) paths . We thus call this procedure that transforms a static kernel on into a kernel on sequences in , the so-called sequentialization (of the kernel ).

Definition 4.2.

Fix and , . We define444For and we denote with the path .

 k⊕: P(X)×P(X)→R,(σ,τ)↦⟨S(ϕ(σ)),S(ϕ(τ))⟩T(H) k⊕≤M: P(X)×P(X)→R,(σ,τ)↦⟨S≤M(ϕ(σ)),S≤M(ϕ(τ))⟩T(H).

We refer to as the sequentialization of the kernel and to as the sequentialization of the kernel truncated at level .

As we have done for and , we again need to verify points 1234 for and . Point 1 follows under appropriate regularity assumptions on immediately from the corresponding statement for and . For point 2 note, that although the data enters and in the recursion formula only in the form of scalar products of differentials in , it is mathematically somewhat subtle to replace these by evaluations of over an arbitrary set , due to the differential operators involved. However, we will do so by identifying the kernel with a signed measure on .

Points 1 and 2 will be discussed in this section, point 3 will be the main topic of Section 5, and point 4 is discussed in Section 7.

4.1 Well-definedness of the signature kernel

For 1 well-definedness, note: and are positive definite kernels, since explicitly defined as a scalar product of features. Also, these scalar products are always finite (thus well-defined) for paths of bounded variation, as the following Lemma 4.3 shows.

Lemma 4.3.

Let . Then it holds that

1. [label=()]

Proof.

This follows from Lemma 3.12 and the Cauchy–Schwarz-inequality. ∎

Hence, are well-defined (positive definite) kernels.

4.2 Kernel trick number one: kernelizing the signature

The kernel trick consists of defining a kernel which is 2 efficiently computable. For this, we show that can be entirely expressed in terms of -scalar products:

Let . Then:

1. [label=()]