Proper Scoring Rules, Gradients, Divergences, and Entropies for Paths and Time Series

by   Patric Bonnier, et al.
University of Oxford

Many forecasts consist not of point predictions but concern the evolution of quantities. For example, a central bank might predict the interest rates during the next quarter, an epidemiologist might predict trajectories of infection rates, a clinician might predict the behaviour of medical markers over the next day, etc. The situation is further complicated since these forecasts sometimes only concern the approximate "shape of the future evolution" or "order of events". Formally, such forecasts can be seen as probability measures on spaces of equivalence classes of paths modulo time-parametrization. We leverage the statistical framework of proper scoring rules with classical mathematical results to derive a principled approach to decision making with such forecasts. In particular, we introduce notions of gradients, entropy, and divergence that are tailor-made to respect the underlying non-Euclidean structure.



There are no comments yet.


page 1

page 2

page 3

page 4


Probabilistic coherence and proper scoring rules

We provide self-contained proof of a theorem relating probabilistic cohe...

Probability Paths and the Structure of Predictions over Time

In settings ranging from weather forecasts to political prognostications...

Using scoring functions to evaluate point process forecasts

Point process models are widely used tools to issue forecasts or assess ...

Validation of point process predictions with proper scoring rules

We introduce a class of proper scoring rules for evaluating spatial poin...

Evaluation of point forecasts for extreme events using consistent scoring functions

We present a method for comparing point forecasts in a region of interes...

Evaluating probabilistic forecasts with the R package scoringRules

Probabilistic forecasts in the form of probability distributions over fu...

Strictly Proper Mechanisms with Cooperating Players

Prediction markets provide an efficient means to assess uncertain quanti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scoring rules provide a principled approach to form and evaluate probabilistic predictions. The earliest applications go at least back to the evaluation of weather forecasts [1], but have since then developed into a rich theoretical framework that plays a central part in modern statistical inference. We refer to [2] and [3]

for a general background. The theoretical underpinnings of scoring rules are well-developed, but nearly all of the literature focuses on prediction of vector-valued or scalar-valued quantities. The aim of this article is develop a scoring rule framework for an important class of non-Euclidean data, namely sequential data – both in discrete and continuous time.

The Drawbacks of (Naïve) Vectorization.

Given a dataset consisting of multi-variate time series (TS), a common approach is to flatten each TS into a long vector and then use a standard pipeline for vector-valued data. However, this approach has several drawbacks. Firstly, there’s trouble whenever the different TS are irregularly sampled or are of different length since this embeds the different TS in Euclidean spaces of different dimension. Typically this addressed by ad-hoc approaches such as adding synthetic data by interpolation or dropping data points. Secondly, often the relevant information is independent of the time-parametrization (“

time-warping invariance”), at least to a large degree; for instance, the meaning of a spoken word or an object being filmed are both independent of how fast or slow the audio signal or video is presented. Finally, many models are naturally formulated in continuous time rather than discrete time; for example, stochastic differential equations form a popular class of models in many applications and it is unclear how to evaluate such continuous time models in a scoring rule framework besides above naive vectorization on an arbitrary time grid.

A Non-Euclidean Data Domain.

Key to our approach is that classic tools from pure mathematics faithfully capture the Non-Euclidean structure of the space of (unparametrized) paths. While there is no linear structure that allows for addition of paths of different length, any two paths can be concatenated into one path and any path can be run backwards. Both these operations – concatenation and reversal – are independent of the choice of parametrization, hence they also apply to equivalence classes of paths under reparametrization. We refer to an unparametrized path – that is an equivalence class of paths under reparametrization – as a track, as it is defined uniquely by the track it carves out in the space where it evolves. A classical result [4] is that there is a “feature map” from the set of tracks into a linear space that is functorial and universal; the former means that operations on tracks (concatenation and reversal) turn into algebraic operations in feature space, the latter means that any function of tracks can be approximated as a linear functional of this map. Moreover, this map is given as a series of iterated integrals which makes it amenable to computation and we refer to it as the signature map. In fact, the co-domain of this feature map (“the feature space”) is not only a linear space but forms a so-called Hopf algebra and it is the Hopf algebra structure that captures operations on tracks as algebraic operations. The third mathematical ingredient that we use are gradients of functions of tracks: the usual definition of linear (Fréchet) differentiability can sometimes be unsuitable for such functions due to the above lack of linear structure. However, Pansu generalized classical differentiation to a special class of groups and we leverage this to define gradients of functions of (unparametrized) paths. We show that the usual guarantees of gradient descent algorithms apply which allows us to compute quantities associated with our scoring rule framework (as even in the case of Euclidean data, many quantities are not given in closed form, but can be found by first order methods).


Section 2 recalls the basic definitions and general of scoring rules. Section 3 contains the theoretical background; it formalizes the structure of the spaces of (unparametrized paths), and introduces the signature feature map and the Hopf algebra structure of its co-domain (the feature space). Section 4 then shows how these quantities lead to natural scoring rules on the non-Euclidean space of tracks; in particular the so-called antipode of the Hopf algebra plays key role to relate the scoring rule framework to structural properties of tracks. From general results this then immediately leads to definitions of entropy, divergence, and mutual information that – unlike the (naïve) vectorization approach outlined above – are compatible with the structure of (probability measures) on spaces of (unparametrized) paths. Section 5 then utilizes that one may identify a track as “group-like” element and shows that the concept of Pansu differentiabilty leads to a natural notion of gradient descent on the space of paths resp. tracks. Finally, Section 6 demonstrates that despite this approach being motivated by pure maths, the resulting quantities lead to efficiently computable quantities that have some advantageous properties compared to other methods with similar invariances.

1.1 Related Work

One of the earliest empirical insights for time series data was that time-parametrization (“time warping”) invariance is of key importance [5, 6]. Arguably the most popular way to address this invariance is via the classical dynamic time warping distance (DTW) and its many variations that introduce a distance between time series by searching over time changes. For example, [7, 8]

introduce a regularised version of DTW, so-called soft DTW (s-DTW) which addresses the fact that the DTW distance is not differentiable, making it viable fo use in deep learning pipelines. In the process of doing so it loses the invariance that DTW enjoys and introduces a trade-off of smoothness versus invariance. A more general point is that DTW and its variations do not aim to provide the full forecasting framework of scoring rules (divergence between measures, entropy, mutual information of TS) and although DTW approaches successfully deal with time-parametrization, they ignore other structural properties such as concatenation and reversal of TS. Another drawback is that while the focus of DTW is on discrete time it can be formulated in continuous time (so-called Fréchet distance) but the computation scales with quadratic complexity in the number of sequence entries which makes it too expensive for many sources of high-frequency data, whereas our distance can be computed in linear time for the price of higher complexity in the state space dimension. Ultimately the reason for this increase in efficiency is that the time-warpings are never explicitly computed or exhibited.

Another area that is directly related is kernel learning. Any kernel with reproducing kernel Hilbert space induces a scoring rule by setting where , see [9, Section 5] and several kernels for sequences have been developed in the literature. Most relevant to our approach is the “signature kernel” introduced in [10]

. However, for any scoring rule given by a kernel, the (generalized) divergence becomes simply the maximum mean discrepancy and the entropy simply the variance in the RKHS

. While kernels give rise to a powerful class of scoring rules, the success and popularity of non-kernel based scoring rules on is motivation enough to look for other interesting, non-linear scoring rules.

The technical key to our approach comes from mathematics where iterated integrals, so-called signatures, and non-commutative algebras are use to represent paths. This goes at least back to seminal work of Chen [4] in algebraic topology and subsequent applications in control theory [11, 12] and more recently rough path theory [13]. These results have been influential in stochastic analysis [14, 15]

and only more recently have been started to be explored in a statistical and machine learning context. We mention pars-pro-toto

[16, 17, 18] for inference about laws of stochastic processes; [10, 19, 20] for kernel learning; [21, 22, 23] for Bayesian approaches; [24, 25, 26] for generative modelling; [27, 28, 29] for applications in topology; [30, 31, 32] for algebraic perspectives.

Finally, we mention that the two topics that are central to us – invariances and non-Euclidean structure – have been considered in different contexts in scoring rule frameworks. For example, [33]

studies equi- and in-variances for scoring rules for Euclidean data; non-vector valued data such as sets, contours, intervals, and quantiles have received attention

[34, 35, 36].

2 Proper Scoring Rules, Entropies, and Divergences

We briefly recall general background on scoring rules following closely the notation in [3], see also [2, 37, 38, 39]. Let be a measurable space (the outcome space), be a set (the action space), and be a function (

the loss function

). Further, let be a

-valued random variable. We consider the following game between a Decision-maker and Nature: the task of the decision maker is to choose an action

, after which Nature reveals the outcome that is given by sampling . The decision maker then suffers the loss .

Given a set of probability measures on , a principled probabilistic (Bayesian) approach to this decision problem is to proceed as follows:

  1. Associate with every its Bayes act defined by


    (assuming that a minimum exists; if it is not unique, choose arbitrary among the minimizers).

  2. Use the Bayes act to define the scoring rule on given by

  3. Use the scoring rule to define the (generalised) entropy , the (generalised) divergence , and the (generalised) mutual information as


    where denotes a space of probability measures (not necessarily on ) and denotes the law of conditioned on the random variable .

The above definitions and nomenclature is justified as follows: firstly, it is an instructive exercise to check that for standard choices of state space , action space , and loss function , the above reduce to classical definitions of entropy, divergence and mutual information (e.g. if is the set of densities on then the log score from [40]

yields the the usual Shannon entropy, Kullback-Leibler divergence, and mutual information); see 

[3] for more examples. Secondly, Theorem 2.1 below shows that characteristic properties hold in the full generality of the above setup:

Theorem 2.1 ([3]).

Let , , and be as above. Further, denote with , , and the associated (generalized) scoring rule, entropy, and divergence. Then

  1. the scoring rule (2) is proper, that is


    is minimized at .

  2. is concave,

  3. is affine in for every ,

  4. with equality for ,

  5. with equality for .

Different applications areas demand different scoring rules. Classical choices for the Euclidean case are besides the already mentioned log score, the Brier score, the Tsallis score, the Bregman score, the Hyvärinen score, etc.; see [3] for details. The aim of the remainder of this article is to study the case when is a space of paths or a space of equivalence classes of paths (under reparametrisation).

A Toy Example: From Feature Maps to Bayes Actions.

To motivate our scoring rule for paths let us first revisit the vector-valued case, . To arrive at a proper scoring rule, the space of Bayes actions should be large enough to characterize any (sufficiently nice) probability measures on

. A classic way to characterize a probability measure, is to consider the sequence of moments,


that is, is the mean vector, is the covariance matrix, etc. (We tacitly assume that the sequence of moments is well-defined and decays quickly enough so that it characterizes the measure, see Remark 2.2). The sequence (7) is an element of the set


of sequences of tensor of increasing degree

. In fact, this set forms a vector space by element-wise addition of tensors of the same degree. If we define the “feature map”


With the above notation, the moment sequence (7) is simply the mean of ,


Using a well-known characterization of the mean as minimizer of a quadratic we can introduce the loss function


which associates with a probability measure on the Bayes action


It follows from general principles that the resulting scoring rule on the state and action space ,


is proper. Despite the elementary nature of this example it gives us a simple way to associate with any “feature map” a Bayes action and a scoring rule and already simple variations lead to interesting questions: for example, if the quadratic loss function is replaced by the absolute value, one ends with medians of moment as Bayes action and many other choices are possible. Such questions fall under the framework of “elicitation” of properties of probability measures with scoring rules which is an active research area, already in the classical vector-valued (even scalar) case; see [41] and [42, 43, 44] for some of the recent advances.

Remark 2.2.

The question which probability measures on are characterized by the moment sequence (7) is classical but quite subtle in general. But for compactly supported measures this trivially holds. Our focus will soon shift to probability measures on pathspace but since spaces of paths are generically not even locally compact, compactness is a too strong assumption; in fact, important examples of measures on pathspace such as geometric Brownian motion are not characterized by their “signature moments” that we will use in Section 3 and Section 4. However, one can replace the moment sequence (7) by a normalized moment sequence that characterizes any probability measure on and this extends to path space and signature moments, see [17] for details. Hence, for simplicity, we assume throughout that the probability measures are characterized by their expected feature map (since this is possible by a slight modification of the feature map).

3 Structure of the Space of (Unparametrized) Paths

We review classic mathematical results about spaces of paths going back to seminal work of Chen [4]. The main result is the existence of a “feature map”


that has as domain the set that consist of equivalence classes of paths that evolve in , and as co-domain the linear space .

We already encountered in Section 2 where it arose as the vector space of sequences of tensors of increasing degree , see (8). However, is not only a vector space but a so-called Hopf algebra: we can multiply elements of and take the “inverse” of elements of . One of the well-known and attractive properties of the map (14) is that these two algebraic operations (multiplication and inversion) capture the natural operations on (conatentation and reversal). Exploiting this correspondence will be essential for our main results in Section 4.

The Domain of Paths.

A bounded variation path111For simplicity we focus on (equivalence classes of) bounded variation paths but all the results immediately extend to paths with much less regularity such as trajectories of stochastic differential equations or (fractional) Brownian motion by replacing the iterated Riemann–Stieltjes integrals by stochastic integrals or rough path integrals [13] in is a continuous map


For we denote with the set of all continuous bounded variation paths that start at and end in ,


and by


the set of all bounded variation paths in . Although is not a linear space, it has a rich structure given by concatenation and time reversal. Informally, this says that if one can go from to and from to then one can go from to and that if one can go from to then one can go from to . Formally, concatenation and reversal are defined as

  1. For , , their concatenation is defined by

  2. for any there exists an inverse path defined as


The Domain of Tracks

As discussed above, often we want to ignore the time parametrization, hence the fundamental object we care about is not the set of paths but equivalence classes of paths. It turns out that is useful to work with slightly more general equivalence relation, namely that of tree-like equivalence . We define


With slight abuse of notation, we use the same notation for an element of and an element of but emphasize that an element is a whole equivalence class of paths. We give the precise definition of the equivalence relation in Appendix A and only note here that if two paths differ by time-parametrization, that is for every and an increasing function , then . However, in addition to time parametrization, tree-like equivalence also identifies paths that backtrack all their excursions, see Appendix A. We invite readers to think of elements of like animal tracks in nature: they provide shape and direction but not the speed at which the track was made. In particular, we note that the above operations of concatenation and reversal are well-defined for the elements of ; after all, they do not depend on the time-parametrization. So again, with slight abuse of notation we have concatenation and reversal map,


The co-domain .

Our first encounter of was in Section 2 as the state space of the moment map (8). However, a more abstract way to introduce is by identifying it as the free algebra over . Informally, this means we want to keep the vector space structure of but we also would like to have a multiplication and do this in the most general way possible. Formally, this means is the free algebra over . Despite this abstract characterization as a free object, the space has a very concrete form which we will take as its definition,


(one can then directly verify that this indeed is the free algebra, see [45]. That is, an element of is sequence of tensors of increasing degree where by convention . The vector space structure of is simply given as element-wise addition: addition of is defined as


and their multiplication is defined by , i.e


where denotes the usual tensor (outer) product. Like matrix multiplication, this multiplication is associative but in general not commutative, and it has as multiplicative unit ,


The existence of a unit for multiplication naturally leads to the question of the existence of inverses, that is for can one find another element in , denoted by , such that


This is true whenever , and moreover, has the explicit formula


The Feature map .

Definition 3.1.

For , define


It is known that if two paths are tree-like equivalent, , then for every , see [46]. In fact, for the case of reparametrization this follows immediately from the change of variables formula. With slight abuse of notation we now define for .

Definition 3.2.

For , define


where is in the equivalence class of and is as in Definition 3.1.

By [46] for is well-defined in the sense that the choice of does not matter. We refer to the resulting map as the signature map (this is also known as the Chen–Fliess series or chronological exponential).

Definition 3.3.

We call


the signature map.

A well-known key property of the map is that concatenation and reversal in correspond to multiplication and inversion in . Further, the map is universal (up to fixing the starting point of the track, which is why we fix the starting point and restrict to the domain ) in the sense that it linearizes continuous functions on . We summarize all this in Theorem 3.4 below.

Theorem 3.4.

For every the map


is injective and

  1. for every , there exists a linear functional such that


    uniformly in on compacts.


This is a folk theorem in algebraic topology and control theory; see [4] and [11]. What is less standard is that we use the treelike equivalence from from [46]. ∎

Remark 3.5.

The space is graded by the tensor degree , and decays exponentially fast in , that is


(on the right hand side denotes the bounded variation (semi-)norm, on the left-hand side it denotes the norm on ). Hence, in practice one only needs to compute the first iterated integrals of . For piecewise linear tracks – which is how we identify time series – the first entries of the map can be computed in computational steps: if a track is given by piecewise linear segments then


where . Hence, for a low dimensional state space, the map can be approximately in time that scales linearly in the length of the path.

The Antipode in .

The two operations of addition and multiplication turn into a (non-commutative) algebra . However, comes with a bit more structure, namely the so-called antipode map


which is defined as the linear function given by linear extensions of the map


There is an important subset of defined by the property


It turns out that in fact forms a group and will play an important role in our Bayes acts for the simple fact that the feature map takes values in . We summaries this along with some facts about that we use later in the following lemma.

Lemma 3.6.

Let be the antipode on . Then

  1. ,

  2. ,

  3. If , then ,

  4. For a power series , it holds that for any ,

  5. If is invertible, then ,

  6. Let have the form


    where we identify the degree- tensor of with its coordinates . If for every and , then


Items 1 and 2 follow from the definition. Item 3 is well known, see for instance [47, Section 5]. To see item 4 note that


by Item 2 and linearity. Item 5 follows since the inverse map has the power series expansion


see [48, Lemma 7.16] which together with Item 4 shows the claim. For item 6, we note that


A simple example of a function that satisfies the requirements Item 6 in Lemma 3.6 is the sum of squares


Which will be used to construct a loss function for tracks in Section 6.

From Discrete Time to Continuous Time.

This section has so far focused on paths [resp. tracks], that evolves [resp. equivalence classes of evolutions] in continuous time . However, in practice one typically has only access to a discrete time observations along some grid , that is a time series. But any TS can be identified as the piecewise linear path


and hence also as an element of after forgetting the parametrisation. Working in continuous time when the original data is discrete might look cumbersome and unnecessary at first sight but it has several advantages. Firstly, all TS are embedded into the same space respectively , even if the sample grid varies from TS to TS, which would not be the case if one identifies TS as vectors. Secondly, this automatically ensures consistency in terms of high-frequency limits when the grid gets finer, that is as for a sequence . Finally, many popular models are naturally formulated in continuous time rather than discrete time.

From Tracks to Paths.

Our guiding philosophy is that the fundamental object is the set rather than the set since the former allows to ignore the time parametrization; note that the set of time parametrisations is infinite-dimensional since every continuous function can be used to reparametrize a path to , hence working with factors out an infinite-dimensional class of invariances. Nevertheless, for certain applications the parametrization matters – at least to a certain degree. However, this can be easily addressed by adding time as an additional coordinate: to emphasize the dimension of the state space in which the paths evolve we write (instead of just that we used until now); similarly for the set of equivalence classes of . Given we embed


That is a path evolving in is turned into a path in by simply adding an additional coordinate that is time itself. This makes the parametrization part of the “shape” of the trajectory which in turn is exactly the information that distinguishes tracks, hence


This injection shows that any scoring rule for tracks induces a scoring rule for paths.

4 Scoring Rules For Tracks and Paths

Motivated by the toy example in Section 2 with the moment feature map for data in , we now follow the analogous reasoning on the non-Euclidean space of tracks by using the feature map instead of , note that neither the domain nor the image of is a linear space as its image is the group that is embedded into the linear space . Recall that in Section 3 we have seen that it is exactly the group structure that captures the structure of space of tracks of concatenation and reversal. This motivates us to


Our first main result Theorem 2.1 shows that this indeed leads to a proper scoring rule on the space of tracks and operations on turn into algebraic operations in decision space. Consquences of this result are Proposition 4.3 and Corollary 4.4 which show how the associated entropy on the space of tracks is invariant to time-reversal and behaves under conditioning on the past.

A Scoring Rule for Tracks.

We need to introduce an additional space wedged between and defined as the space of all elements starting with a one, formally


Unlike , is not a vector space or a Hopf algebra, but it is a group like while also being convex as a subset of in addition to being topologically closed – unlike the set of invertible elements of . We have the following sequence of inclusions

Definition 4.1.

Let be convex with a unique minimum at the unit of . Define the left loss function as


Applying step (I) from the scoring rule framework of Section 2, the left Bayes’ act is defined as


Applying step (II) yields the proper scoring rule


Applying step (III) yields the (generalised) entropy, divergence, and mutual information


on the output space and the action space . Analogously we define the right loss function , right Bayes act and right scoring rule as well as right entropy, divergence, and mutual information.

The scoring rule framework of Definition 4.1 turns operations on tracks into algebraic operations in the decision space.

Theorem 4.2.

Let the output space be , the action space and


the loss functions from Definition 4.1. The following properties hold

  1. If is coercive, that is whenever , then for any Borel measure such that is -integrable, both and exist. If is strictly convex, then they are unique.

  2. If then

  3. The Bayes’ acts satisfy


    where denotes the pushforward measure of under and by we denote the law of where and .

  4. If satisfies , then


    where denotes the measure given by running samples from backwards in time222Formally then is defined as the law of .


We give the proofs for the right Bayes’ act as the proofs for the left Bayes’ act is similar.

(1) We equip with its norm,


which makes it into a separable Hilbert Space.

Fix some measure on and define the map by


We want to show that is convex, coercive and lower semicontinuous on , as this guarantees the existence of a minimiser. This is because the unit ball of is weakly compact, hence we could choose some weakly compact and convex set such that outside of , and since is lower semicontinuous it is also weakly lower semicontinuous, and therefore since it is convex it achieves a minimum on which must be a global minimum. Note that its minimiser must be . It follows that if is strictly convex, then the minimiser is unique.

Note that if is (strictly) convex, then


hence is also (strictly) convex.

Note that is a Banach algebra, that is . By taking multiplicative inverses, this implies that


for any invertible element . As is Borel, and is a Polish space, is a Radon measure by [49, Theorem 7.1.7] and we may choose a compact set such that , and define to be . Then on , , and since


and is coercive, so is .

To see that is lower semicontinuous, note that for a sequence


by Fatous Lemma, the assertion follows.

(2) Since is minimised at the unit it is clear that for , is optimal since .

(3) For we have


(4) If satisfies , then


Entropy, Divergence, and Mutual Information on the Space of Tracks.

We now focus on the (generalized) entropy, divergence and mutual information for probability measures on tracks that results from Definition 4.1.

Proposition 4.3.

For any two probability measures and on it holds that

  1. and

  2. If satisfies , then






The other equalities follow. ∎

Corollary 4.4.

If satisfies and a measure is reversible, that is, and are equal up to their starting distribution, then


In the experiments we will simulate sample paths from Brownian motion and since this is a reversible process it will not matter if we use the left- or right entropy.

Remark 4.5.

An alternative to using the group structure in Definition 4.1 of the Bayes act is to use that the group is embedded into the linear space and use this linear structure. That is, we define a Bayes act as . It is easy to show that this gives a proper scoring rule and that . However, this scoring rule relies on the embedding of the group into its ambient vector space and does not account for or respect the group structure. Moreover, the resulting divergence and entropy reduce to just the Euclidean distance and usual variance. The same remark extends to (signature) kernel based scoring, where linear methods are used in an RKHS; see the discussion about non-kernel based scoring in the introduction.

Remark 4.6.

We identify a stochastic process as a path- or sequence-valued random variable, possibly even ignoring its parametrization. However, for some applications the filtration of a stochastic process matters and one could ask to extend the scoring rule framework to this. A kernel that captures the filtration was introduced in [50] and a kernel algorithm and new applications given in [19]. To get a non-kernel scoring one could try to replace in Definition (4.1) by the higher-rank signature from [50].

5 Gradient Descent on the Space of Tracks

Given a smooth function , the simplest update rule for gradient descent is


and under additional regularity of , the resulting sequence converges to a minimizer of . Our interest lies minimizing functions . In accordance with our guiding theme we do not identify these domains as linear spaces where classical gradient descent can be applied. However, we have seen that provides an isomorphism between the space of tracks and the free group (up to forgetting the starting point of the track)


Hence, the minimization problem of a function of tracks can be re-formulated as a minimization problem of a function on the free group . That is, the general problem we try to solve is to find


for any in class of sufficiently “smooth” real-valued functions on .

There have been many attempts to generalize gradient descent to non-linear domains. Arguably the the case of Riemannian manifolds [51] is the most well-developed among these. However, the group does not come with a Riemannian structure (to wit, only a Sub-Riemannian structure [52]). We follow here a somewhat different approach inspired by work of Pierre Pansu [53] that directly uses the group structure to define gradients. We show that this gradient in turn allows us to give a straightforward generalization of the gradient update rule (79) from to , so that the resulting sequence converges to the minimizer.

Pansu Derivatives.

The derivative of a function


at a point is a linear functional of that can be defined as the limit


Identifying as a the additive group one can regard the difference quotient that appears in the limit as applying to the group operation to and . Hence, if we have also have a generalization of the multiplication with the scalar , then the above difference quotient makes sense for other groups than the additive group . To formalize scalar multiplication, it turns out that the right notion is that of a Carnot group: a Carnot group is a Lie group that carries a left-invariant geodesic distance and for each a bijection


However, our focus is on the free group and here the scaling by a scalar has an explicit form

Proposition 5.1.

The group equipped with the geodesic distance and


forms a limit of Carnot groups.

We now have all we need to define the (Pansu) derivative. We denote by the topological dual of .

Definition 5.2.

Let . We define as


whenever this limit exists and call the Pansu derivative of at in direction . If has a Pansu derivative for all then we say that is Pansu differentiable. Analogous we define the spaces of -times Pansu differentiable functions.

The Pansu derivative behaves very similar to the classic linear gradient. For example, for the proof of convergence of gradient descent on we make use of the following “Taylor expansion”.

Lemma 5.3.

If , then


Consider the function


then is and by a (classical linear) Taylor expansion we may write


which translates into the asserted equation since is contained in . ∎

Remark 5.4.

A popular approach to differentiating functions of paths is to use a Fréchet derivative as in Malliavin calculus, i.e. one identifies the space of paths as a linear space, see [54]. However, the above Pansu derivative is of a very different nature and – by construction – respects the non-Euclidean structure of paths resp. tracks.

Gradient Descent on .

The idea of gradient descent to take a step in a direction that minimises in a neighbourhood . In our (Lie) group , a natural choice is to choose some vector and use an exponential neighbourhood of where denotes the exponential from Lie algebra to Lie group. Hence, the question becomes how choose