Fast, asymptotically efficient, recursive estimation in a Riemannian manifold

05/17/2018 ∙ by Jialun Zhou, et al. ∙ Université de Franche-Comté Université de Bordeaux 0

Stochastic optimisation in Riemannian manifolds, especially the Riemannian stochastic gradient method, has attracted much recent attention. The present work applies stochastic optimisation to the task of recursive estimation of a statistical parameter which belongs to a Riemannian manifold. Roughly, this task amounts to stochastic minimisation of a statistical divergence function. The following problem is considered : how to obtain fast, asymptotically efficient, recursive estimates, using a Riemannian stochastic optimisation algorithm with decreasing step sizes? In solving this problem, several original results are introduced. First, without any convexity assumptions on the divergence function, it is proved that, with an adequate choice of step sizes, the algorithm computes recursive estimates which achieve a fast non-asymptotic rate of convergence. Second, the asymptotic normality of these recursive estimates is proved, by employing a novel linearisation technique. Third, it is proved that, when the Fisher information metric is used to guide the algorithm, these recursive estimates achieve an optimal asymptotic rate of convergence, in the sense that they become asymptotically efficient. These results, while relatively familiar in the Euclidean context, are here formulated and proved for the first time, in the Riemannian context. In addition, they are illustrated with a numerical application to the recursive estimation of elliptically contoured distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last five years, the data science community has devoted significant attention to stochastic optimisation in Riemannian manifolds. This was impulsed by Bonnabel, who proved the convergence of the Riemannian stochastic gradient method 

[7]. Later on [32], the rate of convergence of this method was studied in detail, under various convexity assumptions on the cost function. More recently, asymptotic efficiency of the averaged Riemannian stochastic gradient method was proved in [29]. Previously, for the specific problem of computing Riemannian means, several results on the convergence and asymptotic normality of Riemannian stochastic optimisation methods had been obtained [4][3].

The present work moves in a different direction, focusing on recursive estimation in Riemannian manifolds. While recursive estimation is a special case of stochastic optimisation, it has its own geometric structure, given by the Fisher information metric. Here, several original results will be introduced, which show how this geometric structure can be exploited, to design Riemannian stochastic optimisation algorithms which compute fast, asymptotically efficient, recursive estimates, of a statistical parameter which belongs to a Riemannian manifold. For the first time in the literature, these results extend, from the Euclidean context to the Riemannian context, the classical results of [23][13].

The mathematical problem, considered in the present work, is formulated in Section 2. This involves a parameterised statistical model

of probability distributions

, where the statistical parameter belongs to a Riemannian manifold . Given independent observations, with distribution for some , the aim is to estimate the unknown parameter . In principle, this is done by minimising a statistical divergence function , which measures the dissimilarity between and . Taking advantage of the observations, there are two approaches to minimising  : stochastic minimisation, which leads to recursive estimation, and empirical minimisation, which leads to classical techniques, such as maximum-likelihood estimation [9][10].

The original results, obtained in the present work, are stated in Section 3. In particular, these are Propositions 2, 4, and 5. Overall, these propositions show that recursive estimation, which requires less computational resources than maximum-likelihood estimation, can still achieve the same optimal performance, characterised by asymptotic efficiency [18][31].

To summarise these propositions, consider a sequence of recursive estimates , computed using a Riemannian stochastic optimisation algorithm with decreasing step sizes ( is the number of observations already processed by the algorithm). Informally, under assumptions which guarantee that is an attractive local minimum of , and that the algorithm is neither too noisy, nor too unstable, in the neighborhood of ,
Proposition 2 states that, with an adequate choice of step sizes, the achieve a fast non-asymptotic rate of convergence to . Precisely, the expectation of the squared Riemannian distance between and is . This is called a fast rate, because it is the best achievable, for any step sizes which are proportional to with  [5][13]. Here, this rate is obtained without any convexity assumptions, for twice differentiable . It would still hold for non-differentiable, but strongly convex,  [32].
Proposition 4 states that the distribution of the becomes asymptotically normal, centred at , when grows increasingly large, and also characterises the corresponding asymptotic covariance matrix. This proposition is proved using a novel linearisation technique, which also plays a central role in [29].
Proposition 5 states that, if the Riemannian manifold is equipped with the Fisher information metric of the statistical model , then Riemannian gradient descent with respect to this information metric, when used to minimise , computes recursive estimates which are asymptotically efficient, achieving the optimal asymptotic rate of convergence, given by the Cramér-Rao lower bound. This is illustrated, with a numerical application to the recursive estimation of elliptically contoured distributions, in Section 4.

The end result of Proposition 5 is asymptotic efficiency, achieved using the Fisher information metric. In [29], an alternative route to asymptotic efficiency is proposed, using the averaged Riemannian stochastic gradient method. This method does not require any prior knowledge of the Fisher information metric, but has an additional computational cost, which comes from computing on-line Riemannian averages.

The proofs of Propositions 2, 4, and 5, are detailed in Section 5, and Appendices A and B. Necessary background, about the Fisher information metric (in short, this will be called the information metric), is recalled in Appendix C. Before going on, the reader should note that the summation convention of differential geometry is used throughout the following, when working in local coordinates.

2 Problem statement

Let be a statistical model, with parameter space and sample space . To each , the model associates a probability distribution on . Here, is a Riemannian manifold with , and is any measurable space. The Riemannian metric of will be denoted , with its Riemannian distance . In general, the metric is not the information metric of the model .

Let be a complete probability space, and

be i.i.d. random variables on

, with values in . While the distribution of is unknown, it is assumed to belong to the model . That is, for some , to be called the true parameter.

Consider the following problem : how to obtain fast, asymptotically efficient, recursive estimates of the true parameter , based on observations of the random variables ? The present work proposes to solve this problem through a detailed study of the decreasing-step-size algorithm, which computes equationparentequation

(1a)
starting from an initial guess .

This algorithm has three ingredients. First, denotes the Riemannian exponential map of the metric of  [26]. Second, the step sizes are strictly positive, decreasing, and verify the usual conditions for stochastic approximation [23][20]

(1b)

Third,

is a continuous vector field on

for each , which generalises the classical concept of score statistic [18][16]. It will become clear, from the results given in Section 3, that the solution of the above-stated problem depends on the choice of each one of these three ingredients.

A priori knowledge about the model is injected into Algorithm (1a) using a divergence function . As defined in [2], this is a positive function, equal to zero if and only if , and with positive definite Hessian at . Since one expects that minimising will lead to estimating , it is natural to require that

(1c)

In other words, that

is an unbiased estimator of minus the Riemannian gradient of

. With given by (1c), Algorithm (1a

) is a Riemannian stochastic gradient descent, of the form considered in 

[7][32][29]. However, as explained in Remark 2, (1c) may be replaced by the weaker condition (8), without affecting the results in Section 3. In this sense, Algorithm (1a) is more general than Riemannian stochastic gradient descent.

In practice, a suitable choice of

is often the Kullback-Leibler divergence 

[27], equationparentequation

(2a)
where is absolutely continuous with respect to with Radon-Nikodym derivative . Indeed, if is chosen to be the Kullback-Leibler divergence, then (1c) is satisfied by
(2b)
which, in many practical situations, can be evaluated directly, without any knowledge of .

3 Main results

The motivation of the following Propositions 1 to 5 is to provide general conditions, which guarantee that Algorithm (1a) computes fast, asymptotically efficient, recursive estimates of the true parameter . In the statement of these propositions, it is implicitly assumed that conditions (1b) and (1c) are verified. Moreover, the following assumptions are considered.
(d1) the divergence function has an isolated stationary point at , and Lipschitz gradient in a neighborhood of this point.

(d2) this stationary point is moreover attractive : is twice differentiable at , with positive definite Hessian at this point.

(u1) in a neighborhood of , the function is uniformly bounded.

(u2) in a neighborhood of , the function is uniformly bounded.
For Assumption (d1), the definition of a Lipschitz vector field on a Riemannian manifold may be found in [22]. For Assumptions (u1) and (u2), denotes the Riemannian norm.
Let be a neighborhood of which verifies (d1), (u1), and (u2). Without loss of generality, it is assumed that is compact and convex (see the definition of convexity in [26][30]). Then, admits a system of normal coordinates with origin at . With respect to these coordinates, denote the components of by and let ,equationparentequation

(3a)
When (d2) is verified, denote the components of the Hessian of at by ,
(3b)
Then, the matrix is positive definite [1]. Denote by

its smallest eigenvalue.

Propositions 1 to 5 require the condition that the recursive estimates are stable, which means that all the lie in , almost surely. The need for this condition is discussed in Remark 3. Note that, if lies in , then is determined by its normal coordinates .

Proposition 1 (consistency)

assume (d1) and (u1) are verified, and the recursive estimates are stable. Then, almost surely.

Proposition 2 (mean-square rate)

assume (d1), (d2) and (u1) are verified, the recursive estimates are stable, and where . Then

(4)

Proposition 3 (almost-sure rate)

assume the conditions of Proposition 2 are verified. Then,

(5)

Proposition 4 (asymptotic normality)

assume the conditions of Proposition 2, as well as (u2), are verified. Then, the distribution of the re-scaled coordinates converges to a centred

-variate normal distribution, with covariance matrix

given by Lyapunov’s equation

(6)

where with (here, denotes Kronecker’s delta).

Proposition 5 (asymptotic efficiency)

assume the Riemannian metric of coincides with the information metric of the model , and let be the Kullback-Leibler divergence (2a). Further, assume (d1), (d2), (u1) and (u2) are verified, the recursive estimates are stable, and where . Then,
equationparentequation

(i) the rates of convergence (4) and (5) hold true.
(ii) if , the distribution of the re-scaled coordinates converges to a centred -variate normal distribution, with covariance matrix .
(iii) if , and is given by (2b), then

is the identity matrix, and the recursive estimates

are asymptotically efficient.
(iv) the following rates of convergence also hold
(7a)
(7b)

The following remarks are concerned with the scope of Assumptions (d1), (d2), (u1), and (u2), and with the applicability of Propositions 1 to 5.

Remark 1

(d2), (u1) and (u2) do not depend on the Riemannian metric of . Precisely, if they are verified for one Riemannian metric on , then they are verified for any Riemannian metric on . Moreover, if the function is , then the same is true for (d1). In this case, Propositions 1 to 5 apply for any Riemannian metric on , so that the choice of the metric is a purely practical matter, to be decided according to applications.

Remark 2

the conclusion of Proposition 1 continues to hold, if (1c) is replaced by

(8)

Then, it is even possible to preserve Propositions 2, 3, and 4, provided (d2) is replaced by the assumption that the mean vector field, , has an attractive stationary point at . This generalisation of Propositions 1 to 4 can be achieved following essentially the same approach as laid out in Section 5. However, in the present work, it will not be carried out in detail.

Remark 3

the condition that the recursive estimates are stable is standard in all prior work on stochastic optimisation in manifolds [7][32][29]. In practice, this condition can be enforced through replacing Algorithm (1a) by a so-called projected or truncated algorithm. This is identical to (1a), except that is projected back onto the neighborhood of , whenever it falls outside of this neighborhood [23][20]. On the other hand, if the are not required to be stable, but (d1) and (u1) are replaced by global assumptions,
(d1’) has compact level sets and globally Lipschitz gradient.
(u1’) for some constant and for all .
then, applying the same arguments as in the proof of Proposition 1, it follows that the converge to the set of stationary points of , almost surely.

Remark 4

from (ii) and (iii) of Proposition 5, it follows that the distribution of converges to a -distribution with degrees of freedom. This provides a practical means of confirming the asymptotic efficiency of the recursive estimates .

4 Application : estimation of ECD

Here, the conclusion of Proposition 5 is illustrated, by applying Algorithm (1a) to the estimation of elliptically contoured distributions (ECD) [14][28]. Precisely, in the notation of Section 2, let the space of positive definite matrices, and . Moreover, let each

have probability density function

(9)

where is fixed, has negative values, and is decreasing, and denotes the transpose. Then, is called an ECD with scatter matrix . To begin, let be i.i.d. random vectors in , with distribution given by (9), and consider the problem of estimating the true scatter matrix . The standard approach to this problem is based on maximum-likelihood estimation [24][28]. An original approach, based on recursive estimation, is now introduced using Algorithm (1a).

As in Proposition 5, the parameter space will be equipped with the information metric of the statistical model just described. In [6], it is proved that this information metric is an affine-invariant metric on . In other words, it is of the general form [21] equationparentequation

(10a)
parameterised by constants and , where denotes the trace and the squared trace. Precisely [6], for the information metric of the model ,
(10b)
where is a further constant, given by the expectation
(10c)
with the identity matrix, and the derivative of . This expression of the information metric can now be used to specify Algorithm (1a).

First, since the information metric is affine-invariant, it is enough to recall that all affine-invariant metrics on have the same Riemannian exponential map [25][28], equationparentequation

(11a)
where denotes the matrix exponential. Second, as in (ii) of Proposition 5, choose the sequence of step sizes
(11b)
Third, as in (iii) of Proposition 5, let be the vector field on given by (2b),
(11c)
where denotes the gradient with respect to the information metric, and is the likelihood ratio, equal to divided by . Now, replacing (11) into (1a) defines an original algorithm for recursive estimation of the true scatter matrix .

To apply this algorithm in practice, one may evaluate via the following steps. Denote the gradient of with respect to the affine-invariant metric of [25], which corresponds to and . By direct calculation from (9), this is given by equationparentequation

(12a)
Moreover, introduce the constants and . Then, can be evaluated,
(12b)
from the orthogonal decomposition of ,
(12c)

Figures 2 and 2 below display numerical results from an application to Kotz-type distributions, which correspond to in (9) and in (10c[14][6]. These figures were generated from Monte Carlo runs of the algorithm defined by (1a) and (11), with random initialisation, for the specific values and . Essentially the same numerical results could be observed for any and .

Figure 2 confirms the fast non-asymptotic rate of convergence (4), stated in (i) of Proposition 5. On a log-log scale, it shows the empirical mean over Monte Carlo runs, as a function of . This decreases with a constant negative slope equal to , starting roughly at . Here, the Riemannian distance induced by the information metric (10) is given by [21]

(13)

where denotes the symmetric matrix logarithm [17]. Figure 2 confirms the asymptotic efficiency of the recursive estimates , stated in (iii) of Proposition 5, using Remark 4

. It shows a kernel density estimate of

where (solid blue curve). This agrees with a -distribution with degrees of freedom (dotted red curve), where is indeed the dimension of the parameter space for .

Figure 1: fast non-asymptotic rate of convergence
Figure 2: asymptotic efficiency (optimal rate of convergence)

5 Proofs of main results

5.1 Proof of Proposition 1

the proof is a generalisation of the original proof in [7], itself modeled on the proof for the Euclidean case in [8]. Throughout the following, let be the -field generated by  [27]. Recall that are i.i.d. with distribution . Therefore, by (1a), is -measurable and is independent from . Thus, using elementary properties of conditional expectation [27], equationparentequation

(14a)
(14b)

where (14a) follows from (1c), and (14b) from (u1). Let be a Lipschitz constant for , and be an upper bound on , for . The following inequality is now proved, for any positive integer ,

(15)

once this is done, Proposition 1 is obtained by applying the Robbins-Siegmund theorem [13].
Proof of (15) : let be the geodesic connecting to with equation equationparentequation

(16a)
From the fundamental theorem of calculus,
(16b)
Since the recursive estimates are stable, and both lie in . Since is convex, the whole geodesic lies in . Then, since is Lipschitz on , it follows from (16b),
(16c)
Taking conditional expectations in this inequality, and using (14a) and (14b),
(16d)

so (15) follows since (u1) guarantees .
Conclusion : by the Robbins-Siegmund theorem, inequality (15) implies that, almost surely, equationparentequation

(17a)
In particular, from the first condition in (1b), convergence of the sum in (17a) implies
(17b)

Now, since the sequence of recursive estimates lies in the compact set , it has at least one point of accumulation in this set, say . If is a subsequence of , converging to ,

where the third equality follows from (17b). This means that is a stationary point of in . Thus, (d1) implies is the unique point of accumulation of . In other words, almost surely.

5.2 Proof of Proposition 2

the proof is modeled on the proofs for the Euclidean case, given in [23][5]. It relies on the following geometric Lemmas 1 and 2. Lemma 1 will be proved in Appendix A. On the other hand, Lemma 2 is the same as the trigonometric distance bound of [32]. For Lemma 1, recall that denotes the smallest eigenvalue of the matrix defined in (3b). equationparentequation

Lemma 1

for any , there exists a neighborhood of , contained in , with

(18a)

Lemma 2

let be a lower bound on the sectional curvature of in , and where is the diameter of . For , where ,

(18b)

Proof of (4) : let with for some , and let be the neighborhood corresponding to in Lemma 1. By Proposition 1, the converge to almost surely. Without loss of generality, it can be assumed that all the lie in , almost surely. Then, (1a) and Lemma 2 imply, for any positive integer , equationparentequation

(19a)
Indeed, this follows by replacing and in (18b). Taking conditional expectations in (19a), and using (14a) and (14b),
Then, by (u1) and (18a) of Lemma 1,
(19b)
where is an upper bound on , for . By further taking expectations
(19c)

Using (19c), the proof reduces to an elementary reasoning by recurrence. Indeed, replacing into (19c), it follows that equationparentequation

(20a)
On the other hand, if where , then
(20b)

Let be sufficiently large, so (20b) is verified and for some . Then, by recurrence, using (20a) and (20b), one also has that for all . In other words, (4) holds true.

5.3 Proof of Proposition 3

the proof is modeled on the proof for the Euclidean case in [23]. To begin, let be the stochastic process given by equationparentequation

(21a)
The idea is to show that this process is a positive supermartingale, for sufficiently large . By the supermartingale convergence theorem [27], it then follows that converges to a finite limit, almost surely. In particular, this implies
(21b)
Then, must be equal to zero, since is arbitrary in the interval . Precisely, for any ,

It remains to show that is a supermartingale, for sufficiently large . To do so, note that by (19b) from the proof of Proposition 2,

Here, the first term on the right-hand side is negative, since . Moreover, the third term dominates the second one for sufficiently large , since . Thus, for sufficiently large , the right-hand side is negative, and is a supermartingale.

5.4 Proof of Proposition 4

the proof relies on the following geometric Lemmas 3 and 4, which are used to linearise Algorithm (1a), in terms of the normal coordinates . This idea of linearisation in terms of local coordinates also plays a central role in [29]. equationparentequation

Lemma 3

let be given by (1a) with . Then, in a system of normal coordinates with origin at ,

(22a)

where are the components of .

Lemma 4

let . Then, in a system of normal coordinates with origin at ,

(22b)

where are the components of and the were defined in (3b).

Linearisation of (1a) : let . Then, it follows from (22a) and (22b), equationparentequation

(23a)
Denote the re-scaled coordinates by , and recall . Then, using the estimate , it follows from (23a) that
(23b)
where and . Equation (23b) is a first-order, inhomogeneous, linear difference equation, for the “vector” of components .

Study of equation (23b) : switching to vector-matrix notation, equation (23b) is of the general form equationparentequation

(24a)
where denotes the identity matrix, has matrix elements , and is a sequence of inputs. The general solution of this equation is [23][19]
(24b)
where the transition matrix is given by
(24c)
Since , the matrix is stable. This can be used to show that [23][19]
(24d)
where denotes the Euclidean vector norm. Then, it follows from (24d) that converges to zero in probability, in each one of the three cases

Indeed, in the first two cases, the condition required in (24d) can be verified using (4), whereas in the third case, it follows immediately from the estimate of in (22a).
Conclusion : by linearity of (23b), it is enough to consider the case in (24a). Then, according to (24b), has the same limit distribution as the sums

(25)

equationparentequation

By (14), is a sequence of square-integrable martingale differences. Therefore, to conclude that the limit distribution of is a centred -variate normal distribution, with covariance matrix given by (6

), it is enough to verify the conditions of the martingale central limit theorem 

[15],
(26a)
(26b)
(26c)

where is the conditional covariance matrix

(27)

Conditions (26) are verified in Appendix B, which completes the proof.

5.5 Proof of Proposition 5

denote the coordinate vector fields of the normal coordinates . Since coincides with the information metric of the model , it follows from (3b) and (41), equationparentequation

(28a)
However, by the definition of normal coordinates [26], the are orthonormal at . Therefore,
(28b)

Thus, the matrix is equal to the identity matrix, and its smallest eigenvalue is .
Proof of (i) : this follows directly from Propositions 2 and 3. Indeed, since , the conditions of these propositions are verified, as soon as . Therefore, (4) and (5) hold true.
Proof of (ii) : this follows from Proposition 4. The conditions of this proposition are verified, as soon as . Therefore, the distribution of the re-scaled coordinates converges to a centred -variate normal distribution, with covariance matrix given by Lyapunov’s equation (6). If , then (28b) implies , so that Lyapunov’s equation (6) reads , as required.
For the following proof of (iii), the reader may wish to recall that summation convention is used throughout the present work. That is [26], summation is implicitly understood over any repeated subscript or superscript from the Greek alphabet, taking the values .
Proof of (iii) : let