DeepAI

# Stein's Method for Probability Distributions on 𝕊^1

In this paper, we propose a modification to the density approach to Stein's method for intervals for the unit circle 𝕊^1 which is motivated by the differing geometry of 𝕊^1 to Euclidean space. We provide an upper bound to the Wasserstein metric for circular distributions and exhibit a variety of different bounds between distributions; particularly, between the von-Mises and wrapped normal distributions, and the wrapped normal and wrapped Cauchy distributions.

02/16/2019

### Projected Pólya Tree

One way of defining probability distributions for circular variables (di...
08/18/2022

### Sharp Inequalities of Bienaymé-Chebyshev and GaußType for Possibly Asymmetric Intervals around the Mean

Gauß(1823) proved a sharp upper bound on the probability that a random v...
12/23/2020

### A method to integrate and classify normal distributions

Univariate and multivariate normal probability distributions are widely ...
06/26/2022

### The Sketched Wasserstein Distance for mixture distributions

The Sketched Wasserstein Distance (W^S) is a new probability distance sp...
03/06/2020

### Wasserstein statistics in 1D location-scale model

Wasserstein geometry and information geometry are two important structur...
10/28/2020

### Intrinsic Sliced Wasserstein Distances for Comparing Collections of Probability Distributions on Manifolds and Graphs

Collections of probability distributions arise in a variety of statistic...
11/02/2017

### An alternative approach for compatibility of two discrete conditional distributions

Conditional specification of distributions is a developing area with inc...

## 1 Introduction

Since its inception in 1972 [18], Stein’s method has been utilised in many research areas to provide a foundation for distributional comparison and approximation. The main objective of such a problem is to bound the (pseudo-) metric

 dH(P,Q):=suph∈H∣∣∣∫hdP−∫hdQ∣∣∣

between an arbitrary probability measure

and target probability measure , both defined on the state space with respect to some class of test functions on . Stein’s approach to this problem has three critical steps:

1. Find an operator

such that, for a random variable

, for a class of functions with .

2. Formulate and solve the so-called Stein equation

 Lfh(x)=h(x)−E[h(X)].
3. Attempt to bound the left hand side of the Stein equation using the vast number of tools that the literature has to offer. This includes, but not limited to, exchangeable pairs, size-biasing, zero-biasing, and operator comparison.

Depending on the approach taken, step 2 may not involve solving the Stein equation. We point the reader towards the exposition [16] which provides a comprehensive guide of Stein’s method.

More recently, Stein’s method has greatly expanded in scope to many distributional types; general univariate [9], multivariate [10, 15], and manifold valued [7, 19]. One can split Stein’s method into two distinct approaches: the classical Stein density approach, and the diffusion approach. In short, the density approach constructs the operator from the density of the target distribution itself, whereas the diffusion approach uses the infinitesimal generator of an over-damped Langevin diffusion.

The focus of this paper is to motivate and develop a Stein method for the manifold , and in particular, provide examples of bounds between popular distributions within directional statistics. We shall take a kernel based route similar to that of [8] which avoids the need to directly bound the solution to the Stein equation. We shall motivate as to why below.

### 1.1 Motivation and Background

As described in the Introduction, recent development within the area of Stein’s method [7] and [19] have led to the ability to extend the diffusion method originally presented in [10] to general manifolds. This particular construction of Stein’s method, however, relies upon a sufficient condition which links the geometry of the manifold with the probability density , namely that

 Ric+Hessϕ≥2κg (1)

for some with the Ricci curvature on the manifold and the Hessian of . For , everywhere and so this condition is simplified to a convexity requirement on (a log-convexity condition on ).

###### Remark.

This is the sufficient condition initially presented in [7]. An identical condition is put forward in [19]

which redefines the Bakry-Émery-Ricci tensor.

However, many popular distributions that are used in directional statistics do not satisfy this sufficient condition for any choice of their canonical parameters; e.g. von-Mises, Bingham, uniform, cardioid and wrapped distributions. Therefore, the diffusion approach will not be utilised for our study. This motivates the need to use classical methods in order to construct a Stein method for distributions on .

One may think of applying already known density methods on an interval and identify this as the circle. However, one must appreciate that in order to equate the circle with an interval, it is neccessary to assign a wrapping at the endpoints of the interval. This means one can not simply employ general density methods discussed in, for example, [2]. Boundary conditions on and must be obtained for these methods to be applicable — particularly,

. The von-Mises, Bingham and uniform distributions do not satisfy this boundary condition unless one restricts the function space for which the operator is defined on. Instead, we will modify the density approach to accommodate the geometry of

, and we shall see that by the definition of continuous functions on , this condition is always satisfied for absolutely continuous and . Even though one may use an interval of arbitrary length to construct the circle, we shall, in this paper, use intervals of lengths for simplicity.

The Stein kernel, which shall be defined later, has been shown [8, 5] to provide another way to construct analytic bounds on the Wasserstein metric between two known distributions. For this reason, we shall be utilising the Stein kernel to bound the Wasserstein metric. This avoids the need to bound the solution to the Stein equation directly, which can sometimes yield large bounds. For example, when looking at the von-Mises distribution , one will find that the solution to the Stein equation has the form (up to a constant proportional to ),

 fh(x)=e−κcos(x)∫x−π(h(u)−E[h(X)])eκcos(u)du.

This cannot be bounded via conventional means, i.e. using properties of the CDF, due to the oscillatory nature of the cosine function in the exponent. Bounding it directly using the Lipschitz continuity of will result in very large upper bounds. In addition to this, one usually relies upon applying one or more of: exchangeable pairs, size-biasing, sum of variables and zero-biasing in order to bound the Wasserstein metric. This will not be needed when working with the kernel as our method — reminiscent of [8] — shall be directly comparing the operators to obtain an upper bound. As we shall see, there are limitations to which distributions we can compare, which is the price to pay for using this kernel based method.

### 1.2 Main Results

In this paper, we present a formulation for a Stein’s method on . We develop a new Stein kernel — named circular Stein kernel — that is invariant to choice of coordinate system on . Let be a random variable on with density function . The circular Stein operator of is defined in a similar fashion to its Euclidean counterpart, . We then go on to define the circular Stein kernel of which is defined differently as

 τc(x):=T−1psin(μ−Id)=∫xμ−πsin(μ−y)p(y)dy

where is the mean angle of defined as . The first notable result is an upper bound on the Wasserstein metric:
Theorem 4.2. Let and be random variables on with Lebesgue densities respectively and and define . Furthermore, let be the mean angle of and be the circular Stein kernel of . Assume that and are differentiable everywhere on . Then we have the following bounds on the Wasserstein metric between and :

 |E[τc(X)π′0(X)]|≤dW(Y,X)≤E[|α(X)π′0(X)τc(X)|],

where

 α(x)=∫xμ−π(E[X]−y)p1(y)dy∫xμ−πsin(μ−y)p1(y)dy.

This is the main method of comparison that we employ for random variables on . The second, a direct consequence of the above Theorem, is a comparison between the von-Mises (VM) and wrapped normal (WN) distributions:
Corollary 4.7. Let and . Then the Wasserstein distance between and is bounded above,

 dW(Z,X)≤2π3κσ4+2π.

This bound on the Wasserstein metric is discovered after an application of Theorem 4.2. The result itself provides a relatively weak comparison between the two distributions, however such a result is still of interest. In contrast to the von-Mises distribution, wrapped normal distribution is typically non-trivial to simulate from, and so comparisons between these two distributions have been a central point of discussion within directional statistics [11, 1].

Notation and conventions. Throughout the paper we shall be using the following notation: is a probability measure on with continuous Lebesgue density . denotes the space of absolutely integrable functions on under . We use this abbreviation for unless explicitly stated otherwise. For simplicity, we shall assume that the support of is a connected subset of . Any reference to standard coordinates of means that we associate with the interval alongside the equivalence relation ; means that . We prescribe with its canonical Riemannian metric .

## 2 The Circular Stein Operator

This initial section is dedicated to establishing the framework necessary to formulate the Stein equation on . We begin by defining the canonical Stein operator for and further define its inverse operator. In lieu of a diffusion approach, we shall pursue a modified density approach which draws inspiration from Döbler [4].

Before we start, we recall two definitions from analysis that shall aid us:

###### Definition 2.1.

A function is absolutely continuous on if has derivative almost everywhere, and one can write

 f(x)=f(a)+∫xaf′(y)dy,a≤x≤b.
###### Definition 2.2.

Let with be a given function. We say a function has weak derivative if

 ∫baϕ(x)g′(x)dx=−∫baϕ′(x)g(x)dx.

### 2.1 The Canonical Density Operator

To begin, we start by adapting the general density method for :

###### Definition 2.3.

Let be a probability measure on with Lebesgue density with the assumption that . Define . The Stein class of is the collection of functions such that

1. is differentiable everywhere on ,

2. ,

3. .

Since we assume is absolutely continuous, it is immediate that for any the product is absolutely continuous on since is also absolutely continuous by items i) and ii). Because of the Lebesgue integrability assumption on , constant functions are always in , and hence is always non-empty.

###### Definition 2.4.

The Stein operator of a probability measure on is the mapping

 Tp:F(P)→L1(P)

given by

 Tpf(x)={(fp)′p(x)x∈If(x)x∉I. (2)

Furthermore, the double is called the Stein pair.

If one were to compare these definitions with their Euclidean counterparts (for example in [9]), one would see immediate differences. Instead of requiring to be absolutely continuous, we are able to instead restrict so that is absolutely continuous. In other words, we need not demand absolute continuity in since it is a consequence of Definition 2.4. Stating that is differentiable everywhere allows us to write down, explicitly, the Stein operator as a differential operator in terms of — see below. The key difference is that for the definition of the Stein class on , the third condition is required (cf. [9]) for . In the case of the circle, if , this condition is automatically satisfied: if we are to identify with the interval alongside the equivalence relation , then and so

 ∫S1(fp)′dx=(fp)∣∣∣π−π=f(π)p(π)−f(−π)p(−π)=0.
###### Lemma 2.5.

Let be a probability measure on with Lebesgue density and Stein class . For all , .

###### Proof.

This statement is evident from Definition 2.3. For and ,

 EP[Tpf]=∫I(fp)′ppdx+∫Icfpdx=0.

###### Example 2.6.

We now give some examples of Stein operators for circular distributions, with throughout;

1. Uniform measure with ;

 Tpf(x)=f′(x). (3)

In this particular instance, the Stein class is

 F(P)={f∈C0(S1):f′∈L1(dx)}.
2. von-Mises distribution with , where is the modified Bessel function of the first kind defined as

 Tpf(x)=f′(x)−κsin(x−μ)f(x).
3. Bingham distribution with ;

 Tpf(x)=f′(x)−κsin(2(x−μ))f(x).
4. Cardioid distribution with and ;

 Tpf(x)=f′(x)−2ρsin(x−μ)2ρcos(x−μ)+1f(x).
###### Definition 2.7.

Let be a circular random variable. The mean angle is defined as for and is the complex argument function.

This quantity’s origin is from the first circular moment

which is decomposed into in the standard coordinate system of ; is the mean resultant length, and is also known as the mean direction [13]. Before calculating , it is paramount to determine what coordinate system one is working with. The first moment, , is not necessarily always invariant to choice of coordinate system and is only defined to have support in the standard coordinates of . One will have to convert to be in standard coordinates before proceeding to calculate . Under the standard coordinates, one particular property has is that . This fact will be important in the following section.

We shall be using this parameter as a foundation from which we shall construct a coordinate system on for the purpose of integration. This procedure is as follows: For any which is not , the antipodal point to

, there is a unique tangent vector

with such that . From which, the map determines a local coordinate system covering . Furthermore, the mapping identifies with Under this new coordinate system, is identified with the origin of and is simply . Then, by mapping to , we in effect identify with with the understanding that the two endpoints are wrapped together; is identified with . In the case where is not unique, for example with the uniform measure on , we take (any) one of the valid values for and form the corresponding identification as described above. Hence, our chosen coordinate system of is dependent upon . In the sequel, any reference to the coordinate system will directly refer to this construction.

### 2.2 The Inverse Operator

The next objective is to define the inverse of the Stein operator (2) from which we can define the Stein kernel. Under the coordinate system, since we have identified with , for a random variable on . Moreover, is now centred at , meaning that . For example, if is the von-Mises, in standard coordinates, in the coordinate system changes to .

###### Definition 2.8.

Let and define the operator by

 T−1ph(x):=⎧⎪⎨⎪⎩1p(x)∫x−πh(y)p(y)dy+h(−π)p(−π)p(x)ifp(x)≠0,h(x)ifp(x)=0, (4)

in which the parameters of are in terms of the coordinate system.

###### Proposition 2.9.

The operator is the inverse of .

###### Proof.

There are two sections to this proof: the first being the case where and the second being the case where . We begin with the first case. First, let us check that for a function , :

 T(T−1ph)(x) =((T−1ph(x))p(x))′p(x), =1p(x)∂∂x(∫x−πh(y)p(y)dy+h(−π)p(−π)), =1p(x)h(x)p(x), =h(x).

Now to show the other way, let . Since by Lemma 2.5, it is clear that . Then,

 T−1p(Tph)(x) =1p(x)∫x−πTph(y)p(y)dy+h(−π)p(μ−π)p(x), =1p(x)∫x−π((h(y)p(y))′p(y)p(y)dy+h(−π)p(−π)p(x), =1p(x)∫x−π(h(y)p(y))′dy+h(−π)p(−π)p(x), =h(x).

For the case where , let . Then

 Tp(T−1ph)(x)=Tph(x)=h(x).

For the other way, since it is clear that and hence on , therefore

 T−1p(Tph)(x)=T−1ph(x)=h(x).

A special quantity is obtained when we select in (4), where . Applying the inverse operator to this particular we generate the classical Stein kernel: when

 τ(x): =T−1p(ν−Id)(x) =1p(x)∫x−π(ν−y)p(y)dy+(ν+π)p(−π)p(x), (5) =−1p(x)∫πx(ν−y)p(y)+(ν+π)p(π)p(x).

Again, we must define when . However, this follows from Definition 2.8.

The role of the constant is to ensure that is injective. If one does not include it, then we have that and which is not necessarily the case for general . However, owing to the definition of the Stein operator we also have the following:

###### Corollary 2.10.

Fix any . Define for and for . Then,

 Tpg(x)=Tp(T−1ph)(x).
###### Example 2.11.

Let the uniform measure on . Then and choose . The Stein kernel of this distribution is

 τ(x) =2π∫x−π−y2πdy+π, =π2−x22+π. (6)

One of the main uses of the Stein kernel is to be able to construct bounds on the Wasserstein distance between distributions on (cf. Theorem 3.1 in [8]). If we wish to adapt this theorem onto for a circular distribution — say a von-Mises distribution — we will have to compute the Stein kernel. So far we have only looked at examples with the uniform measure on due to its simple Lebesgue density. However, for the von-Mises distribution in particular, one will quickly find that obtaining a closed form solution of the kernel is impossible. For simplicity, let ; then

 τ(x)=Tp(E[X]−Id)(x)=e−κcos(x)∫x−π(μ−y)eκcos(y)dy+Cp(x).

There are two problems with this: The first is that the integral is intractable. We can obtain bounds on , but these bounds do not particularly aid in bounding the Wasserstein metric as they will be large — this is akin to bounding the solution to the Stein equation which is what we wanted to avoid. The second is the definition of . In Example 2.11 we used , but for directional data analysis this is not used as a parameter of location since the standard mean is not well defined on .

A different approach is to redefine to be the extrinsic mean of : . This does, however, require us to completely redefine the kernel to ensure that . This is precisely the route that we shall take in the next section except we shall not use the full extrinsic mean, rather the mean angle as defined in Definition 2.7. We shall see that this particular choice of parameter coincides well with the von-Mises and other common directional distributions.

## 3 The Circular Stein Kernel

The end of the previous section lead us to motivate the need to redefine the Stein kernel for certain circular distributions. We dedicate this section to constructing this new kernel as well as compute it for a handful of distributions. Similarly to the classical Stein kernel, we shall utilise the inverse Stein operator to initially define it.

###### Definition 3.1.

Let be a circular random variable with distribution , mean angle and -centred Lebesgue density ; then on the coordinate system. The circular Stein kernel of is defined as

 τc(x) :=T−1sin(−Id)(x), =−1p(x)∫x−πsin(y)p(y)dy, =1p(x)∫πxsin(y)p(y)dy.

By a -centred density of a circular random variable with density , we mean . In Example 3.2, we shall see that this -centred density is precisely the density of the random variable .

We are distinguishing the circular Stein kernel from the classical Stein kernel (2.2) with superscript .

###### Example 3.2.

Let with Lebesgue density with and . To calculate the mean angle, we first introduce the special function . It turns out that dividing the moment by aids in its calculation:

 E[ei(X−μ)] =12πI0(κ)∫S1ei(x−μ)eκcos(x−μ)dx, =1I0(κ)(12π∫S1cos(x−μ)eκcos(x−μ)dx, +i2π∫S1sin(x−μ)eκcos(x−μ)dx), =I1(κ)I0(κ),

since is anti-symmetric about the origin. Therefore, because , it must be that . Then we calculate the circular Stein kernel by switching to coordinates,

 τc(x) =exp(−κcos(x))∫x−π−sin(y)exp(κcos(y))dy, =1κ−1κexp(κ(−1−cos(x))).

Notably, we have the bounds

 0≤τc(x)≤1κ(1−e−2κ)≤1κ (7)

which achieves minima at and a maximum at . This particular bound on for the von-Mises distribution will be of use to us later on.

###### Example 3.3.

Let be a one-dimensional Bingham random variable, with Lebesgue density

 p(x)=12πeκ/2I0(κ2)exp(κcos2(x−μ)),x∈S1.

One can deduce that the mean angle is due to the fact that is symmetric about , and so in standard coordinates . In order to calculate the circular Stein kernel of this random variable, we must first compute the integral

 =∫√κcos(x)−√κez2√κdz, =√π2√κ(erfi(√κcos(x))+erfi(√κ)).

Here, is the imaginary error function and relates to the error function . Whence,

 τc(x)=√π2e−κcos2(x)√κ(erfi(√κcos(x))+erfi(√κ)).
###### Example 3.4.

Let the uniform measure on which has Lebesgue density , , and choose: from Example 2.11.

 τc(x)=2π∫x−π−sin(y)2πdy=cos(x)+1.

One can also obtain this kernel by taking the limit as in Example 3.2 for ;

 limκ→01−eκ(−1−cosx)κ =limκ→0(1+cosx)eκ(−1−cosx), =1+cosx.

Similar to the classical Stein kernel, the circular Stein kernel also satisfies the following integration by parts property.

###### Lemma 3.5.

Define to be a random variable on with corresponding circular Stein kernel and mean angle , and let be absolutely continuous with weak derivative . We have that

 E[sin(X)ϕ(X)]=E[τc(X)ϕ′(X)].
###### Proof.

Let have Lebesgue density on , then

 E[τc(X)ϕ′(X)]=−∫S1∫x−πsin(y)p(y)dyϕ′(x)dx.

Using integration by parts with and we obtain

 E[τc(X)ϕ′(X)] =−ϕ(x)∫x−πsin(y)p(y)dy∣∣∣π−π+∫S1sin(x)ϕ(x)p(x)dx =E[sin(X)ϕ(X)].

In the second equality we have used the continuity of and the fact that in the coordinate system. ∎

## 4 Bounding of the Wasserstein Distance

Let be the set of Lipschitz continuous functions with a Lipschitz constant of 1. The Wasserstein distance between two probability measures and on measurable space is defined as

 dW(P1,P2)=suph∈W∣∣∣∫ΩhdP1−∫ΩhdP2∣∣∣.

Using the Stein operator, we may also construct the renowned Stein equation for ;

 Tpfh(x)=h(x)−E[h(X)] (8)

with defined in Definition 2.4. Clearly, since More concretely, we can say that . It is now evident that we can apply the inverse Stein operator to both sides of the Stein equation (8) in order to find its solution. However, by Corollary 2.10 we may choose so that we can define the solution

 fh(x):=1p(x)∫x−π(h(y)−E[h(X)])p(y)dy. (9)

### 4.1 Main Theorem

We next turn our attention to the use of the Stein kernel to bound the Wasserstein distance for distributions on . We shall take a similar approach to that of [8] with modifications of the kernel that are discussed in [4] since this does not involve bounding the solution to the Stein equation directly.

###### Lemma 4.1.

Let be the circular Stein kernel of a circular random variable with Lebesgue density and mean angle . Define the solution to the Stein equation by (9) and further define . Then, we have for any Lipschitz continuous test function

 |gh(x)|≤∥∥h′∥∥∞∫xμ−π(E[X]−y)p(y)dy∣∣∫xμ−πsin(μ−y)p(y)dy∣∣.
###### Remark.

This result was formulated by Döbler in [4] Proposition 3.13 a). Particularly in Döbler’s work, he looked at a general kernel on an interval of with closure . This kernel took the form

 η(x)=1p(x)∫xaγ(t)p(t)dt.

In the proposition, Döbler imposed conditions on the that can be used; in particular, is decreasing on . However, this condition is not necessary for part a) of the relevant proposition and instead relies upon properties of and the CDF of . Therefore this lemma is easily translated from an interval onto the circle.

###### Theorem 4.2.

Let and be circular random variables with Lebesgue densities respectively and , define . Furthermore, let be the mean angle of and be the circular Stein kernel of . Assume that and are differentiable everywhere on . Then we have the following bounds on the Wasserstein metric between and :

 |E[τc(X)π′0(X)]|≤dW(Y,X)≤E[|α(X)π′0(X)τc(X)|],

where

 α(x)=∫xμ−π(E[X]−y)p1(y)dy∫xμ−πsin(μ−y)p1(y)dy.

Note that is expressed in -coordinates.

###### Proof.

We begin by proving the lower bound.

First, note that since sine is a Lipschitz continuous function with a Lipschitz constant of 1,

 |E[sin(Y)]−E[sin(X)]|≤dW(Y,X).

Moreover, since is the mean angle of , the second expectation on the left hand side is 0. For the first expectation,

 E[sin(Y)] =∫S1sin(x)p2(x)dx, =∫S1sin(x)p2(x)p1(x)p1(x)dx, =E[sin(X)π0(X)].

Then by applying Lemma 3.5 with , we obtain the lower bound.

For the upper bound, let and be the Stein pairs of and respectively. Then by Definition of the Stein equation, one clearly sees that since . We need to verify that :

First and is differentiable everywhere on already, because . Furthermore by continuity. Whence, we can conclude that , and more importantly . Using this fact, we wish to relate the Stein operators of and ; and One can clearly see that both operators share a common term of , and so

 T2(fh)−T2(fh)=(logπ0)′fh. (10)

Now, by definition of the Stein equation (9),

 E[h(Y)]−E[h(X)] =E[T1(fh)(Y)], =E[T1(fh)(Y)]−E[T2(fh)(Y)], =−E[fh(Y)(logπ0)′(Y))], =−E[τc(Y)fh(Y)τc(Y)(logπ0)′(Y)]. (11)

The second equality is due to the fact that — since , and in the third equality we have used Equation (10). Define the quantity .

Now, using Lemma 4.1,

 |gh(x)|≤∥∥h′∥∥∞|α(x)|. (12)

Compiling (4.1) and (12) together we obtain the upper bound,

 dW(Y,X) ≤suph