 # Convergence rate of optimal quantization grids and application to empirical measure

We study the convergence rate of optimal quantization for a probability measure sequence (μ_n)_n∈N^* on R^d which converges in the Wasserstein distance in two aspects: the first one is the convergence rate of optimal grid x^(n)∈(R^d)^K of μ_n at level K; the other one is the convergence rate of the distortion function valued at x^(n), called the `performance' of x^(n). Moreover, we will study the performance of the optimal grid of the empirical measure of a distribution μ with finite second moment but possibly unbounded support. As an application, we show that the mean performance of the empirical measure of the multidimensional normal distribution N(m, Σ) and of distributions with hyper-exponential tails behave like O(K n/√(n)).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let denote the Euclidean norm on introduced by an inner product and the distance between a point and a set in is defined by .

For , let denote the set of all probability measures on with a finite -moment. Let be an

-valued random variable defined on a probability space

. The (quadratic) quantization procedure of (or of ) at level consists in finding a discrete approximate grid such that its quantization error achieves the optimal quantization error (or written ) for the distribution at level , defined as follows,

 e∗K,μ=infy=(y1,...,yK)∈(Rd)K[Emin1≤i≤K|X−yi|2]12=infy=(y1,...,yK)∈(Rd)K[∫Rdmin1≤i≤K|ξ−yi|2μ(dξ)]12. (1)

If , we call an optimal grid (or called an optimal cluster center) of (or of ) at level (2)(2)(2)In many references, the quantization grid at level is defined by a set of points with its cardinality card() and the quadratic quantization error function is defined by . However, for every with , one can always find a -tuple (by repeating some elements in ) such that . For example, if with (the are pointwise distinct), one may set or among many other possibilities. In [Theorem 4.12], the authors have proved that if the cardinality of the support of , an optimal grid at quantization level satisfies . Hence, . Therefore, in this paper, with a slight abuse of notation, we will mostly use but also use (in Section 1.1) with to represent a quantization grid at level . . We denote by the set of all optimal quantization grids at level of .

The distortion function is often used to describe the quantization error at a grid , defined as follows,

###### Definition 1.1 (Distortion function).

Let be the quantization level. Let be an -valued random variable and let denote its probability distribution. We assume that and , the (quadratic) distortion function of at level is defined on by,

 x=(x1,...,xK)↦DK,μ(x)=Emin1≤k≤K|X−xk|2=∫Rdmin1≤i≤K|ξ−xi|2μ(dξ). (2)

It is clear that for any grid , . Hence, if , . Sometimes we withdraw the subscript of if the quantization level is fixed in the context.

Let denote the set of all probability measures on with marginals and . For , the Wasserstein distance on is defined by

 Wp(μ,ν) =(infπ∈Π(μ,ν)∫Rd×Rdd(x,y)pπ(dx,dy))1p (3)

equipped with Wasserstein distance is a separable and complete space (see ). If , then for any ,

The target measure for the optimal quantization is sometimes unknown. In this case, in order to obtain the optimal grid of , we will implement the optimal quantization to a known distribution sequence which converges (in the Wasserstein distance) to and search the limiting point of optimal grids of . For , let denote the optimal grid of . The consistency of , i.e. , has been proved by D. Pollard in [see Theorem 9]. Therefore, a further question is, at which rate the optimal grid of converge to an optimal grid of ?

In the literature, there are two perspectives to study the convergence rate of optimal grids:

1. The convergence rate of ;

2. The convergence rate of the distorting function of valued at : .

The latter quantity is also called the “performance” at since this value describes how close between the optimal quantization error of and the quantization error of , considered as a quantization grid for (even is obviously not “optimal” for ).

A typical example of what is described above is the quantization of the empirical measure. Let be i.i.d -valued observations of with a unknown probability distribution , then the empirical measure is defined by:

 μωn=1nn∑i=1δXi(ω), (4)

where denotes the Dirac mass at . The convergence of empirical measure and have been proved in many reference, for example [Theorem 7] and [Theorem 1] so that we have the consistency for the optimal grids of . Moreover, most references of the convergence rate result for the optimal grids are concerning the empirical measure as far as we know: A first example is . In this paper, the author has proved that if denotes the unique limiting point of , the convergence rate (convergence in law) of is . For the second perspective, it is proved in a recent work that if has a support contained in , where denotes the ball in centered at with radius , then .

In this paper, we will generalise these two precedent works:

1. In Section 2, we will study the general case, that is, the convergence rate of and the performance for any probability distribution sequence which converges in Wasserstein distance to . We obtain that, if and the Hessian matrix of distortion function is positive definite at all points , then for large enough,

 d(x(n),GK(μ))2≤Cμn⋅(DK,μ(x(n))−infx∈(Rd)KDK,μ(x))≤˜Cμn⋅W2(μn,μ∞),

where and are both bounded by a constant only depending on . If , we also establish a non-asymptotic upper bound for the performance: for every , there exist a constant depending on and a constant depending on , such that

 (DK,μ(x(n))−infx∈(Rd)KDK,μ(x))≤W2(μn,μ)[Cμ,d,ηK1/d+2W2(μn,μ)+˜Cd,ηK1/dW2+η(μn,μ)],

under the condition that for some and .

2. In Section 3 we will generalise the mean performance result for the empirical measure established in  for distributions with bounded support to any measure with finite second moment. We obtain

 EDK,μ(x(n),ω)−infx∈(Rd)KDK,μ(x)≤2K√n[r22n+ρK(μ)2+2r1(r2n+ρK(μ))], (5)

where and is the maximum radius of -optimal grids, defined by

 ρK(μ)\coloneqqmax{max1≤k≤K∣∣x∗k∣∣,{x∗1,...,x∗K} is an optimal grid of μ}. (6)

Especially, we will give a precise upper bound for , the multidimensionnal normal distribution

 EDK,μ(x(n),ω)−infx∈(Rd)KDK,μ(x)≤Cμ⋅2K√n[1+logn+γKlogK(1+2d)], (7)

where and . If , .

We will start our discussion with a brief review on the properties of optimal grid and the distortion function.

### 1.1 Properties of optimal grid and the distortion function

Let be an -valued random variable with probability distribution such that and . Let denote the set of all optimal quantization grids at level of and let denote the optimal quantization error of defined in (1). The properties below recall some classical background on optimal quantization of probability measure.

###### Proposition 1.2.

Let . Let and .

1. (Decreasing of ) .

2. (Existence and boundedness of optimal grids) is a nonempty compact set so that defined in (6) is finite for any fixed . Moreover, if is an optimal grid of , then . In particular, if , then and vice versa.

3. If has a compact support and if the norm on is Euclidean, drived by an inner product , then all the optimal grids are contained in the closure of convex hull of , denoted by .

For the proof of Proposition 1.2-(i) and (ii), we refer to [see Theorem 4.12] and for the proof of (iii) to Appendix A.

###### Theorem.

(Non-asymptotic Zador’s theorem) Let . If , then for every quantization level , there exists a constant which depends only on and such that

 e∗K,μ≤Cd,η⋅σ2+η(μ)K−1/d, (8)

where for , .

For the proof of non-asymptotic Zador’s theorem, we refer to  and [see Theorem 5.2]. When has an unbounded support, we know from  that . The same paper also gives an asymptotic upper bound of when has a polynomial tail or hyper-exponential tail. We first give the definitions of different tails of probability measure,

###### Definition 1.3.

Let be absolutely continuous with respect to Lebesgue measure on and let denote its density function.

1. A distribution has a -th radial-controlled tail if there exists and a function such that

 ∀ξ∈Rd,|ξ|≥A,f(ξ)≤g(|ξ|)and∫R+xkg(x)dx<+∞.
2. A distribution has a -th polynomial tail if there exists and such that .

3. A distribution has a -hyper-exponential tail if there exists and such that .

The purpose of the definition of radial-controlled tail is to control the convergence rate of the density function to 0 when converges in every direction to infinity. Remark that the -th polynomial tail with and the hyper-exponential tail are sufficient conditions to -th radial-controlled tail. A typical example of hyper-exponential tail is the multidimensional normal distribution .

###### Theorem.

([see Theorem 1.2]) Assume that

1. Polynomial tail. For , if has a -th polynomial tail with , then

 limKlogρKlogK=p+dd(c−p−d). (9)
2. Hyper-exponential tail. If has a -hyper-exponential tail, then

 (10)

Furthermore, if , .

Quantization theory has a close connection with Voronoï partitions. Let be a grid at level and let be any norm on . The Voronoï cell (or Voronoï region) generated by is defined by

 Vxi(x)={ξ∈Rd:|ξ−xi|=min1≤j≤K∣∣ξ−xj∣∣}, (11)

and is called the Voronoï diagram of , which is a locally finite covering of . A Borel partition is called a Voronoï partition of induced by if

 ∀i∈{1,...,K},Cxi(x)⊂Vxi(x). (12)

We also define the open Voronoï cell generated by by

 Voxi(x)={ξ∈Rd:|ξ−xi|

Since we discuss mostly the Euclidean norm on , we know from [Proposition 1.3] that , where denotes the interior of a set . Moreover, if we denote by the Lebesgue measure on , we have , where denotes the boundary of (see [Theorem 1.5]). If and is an optimal grid of , even if is not absolutely continuous with the respect of , we have for all (see [Theorem 4.2]).

For any -tuple such that , one can rewrite the distortion function with the definition of Voronoï partition as follows,

 DK,μ(x)=K∑i=1∫Cxi(x)|ξ−xi|2μ(dξ). (14)

If , we know from Proposition 1.2 that and we have . In this case, is differentiable at (see [Chapter 5]) and its gradient is given by

 ∇DK,μ(x∗)=2[∫Ci(x∗)(x∗i−ξ)μ(dξ)]i=1,...,K. (15)

For , if we denote by the distortion function of and the distortion function of . Then, for every ,

 ∥∥D1/2K,μ−D1/2K,ν∥∥sup\coloneqqsupx∈(Rd)K∣∣D1/2K,μ(x)−D1/2K,ν(x)∣∣≤W2(μ,ν), (16)

by a simple application of the triangle inequality for the norm (see  Formula (4.4) and Lemma 3.4). Hence, if is a sequence in converging for the -distance to , then for every

 ∥∥D1/2K,μn−D1/2K,μ∞∥∥sup≤W2(μn,μ∞)n→+∞−−−−→0. (17)

We can also define the quantization error function (resp. the distortion function ) for any order as follows,

 ∀x∈(Rd)K, ep,K,μ(x)\coloneqq[∫Rdmin1≤k≤K|ξ−xk|pμ(dξ)]1/p, Dp,K,μ(x)\coloneqq∫Rdmin1≤k≤K|ξ−xk|pμ(dξ)=epp,K,μ(x).

For and for every , we have the similar inequality as (16):

 ∥∥ep,K,μ−ep,K,ν∥∥sup=∥∥D1/pp,K,μ−D1/pp,K,ν∥∥sup≤W2(μ,ν). (18)

Let such that . For a fixed quantization level , the consistency of optimal grids is firstly established by D. Pollard by using

 μK∈P(K)\coloneqq{ν∈P2(Rd)such thatcard\big{(}supp(ν)\big{)}≤K}

to represent a quantization “grid” at level and is called “optimal” for a probability mesure if . We will annonce differently the consistency theorem by letting to represent the optimal grid of (of course we still call the theorem “Pollard’s Theorem”) and we will give the proof of Pollard’s Theorem with this representation to Annex B.

###### Theorem (Pollard’s Theorem).

Let be the quantization level. Let such that . Assume , for . For , let be a -optimal grid for , then the grid sequence is bounded in and any limiting point of , denoted by , is an optimal grid of .

## 2 General case

### 2.1 Convergence rate of optimal grid sequence

Let such that as . Fix a quantization level through this section. For every , let which is, after Proposition 1.2 - (ii), an optimal quantization grid of at level .

Recall that a probability distribution has a -th radial-controlled tail (Definition 1.3) if and there exists a function such that

 ∀ξ∈Rd,f(ξ)≤g(|ξ|)and∫R+xkg(x)dx<+∞.

Under the radial-controlled tail assumption, the convergence rate of optimal grids and its performance can be bounded by the convergence rate of probability sequence in the Wasserstein distance multiplied by a constant, as described in the following theorem.

###### Theorem 2.1.

Let be the quantization level. Let with for all . Assume that . For , let be an optimal quantization grid of .

1. If , suppose that

1. has a -th radial-controlled tail,

2. For any , the Hessian matrix of valued at , denoted by is a positive definite matrix.

Let

denotes the smallest eigenvalue of all matrices

, . Then for large enough,

 d(x(n),GK(μ∞))2≤K(1)n(DK,μ∞(x(n))−infx∈(Rd)KDK,μ∞(x))≤K(2)n⋅W2(μn,μ∞),

where and .

2. Non-asymptotic upper bound for the performance. If , suppose that for some such that . Then for any

 DK,μ∞(x(n))−infx∈(Rd)KDK,μ∞(x)≤W2(μn,μ∞)[Cμ∞,d,ηK1/d+2W2(μn,μ∞)+˜Cd,ηK1/dW2+η(μn,μ∞)],

where is a constant depending on and depends on .

The proof of Theorem 2.1 relies on the following lemma.

###### Lemma 2.2.

Let be absolutely continuous with the respect to Lebesgue measure on . If has a -th radial-controlled tail, then every element of the Hessian matrix of the distortion function is a continuous function. As a consequence, if the Hessian matrix is positive definite at some point , then is positive definite in the neighbourhood of .

The proof of Lemma 2.2 is in Appendix C.

###### Proof of Theorem 2.1.

(a) Since the quantization level is fixed throughout the proof, we will drop the subscripts and of the distortion function and we will denote by (respectively, the distortion function of (resp. ).

After Pollard’s theorem in Section 1.1, is bounded and any limiting point of is in . We may assume that, up to a subsequence of , still denoted by , we have . Hence .

It follows from (15) that is differentiable at . Hence, the Taylor expansion of at reads:

 D∞(x(n))=D∞(x(∞))+(∇D∞(x(∞))∣x(n)−x(∞))+12HD∞(ζ(n))(x(n)−x(∞))⊗2,

where denotes the Hessian matrix of , lies in the geometric segment , and for a matrix and a vecteur , stands for .

As and , one has by applying Fermat’s theorem on stationary point. Hence

 D∞(x(n))−D∞(x(∞))=12HD∞(ζ(n))(x(n)−x(∞))⊗2. (19)

Since , , it follows that

 HD∞(ζ(n)) (x(n)−x(∞))⊗2=2(D∞(x(n))−D∞(x(∞))) ≤ 2(D∞(x(n))−Dn(x(n))+Dn(x(∞))−D∞(x(∞))