Multi-Scale Vector Quantization with Reconstruction Trees

07/08/2019 ∙ by Enrico Cecini, et al. ∙ 2

We propose and study a multi-scale approach to vector quantization. We develop an algorithm, dubbed reconstruction trees, inspired by decision trees. Here the objective is parsimonious reconstruction of unsupervised data, rather than classification. Contrasted to more standard vector quantization methods, such as K-means, the proposed approach leverages a family of given partitions, to quickly explore the data in a coarse to fine-- multi-scale-- fashion. Our main technical contribution is an analysis of the expected distortion achieved by the proposed algorithm, when the data are assumed to be sampled from a fixed unknown distribution. In this context, we derive both asymptotic and finite sample results under suitable regularity assumptions on the distribution. As a special case, we consider the setting where the data generating distribution is supported on a compact Riemannian sub-manifold. Tools from differential geometry and concentration of measure are useful in our analysis.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dealing with large high-dimensional data-sets is a hallmark of modern signal processing and machine learning. In this context, finding parsimonious representation from unlabeled data is often key to both reliable estimation and efficient computations, and more generally for exploratory data analysis. A classical approach to this problem is principal component analysis (PCA), relying on the assumption that data are well represented by a linear subspace. Starting from PCA a number of developments can be considered to relax the linearity assumption. For example kernel PCA is based on performing PCA after a suitable nonlinear embedding

[26]. Sparse dictionary learning tries to find a set of vectors on which the data can be written as sparse linear combinations [21]. Another line of works assumes the data to be sampled from a distribution supported on a manifold and include isomap [29], hessian eigenmaps [8], laplacian eigenmaps [3] and related developments such as diffusion maps [20]. A more recent and original perspective has been proposed in [14], and called geometric multi-resolution analysis (GMRA). Here the idea is to borrow and generalize ideas from multi-resolution analysis and wavelet theory [17], to derive locally linear representation organized in a multi-scale fashion. The corresponding algorithm is based on a cascade of local PCAs and is reminding of classical decision trees for function approximation, see e.g. [12]. In this paper we further explore the ideas introduced to GMRA which is our main reference.

Indeed, we consider these ideas in the context of vector quantization, which is another classical, and extreme, example of parsimonious representation. Here, a set of centers and corresponding partition is considered, and then all data points in each cell of the partition represented by the corresponding center. The most classical approach in this context is probably

-means, where a set of centers (means) is defined by a non-convex optimization problem over all possible partitions. Our approach offers an alternative to -means, by following the basic idea of GMRA and decision trees, but considering local means rather than local PCA. In this view, our approach can be seen as a zero-th order version of GMRA, hence providing a piece-wise constant data approximation. Compared to -means, the search for a partition is performed through a coarse-to-fine recursive procedure, rather than by global optimization. A strategy that we call reconstruction tree. As a byproduct the corresponding vector quantization is multi-scale, and naturally yields a multi-resolution representation of the data. Our main technical contribution is a theoretical analysis of the above multi-scale vector quantization procedure. We consider a statistical learning framework, where the data are assumed to be sampled according to some fixed unknown distribution and measure performance according to the so called expected distortion, measuring the reconstruction error with respect to the whole data distribution. Our main result is deriving corresponding finite sample bound in terms of natural geometric assumptions.

The rest of the paper is organized as follows. After describing the basic ideas in the context of vector quantization in Section 2, we present the algorithm we study in Section 3. In Section 4, we introduce the basic theoretical assumptions needed in our analysis, and illustrate them considering the case where the data are samples from a manifold. In Section 5, we present and discuss our main results, and detail the main steps in their proofs. All other proofs are deferred to the Appendix.

2 Vector quantization & distortion

We next introduce the problem of interest and comment on its connections with related questions.

A vector quantization (VQ) procedure is defined by a set of code vectors/centers and an associated partition of the data space. The idea is that compression can be achieved replacing all points in a given cell of the partition by the corresponding code vector.

More precisely, assuming that the data space is , consider a set of code vectors , where the set of cells , defines a partition of . Then, a nonlinear projection can be defined by


for all . Given a set of points the error (distortion) incurred by this nonlinear projection can be defined as

where is the Euclidean norm in

. If we consider the data to be identical and independent samples of a random variable

in , then the following error measure can also be considered


The above error measure is the expected distortion associated to the quantization defined by .

In the following we are interested in deriving VQ schemes with small expected distortion given a dataset of samples of . Before describing the algorithm we propose, we add two remarks.

Remark 1 (Comparison to supervised learning).

Classical supervised learning is concerned with the problem of inferring a functional relationship

given a set of input-output pairs . A classical error measure is the least squares loss (if the outputs are vectors valued). A parallel between the above setting and supervised learning can be seen, considering the case where the input and output spaces coincide and the least squares loss would be . Clearly, in this case an optimal solution is given by the identity map, unless further constraints are imposed. Following the above remark, we can view the nonlinear projection as a piece-wise constant approximation of the identity map, possibly providing a parsimonious representation.

Remark 2 (Vector quantization as Dictionary Learning).

A dictionary is a set of vectors (called atoms) that can be used to approximately decompose each point in the data space, i.e.

with a coefficients vector. Given a set of points , dictionary learning is the problem of estimating a dictionary, as well as a set of coefficients vectors.Vector quantization can be viewed as a form of dictionary learning where the code vectors are the atoms, and coefficients vectors are binary and have at most one non zero component [21].

The above remarks provide different views on the problem under study. We next discuss how ideas from decision trees in supervised learning can be borrowed to define a novel VQ approach.

3 Multi-scale vector quantization via reconstruction trees

We next describe our approach to Multi-Scale Vector Quantization (MSVQ), based on a recursive procedure that we call reconstruction trees, since it is inspired by decision trees for function approximation. The key ingredient in the proposed approach is a family of partitions organized in a tree. The partition at the root of the tree has the largest cells, while partitions with cells of decreasing size are found in lower leafs. This partition tree provides a multi-scale description of the data space: the lower the leaves, the finer the scale. The idea is to use data to identify a subset of cells, and corresponding partition, providing a VQ solution (1) with low expected distortion (2). We next describe this idea in detail.

3.1 Partition trees and subtrees

We begin introducing the definition of a partition tree. In the following we denote by the data space endowed with its natural Borel -algebra and by the cardinality of a set .

Definition 3.

A partition tree is a denumerable family of partitions of satisfying

  1. is the root of the tree;

  2. each family is a finite partition of of Borel subsets, i.e

  3. for each , there exists a family such that


    where is a constant depending only on .

Note that, we allow the partition tree to have arbitrary, possibly infinite, depth, needed to derive asymptotic results.

Further, notice that, since the constant characterize how the cardinality of each partition increases at finer scale. The case corresponds to dyadic trees.

We add some further definitions. For any , and the depth of is and is denoted by . The cells in are the children of , the unique cell such that is the parent of and is denoted by (by definition ). We regard as a set of nodes where each node is defined by a cell with its parent and its children . The following definition will be crucial.

Definition 4.

A (proper) subtree of is a family of cells such that for all and

denotes the set of outer leaves.

It is important in what follows that is a partition of if is finite, see Lemma  11.

3.2 Reconstruction trees

We next discuss a data driven procedure to derive a suitable partition and a corresponding nonlinear projection. To do this end, we need a few definitions depending on an available dataset .

For each cell , we fix an arbitrary point and define the corresponding cardinality and center of mass, respectively, as


is the characteristic function of

, i.e.

If , a typical choice is for all cells . While depends on the choice of , our bounds hold true for all choices. We point out that it is more convenient to choose , as this (arbitrary) choice produce a improvement of for free, in particular whenever but .

Using this quantity we can define a local error measure for each cell ,

as well as the potential error difference induced by considering a refinement,


where the second equality is consequence of the between-within decomposition of the variance. Following 

[4], we first truncate the partition tree at a given depth, depending on the size of the data set. More precisely, given , we set


Deeper trees are considered as data size grows.

As a second step, we select the cells such that . Since is not an decreasing function with the depth of the tree, this requires some care – see Remark 10 for an alternative construction. Indeed, for a threshold , we define the subtree


and is defined as outerleaves of , i.e. , see Figure 1 below. Note that is finite, so that by Lemma 11 is a partition of such that for all .

The code vectors are the centers of mass of the cells the above empirical partition, and the corresponding nonlinear projection is


We add a a few comments, the above vector quantization procedure, that we call reconstruction tree, is recursive and depends on the threshold . Different quantizations and corresponding distortions are achieved by different choices of . Smaller values of correspond to vector quantization quantizations with smaller distortion. It is a clear that the empirical distortion becomes zero for a suitably small corresponding to having a single point in each cell. Understanding the behaviour of the expected distortion as function of and the number of points is our main theoretical contribution. Before discussing these results we discuss the connection of the above approach to related ideas. A similar construction is given in [14]. However, the thresholding criterion depends on the scale, see Section 2.3 of the cited reference.

3.3 Comparison with related Topics

The above approach can be compared with a number of different ideas.

Decision and Regression Trees

We call the above procedure Reconstruction Tree, since its definition is formally analogous to that of decision trees for supervised learning; see for example [13], Chapter 8. In particular our construction and analysis follows closely that of tree based estimators studied in [4], in the context of least square regression. As commented in Remark 1 our problem can actually be interpreted as a special instance of regression. More precisely, referring to the notation of [4], the definition of lacks a natural analogue in our setting; this can be overcome by defining the regression function as the identity function from to . From this point on the two formalisms overlap, in that becomes . Since the tree based estimators considered in [4] are piece-wise constant, the expected square loss cannot vanish and its analysis is non trivial. Despite the formal similarity, the two settings do exhibit distinct features. For example, the analysis in [4] is specifically formulated for scalar functions, while our analysis is necessarily vectorial in nature. In [4] a uniform bound is imposed, while in our setting we can assume a local bound for free; namely, if is constant on a cell then for all . The present setting finds a natural instance in the case of a probability measure supported on a smooth manifold isometrically embedded in (see Section 4), while this case is hardly addressed explicitly in the literature about least square regression. For example, the manifold case is actually discussed, in the context of classification through Decision Trees, in [28]. One last point is that the present work contains explicit quantitative results about the approximation error, see Section 5.1, while similar results are not available in the setting of [4], that aspect being typically addressed indirectly in the corresponding literature.

Empirical risk minimization

Again in analogy to supervised learning, as in [4], one can consider the minimization problem:

where is the (finite-dimensional) vector space of the vector fields , which are piecewise constant on a given partition . There correspond a number of independent minimization problems, one for each cell in , so that from 7 is easily shown to be a minimizer. The minimizer is not unique, since the value of is irrelevant on cells such that . Similar considerations hold for as well, in which case the value of is irrelevant whenever . See also Section 3.2, Section 5.1 and Lemma 21.

One could consider minimization over a wider class of functions, piece-wise constant on different partitions, for example on all the partitions with a given number of cells that are induced by proper subtrees of a given partition tree

. This would be a combinatorial optimization problem. The algorithm defined in

6 overcomes this issue by providing a one-parameter coarse-to-fine class of partitions, such that each refinement carries local improvements that are uniformly bounded. As observed in [4], such a strategy is inspired by wavelet thresholding.

Geometric multi-resolution analysis (GMRA)

A main motivation for our work was the algorithm GMRA [1, 16, 14], which introduces the idea of learning multi-scale dictionaries by geometric approximation. The main difference between GMRA and Regression Trees is that the former represents data through a piece-wise linear approximation, while the latter through a piece-wise constant approximation. More precisely, rather than considering the center of mass of the data in each cells (3.2), a linear approximation is obtained by (local) Principal Component Analysis, so that the data belonging to a cell are sent to a linear subspace of suitable dimension, the latter approach being particularly natural in the case of data supported on a manifold. Another difference is in the thresholding strategy: unlike [4] and our work, in [14] the local improvement is scaled depending on its depth in the tree. One of our purposes was to check whether either of these choices affects the learning rates significantly. We provide more quantitative comparisons later in Section 5.


A main motivation for GMRA was extending ideas from wavelets and multi-resolution analysis to the context of machine learning, where, given the potential high dimensionality, non regular partition need be considered; this point is discussed in [1, 16, 14] and references therein. Indeed partition trees generalize the classic notion of dyadic partitions. In this view, given the piece-wise constant nature of reconstruction trees, a parallel can be drawn between the latter and classical Haar wavelets.


Our procedure being substantially a vector quantization algorithm, a comparison with the most common approach to vector quantization, namely -means, is in order. In -means, a set of code vectors are derived from the data and used to define corresponding partitions via the corresponding Voronoi diagram

Code vectors are defined by the minimization of the following empirical objective

This minimization problem is non convex and is typically solved by alternating minimization, a procedure referred to as Lloyd’s algorithm [15]. The inner iteration assigns each point to a center, hence a corresponding Voronoi cell. The output minimization can be easily shown to update the code vectors by computing the center of mass, the mean, of each Voronoi cell. In general the algorithm is ensures to decrease or at least not increase the objective function and to converge in finite time to a local minimum. Clearly, the initialization is important, and initializations exist yielding some stronger convergence guarantees. In particular, -means++ is a random initialization providing on average an approximation to the global minimum [2].
Compared to -means, reconstruction trees restrict the search for a partition over a prescribed family defined by the partition tree. In turns, they allow a fast multi-scale exploration of the data, while -means requires solving a new optimization problem each time is changed. Indeed it can be shown that a solution for the -means problem leads to a bad initialization for the -means problem. In other words, unlike restriction trees, the partitions found by -means at different scales (different values of ) are generally unrelated, and cannot be seen one a refinement of the other.

Hierarchical clustering

Lastly, our coarse-to-fine approach can be compared with hierarchical clustering, in particular with the so called Ward’s method, which proceeds in the opposite way. Indeed, this algorithm produces a coarser partition of the data starting from a finer. It starts with a Voronoi partition having all the data as centers, and at each step it merges a couple of cells that have the smallest so called between cluster inertia

[30]. Interestingly this definition has an analogue in our algorithm. Our corresponds to the within cluster inertia of a cell while to the between cluster inertia (up to a factor ) of cells that merge into . Nevertheless the obtained partitions will not in general coincide, unless very specific choices are made ad hoc.

4 General assumptions and manifold setting

In this section, we introduce our main assumptions and then discuss a motivating example where data are sampled at random from a manifold.

We consider a statistical learning framework, in the sense that we assume the data to be random samples from an underlying probability measure. More precisely, we assume the available data to be a realization of identical and independent random vectors taking values in a bounded subset and we denote by the common law. Up to a rescaling and a translation, we assume that and


Our main assumption relates the distribution underlying the data to the partition tree to be used to derive a MSVQ via reconstruction trees. To state it, we recall the notion of essential diamater of a cell , namely

Assumption 5.

There exists and such that for all


where and are fixed constants depending only on .

To simplify the notation, we write for a constant depending only on and we write if there exists a constant such that .

Given the partition tree , the parameters and define a class of probability measures and for this class we are able to provide a finite sample bound on the distortion error of our estimator , see (13). In the context of supervised machine learning is an a-priori class of distributions defining a upper learning rate, see (15a). It is an open problem to provide a lower min-max learning rate.

Clearly, (9b) is implied by the distribution-independent assumption that


i.e. the diameter of the cells goes to zero exponentially with their depth. This assumption ensures that the reconstruction error goes to zero and, in supervised learning, it corresponds to the assumption that the hypotheses space is rich enough to approximate any regression function, compare with condition (A4) in [14].

Eq. (9a) is a sort of regularity condition on the shape of the cells and, if it holds true, (9b) is implied by the following condition


which states that the volume of the cells goes to zero exponentially with their depth.

In [14], following ideas from [4], it is introduced a suitable model class, see Definition 5, in terms of the decay of the approximation error, compare Eq. (7) of [14] with (18) below. This important point is further discussed in Section 5.1.

In many cases the parameter is related to the intrinsic dimension of the data. For example, if is the unit cube and is given by

where is the Lebesgue measure of and the density is bounded from above and away from zero, see (12b) below, it is easily to check that the family of dyadic cubes

is a partition tree satisfying Assumption 5 with and a suitable . The construction of dyadic cubes can be extended to more general settings, see [7, 9] and references therein, by providing a large class of other examples, as shown by the following result. The proof is deferred to Section 6.

Proposition 6.

Assume that the support of is a connected submanifold of and the distribution is given by


where is the Riemannian volume element of , then there exists a partition tree of satisfying Assumption 5 with , where is the intrinsic dimension of .

We recall that, as a submanifold of , becomes a compact Riemannian manifold with Riemannian distance and Riemannian volume element . We stress that the construction of the dyadic cubes only depend on . Proposition 6 has to be compared with Proposition 3 and Lemma 6 in [14].

By inspecting the proof of the above result, it is possible to show that a partition tree satisfying Assumptions 5 always exists if there are a metric and a Borel measure on such that is an Ahlfors regular metric measure [10, page 413], has density with respect to satisfying (12b) and the embedding of into is a Lipschitz. function.

5 Main result

In this section we state and discuss our main results, characterizing the expected distortion of reconstruction trees. The proofs are deferred to Section 6. Our first result is a probabilistic bound for any given threshold . Recall that is defined by   and by (6) and (7).

Theorem 7.

Fix as in  and , for any


where and depends on the partion tree .

As shown in Remark 19, it is possible to set up to an extra logarithmic factor.

Next, we show how it allows derive the best choice for as a function of the number of examples, and a corresponding expected distortion bound.

Corollary 8.

Fix , and set


where . Then for any

where is a constant depending on the partition tree . Furthermore
where is the family of distributions such that Assumptions 5 hold true.

If is chosen large enough so that , then bound (15a) reads as

where and are suitable constants depending on . This bound can be compared with Theorem 8 in [14] under the assumption that is a compact manifold. Eq. (15a) with gives a convergence rate of the order for any , whereas the GMRA algorithm has a rate of the order , see also Proposition 3 of [14]. Hence, up to a logarithmic factor our estimator has the same convergence rate of the GMRA algorithm. However it is in order to notice that our algorithm works with a cheaper representation; indeed, given the adaptive partition , it only requires to compute and store the centers of mass .

In a similar setting, in [6], it is shown that the -means algorithm with a suitable choice of depending on , provides has a convergence rate of the order and -flat algorithm of the order .

The proof of Theorem 7 relies on splitting the error in several terms. In particular, it requires studying the stability to random sampling and the approximation properties of reconstruction trees. This latter result is relevant in the context of quantization of probability measures, hence of interest in its own right. We present this result first.

Towards this end, we need to introduce the infinite sample version of the reconstruction tree. For any cell , denote the volume of the cell by

the center of mass of the cell by

where is an arbitrary point in . The local expected distortion in a cell by


Given the threshold , define the subtree


and let be the corresponding outerleaves. Lemma 14 shows that is finite, so that by Lemma 11 is a partition and the corresponding nonlinear projection is


so that the code vectors are the centers of mass of the cells.

Comparing the definition of and , we observe that is truncated at the depth given by (5), whereas is not truncated, but its maximal depth is bounded by Lemma 17.

Given the above definitions, we have the following result.

Proposition 9.

Given , for all


Note that the bound is meaningful only if . Indeed for and , see Remark 15.

5.1 Approximation Error

The quantity

is called approximation error, by analogy with the corresponding definition in statistical learning theory, and it plays a special role in our analysis.

The problem of approximating a probability measure with a cloud of points is related to the so called optimal quantization [11]. The cost of an optimal quantizer is defined as:

where . An optimal quantizer corresponds to a set of points attaining the infimum, with the corresponding Voronoi-Dirichlet partition of . One can interpret the apporximation error as the quantization cost associated with the (suboptimal) quantizer given by the partition as defined in 16 with the corresponding centers , and .

This point of view is also taken through the analysis of -means given in [6], optimal quantizers corresponding in fact to absolute minimizers of the -means problem. Asymptotic estimates for the optimal quantization cost are available, see [6] and references therein. In the special case of , being a smooth -dimensional manifold isometrically emdedded in , they read:


where is a constant depending only on . We underline that the result provided by Proposition 9 is actually a non-asymptotic estimate for the quantization cost, when the quantizer is given by the outcome of our algorithm. The quantization cost is strictly higher than the optimal one, since, for instance, an optimal quantizer always corresponds to a Voronoi-Dirichlet partition [11]. Nevertheless, as observed in Section 3.3, a Voronoi quantizer is not suitable multiscale refinements, whereas ours is. Proposition 9 does not directly compare with 19, as it depends on a different parameter quantifying the complexity of the partition, namely instead of . Though, by carefully applying 38b, in the manifold case we get:

so that the bound is in fact optimal up to a logaritmic factor. Furthermore, it is in order to observe that Assumption 5 together with Proposition 9 provide a more transparent understanding of the approximation part of the analysis, as compared to what is provided in [4] and [14]. Therein, the approximation error is essentially addressed by defining the class of probability measures as those for which a certain approximation property holds; see Definition 5 in [14] and Definition 5 in [4], in both being the thresholding algorithm explicitly used. On the other hand Assumption 5 does not depend on the thresholding algorithm, but only on the mutual regularity of and . Lastly we notice that, while for the sake of clarity none of the constants appear explicitly in our results, the proofs allow in principle to estimate them.

6 Proofs

In this section we collect some of the proofs of the above results. The more technical proofs are postponed to the appendix.

Proof of Thm. 6.

We first observe that it is enough to show that there exists a partition tree for . Indeed, by adding to each partion , the cell , we get a partion of , which satisfies Assumptions 5, since .

Since is bounded, then is a connected compact manifold and, hence, is an Ahlfors regular metric measure space [10, page 413], i.e.

where is the ball of center and radius with respect to the Riemannian metric . By (12b)


where is the intrinsic dimension of . Since is an Ahlfors regular metric measure, too, there exists a family of dyadic cubes, i.e for each there is a family of open subsets of such that


where and are given constants [7, Theorem 11]. As noted in [9], it is always possible to redefine each cell by adding a suitable portion of its boundary in such a way that


and (12a)–(21e) still hold true, possibly with different constants. Since is compact, there exists such that for some . Hence, possibly redefining , and , we can assume that and, as a consequence of (21b), (21c) and (22), the family is a partion tree for where the bound in (3) is a consequence of the following standard volume argument. Fix large enough such that for all , , then given

where the third and the forth inequalities are consequence of (21d) and (20).

On the other hand, by (21e) and (20),

so that

Bound (3) holds true by setting

We now show that (9b) holds true. Indeed, since is Riemmannian submanifold of it holds that


see [22, Corollary 2 and Proposition 21, Chapter 5]. Given , by (21e),

so that (9b) holds true with and . To show (9a), given , by (21d) and (20)


so that (9a) holds true with and . ∎

The proof of Theorem 7 borrows ideas from [4, 14] and combines a number of intermediate results given in Appendix A. For sake of clarity let .

Proof of Thm. 7.

Consider the following decomposition

which holds for all . Since

it holds that

We bound the four terms.

  1. Since , is a partition finer than , then

    where the last inequality is a consequence of (18).

  2. Bound (49a) implies that the term is zero with probability greater than where

  3. Since and , by (40) term C is bounded by with probability greater than with


    where the second inequality is a consequence of (38a) and (38b), and is a suitable constant depending on the partition tree .

  4. By (49b) term D is zero with probability greater that where

If follows that with probability greater than


which gives (13). ∎

Proof of Cor. 8.

Since , then bound (13) gives (15a) since

where . Eq. (15b) is clear. ∎

Proof of Prop. 9.

Given , by (33) and (9b),

so that

and, by taking the limit in (31),

Set for all , then