1 Introduction
An essential challenge in machine learning is to develop scalable techniques able to cope with largescale training collections comprised of numerous elements of high dimension. To achieve this goal, it is necessary to come up with approximate learning schemes which perform the learning task with a fair precision while reducing the memory and computational requirements compared to standard techniques. A possible way to achieve such savings is to compress data beforehand, as illustrated in Figure
1. Instead of the more classical individual compression of each element of the database with random projections [2, 53, 24, 84, 72] the framework we consider in this paper corresponds to the top right scheme: a large collection of training data is compressed into a fixedsize representation called sketch. The dimension of the sketch –and therefore the cost of infering/learning the parameters of interest from this sketch– does not depend on the number of data items in the initial collection. A complementary path to handling largescale collections is online learning [30]. Sketching, which leverages ideas originating from streaming algorithms [41], can trivially be turned into an online version and is amenable to distributed computing.1.1 From compressive sensing to sketched learning.
As we will see, compressing a training collection into a sketch before learning is reminiscent of –and indeed inspired by– compressive sensing (CS) [52] and streaming algorithms [40, 41]. The main goal of CS is to find a dimensionalityreducing linear operator
such that certain highdimensional vectors (or signals) can be reconstructed from their observations by
. Initial work on CS [29, 46] showed that such a reconstruction is possible for sparse signals of dimension ; from only linear measurements by (theoretically) solving an intractable NPcomplete problem ([52], chap. 2), and in practice from linear measurements by solving a convex relaxation of this problem. Matriceswith such reconstruction guarantees can be obtained as typical draws of certain random matrix ensembles. This reconstruction paradigm from few random measurements has subsequently been considered and proven to work for more general signal models
[17]. Examples of such models include lowrank matrices [25], cosparse vectors [75] and dictionary models [26]. Reconstruction from compressive measurements for these models is made possible, among other properties, by the fact that they correspond to unions of subspaces [12] which have a much lower dimension than the ambient dimension.Lowdimensional models also intervene in learning procedures, where one aims at fitting a model of moderate “dimension” to some training data in order to prevent overfitting and ensure good generalization properties. In this paper, we consider mixture models comprising probability distributions on the set of the form
(1) 
where the ’s are probability distributions taken in a certain set and the ’s, , , are the weights of the mixture. Such a model can be considered as a generalized sparse model in the linear space of finite measures over the set .
Similarly to compressive sensing, one can define a linear compressive operator which computes generalized moments [60] of a measure :
(2) 
where the ’s are wellchosen functions on and the constant is used for normalization purposes. In the case where is a probability measure , the integrals are the expectations of with .
Given some training data drawn from , the corresponding empirical distribution is
(3) 
where is a unit mass at . A practical sketch of the data can then be defined^{1}^{1}1Any other unbiased empirical estimator of the moments, for example using the empirical median, can be envisioned. and computed as
(4) 
Denoting the set of probability distributions satisfying (1), fitting a probability mixture to the training collection in a compressive fashion can be expressed as the following optimization problem
(5) 
which corresponds to the search for the probability mixture in the model set whose sketch is closest to the empirical data sketch . By analogy with sparse reconstruction, we propose an iterative greedy reconstruction algorithm to empirically address this problem, and exemplify our framework on the estimation of GMMs.
1.2 Related works
A large set of the existing literature on random projections for dimension reduction in the context of learning focuses on the scheme represented on the bottom left of Figure 1 : each item of the training collection is individually compressed with random projections [2, 53] prior to learning for classification [24, 84] or regression [72], or to fitting a GMM [44]. In contrast, we consider here a framework where the whole training collection is compressed to a fixedsize sketch, corresponding to the top right scheme in Figure 1. This framework builds on work initiated in [19][20]. Compared to [19][20]
, the algorithms we propose here: a) are inspired by Orthogonal Matching Pursuits rather than Iterative Hard Thresholding; b) can handle GMMs with arbitrary diagonal variances; c) yield much better empirical performance (see Section
5 for a thorough experimental comparison).Our approach bears connections with the Generalized Method of Moments (GeMM) [60]
, where parameters are estimated by matching empirical generalized moments computed from the data with theoretical generalized moments of the distribution. Typically used in practice when the considered probability models do not yield explicit likelihoods, the GeMM also provides mathematical tools to study the identifiability and learnability of parametric models such as GMMs
[10, 9, 63]. Using the empirical characteristic function is a natural way of obtaining moment conditions
[49, 50, 101]. Following developments of GeMM with a continuum of moments instead of a finite number of them [31], powerful estimators can be derived when the characteristic function is available at all frequencies simultaneously [32, 107, 33]. Yet, these estimators are rarely implementable in practice.This is naturally connected with a formulation of mixture model estimation as a linear inverse problem. In [22, 11] for example, this is addressed by considering a finite and incoherent dictionary of densities and the unknown density is reconstructed from its scalar products with every density of the dictionary. These scalar products can be interpreted as generalized moments of the density of interest. Under these assumptions, the authors provide reconstruction guarantees for their cost functions. In our framework, we consider possibly infinite and even uncountable dictionaries, and only collect a limited number of “measurements”, much smaller than the number of elements in the dictionary.
Sketching is a classical technique in the database literature [41]. A sketch is a fixedsize data structure which is updated with each element of a data stream, allowing one to perform some tasks without keeping the data stored. Applications include the detection of frequent elements, called heavy hitters [39] and simple statistical estimations on the data [55]. The sketches used in these works typically involve quantization steps which we do not perform in our work. We also consider the density estimation problem, which is closer to machine learning than the typical application of sketches. Closer to our work, the authors in [100]
consider the estimation of 2dimensional histograms from random linear sketches. Even though this last method is theoretically applicable in higher dimensions, the complexity would grow exponentially with the dimension of the problem. Such a “curse of dimensionality” is also found in
[22, 11]: the size of the dictionary grows exponentially with the dimension of the data vectors, and naturally impacts the cost of the estimation procedure. In our work, we rather consider parametric dictionaries that are described by a finite number of parameters. This enables us to empirically leverage the structure of iterative algorithms from sparse reconstruction and compressive sensing to optimize with respect to these parameters, offering better scalability to higher dimensions. This is reminiscent of generalized moments methods for the reconstruction of measures supported on a finite subset of the real line [34], and can be applied to much more general families of probability measures.The particular sketching operator that we propose to apply on GMMs (see Section 3 and further) is obtained by randomly sampling the (empirical) characteristic function of the distribution of the training data. This can be seen as a combination between two techniques from the Reproducing Kernel Hilbert Space (RKHS) literature, namely embedding of probability distributions in RKHS using a feature map referred to as the “Mean Map” [16, 94, 98], and Random Fourier Features (RFFs) for approximating translationinvariant reproducing kernels [82].
Embedding of probability distributions in RKHS with the Mean Map has been successfully used for a large variety of tasks, such as twosample test [57], classification [74] or even performing algebraic operations on distributions [92]. In [96], the estimation of a mixture model with respect to the metric of the RKHS is considered with a greedy algorithm. The proposed algorithm is however designed to approximate the target distribution by a large mixture with many components, resulting in an approximation error that decreases as the number of components increases, while our approach considers (1) as a “sparse” combination of a fixed, limited number of components which we aim at identifying. Furthermore, unlike our method that uses RFFs to obtain an efficient algorithm, the algorithm proposed in [96] does not seem to be directly implementable.
Many kernel methods can be performed efficiently using finitedimensional, nonlinear embeddings that approximate the feature map of the RKHS [82, 103]. A popular method approximates translationinvariant kernels with RFFs [82]. There has been since many variants of RFFs that are faster [69, 108, 35], more precise [106], or designed for a different type of kernel [103]. Similar to our sketching operator, structures combining RFFs with Mean Map embedding of probability distributions have been recently used by the kernel community [15, 99, 77] to accelerate methods such as classification with the socalled Support Measure Machine [74, 99, 77] or twosample test [110, 36, 65, 78].
Our point of view, i.e. that of generalized compressive sensing, is sensibly different: we consider the sketch as a compressed representation of the probability distribution, and demonstrate that it contains enough information to robustly recover the distribution from it, resulting in an effective “compressive learning” alternative to usual mixture estimation algorithms.
1.3 Contributions and outline
Our main contributions can be summarized as follows:

Inspired by Orthogonal Matching Pursuit (OMP) and its variant OMP with Replacement (OMPR), we design in Section 2 an algorithmic framework for general compressive mixture estimation.

In the specific context of GMMs, we design in Section 3.2 an algorithm that scales better with the number of mixture components.

Inspired by random Fourier sampling for compressive sensing, we consider sketching operators defined in terms of random sampling of the characteristic function [19, 20]. However we show that the sampling pattern of [19, 20] is not adapted in high dimension. In Section 3.3, in the specific case of GMMs we propose a new heuristic and devise a practical scheme for randomly drawing the frequencies that define . This is empirically demonstrated to yield significantly improved performance in Section 5.

Extensive tests on synthetic data in Section 5 demonstrate that our approach matches the estimation precision of a stateoftheart C++ implementation of the EM algorithm while enabling significant savings in time and memory.

In the context of hypothesis testingbased speaker verification, we also report in Section 6 results on real data, where we exploit a corpus of 1000 hours of speech at scales inaccessible to traditional methods, and match using a very limited number of measurements the results obtained with EM.

We provide preliminary theoretical results (Theorem 3 in Section 7) on the information preservation guarantees of the sketching operator. The proofs of these results (Appendices B and C) introduce a new variant of the Restricted Isometry Property (RIP) [28], connected here to kernel mean embedding and Random Features. Compared to usual guarantees in the GeMM literature, our results have less of a “statistics” flavor and more of a “signal processing” one, such as robustness to modeling error, i.e.the true distribution of the data is not exactly a GMM but close to one.
2 A Compressive Mixture Estimation Framework
In classical compressive sensing [52], a signal is encoded with a measurement matrix into a compressed representation :
(6) 
and the goal is to recover from those linear measurements. Often the system is underdetermined () and recovery can only be done with additional assumptions, typically sparsity. The vector is said to be sparse if it has only a limited number of nonzero coefficients. Its support is the set of indices of nonzero entries: . The notation (resp. ) denotes the restriction of matrix (resp. vector ) to columns (resp. entries) with indices in .
A sparse vector can be seen as a combination of few basis elements: , where is the canonical basis of . The measurement vector is thus expressed as a combination of few atoms, corresponding to the columns of the measurement matrix: . The set of all atoms is referred to as a dictionary .
2.1 Mixture model and generalized compressive sensing
Let = be a space of signed finite measures over a measurable space , and the set of probability distributions over , . In our framework, a distribution is encoded with a linear measurement operator (or sketching operator) :
(7) 
As in classical compressive sensing, we define a “sparse” model in . As mentioned in the introduction, it is here assimilated to a mixture model (1), generated by combining elements from some given set of basic distributions indexed by a parameter . For some finite , a distribution is thus said to be Ksparse (in ) if it is a convex combination of elements from :
(8) 
with , for all ’s, and . We name support of the representation^{2}^{2}2Note that this representation might not be unique. of such a sparse distribution the set of parameters .
The measurement vector of a sparse distribution is a combination of atoms selected from the dictionary indexed by the parameter . Table 1 summarizes the notations used in the context of compressive mixture estimation and their correspondence with more classical notions from compressive sensing.
Usual compressive sensing  Compressive mixture estimation  

Signal  
Model  sparse vectors  sparse mixtures 
Measurement operator  
Support  
Dictionary of atoms 
2.2 Principle for reconstruction: moment matching
As mentioned in Section 1, usual reconstruction algorithms (also known as decoders [37, 17]) in generalized compressive sensing are designed with the purpose of minimizing the measurement error while enforcing sparsity [13], as formulated in equation (5). Here it also corresponds to traditional parametric optimization in the Generalized Method of Moments (GeMM) [60]:
(9) 
where is the empirical sketch. This problem is usually highly nonconvex and does not allow for an efficient direct optimization, nevertheless we show in Section 7 that in some cases it yields a decoder robust to modeling errors and empirical estimation errors, with high probability.
Convex relaxations of (9) based on sparsitypromoting penalization terms [22, 11, 34, 80] can be envisioned in certain settings, however their direct adaptation to general uncountable dictionaries of atoms (e.g., with GMMs) seems difficult. The main alternative is greedy algorithms. Using an algorithm inspired by Iterative Hard Thresholding (IHT) [14], Bourrier et al. [19]
estimate mixtures of isotropic Gaussian distributions with fixed variance, using a sketch formed by sampling the empirical characteristic function. As will be shown in Section
5.6, this IHTlike algorithm often yields an unsatisfying local minimum of (9) when the variance is estimated. Instead, we propose a greedy approach similar to Orthogonal Matching Pursuit (OMP) [73, 79] and its extension OMP with Replacement (OMPR) [64].Another intuitive solution would be to discretize the space of the parameter to obtain a finite dictionary of atoms and apply the classic convex relaxation or greedy methods mentioned above. However, one quickly encounters the wellknown curse of dimensionality: for a grid with precision and a parameter of dimension , the size of the dictionary is as , which is intractable even for moderate . Initial experiments for learning GMMs in dimension with diagonal covariance (i.e.the dimension of the parameter is ) show that this approach is extremely long and has a very limited precision. Instead, in the next section we propose an adaptation of OMPR directly in the continuous domain.
2.3 Inspiration: OMPR for classical compressive sensing
Matching Pursuit [73] and Orthonormal Matching Pursuit [79] deal with general sparse approximation problems. They gradually extend the sparse support by selecting atoms most correlated with the residual signal, until the desired sparsity is attained.
An efficient variation of OMP called OMP with Replacement (OMPR) [64] exhibits better reconstruction guarantees. Inspired by IHT[14], and similar to CoSAMP or Subspace Pursuit [52], it increases the number of iterations of OMP and extends the size of the support further than the desired sparsity before reducing it with Hard Thresholding to suppress spurious atoms.
2.4 Proposed algorithms: Compressive Learning OMP/OMPR
Adapting OMPR to the considered compressive mixture estimation framework requires several modifications. We detail them below, and summarize them in Algorithm LABEL:algo:CLOMP.
algocf[htbp]
Several aspects of this framework must be highlighted:

Nonnegativity. Unlike classical compressive sensing, the compressive mixture estimation framework imposes a nonnegativity constraint on the weights , that we enforce at each iteration. Thus Step 1 is modified compared to classical OMPR by replacing the modulus of the correlation by its real part, to avoid negative correlation between atom and residual. Similarly, in Step 4 we perform a NonNegative LeastSquares (NNLS) [68] instead of a classical LeastSquares.

Normalization. Note that we do not enforce normalization at each iteration. Instead, a normalization of is performed at the end of the algorithm to obtain a valid distribution. Enforcing the normalization constraint at each iteration was found on initial experiments to have a negligible effect while increasing computation time.

Continuous dictionary. The set of elementary distributions is often continuously indexed (as with GMMs, see Section 3.2) and cannot be exhaustively searched. Instead we propose to replace the maximization in Step 1 of classical OMPR with a randomly initialized gradient descent, denoted by a call to a subroutine , leading to a – local – maximum of the correlation between atom and residual. Note that the atoms are normalized during the search, as is often the case with OMP.

Global optimization step to handle coherent dictionaries. Compared to classical OMPR, the proposed algorithm includes a new step at each iteration (Step 5), which further reduces the cost function with a gradient descent initialized with the current parameters . This is denoted by a call to the subroutine . The need for this additional step stems from the lack of incoherence between the elements of the uncountable dictionary. For instance, in the case of GMM estimation, a GMM approximation of a distribution cannot be directly derived from a GMM by simply adding a Gaussian. This is reminiscent of a similar problem handled in High Resolution Matching Pursuit (HRMP) [59], which uses a multiscale decomposition of atoms, while we handle here the more general case of a continuous dictionary using a global gradient descent that adjusts all atoms. Experiments show that this step is the most timeconsuming of the algorithm, but that it is necessary.
Similar to classical OMPR, Algorithm LABEL:algo:CLOMP yields two algorithms depending on the number of iterations:

Compressive Learning OMP (CLOMP) if run without Hard Thresholding (i.e. with );

CLOMPR (with Replacement) if run with iterations.
Learning the number of components ?
In the proposed framework, the number of components is known in advance and provided by the user. However, it is known that greedy approaches such as OMP are convenient to derive stopping conditions, that could be readily applied to CLOMP: when the residual falls below a fixed (or adaptive) threshold, stop the algorithm (additional strategies would be required for CLOMPR however). In this paper however, we only compare the proposed method with classical approaches such as EM for Gaussian Mixture Models, that consider the number of components known in advance. We leave for future work the implementation of a stopping condition for CLOMP(R) and comparison with existing methods for model selection.
2.5 Complexity of CLOMP(R).
Just as OMP, which complexity scales quadratically with the sparsity parameter , proposed greedy approaches CLOMP or CLOMPR have a computational cost of the order of , where is the number of iterations, resulting in a quadratic cost with respect to the number of components .
This is potentially a limiting factor for the estimation of mixtures of many basic distributions (large ). In classical sparse approximation, approximate least squares approaches such as Gradient Pursuit [13] or LocOMP [71] have been developed to overcome this computational bottleneck. One could probably get inspiration from these approaches to further scale up compressive mixture estimation, however in the context of GMMs we propose in Section 3.2 to rather leverage ideas from existing fast ExpectationMaximization (EM) algorithms that are specific to GMMs.
2.6 Sketching by randomly sampling the characteristic function
Let us now assume . The reader will notice that in classical compressive sensing, the compressed object is a vector , while in this context, a training collection of vectors is considered as an empirical version of some probability distribution which is the compressed object.
The proposed algorithms CLOMP(R) are suitable for any sketching operator and any mixture model of parametric densities , as long as the optimization schemes in Steps 1 and 5 can be performed. In the case of a continuous dictionary the optimization steps can be performed with simple gradient descents implicitly represented by calls to the subroutines and , provided and its gradient with respect to have a closedform expression.
In many important applications such as medical imaging (MRI and tomography), astrophysics or geophysical exploration, one wishes to reconstruct a signal from incomplete samples of its discrete Fourier transform. Random Fourier sampling was therefore one of the first problems to give rise to the classical notions of compressive sensing
[28, 27, 29]. Indeed, a random uniform selection of rows of the full Fourier matrix, i.e. a random selection of frequencies, forms a partial Fourier matrix that satisfies a certain Restricted Isometry Property (RIP) with overwhelming probability [29], and is therefore appropriate for the encoding and recovery of sparse signals. For more details, we refer the reader to [52, 29, 28, 27, 6] and references therein. Inspired by this methodology, we form the sketch by sampling the characteristic function (i.e. the Fourier transform) of the probability distribution .The characteristic function of a finite measure is defined as:
(10) 
For a sparse distribution (in some given set of basic distributions ), we also denote for simplicity.
The characteristic function of a probability distribution
is a wellknown object with many desirable properties. It completely defines any distribution with no ambiguity and often has a closedform expression (even for distributions which may not have a probability density function with closedform expression,
e.g., for stable distributions [91]), which makes it a suitable choice to build a sketch used with CLOMP(R). It has been used as an estimation tool at an early stage [50] as well as in more recent developments on GeMM [32].The proposed sketching operator is defined as a sampling of the characteristic function. Given frequencies in , we define generalized moment functions:
(11) 
and the sketching operator (2) is therefore expressed as
(12) 
Given a training collection in , we denote the empirical characteristic function^{3}^{3}3Other more robust estimators can be envisioned such as the empirical median. The empirical average allows more easy streaming or distributed computing.. The empirical sketch is
(13) 
To fully specify the sketching operator (12), one needs to choose the frequencies that define it. In the spirit of Random Fourier Sampling, we propose to define a probability distribution to draw . Choosing this distribution is a problem of its own that will be discussed in details in Section 3.3.
Connections with Random Neural Networks.
It is possible to draw connections between the proposed sketching operation and neural networks. Denote
and . To derive the sketch, one needs to compute the matrix , take the complex exponential of each individual entry where is the pointwise application of the function and finally pool the columns of to obtain the empirical sketch . This procedure indeed shares similarities with a layer neural network: the output of such a network can be expressed as , where is the input signal, is a pointwise nonlinear function and is some weighting matrix. Therefore, in the proposed framework, such a layer network is applied to many inputs , then the empirical average of the outputs is taken to obtain a representation of the underlying distribution of the .Neural networks with weights chosen at random rather than learned on training data – as is done with the frequencies – have been studied in the socalled Random Kitchen Sinks [83] or in the context of Deep Neural Networks (DNN) with Gaussian weights [56]. In the latter, they have been shown to perform a stable embedding of the input when it lives on a lowdimensional manifold. Similar to [56], we show in Section 7 that with high probability the sketching procedure is a stable embedding of the probability distribution of when this distribution belongs to, e.g., a compact manifold.
2.7 Complexity of sketching
The main computational load of the sketching operation is the computation of the matrix , which theoretically scales in . Large multiplications by random matrices are indeed a wellknown computational bottleneck in Compressive Sensing, and some methods circumvent this issue by using approximated fast transforms [45, 70]. Closer to our work (see Section 4), fast transforms have also been used in the context of Random Fourier Features and kernel methods [69, 108]. We leave the study of a possible adaptation of these acceleration methods for future work, and focus on simple practical remarks about the computation of the sketch.
GPU computing.
Distributed/online computing.
The computation of the sketch can also be performed in a distributed manner. One can divide the database in subsets containing items respectively, after which individual computing units can compute the sketches of each subset in parallel, using the same frequencies. Those sketches are then easily merged^{4}^{4}4A similar strategy can also be used on a single machine when the matrix is too large to be stored as . The cost of computing the sketch is thus divided by the number of units . Similarly, this simple observation allows the sketch to be computed in an online fashion.
3 Sketching Gaussian Mixture Models
In this section, we instantiate the proposed framework in the context of Gaussian Mixture Models (GMMs). A more scalable algorithm specific to GMMs is first introduced as a possible alternative to CLOMP(R). We then focus on the design of the sketching operator , i.e. on the design of the frequency distribution (see Section 2.6).
3.1 Gaussian Mixture Models
In the GMM framework, the basic distributions are Gaussian distributions with density functions denoted :
(14) 
where represents the parameters of the Gaussian with mean and covariance . A Gaussian is said to be isotropic, or spherical, if its covariance matrix is proportional to the identity: .
Densities of GMMs are denoted . A GMM is then naturally parametrized by weight vector and parameters .
Compressive density estimation with mixtures of isotropic Gaussians with fixed, known variance was considered in [19]. In this work, we consider mixtures of Gaussians with diagonal covariances, which is known to be sufficient for many applications [86, 111] and is the default framework in wellknown toolboxes such as VLFeat [102]. We denote . Depending on the context, we equivalently denote where is the diagonal of the covariance of the th Gaussian.
The characteristic function of a Gaussian has a closedform expression:
(15) 
from which we can easily derive the expression of the gradients necessary to perform the optimization schemes , in CLOMP(R), with the sketching operator introduced in Section 2.6.
3.2 A faster algorithm specific to GMM estimation
To handle mixtures of many Gaussians (large ), the fast hierarchical EM used in [90] alternates between binary splits of each Gaussian along its first principal component (in the case of Gaussians with diagonal covariance, this is the dimension with the highest variance ) and a few EM iterations.
Our compressive adaptation is summarized in Algorithm LABEL:algo:BSCGMM. The binary split is performed by calling the function Split, and the EM steps are replaced by Step 5 of Algorithm LABEL:algo:CLOMP to adjust all Gaussians. In the case where the targeted sparsity level is not a power of , we split the GMM until the support reaches a size , then reduce it with a Hard Thresholding (Step 3 of Algorithm LABEL:algo:CLOMP), similar to CLOMPR.
Since the number of iterations in Algorithm LABEL:algo:BSCGMM is , the computational cost of this algorithm scales in , which is much faster for large than the quadratic cost of CLOMPR.
In practice, Algorithm LABEL:algo:BSCGMM is seen to sometimes yield worse results than CLOMPR (see Section 5.6), but be welladapted to other tasks such as, e.g., GMM estimation for largescale speaker verification (see Section 6).
algocf[htbp]
algocf[htbp]
3.3 Designing the frequency sampling pattern
A key ingredient in designing the sketching operator is the choice of a probability distribution to draw the frequency sampling pattern as defined in Section 2.6. We show in Section 4 that this corresponds to the design of a translationinvariant kernel in the data domain. Interestingly, working in the Fourier domain seems to make the heuristic design strategy more direct. Literature on designing kernels in this context usually focus on maximizing the distance between the sketch of two distributions [97, 77], which cannot be readily applied in our context since we sketch only a single distribution. However, as we will see, the proposed approach follows the general idea of maximizing the capacity of the sketch to distinguish this distribution from others, which amounts to maximizing the variations of the sketch with respect to the parameters of the GMM at the selected frequencies.
3.3.1 Oracle frequency sampling pattern for a single known Gaussian
We start by designing a heuristic for choosing frequencies adapted to the estimation of a single Gaussian , assuming the parameters are available – which is obviously not the case in practice. We will deal in due time with mixtures, and with unknown parameters.
Gaussian frequency distribution.
Recall the expression (15) of the characteristic function of the Gaussian . It is an oscillating function with Gaussian amplitude of inverted variance with respect to the original Gaussian. Given that , choosing a Gaussian frequency distribution is a possible intuitive choice [19, 20] to sample the characteristic function . It concentrates frequencies in the regions where the sampled characteristic function has high amplitude.
However, points drawn from a highdimensional Gaussian concentrate on an ellipsoid which moves away from the origin as the dimension increases. Such a Gaussian sampling therefore “undersamples” low or even middle frequencies. This phenomenon has long been one of the reasons for using dimensionality reduction for GMM estimation [44]. Hence, in high dimension the amplitude of the characteristic function becomes negligible (with high probability) at all selected frequencies.
Folded Gaussian radial frequency distribution.
In light of the problem observed with the Gaussian frequency distribution, we propose to draw frequencies from a distribution allowing for an accurate control of the quantity , and thus of the amplitude of the characteristic function. This is achieved by drawing
(16) 
where
is uniformly distributed on the
unit sphere , and is a radius chosen independently from with a distribution we will now specify.With the decomposition (16), the characteristic function is now expressed as
where is the characteristic function of a onedimensional Gaussian with mean and variance . We thus consider the estimation of a onedimensional Gaussian , with , as our baseline to design a radius distribution .
In this setting, we no longer suffer from unwanted concentration phenomena and can resort to the intuitive Gaussian radius distribution to sample . It corresponds to a radius density function (i.e. Gaussian with absolute value, referred to as folded Gaussian). Using this radius distribution with the decomposition (16) yields a frequency distribution referred to as Folded Gaussian radius frequency distribution. Note that, similar to the Gaussian frequency distribution, the Folded Gaussian radius distribution only depends on the (oracle) covariance of the sketched distribution .
Adapted radius distribution
Though we will see it yields decent results in Section 5, the Folded Gaussian radius frequency distribution somehow produces too many frequencies with a low radius . These carry a limited quantity of information about the original distribution, since all characteristic functions equal at the origin^{5}^{5}5In a way, numerous measures of the characteristic function near the origin essentially measure its derivatives at various orders, which are associated to classical polynomial moments.. We now present a heuristics that may avoid this “waste” of frequencies.
Intuitively, the chosen frequencies should properly discriminate Gaussians with different parameters, at least in the neighborhood of the true parameter . This corresponds to promoting frequencies leading to a large difference for parameters close to . A way to achieve this is to promote frequencies where the norm of the gradient is large at .
Recall that for a onedimensional Gaussian . The norm of the gradient is expressed as:
and therefore since . This expression still has a Gaussian decrease (up to polynomial factors), and indeed avoids very low frequencies. It can be normalized to a density function:
(17) 
with some normalization constant. Using this radius distribution with the decomposition (16) yields a distribution referred to as Adapted radius frequency distribution. Once again, this distribution only depends on the covariance .
3.3.2 Oracle frequency sampling pattern for a known mixture of Gaussians
Any frequency distribution selected for sampling the characteristic function of a single known Gaussian can be immediately and naturally extended to a frequency distribution to sample the characteristic function of a known GMM , by mixing the frequency distributions corresponding to each Gaussian:
(18) 
Each component has the same weight than its corresponding Gaussian . Indeed, a Gaussian with a high weight must be precisely estimated, as its influence on the overall reconstruction error (e.g.
in terms of KullbackLeibler divergence) is more important than the components with low weights. Thus more frequencies adapted to this Gaussian are selected.
The draw of frequencies with an oracle distribution is summarized in Function DrawFreq.
3.3.3 Choosing the frequency sampling pattern in practice
In practice the parameters of the GMM to be estimated are obviously unknown beforehand, so the oracle frequency distributions introduced in the previous section cannot be computed. We propose a simple method to obtain an approximate distribution that yields good results in practice. The reader must also keep in mind that it is very easy to integrate some prior knowledge in this design, especially since the proposed frequency distributions only take into account the variances of the GMM components, not their means.
The idea is to estimate the average variance of the components in the GMM – note that this parameter may be significantly different from the global variance of the data, for instance in the case of wellseparated components with small variances. This estimation is performed using a light sketch with frequencies, computed on a small subset of items from the database, then a frequency distribution corresponding to a single isotropic Gaussian is selected.
Indeed, if the variances ’s are not too different from each other, the amplitude of the empirical characteristic function approximately follows with high oscillations, allowing for a very simple amplitude estimation process: assuming the frequencies used to compute the sketch are ordered by increasing Euclidean radius, the sketch is divided into consecutive blocks, maximal peaks of its modulus are identified within each block forming a curve that approximately follows , then a simple regression is used to estimate . This process is illustrated in Figure 2.
To cope with the fact that the "range" of frequencies that must be considered to compute is also not known beforehand, we initialize and reiterate this procedure several times, each time drawing frequencies adapted to the current estimation of , i.e. with some choice of frequency distribution , and update at each iteration. In practice, the procedure quickly converges in three iterations.
The entire process is summarized in detail in Function EstimMeanSigma and Algorithm LABEL:algo:drawfreqpractice.
algocf[htbp]
algocf[htbp]
algocf[htbp]
3.4 Summary
At this point, all procedures necessary for compressive GMM estimation have been defined. Given a database , a number of measurements and a number of components , the entire process is as follow:

In the absence of prior knowledge, draw frequencies using Algorithm LABEL:algo:drawfreqpractice on (a fraction of) the dataset. The proposed Adapted radius frequency distribution is recommended, other parameters have default values (see Section 5.3), the effect of modifying them has been found to be negligible.

Compute the empirical sketch . One may use GPU and/or distributed/online computing.

Estimate a GMM using CLOMP(R) or the more scalable but less precise Algorithm LABEL:algo:BSCGMM.
Connections with Distilled sensing.
The reader may note that designing a measurement operator adapted to some particular data does not fit the classical paradigm of compressive sensing.
The twostage approaches used to choose the frequency distribution presented above can be related to a line of work referred to as adaptive (or distilled) sensing [61], in which a portion of the computational budget is used to crudely design the measurement operator while the rest is used to actually measure the signal. Most often these methods are extended to multistage approaches, where the measurement operator is refined at each iteration, and have been used in machine learning [38] or signal processing [7]. Allocating the resources and choosing between exploration (designing the measurement operator) and precision
(actually measuring the signal) is a classic tradeoff in areas such as reinforcement learning or game theory.
4 Kernel design and sketching
It turns out that sketching operators as (12) are intimately related to RKHS embedding of probability distributions [94, 98] and RFFs [82]. The proposed, carefullydesigned choice of frequency distribution appears as an innovative method to design a reproducing kernel, which is faster and provides better results than traditional choices in the kernel literature [99], as experimentally shown in Section 5.4. Additionally, approximate RKHS embedding of distributions turns out to be an appropriate framework to derive the information preservation guarantees of Section 7.
4.1 Reproducing Kernel Hilbert Spaces (RKHS)
We refer the reader to Appendix A for definitions related to positive definite (p.d.) kernels and measures.
Let be a p.d. kernel. By MooreAronszajn Theorem [4], to this kernel is associated a unique Hilbert space that satisfies the following properties: for any the function belongs to , and the kernel satisfies the reproducing property . The space is referred to as the Reproducing Kernel Hilbert Space (RKHS) associated with the kernel . We denote the scalar product of the RKHS . We refer the reader to [62] and references therein for a review of RKHS and kernel methods.
We focus here on the space and on translationinvariant kernels of the form
(19) 
where is a positive definite function. Translationinvariant positive definite kernels are characterized by the Bochner theorem:
Theorem 1 (Bochner [88], Thm. 1.4.3).
A continuous function is positive definite if and only if it is the Fourier transform of a finite (i.e., ) nonnegative measure on , that is:
(20) 
This expression implies the normalization . Hence, without loss of generality, up to a scaling factor, we suppose . This means in particular that is a probability distribution on .
4.2 Hilbert Space Embedding of probability distributions
Embeddings of probability distributions [94, 98] are obtained by defining the Mean Map :
(21) 
Using the Mean Map we can define the following p.d. kernel over the set of probability distributions . Note that this expression is also equivalent to the following definition [98, 74]: .
We denote the pseudometric induced by the Mean Map on the set of probability distributions, often referred to as Maximum Mean Discrepancy (MMD) [99]. Fukumizu et al. introduced in [54] the concept of characteristic kernel, that is, a kernel for which the map is injective and the pseudometric is a true metric. Translationinvariant characteristic kernels are characterized by the following Theorem [98].
Theorem 2 (Sriperumbudur et al. [98]).
Assume that where is a positive definite function on . Then is characteristic if and only if , where is defined as (20).
Many classical translationinvariant kernels (Gaussian, Laplacian….) indeed exhibit a Fourier transform with a support that is equal to , and are therefore characteristic. This is also the case for all kernels corresponding to the proposed frequency distributions that we defined in Section 3.3.
4.3 From Kernel design to the design of a frequency sampling pattern
In the case of a translationinvariant kernel, the metric can be expressed as where by abuse of notation, for a given frequency distribution , we introduce
(22) 
where we recall that is the characteristic function of . The proof of Theorem 2 is actually based on this reformulation [98].
Hence, given frequencies and the corresponding sketching operator (12), we can expect that for large enough , with high probability,
(23) 
Building on this relation between the metric and the sketching operators considered in this paper, we next derive some connections with kernel design. We further exploit these connections to draw a theoretical analysis of the information preservation guarantees of the sketching operator in Section 7.
Traditional kernel design.
Given some learning task, the selection of an appropriate kernel –known as kernel design– is a difficult, open problem, usually done by crossvalidation. Even when using RFFs to leverage the computational gain of explicit embeddings (compared to potentially costly implicit kernels), the considered frequency distribution is usually derived from a translationinvariant kernel chosen in a parametric family endowed with closedform expressions for both and [82, 99]. A typical example is the Gaussian kernel (chosen for the simplicity of its Fourier transform), which bandwidth is often selected by crossvalidation [97, 99].
Spectral kernel design: designing the frequency sampling pattern.
Learning an appropriate kernel by directly deriving a spectral distribution of frequencies for Random Fourier Features has been an increasingly popular idea in the last few years. In the field of usual reproducing kernels on finitedimensional objects, researchers have explored the possibility of modifying the matrix of frequencies to obtain a better approximation quality [106] or to accelerate the computation of the kernel [69]. Both ideas have been exploited for learning an appropriate frequency distribution, often modeled as a mixture of Gaussians [105, 108, 76] or by optimizing weights over a combination of many distributions [93].
In the context of using the Mean Map with Random Features, learning a kernel has been often explored for the twosample test problem, mainly based on the idea of maximizing the discriminative power of the MMD [97]. Similar to our approach, such methods often divide the database in two parts to learn the kernel then perform the hypothesis test [58, 65], or are done in a streaming context [78]. Variants of the Random Fourier Features have also been used [36].
Compared to these methods, the approach proposed in Section 3.3 is relatively simple, fast to perform, and based on an approximate theoretical expression of the MMD for clustered data instead of a thorough statistical analysis of the problem. In the spirit of our sketching framework, it only requires a tiny portion of the database and is performed in a completely unsupervised manner, which is quite different from most existing literature. Furthermore, in this paper we only consider traditional Random Fourier Features for translationinvariant kernels.
Using more exotic kernels and/or adapting more advanced learning methods to our framework is left for future work. Still, in Section 5.4, we empirically show that the estimation of (Function ) is much faster than estimation by crossvalidation, and that the Adapted radius distribution performs better than a traditional Gaussian kernel with optimized bandwidth.
5 Experiments with synthetic data
To validate the proposed framework, the algorithms CLOMP(R) (Algorithm LABEL:algo:CLOMP) are first extensively tested against synthetic problems for which the true parameters of the GMM are known. Experiments on real data will be conducted in Section 6, with the additional analysis of Algorithm LABEL:algo:BSCGMM. The full Matlab code is available at [67].
5.1 Generating the data
When dealing with synthetic data for various settings, it is particularly difficult to generate problems that are “equally difficult” when varying the dimension and the level of sparsity . We opted for a simple heuristic to draw Gaussians neither too close nor to far from one another, given their variances.
Since an dimensional ball with radius has volume (with a constant depending on ), we consider that any Gaussian which is approximately isotropic with variance (i.e., with ) “occupies” a volume where is a reference volume for .
Variances are randomly drawn uniformly from to , so that . The means are chosen so that they lie in a ball sufficiently large to accommodate for a volume . We therefore choose with that verifies , which yields i.e., by considering the expected value of . In practice this choice seems to offer problems that are neither too elementary nor too difficult.
5.2 Evaluation measure
To evaluate reconstruction performance when the true distribution of the data is known, one can resort to the classic KullbackLeibler divergence [42]. A symmetric version of the KLdivergence is more comfortable to work with: , where is the true distribution and is the estimated one (we still refer to this measure as “KLdivergence”). In our framework, and are GMMs with density functions denoted and , hence as in [19] to estimate the KLdivergence in practice we draw with and compute
(24) 
5.3 Basic Setup
The basic setup is the same for all following experiments, given data and parameters . We suppose the data to be approximately centered (see Section 5.1).
First, unless specified otherwise (e.g., in Section 5.4 ), we draw frequencies according to an Adapted radius distribution , using Algorithm LABEL:algo:drawfreqpractice with parameters , , , . The empirical sketch is then computed.
The compressive algorithms are performed with their respective number of iterations. For Step 1 of CLOMP(R), the gradient descent is initialized with a centered isotropic Gaussian with random variance^{6}^{6}6After trying many variants of initialization strategy, we came to the conclusion that the algorithm is very robust to initialization. This is one of the possible choices. . Furthermore, during all optimization steps in CLOMP(R) or Algorithm LABEL:algo:BSCGMM, we constrain all variances to be larger than a small for numerical stability. All continuous optimization schemes in CLOMP(R) are performed with Stephen Becker’s adaptation of the LBFGSB algorithm [23] in C, with Matlab wrappers [8].
We compare our results with an adaptation of the previous IHT algorithm [19, 20] for isotropic Gaussians with fixed variances, in which all optimization steps have been straightforwardly adapted for our nonisotropic framework with relaxed variances. The IHT is performed with iterations, which is the default value in the original paper.
We use the VLFeat toolbox [102] to perform the EM algorithm with diagonal variances. The algorithm is repeated 10 times with random initializations and the result yielding the best loglikelihood is kept. Each run is limited to 100 iterations.
All following reconstruction results are computed by taking the geometric mean over 50 experiments (
i.e. the regular mean in logarithmic scale).5.4 Results: role of the choice of frequency distribution
(G)  (FGr)  (Ar)  
In this section we compare different choices of frequency distributions. We draw items in two settings, respectively low and high dimensional: and . In each setting we construct the sketch with frequencies (see Section 5.5). Reconstruction is performed with CLOMPR.
In Table 2, we compare the three frequency distributions introduced in Section 3, both with the oracle frequency distribution (i.e. using Function with the true parameters of the GMM) – we remind the reader that this setting unrealistically assumes that the variances and weights of the GMM are known beforehand –, and with the approximate one (Algorithm LABEL:algo:drawfreqpractice). The results show that the Gaussian frequency distribution indeed yields poor reconstruction results in high dimension (), while the Adapted radius frequency distribution outperforms the two others. The use of the approximate instead of the oracle is shown to have little effect.
Reconstruction result  Est. time  
Adapted radius (proposed)  ()  
Gaussian kernel [82, 99]  ()  
Adapted radius (proposed)  ()  
Gaussian kernel [82, 99]  () 
In Table 3 we compare the proposed Adapted radius frequency distribution with a frequency distribution corresponding to a Gaussian kernel [82, 99], where the bandwidth is selected among values exponentially spread from to to yield the best reconstruction results in each setting. It is seen that the Adapted radius distribution outperforms the Gaussian kernel in each case, and the estimation of the parameter is significantly faster than the tedious learning of the bandwidth .
We conduct an experiment to determine how robust is the isotropic distribution of frequencies when treating strongly nonisotropic data. We first generate a GMM in dimension with components, with the process described in Section 5.3 but with identity covariances. Then the first five dimensions of the parameters are divided by a factor , meaning that: for and , we do and . It simulates data which do not have the same range in each dimension (Fig. 3, top). In Fig. 3 (bottom), for increasing values of we compare CLOMPR with the learned frequency distribution , CLOMPR using the ideal frequency distribution , and EM. It is indeed seen that, in terms of KLdivergence (left), the results produced by CLOMPR with the learned frequency distribution deteriorates as the nonisotropy of the GMM increases. On the contrary, CLOMPR with the oracle frequency distribution and EM do not seem to be affected. Hence in the first case the problem lies with the learned frequency distribution and not the CLOMPR algorithm itself. To further confirm this fact, we examine the MMD with the corresponding frequency distribution in each case. It is seen that CLOMPR with the learned frequency distribution performs as well as the oracle one, meaning that the MMD defined with an isotropic choice of frequencies is indeed not adapted for strongly nonisotropic problems, and that despite the ability of CLOMPR to approximately minimize the cost function (9) (which approximates the MMD, see (23)), it does not produce good results.
In that particular case, the problem could be potentially resolved by a whitening of the data, e.g. by computing the empirical covariance of the fraction of the database used for the frequency distribution design phase, and multiplying each datapoint by its inverse during the sketching phase. However there are of course cases where the proposed methods would be further challenged, for instance if components are flat in a dimension but far apart: the global covariance of the data along that dimension would be large, even if the variance of each Gaussian components is small (these cases are however rarely encountered in practice). As mentioned before (see Sec. 4.3), another solution would be to use more advanced methods for designing the MMD than the proposed simple one. Overall, we leave treatment of strongly nonisotropic data for future work.
Nevertheless, from now on, all experiments are performed with an approximate Adapted radius distribution .
5.5 Results: number of measurements
We now evaluate the quality of the reconstruction with respect to the number of frequencies, for varying dimension and number of components in the GMM (Figure 4).
It is seen that the KLdivergence exhibits a sharp phasetransition with respect to
, whose occurrence seems to be proportional to the “dimension” of the problem, i.e. the number of free parameters . This phenomenon is akin to usual compressive sensing. In light of this observation, the number of frequencies in all following experiments is chosen as , beyond the empirically obtained phase transition.5.6 Results: comparison of the algorithms
We compare the compressive algorithms and EM in terms of reconstruction, computation time and memory usage, with respect to the number of samples .
Precision of the reconstruction
In Figure 5, KLdivergence reconstruction results are shown for EM and all compressed algorithms: IHT [19], CLOMP (Algorithm LABEL:algo:CLOMP with ), CLOMPR (Algorithm LABEL:algo:CLOMP with ) and Algorithm LABEL:algo:BSCGMM. We consider two settings, one with low (left) and one with high (right), in dimension . The number of measurements is set at in each case.
With few Gaussians (), all compressive algorithms yield similar results, close to those achieved by EM. The precision of the reconstruction is seen to improve steadily with the size of the database. With more Gaussians (), CLOMPR clearly outperforms the other compressive algorithms, and even outperforms EM for very large .
The IHT algorithm [19] adapted to nonisotropic variances often fails to recover a satisfying GMM. Indeed, IHT fills the support of the GMM at the very first iteration, and is seen to converge toward a local minimum of (9), in which all Gaussians in the GMM are equal to the same large Gaussian that encompasses all data. Note that this situation could not happen in [19], where all Gaussians have fixed, known variance.
Computation time and memory
In Figure 6, computation time and memory usage of the compressive methods and EM are presented with respect to the database size , using an Intel Core i74600U 2.1 GHz CPU with of RAM. In terms of time complexity (resp. memory usage), the EM algorithm scales in for a fixed number of iterations (resp. ). The CLOMP or CLOMPR algorithms scale in (resp. ), while Algorithm LABEL:algo:BSCGMM scales in (resp. the same ).
At large the EM algorithm indeed becomes substantially slower than all compressive methods (Fig 6, left). We also keep in mind that we compare a MATLAB implementation of the compressive methods with a stateoftheart C++ implementation of EM[102]. Similarly, at large the compressive algorithms outperform EM by several orders of magnitude in terms of memory usage (Fig 6, right).
6 Largescale proof of concept: speaker verification
Gaussians Mixture Models are popular for their capacity to smoothly approximate any distribution [87] by a large number of Gaussians. This is often the case with real data, and the problem of fitting a large GMM to data drawn from some distribution is somewhat different from that of clustering data and identifying reasonably well separated components, as presented in the previous section. In order to try out compressive methods on this challenging task, we test them on a speaker verification problem, with a classical approach requiring GMM referred to as Universal Background Model (GMMUBM) [86].
6.1 Overview of Speaker Verification
Given a fragment of speech and a candidate speaker, the goal is to assess if the fragment was indeed spoken by that person.
We quickly describe GMMUBM in this section. For more details we refer the reader to the original paper [86]. Similar to many speech processing tasks, this approach uses Mel Frequency Cepstrum Coefficients (MFCC) and their derivatives (MFCC) as features
. Those features have been often modeled with GMMs or more advanced Markov models. However, in our framework we do
not use MFCC; indeed those coefficients typically have a negligible range in dynamic compared to the MFCC, which results in a difficult and unstable choice of frequencies. As mentioned before, this problem may be potentially solved by a prewhitening of the data, which we leave for future work. In this configuration, the speaker verification results will indeed not be stateoftheart, but our goal is mainly to test our compressive approach on a different type of problem than that of clustering synthetic data, for which we have already observed excellent results.In the GMMUBM model, each speaker is represented by one GMM . The key point is the introduction of a model that represents a “generic” speaker, referred to as Universal Background Model (UBM). Given speech data and a candidate speaker , the statistic used for hypothesis testing is a likelihood ratio between the speaker and the generic model:
Comments
There are no comments yet.