Compressive Learning for Semi-Parametric Models

by   Michael P. Sheehan, et al.

In the compressive learning theory, instead of solving a statistical learning problem from the input data, a so-called sketch is computed from the data prior to learning. The sketch has to capture enough information to solve the problem directly from it, allowing to discard the dataset from the memory. This is useful when dealing with large datasets as the size of the sketch does not scale with the size of the database. In this paper, we reformulate the original compressive learning framework to explicitly cater for the class of semi-parametric models. The reformulation takes account of the inherent topology and structure of semi-parametric models, creating an intuitive pathway to the development of compressive learning algorithms. We apply our developed framework to both the semi-parametric models of independent component analysis and subspace clustering, demonstrating the robustness of the framework to explicitly show when a compression in complexity can be achieved.



There are no comments yet.


page 4


Compressive Classification (Machine Learning without learning)

Compressive learning is a framework where (so far unsupervised) learning...

Compressive Independent Component Analysis: Theory and Algorithms

Compressive learning forms the exciting intersection between compressed ...

Compressive K-means

The Lloyd-Max algorithm is a classical approach to perform K-means clust...

Quantized Compressive K-Means

The recent framework of compressive statistical learning aims at designi...

Compressive Learning of Generative Networks

Generative networks implicitly approximate complex densities from their ...

Mean Nyström Embeddings for Adaptive Compressive Learning

Compressive learning is an approach to efficient large scale learning ba...

Asymmetric compressive learning guarantees with applications to quantized sketches

The compressive learning framework reduces the computational cost of tra...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the current era, it is common practice to have available very large datasets with millions of individual entries and hundreds of features. This poses a big challenge for large scale machine and statistical learning due to the fact computational and memory demands scale poorly with the dimensions of the dataset. The compressive learning framework [3]

was developed to tackle this issue and alleviate some of the complexity constraints. The principle of the framework is based on finding a compact representation, a so-called sketch, of the data prior to learning, such that enough information is preserved to minimise a form of risk associated to the learning problem. In general, the sketch does not scale with the size of the dataset but is driven by the complexity of the problem, making it amenable to large scale learning. The framework has been successfully applied to various parametric models, including Gaussian mixture models and

-means clustering [5][6]

, where the authors exploit explicit structural assumptions, residing in the probability space, to recover a risk from the sketch.

Semi-parametric models form an interesting class of models which are used extensively in the fields of machine learning, statistics and signal processing. One calls a statistical learning problem semi-parametric when the two following conditions are met: there are no parametric constraints on the data distribution and the learning problem can be entirely solved thanks to a statistic of the data distribution. For instance, one of the oldest semi-parametric models is principle component analysis (PCA), which can be solved by taking the eigenvalue decomposition of the covariance matrix of the data. The covariance matrix is an identifiable statistic sufficient to solve the PCA problem. The distinction between parametric and semi-parametric models is that we do not have access to a parametrized probability space for semi-parametric models, due to inherent topology and structure. Consequently, the original compressive learning framework does not naturally cater for semi-parametric models nor provide a pathway to design compressive learning algorithms.

In this paper, we reformulate the original framework and apply it to semi-parametric models, leading to insights on creating compressive learning algorithms. Our main contribution is to recast the compressive learning framework to explicitly sketch the models identifiable statistics, exploiting structural assumptions in the statistic space, and show how this reformulation paves the way to creating compressive learning algorithms for both subspace clustering (SC) and independent component analysis (ICA).

2 Background

2.1 Compressive Learning


be independent and identically distributed samples from an unknown probability distribution

on where , some Euclidean space and some Borel -field. Classically, is parametrized by some parameters denoted by . A statistical learning problem can be formalised as follows: find a hypothesis from a hypothesis class that best matches the probability distribution over the training collection

. Given a loss function

, this is equivalent to minimising the risk defined as


Moreover, we define the model set associated to the hypothesis class as:


In other words, the set containing all distributions that are perfectly modeled by the hypothesis . In practice, we generally do not have access to the true distribution , so we instead minimise the empirical risk. As a consequence, this means we have to store all the data in memory.

In compressive learning, we find a compact representation, or so-called sketch, that encodes some statistical properties of the data. Its size is ideally chosen relative to the intrinsic complexity of the problem, making it possible to work with arbitrarily large datasets while storing in memory an object of fixed size. Given a feature function , such that is integrable with respect to any , define a linear operator by


We define our sketch in (3) to be the expectation of some features of the data distribution . We want to choose so that captures relevant statistical information of our data so that we’ll be able to solve our learning problem from these observations directly. The goal of compressive learning is to therefore find a small

that captures enough information to retrieve an estimated risk which is

close to the true risk with high probability [3]. In practice, we use the empirical distribution and form an empirical sketch defined as


denoting by the dirac distribution on

. Due to the law of large numbers,

, the empirical sketch can be formed directly from our data.

Once the sketch has been computed, one can discard the dataset from memory, reducing the memory complexity of the learning task. One can design a decoder that exploits the structural assumptions of the model set to recover a risk from the sketch. Consequently, we can find the best hypothesis by minimising the risk. The sketching operator and the decoder form the pair that define the compressive learning algorithm for a specific learning problem. A schematic diagram summarises the compressive learning framework in figure 1.


Model set



linear observations

minimise risk
Figure 1: A schematic diagram of the compressive learning framework

Gribonval and Keriven pioneered the method of compressive learning [3] and successfully applied their framework to parametric models including Gaussian mixture models (GMM) and -means clustering. In particular, they showed that the sketch algorithm reduces the memory complexity of GMM down to [5], removing the dependency of the number of data points . We say that is a parametric model if it is a subset of the collection of all probability measures on which is fully described by a map with ranging over . In general, the parameter space is finite and the map is smooth for parametric models. In fact, parametric models have the inherent property that a bijection exists between the map which means that each corresponds to exactly one distribution. As we will see, this is not the case for semi-parametric models.

2.2 Semi-Parametric Models

Semi-parametric models contain sets that are a substantially large if not an infinite, subset of on . These models are described by , together with a function , such that the model is specified by the set and the parametrization given by [1]:


We define a semi-parametric model by


In general, the map is not bijective, and therefore one statistic corresponds to many distributions. This is the clear distinction between parametric and semi-parametric models. In most cases the function is not known, or is not sufficiently smooth to explicitly express in a concise parametrized way such that inference can be done. In numerous instances, we can use some statistics of the data which enable one to solve the semi-parametric task. As discussed, we can use the covariance matrix as a statistic to solve the PCA problem. Throughout this discussion we will term such statistics, which are used to solve the semi-parametric problem, as identifiable statistics.

3 Related Works

Recall that the covariance matrix acts as a identifiable statistic for the PCA problem i.e. the principal components can be found through the eigenspectrum of the covariance matrix. Given , , sampled from a probability distribution and covariance matrix , we find the best -dimensional subspace that best matches the data. This defines a hypothesis class for the PCA problem and a corresponding model set that is defined by the distributions that produce a covariance matrix of rank :

Gribonval et. al. in [3] show that one can take a sketch of the covariance matrix, , and decode the sketch to return an estimated risk by using a matrix completion algorithm that exploits the rankness of the covariance matrix. By doing so, one can reduce the complexity of the PCA task to . Sketched PCA does not succinctly fit into the compressive learning framework highlighted in figure 1. Firstly, notice that the sketched method, discussed above, encodes and decodes a statistic, which does not define a single probability distribution but infact an infinite number of distributions having the same covariance matrix. Secondly, we are not directly using structural assumptions on the model set to make the decoding step possible. Instead, we exploit structural assumptions from the intermediary set of identifiable statistics . In the next section we develop the framework to address these issues.

4 Compressive Semi-Parametric Learning

In section 3, we showed that compressive PCA does not succinctly fit the compressive learning framework nor provide any intuition how to create a compressive PCA algorithm. The covariance matrix corresponds to infinitely many data distributions and therefore it is almost impossible to decode a single distribution . Indeed, this is the case for all semi-parametric models. With an abuse of notation, let denote the identifiable statistic associated to an arbitrary semi-parametric model. An equivalence exists between distributions in the model set and the set of identifiable statistics . Formally, let be the equivalence relation defined by


As a result, there exists a many-to-one mapping, that maps equivalence classes in to the same point in . Both the equivalence class structure and the mapping are illustrated in Figure 2.

Figure 2: A schematic diagram of the probability equivalence class where many distributions collapse down to one point in the statistic set.

Due to the equivalence class structure inherent in semi-parametric models, we lose the luxury of injectivity found in parametric models for the mapping . The consequence is that a single distribution cannot be decoded, and therefore the original framework does not cater explicitly for semi-parametric models. Below, we define a reformulation of the compressive learning framework to tackle such models and provide a pathway to develop compressive learning algorithms amenable to semi-parametric models.

4.1 Reformulated Framework

We reformulate the framework by assuming that we know a statistic set that can be used to define the risk function. That means that instead of having one risk function per distribution as before, here we have one risk function per equivalence class. This is possible when there exists a map satisfying


It turns out that the parameterization of the probability distributions is not needed anymore. Indeed, it suffices to have a parameterization of the statistic set to search for a sketch. Note that the size of the set containing the statistics may be smaller than the size of the model set, as many probability distributions have the same statistic. In accordance, we define the new sketch as


where is a linear operator on and is a given feature function. As we are encoding a statistic from finite samples, the empirical sketch is defined as , where is the empirical statistic computed through the samples. As ever, the law of large numbers apply, such that . Once the sketch is formed, we use a decoder that recovers a statistic such that and are uniformly close. The decoder is designed specifically to exploit the structural assumptions of the set . Consequently, we can find the best hypothesis by minimising the risk. Assuming that is our identifiable statistic associated with a semi-parametric model, a schematic diagram of the reformulated framework is highlighted in figure 3.


Intermediary space

Model set



not injective


linear observations

minimise risk
Figure 3: A schematic diagram of the new compressive semi-parametric learning framework.

Our new formulation of the compressive learning framework provides a far more intuitive and explicit pathway enabling one to identify statistics associated with a given semi-parametric model to create compressive learning algorithms. Furthermore, the framework allows one to explicitly design a decoder by demonstrating the structural assumptions of semi-parametric models, where the original framework severely lacked. In the next section, we shall signify the importance of our reformulation by applying the framework to two well known semi-parametric models.

5 Case Studies

In this section we apply our compressive semi-parametric framework to two well known, yet complex, semi-parametric models of independent component analysis and subspace clustering. To be consistent and for comparison, we will use the notation to denote the identifiable statistic for each model.

5.1 Compressive Independent Component Analysis

We start the discussion with ICA, a semi-parametric model that decomposes data into hyperplanes of maximum independence via a linear transformation. For the sake of simplicity and brevity, we assume the data has identity covariance and zero mean, and therefore the task of ICA is to find an orthogonal matrix

such that , where has statistically independent entries : . Each denotes the distribution of an independent component, and as we do not know the nature of the densities in advance, we cannot reduce the estimate done to a finite parameter set. Consequently, the estimation of is non-parametric, and coupled with the parametric part of estimating the orthogonal matrix , results in ICA belonging to the class of semi-parametric models.

We resort to higher order statistics, specifically kurtosis, to solve the problem

[4]. In general, kurtosis is a measure of independence for sources of different densities. Minimising the kurtosis of entry wise sources, maximises the independence of the system [2]. In our setting described, each point-wise kurtosis, defined by:


forms a

order kurtosis cumulant tensor

. The goal of cumulant based ICA, is therefore to find an orthogonal transformation :


resulting in zero cross cumulants . Consequently, the sources will be independent and the cumulant tensor will be diagonal. The set of diagonal cumulant tensors can be defined as


By doing so, we can define the model set of the ICA model:


where is the parameter of interest and denotes the matrix-tensor product.

The new formulation of compressive learning for semi-parametric models described in section 4.1 shows we must look for structural assumptions on the statistic set to sketch . In the case of ICA, the assumption that the cumulant tensor (formed from data ) can be diagonalised by an orthogonal transformation, results in the solution living on a manifold, denoted , of dimension compared to that of of the statistical set [7]. More precisely, it is sufficient to take random linear projections of . A compressive ICA algorithm can be defined by the encoding-decoding pair :


where and defines any independence contrast function defined over cumulant tensors [2].

Figure 4 shows the ratio of compression as grows. The figure illustrates clearly that the framework has enabled us to identify a statistic that lives in a set with strong structural assumptions, that can be sketched to vastly reduce the order of memory complexity.

Figure 4: Compressive ICA learning: A graph showing the compression ratio as grows.

5.2 Compressive Subspace Clustering

The subspace clustering problem consists of finding the best union of subspaces that matches the given data [9]. It can be formalised by the hypothesis class


which forms the corresponding model set


In the literature, we assume that the number of subspaces and the dimension of each subspace are known in advance, to make the problem well-posed. Subspace clustering can be thought of as a -mixture model with data sampled from unknown probability distributions . As we do not know the form of these distributions, we can not reduce the estimate down to a finite parameter set. Similar to the ICA case, estimating is non-parametric, and coupled with the parametric form of the mixture coefficients , results in subspace clustering fitting into the semi-parametric class of models.

The subspace clustering problem can be solved through a generalised principle component analysis (GPCA) approach [8]. By assuming the data lies within a union of subspaces , we denote by , or

for simplicity, the vectors having components equal to all the monomials of degree

in the components of the data point . For instance when and , . The embedded point belongs to with


For any union of subspaces , we can find polynomials that define the union of subspaces:


by computing the null space of the matrix Indeed, computing the null space of can be easily deduced by finding the eigendecomposition of the correlation matrix of the embedded data:


The correlation matrix of the Veronese embeddings is therefore the identifiable statistic associated to subspace clustering and we can therefore apply the compressive semi-parametric framework to it. As expected, the framework motivates us to seek structural assumptions of the statistic set . In the situation of GPCA, the correlation has rank between 1 and

depending on the geometric makeup of the subspaces. In certain cases, the rank of correlation is in fact very small and therefore the degrees of freedom are far less than the dimensions of the statistic set

. In such situations, we know that only measurements are needed to recover and therefore it is sufficient to take rank-one projections of to enable stable recovery. A compressive GPCA algorithm can be defined by the encoding-decoding pair :


Figure 5

shows a phase transition for the ratio of memory compression

with respect to the dataset size as the dimension and number of subspaces grows, when the correlation matrix is of low rank . The green region shows when compression occurs, while the red region shows when compression is not possible in comparison to storing the whole data set in memory. The reformulated compressive learning framework illustrates that compression is only possible for modest dimensions.

Figure 5: Compressive GPCA learning: The compression ratio of the sketch size and the data length, , as the model dimensions and grow. The data length is fixed at .

6 Conclusion

Compressive learning for parametric models achieve successful compression in memory complexity as the sketch is commensurate with the model dimensions. In this paper, our case studies have shown that this is not always the case for semi-parametric models, as the identifiable statistic can scale well (ICA) or poorly (SC) with the underlying model dimensions. Importantly, our developed framework allows the user to identify exactly when memory compression is possible given an identifiable statistic, where the existing framework lacked. An interesting research direction which has arisen from this work is - “Given an identifiable statistic associated with a semi-parametric model, is it of minimal dimensionality?”.


  • [1] P. J. Bickel and C. A. Klaassen (1998) Efficient and adaptive estimation for semiparametric models. Vol. 4, Springer. Cited by: §2.2.
  • [2] P. Comon (1994) Independent component analysis, a new concept?. Signal processing 36 (3), pp. 287–314. Cited by: §5.1, §5.1.
  • [3] R. Gribonval, G. Blanchard, N. Keriven, and Y. Traonmilin (2017)

    Compressive statistical learning with random feature moments

    arXiv preprint arXiv:1706.07180. Cited by: §1, §2.1, §2.1, §3.
  • [4] A. Hyvärinen and E. Oja (2000) Independent component analysis: algorithms and applications. Neural networks 13 (4-5), pp. 411–430. Cited by: §5.1.
  • [5] N. Keriven, A. Bourrier, R. Gribonval, and P. Pérez (2017) Sketching for large-scale learning of mixture models. Information and Inference: A Journal of the IMA 7 (3), pp. 447–508. Cited by: §1, §2.1.
  • [6] N. Keriven, N. Tremblay, Y. Traonmilin, and R. Gribonval (2017)

    Compressive k-means

    In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6369–6373. Cited by: §1.
  • [7] M. P. Sheehan, M. S. Kotzagiannidis, and M. E. Davies (2019) Compressive independent component analysis. In European Signal Processing Conf.(EUSIPCO), Spain, Cited by: §5.1.
  • [8] R. Vidal, Y. Ma, and S. Sastry (2005)

    Generalized principal component analysis (gpca)

    IEEE transactions on pattern analysis and machine intelligence 27 (12), pp. 1945–1959. Cited by: §5.2.
  • [9] R. Vidal (2011) Subspace clustering. IEEE Signal Processing Magazine 28 (2), pp. 52–68. Cited by: §5.2.