1 Introduction
In the current era, it is common practice to have available very large datasets with millions of individual entries and hundreds of features. This poses a big challenge for large scale machine and statistical learning due to the fact computational and memory demands scale poorly with the dimensions of the dataset. The compressive learning framework [3]
was developed to tackle this issue and alleviate some of the complexity constraints. The principle of the framework is based on finding a compact representation, a socalled sketch, of the data prior to learning, such that enough information is preserved to minimise a form of risk associated to the learning problem. In general, the sketch does not scale with the size of the dataset but is driven by the complexity of the problem, making it amenable to large scale learning. The framework has been successfully applied to various parametric models, including Gaussian mixture models and
means clustering [5][6], where the authors exploit explicit structural assumptions, residing in the probability space, to recover a risk from the sketch.
Semiparametric models form an interesting class of models which are used extensively in the fields of machine learning, statistics and signal processing. One calls a statistical learning problem semiparametric when the two following conditions are met: there are no parametric constraints on the data distribution and the learning problem can be entirely solved thanks to a statistic of the data distribution. For instance, one of the oldest semiparametric models is principle component analysis (PCA), which can be solved by taking the eigenvalue decomposition of the covariance matrix of the data. The covariance matrix is an identifiable statistic sufficient to solve the PCA problem. The distinction between parametric and semiparametric models is that we do not have access to a parametrized probability space for semiparametric models, due to inherent topology and structure. Consequently, the original compressive learning framework does not naturally cater for semiparametric models nor provide a pathway to design compressive learning algorithms.
In this paper, we reformulate the original framework and apply it to semiparametric models, leading to insights on creating compressive learning algorithms. Our main contribution is to recast the compressive learning framework to explicitly sketch the models identifiable statistics, exploiting structural assumptions in the statistic space, and show how this reformulation paves the way to creating compressive learning algorithms for both subspace clustering (SC) and independent component analysis (ICA).
2 Background
2.1 Compressive Learning
Let
be independent and identically distributed samples from an unknown probability distribution
on where , some Euclidean space and some Borel field. Classically, is parametrized by some parameters denoted by . A statistical learning problem can be formalised as follows: find a hypothesis from a hypothesis class that best matches the probability distribution over the training collection. Given a loss function
, this is equivalent to minimising the risk defined as(1) 
Moreover, we define the model set associated to the hypothesis class as:
(2) 
In other words, the set containing all distributions that are perfectly modeled by the hypothesis . In practice, we generally do not have access to the true distribution , so we instead minimise the empirical risk. As a consequence, this means we have to store all the data in memory.
In compressive learning, we find a compact representation, or socalled sketch, that encodes some statistical properties of the data. Its size is ideally chosen relative to the intrinsic complexity of the problem, making it possible to work with arbitrarily large datasets while storing in memory an object of fixed size. Given a feature function , such that is integrable with respect to any , define a linear operator by
(3) 
We define our sketch in (3) to be the expectation of some features of the data distribution . We want to choose so that captures relevant statistical information of our data so that we’ll be able to solve our learning problem from these observations directly. The goal of compressive learning is to therefore find a small
that captures enough information to retrieve an estimated risk which is
close to the true risk with high probability [3]. In practice, we use the empirical distribution and form an empirical sketch defined as(4) 
denoting by the dirac distribution on
. Due to the law of large numbers,
, the empirical sketch can be formed directly from our data.Once the sketch has been computed, one can discard the dataset from memory, reducing the memory complexity of the learning task. One can design a decoder that exploits the structural assumptions of the model set to recover a risk from the sketch. Consequently, we can find the best hypothesis by minimising the risk. The sketching operator and the decoder form the pair that define the compressive learning algorithm for a specific learning problem. A schematic diagram summarises the compressive learning framework in figure 1.
Gribonval and Keriven pioneered the method of compressive learning [3] and successfully applied their framework to parametric models including Gaussian mixture models (GMM) and means clustering. In particular, they showed that the sketch algorithm reduces the memory complexity of GMM down to [5], removing the dependency of the number of data points . We say that is a parametric model if it is a subset of the collection of all probability measures on which is fully described by a map with ranging over . In general, the parameter space is finite and the map is smooth for parametric models. In fact, parametric models have the inherent property that a bijection exists between the map which means that each corresponds to exactly one distribution. As we will see, this is not the case for semiparametric models.
2.2 SemiParametric Models
Semiparametric models contain sets that are a substantially large if not an infinite, subset of on . These models are described by , together with a function , such that the model is specified by the set and the parametrization given by [1]:
(5) 
We define a semiparametric model by
(6) 
In general, the map is not bijective, and therefore one statistic corresponds to many distributions. This is the clear distinction between parametric and semiparametric models. In most cases the function is not known, or is not sufficiently smooth to explicitly express in a concise parametrized way such that inference can be done. In numerous instances, we can use some statistics of the data which enable one to solve the semiparametric task. As discussed, we can use the covariance matrix as a statistic to solve the PCA problem. Throughout this discussion we will term such statistics, which are used to solve the semiparametric problem, as identifiable statistics.
3 Related Works
Recall that the covariance matrix acts as a identifiable statistic for the PCA problem i.e. the principal components can be found through the eigenspectrum of the covariance matrix. Given , , sampled from a probability distribution and covariance matrix , we find the best dimensional subspace that best matches the data. This defines a hypothesis class for the PCA problem and a corresponding model set that is defined by the distributions that produce a covariance matrix of rank :
Gribonval et. al. in [3] show that one can take a sketch of the covariance matrix, , and decode the sketch to return an estimated risk by using a matrix completion algorithm that exploits the rankness of the covariance matrix. By doing so, one can reduce the complexity of the PCA task to . Sketched PCA does not succinctly fit into the compressive learning framework highlighted in figure 1. Firstly, notice that the sketched method, discussed above, encodes and decodes a statistic, which does not define a single probability distribution but infact an infinite number of distributions having the same covariance matrix. Secondly, we are not directly using structural assumptions on the model set to make the decoding step possible. Instead, we exploit structural assumptions from the intermediary set of identifiable statistics . In the next section we develop the framework to address these issues.
4 Compressive SemiParametric Learning
In section 3, we showed that compressive PCA does not succinctly fit the compressive learning framework nor provide any intuition how to create a compressive PCA algorithm. The covariance matrix corresponds to infinitely many data distributions and therefore it is almost impossible to decode a single distribution . Indeed, this is the case for all semiparametric models. With an abuse of notation, let denote the identifiable statistic associated to an arbitrary semiparametric model. An equivalence exists between distributions in the model set and the set of identifiable statistics . Formally, let be the equivalence relation defined by
(7) 
As a result, there exists a manytoone mapping, that maps equivalence classes in to the same point in . Both the equivalence class structure and the mapping are illustrated in Figure 2.
Due to the equivalence class structure inherent in semiparametric models, we lose the luxury of injectivity found in parametric models for the mapping . The consequence is that a single distribution cannot be decoded, and therefore the original framework does not cater explicitly for semiparametric models. Below, we define a reformulation of the compressive learning framework to tackle such models and provide a pathway to develop compressive learning algorithms amenable to semiparametric models.
4.1 Reformulated Framework
We reformulate the framework by assuming that we know a statistic set that can be used to define the risk function. That means that instead of having one risk function per distribution as before, here we have one risk function per equivalence class. This is possible when there exists a map satisfying
(8) 
It turns out that the parameterization of the probability distributions is not needed anymore. Indeed, it suffices to have a parameterization of the statistic set to search for a sketch. Note that the size of the set containing the statistics may be smaller than the size of the model set, as many probability distributions have the same statistic. In accordance, we define the new sketch as
(9) 
where is a linear operator on and is a given feature function. As we are encoding a statistic from finite samples, the empirical sketch is defined as , where is the empirical statistic computed through the samples. As ever, the law of large numbers apply, such that . Once the sketch is formed, we use a decoder that recovers a statistic such that and are uniformly close. The decoder is designed specifically to exploit the structural assumptions of the set . Consequently, we can find the best hypothesis by minimising the risk. Assuming that is our identifiable statistic associated with a semiparametric model, a schematic diagram of the reformulated framework is highlighted in figure 3.
Our new formulation of the compressive learning framework provides a far more intuitive and explicit pathway enabling one to identify statistics associated with a given semiparametric model to create compressive learning algorithms. Furthermore, the framework allows one to explicitly design a decoder by demonstrating the structural assumptions of semiparametric models, where the original framework severely lacked. In the next section, we shall signify the importance of our reformulation by applying the framework to two well known semiparametric models.
5 Case Studies
In this section we apply our compressive semiparametric framework to two well known, yet complex, semiparametric models of independent component analysis and subspace clustering. To be consistent and for comparison, we will use the notation to denote the identifiable statistic for each model.
5.1 Compressive Independent Component Analysis
We start the discussion with ICA, a semiparametric model that decomposes data into hyperplanes of maximum independence via a linear transformation. For the sake of simplicity and brevity, we assume the data has identity covariance and zero mean, and therefore the task of ICA is to find an orthogonal matrix
such that , where has statistically independent entries : . Each denotes the distribution of an independent component, and as we do not know the nature of the densities in advance, we cannot reduce the estimate done to a finite parameter set. Consequently, the estimation of is nonparametric, and coupled with the parametric part of estimating the orthogonal matrix , results in ICA belonging to the class of semiparametric models.We resort to higher order statistics, specifically kurtosis, to solve the problem
[4]. In general, kurtosis is a measure of independence for sources of different densities. Minimising the kurtosis of entry wise sources, maximises the independence of the system [2]. In our setting described, each pointwise kurtosis, defined by:(10) 
forms a
order kurtosis cumulant tensor
. The goal of cumulant based ICA, is therefore to find an orthogonal transformation :(11) 
resulting in zero cross cumulants . Consequently, the sources will be independent and the cumulant tensor will be diagonal. The set of diagonal cumulant tensors can be defined as
(12) 
By doing so, we can define the model set of the ICA model:
(13) 
where is the parameter of interest and denotes the matrixtensor product.
The new formulation of compressive learning for semiparametric models described in section 4.1 shows we must look for structural assumptions on the statistic set to sketch . In the case of ICA, the assumption that the cumulant tensor (formed from data ) can be diagonalised by an orthogonal transformation, results in the solution living on a manifold, denoted , of dimension compared to that of of the statistical set [7]. More precisely, it is sufficient to take random linear projections of . A compressive ICA algorithm can be defined by the encodingdecoding pair :
(14) 
where and defines any independence contrast function defined over cumulant tensors [2].
Figure 4 shows the ratio of compression as grows. The figure illustrates clearly that the framework has enabled us to identify a statistic that lives in a set with strong structural assumptions, that can be sketched to vastly reduce the order of memory complexity.
5.2 Compressive Subspace Clustering
The subspace clustering problem consists of finding the best union of subspaces that matches the given data [9]. It can be formalised by the hypothesis class
(15) 
which forms the corresponding model set
(16) 
In the literature, we assume that the number of subspaces and the dimension of each subspace are known in advance, to make the problem wellposed. Subspace clustering can be thought of as a mixture model with data sampled from unknown probability distributions . As we do not know the form of these distributions, we can not reduce the estimate down to a finite parameter set. Similar to the ICA case, estimating is nonparametric, and coupled with the parametric form of the mixture coefficients , results in subspace clustering fitting into the semiparametric class of models.
The subspace clustering problem can be solved through a generalised principle component analysis (GPCA) approach [8]. By assuming the data lies within a union of subspaces , we denote by , or
for simplicity, the vectors having components equal to all the monomials of degree
in the components of the data point . For instance when and , . The embedded point belongs to with(17) 
For any union of subspaces , we can find polynomials that define the union of subspaces:
(18) 
by computing the null space of the matrix Indeed, computing the null space of can be easily deduced by finding the eigendecomposition of the correlation matrix of the embedded data:
(19) 
The correlation matrix of the Veronese embeddings is therefore the identifiable statistic associated to subspace clustering and we can therefore apply the compressive semiparametric framework to it. As expected, the framework motivates us to seek structural assumptions of the statistic set . In the situation of GPCA, the correlation has rank between 1 and
depending on the geometric makeup of the subspaces. In certain cases, the rank of correlation is in fact very small and therefore the degrees of freedom are far less than the dimensions of the statistic set
. In such situations, we know that only measurements are needed to recover and therefore it is sufficient to take rankone projections of to enable stable recovery. A compressive GPCA algorithm can be defined by the encodingdecoding pair :(20) 
Figure 5
shows a phase transition for the ratio of memory compression
with respect to the dataset size as the dimension and number of subspaces grows, when the correlation matrix is of low rank . The green region shows when compression occurs, while the red region shows when compression is not possible in comparison to storing the whole data set in memory. The reformulated compressive learning framework illustrates that compression is only possible for modest dimensions.6 Conclusion
Compressive learning for parametric models achieve successful compression in memory complexity as the sketch is commensurate with the model dimensions. In this paper, our case studies have shown that this is not always the case for semiparametric models, as the identifiable statistic can scale well (ICA) or poorly (SC) with the underlying model dimensions. Importantly, our developed framework allows the user to identify exactly when memory compression is possible given an identifiable statistic, where the existing framework lacked. An interesting research direction which has arisen from this work is  “Given an identifiable statistic associated with a semiparametric model, is it of minimal dimensionality?”.
References
 [1] (1998) Efficient and adaptive estimation for semiparametric models. Vol. 4, Springer. Cited by: §2.2.
 [2] (1994) Independent component analysis, a new concept?. Signal processing 36 (3), pp. 287–314. Cited by: §5.1, §5.1.

[3]
(2017)
Compressive statistical learning with random feature moments
. arXiv preprint arXiv:1706.07180. Cited by: §1, §2.1, §2.1, §3.  [4] (2000) Independent component analysis: algorithms and applications. Neural networks 13 (45), pp. 411–430. Cited by: §5.1.
 [5] (2017) Sketching for largescale learning of mixture models. Information and Inference: A Journal of the IMA 7 (3), pp. 447–508. Cited by: §1, §2.1.

[6]
(2017)
Compressive kmeans
. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6369–6373. Cited by: §1.  [7] (2019) Compressive independent component analysis. In European Signal Processing Conf.(EUSIPCO), Spain, Cited by: §5.1.

[8]
(2005)
Generalized principal component analysis (gpca)
. IEEE transactions on pattern analysis and machine intelligence 27 (12), pp. 1945–1959. Cited by: §5.2.  [9] (2011) Subspace clustering. IEEE Signal Processing Magazine 28 (2), pp. 52–68. Cited by: §5.2.
Comments
There are no comments yet.