1 Introduction
In recent years, the size of datasets have grown exponentially as a result of advances in technology, signal acquisition, and the sophistication of modern day mobile phones and devices. This has enabled researchers, statisticians and machine learning practitioners to build increasingly accurate models as a consequence of larger sample sizes and feature dimensions. Nevertheless, this poses a fundamental challenge to large scale learning as (i) traditional algorithms have computational complexity that scales with the order of the dataset dimensions (ii) the whole dataset has to be stored or transferred on to local RAM as optimisation methods need to return to the data (or a random subset of the data) at subsequent iterations, and (iii) one is vulnerable to malicious attacks of potentially sensitive and personal information as the data needs to be stored or transferred locally. Compressive learning (CL)
[gribonval2021compressive, gribonval2021sketching] partially addresses these fundamental challenges by severely compressing the whole dataset into a random representation of fixed size, named a socalled sketch, in a single (or limited) pass of the data prior to learning. Once the sketch is formed, the parameters of the model are inferred solely from the sketch, hence a CL algorithm, for a given task or model, needs never to return to the original dataset, and it can be deleted from memory as a result. At the core of the CL framework [gribonval2021compressive, keriven2018sketching], is that in general, the size of the sketch does not scale with the dimensions of the dataset, or indeed the data’s underlying dimensionality, but instead is driven by the complexity or dimensionality of the task or model of interest. In theory, one can work with datasets of arbitrary length, as the dimension of the sketch is fixed constant throughout, making CL especially amenable to large scale learning. Inferring the parameters of a model solely from the sketch is an under determined inverse problem. As a result, we need regularity assumptions to make the problem wellposed. These assumptions come in the form of a low dimensional model set that the solution to the inference problem lies on or close to. The reader may notice this is reminiscent of compressive sensing where one assumes the signal of interest is sparse in some domain, and therefore the solution lies on or close to the union of dimensional subspaces representing a low dimensional model set. The sparse regularity assumption allows one to take a limited number of measurements to recover the signal of interest and reduce the complexity and cost of acquisition. In later sections, we take inspiration from compressive sensing to develop and analyse our CL algorithms.In this paper, we develop a CL framework, including theory and practical algorithms, for independent component analysis (ICA). ICA is an unsupervised learning task that attempts to find the linear transformation that separates some given data into components of maximal independence. It is used extensively in the machine learning and signal processing communities for example as a dimensionality reduction tool
[hyvarinen2000independent], to uncover underlying factors that effect the price movements of a collection of stocks [882456] and to detect independent sources in the brain through EEG signals [841330]. As will be discussed in section 2, the ICA problem can be solved directly from the data or through some higher order statistics of the data, such as the kurtosis. Given that the number of independent sources is denoted by
and the signal or data length is denoted by , then the memory complexity typically scales either with or depending on the method of choice. As one can see, this becomes infeasible for large scale datasets. In this paper, we show theoretically and empirically that it is possible to design a CL ICA algorithm where the sketch dimension, and therefore the memory complexity, scales at which can be orders of magnitudes smaller than current approaches.1.1 Contributions and Outline
Below, we highlight the main contributions of the paper:

Focusing on the higher order statistics method of ICA, we show that an independent component model set exists in the space of order cumulant tensors and we state the set’s dimensionality.

We prove that the wellknown restricted isometry property (RIP) holds for projections of the cumulant tensor on the model set with a sketch size , which is order optimal to the dimensional of the model set. Given a sketch constructed from random Gaussian ensembles, we show that the reconstruction error is stable when measuring a signal which lives close to, but not on, the ICA model set through the existence of an instance optimal decoder.

Two inherently different CICA algorithms are proposed, in the form of an alternating steepest descent and an iterative projection gradient algorithm and show that they successfully recover the ICA mixing matrix with overwhelming probability provided that the sketch size
. We analyse the proposed CICA algorithms on synthetic and real datasets showing that the CICA scheme achieves substantial memory savings over existing ICA methods, whilst retaining competitive estimation accuracy. 
We analyse the tradeoff between the compression rate of the sketch and the overall statistical efficiency of the sketch estimate.
1.2 Related Works
1.2.1 Existing Compressive Learning Models
The framework of CL has been successfully applied to a host of learning tasks and models with the desired outcome of reducing the complexities associated with signal acquisition, computation and memory storage. In [keriven2018sketching], Keriven et al.
proposed a CL framework for mixture models, in particular the mixture of Gaussian distributions and
means learning tasks. In both cases, a sketch is constructed by randomly sampling the characteristic function of the mixture model which can be equivalently seen as taking random Fourier features of the data
[rahimi2008random]. The compact representational sketch of each mixture model scales as , where is the number of mixtures in the model and is the feature space dimensions of the data. As a result, a compressive mixture model algorithm was proposed that had both computational and space complexities that scaled independently of the number of data points . In [gribonval2021compressive], a compressive principal component analysis (PCA) framework was proposed. As will be discussed in Section
2.2, the compressive PCA methodology is aligned closely to our compressive ICA framework. Distinct from compressive mixture models, the compressive PCA method is left distribution free and it is assumed that the data lives on, or can be approximately modelled, by a dimensional subspace. As a result, a sketch of size can be computed by taking a random projection of the covariance matrix of the data, hence reducing the memory complexities of storing either the data of size or the covariance matrix of size .1.2.2 Generalised Method of Moments
Compressive learning is similar to the technique of Generalised Method of Moments (GeMM)
[hansenGeMM, gemmhall] where the parameters of interest are estimated by matching a collection of generalised moments of the distribution with the empirical counterparts calculated through the data. In most cases it is used instead of maximum likelihood estimation when calculating the likelihood is not tractable. CL differs from much of the GeMM literature as the goal is fundamentally different: in compressive learning one attempts to construct a compact representation of data with the aim of reducing complexity constraints (computation, memory, acquisition) whilst in GeMM the goal is to primarily estimate when the model is either partially specified or the likelihood does not have a closed form solution. Moreover, the selected generalised moments may be a function of the parameter being estimated, hence not providing a one off sketch.1.2.3 Streaming Methods
Closely related to CL is the collection of streaming methods [cormode2012synopses, tropp2019streamingScientific], where data items are seen and queried only once by the user and then discarded. This is of particular interest when the summary statistic of choice is updated and maintained in real time, for example in the online learning setting [1198387], to reduce space complexities. Notably, the countminsketch [cormodea2005improved]
was developed to query data in an online fashion with the application of maintaining histograms of quantiles. However, these methods in general focus on the discrete collection of objects and database queries while in CL the framework and method is applied to machine learning tasks where typically the signal is question is continuous. Tropp
et al. [tropp2017practical] proposed a streaming framework for large scale PCA. In particular, in [tropp2019streamingScientific], the authors design random sketches for onthefly compression of data matrices associated with large scale scientific simulations. Here the data matrix of interest can be decomposed into a sequence(1) 
where it is assumed each has some structural redundancies for example sparsity or lowrank. These methods have a subtle yet fundamental difference from CL, as in CL the structural assumptions which are exploited to form the CL sketch arise from the model or distribution itself, while in these streaming methods the structural assumptions come directly from the data. Moreover, several passes of the data may be required to reduce the lowrank approximation error [tropp2017practical].
1.2.4 Other Compression Techniques
Coresets are a popular method used to compress a database into a summary statistic used for inferring the parameters of a given model and has been used primarily for subspace clustering based tasks [har2004coresets, feldman2020turning]
. In a similar vein to CL, the compact data representation has size that typically scales independently to the number of input items and the dimension of input feature space. However, the coresets are constructed in a hierarchical manner, possibly resulting in multiple passes of the data and are therefore not naturally amenable to online or distributed learning. Projections that include both random projections and feature selections
[boutsidis2010random, calderbank2009compressed] are used widely to reduce the dimensionality of the data. In [calderbank2009compressed], datasets were randomly projected into a compressed domain using both random Gaussian and Bernoulli matrices. In a similar vein to compressive sensing [Donoho_CS], the data was assumed to be sparse therefore the dependency of the feature space dimension was removed within the space and acquisition complexities. In contrast to random projections, more structural based projections are proposed. In [tang2014feature], different feature selection techniques for classification are reviewed including structured graph methods and the use of embedded models. The wellknown PCA method is a popular preprocessing technique that projects the dataset onto adimensional subspace of maximal variance
[jolliffe2016principal]. In both random and structured projections, the methods discussed only tackle the dependency of the feature space dimension and do not address the challenges posed by a large data size . Subsampling methods are also a popular method for dimensionality reduction whereby a subset of the original dataset is used for learning. As discussed previously, the method of coresets [har2004coresets, feldman2020turning] is a subsampling technique that attempts to subselect dominant items that well approximate the structure of the dataset. Other subsampling techniques include random and adaptive subsampling [cormode2012synopses]. The disadvantage of subsampling techniques is that there is a risk of discarding important information relating to nonsampled data items. Moreover, these techniques only tackle the constraint on the number of data items and don’t combat the complexity issues posed by the feature space dimensional .Specifically to ICA compression, Sela et al. [sela2016randomized] used kernel approximation techniques to reduce the dimensions of the Kernel ICA method proposed by Bach [bach2002kernel]. Random Fourier features are used to approximate the kernel, reducing the memory complexity from to , where is the number of random Fourier weights used. Despite the reduction in memory complexity, the algorithm still has storage demands which scale linearly with . In comparison, we remove the dependency of the data length completely, within our framework, when estimating the ICA mixing matrix.
2 Background
2.1 Compressive Learning
Let
be independent and identically distributed samples from an unknown probability distribution
on where is some Euclidean space and is a Borel field. Classically, is parametrized by some parameters denoted by . A statistical learning problem can be formalised as follows: find a hypothesis from a hypothesis class that best matches the probability distribution over the training collection, given some data fidelity term. Given a loss function
, this is equivalent to minimizing the risk defined as(2) 
Formally, the model set associated to the hypothesis class can be defined as:
(3) 
In other words, the set containing all distributions for which zero risk is achievable. As a result, the model set has a dimension which is intrinsic to the hypothesis class of the model. In practice, one cannot minimize the true risk as we generally do not have access to the true distribution , so instead, one can minimize the empirical risk with respect to the finite samples of the true distribution and as a result this may mean all the data is required to be stored in memory.
In CL [keriven2018sketching, gribonval2021compressive, gribonval2021sketching], we find a compact representation, or a socalled sketch, that encodes some statistical properties of the data. Its size is ideally chosen relative to the intrinsic complexity of the problem, making it possible to work with arbitrarily large datasets while storing in memory an object of fixed size. Given a feature function , such that is integrable with respect to any , define a linear operator by
(4) 
The sketch defined in (4) can be seen as taking the expectation of some particular features of the distribution , which is similar to the field of kernel mean embedding [muandet2017kernel] where one uses feature maps to embed probability distributions. Therefore, we would like to construct so that
captures sufficiently relevant information of the data to allow us to infer the parameters of the model directly from the sketch. As a trivial example, if we seek to infer only the mean of a normal distribution
, the construction where would constitute a trivial yet sufficient sketch. In reality, CL is applicable to much more complex models where the feature function is nontrivial and the model may not necessarily possess a finite dimensional sufficient statistic independent from the data. The goal of CL is to therefore construct a sketch of size that captures enough information to recover an estimated risk which is close to the true risk with high probability [gribonval2021compressive]. In practice, as in the kernel mean embedding literature [muandet2017kernel], the empirical distribution is used to form an empirical sketch defined as(5) 
denoting by the Dirac distribution on
, and therefore the empirical sketch can be formed directly from the data. Due to the law of large numbers,
. Once the sketch has been computed, one can discard the dataset from memory. As a result, CL reduces down to solving an inverse problem of the form(6) 
where is a cost function designed for the specific learning task at hand. In a compressive sensing light, we can exploit structural assumptions of the model set and the associated parameter space , e.g sparsity, low rankness, low dimensional manifold properties, to make (6) wellposed and finding a solution tractable. As such, one can design a decoder that exploits the structural assumptions of the model set to recover the parameters of the model from the sketch whilst minimizing the risk. The sketching operator and the decoder form the pair that define the CL algorithm for a specific learning problem. It should be noted that minimizing (6) plays the role of a proxy for minimizing the empirical risk as, by definition of the model set in (3), any has zero loss in expectation [gribonval2021compressive].
Learning Task   Means  GMM  PCA 

Model set  
Feature func.  
Sketch cost  s.t  
Sketch Size 
2.2 Compressive Principal Component Analysis
In Section 2.1, the framework of CL was discussed in a general manner without specific consideration of the distributional form of the model. As will be discussed in Section 2.3, the PCA and ICA models are similar in nature in that the model is often left distribution free. In other words, the distribution of the sampled data is left unspecified. In Table 1, it is shown that the compressive PCA model set [gribonval2021compressive] is defined as
(7) 
Due to the distribution free assumption of the PCA model, we seek structural assumptions that are manifested within some intermediary statistic space to make computing a sketch possible [sheehan2019compressive]. In the case of compressive PCA, the space of covariance matrices is leveraged as an intermediary statistic space where the rank of the covariance matrices is exploited. Figure 1 depicts a geometric viewpoint of both compressive parametric learning (e.g. means, GMM) and distribution free compressive learning (e.g. PCA, ICA). In general, distribution free CL poses distinct challenges and advantages from the typical parametric CL framework [sheehan2019compressiveSPM]. Challenges arise when choosing an intermediary statistic space , for instance (1) what set of intermediate statistics can we use? (2) How do the structural assumptions of the model set manifest within the intermediate statistic? Equivalently, there are many advantages. Specifically, by leveraging some set of intermediate statistics we have implicitly mapped the problem from an infinite dimensional probability space to a typically finite dimensional statistic space. As a result, we can utilise a host of existing techniques within the compressive sensing literature to design encoder and decoder pairs
. Moreover, it also allows us to use a more flexible semiparametric model that is only partially specified. As will be discussed in Section
4, the compressive ICA framework follows a similar convention where the space of 4th order cumulant tensors is used as an intermediary statistic space to exploit structural assumptions of the model set to form a sketch.2.3 Independent Component Analysis
ICA is used frequently in the machine learning and signal processing communities to identify latent variables that are mutually independent to one another. Consider a data vector
, then the problem of ICA concerns finding a mixing matrix (here we assume that ) such that(8) 
where and the components are statistically independent:
(9) 
The data point is only one realisation of a data matrix or signal of length , so therefore we attempt to infer with the collective set of linear equations , where is a realisation of . There are many techniques and methods in the literature to solve the ICA problem. The simplest method is to assume the distributional form of each of the independent components and then solve the ICA problem through a maximum likelihood approach [BellICA]. In practice, the distributions are not known apriori so therefore in most methods the distributions are left unspecified. As a result, practitioners and researchers often resort to minimizing a given contrast function to solve the ICA problem.
2.3.1 Prewhitening
A popular preprocessing trick for ICA is to prewhiten the data beforehand. This involves the process of finding the matrix such that
(10) 
where
has identity covariance matrix. This initial preprocessing step, which can be executed through singular value decomposition based techniques, uncorrelates the mixed components and handles the issue of when there are more mixing components than independent components
. Moreover, it has the advantage that the matrix to be found is necessarily orthogonal and square. For the sake of presentation, we will subsequently consider the whitened version of the data for the remainder of this section and the corresponding whitened ICA equation(11) 
In Section 4.3, we propose 2 equivalent sketching frameworks that can either incorporate prewhitened and unwhitened data.
2.3.2 Cumulant Based ICA
Tensorial or cumulant based methods are a group of techniques used to solve the ICA problem and are of particular interest in this paper. Statistical properties of the data instance can be described by its cumulants . In the multivariate setting, cumulants give rise to tensors, denoted for a cumulant tensor of order . Assuming the data has zero mean, the first four cumulants are defined [de1997signal] as
(12)  
where is the expectation operator. Given the model in (11) equating to , then the following multilinear property holds for their associated cumulant tensors:
(13) 
where represents the mode tensormatrix product and represents the order cumulant tensor of the independent source signals [de1997signal]. In this paper we will only consider order cumulant tensors (e.g. ) and for the sake of simplified notation we shall drop the superscript in (13) for the rest of the discussion. We denote by the space of 4th order cumulant tensors which account for the symmetry in (12), where each cumulant tensor has a maximum of
unique entries (degrees of freedom)
[COMONTensorDiag]. The diagonal entries are the autocumulants of , while the offdiagonal entries are the crosscumulants. If the variables are statistically independent then, as seen by (12), the crosscumulants vanish to 0 resulting in a strictly diagonal cumulant tensor. In other words, independence implies diagonality. It is shown in [comon2009tensor] that under mild conditions the converse is also true, i.e. diagonality implies independence. Cumulant based ICA can therefore be seen as finding a linear transformation such that the resulting cumulant tensor(14) 
is strictly diagonal. We can define the following ICA model set:
(15) 
where is the set of diagonal cumulant tensors, defined formally as
(16) 
Here, we have the additional requirement^{1}^{1}1A standard requirement in ICA is that at maximum one diagonal cumulant can be zero which arises from the ICA assumption that at maximum one source signal is Gaussian [hyvarinen2000independent]. Here we have the slightly stronger assumption that all source signals are nongaussian. that each diagonal cumulant is greater than or equal to a small constant . The expected cumulant tensor is typically not known owing to finite data length approximations and nonGaussian additive noise [de2000introduction2ICA] and so in general cannot be fully diagonalized by a linear transform. As a result, contrast functions are used to approximately diagonalize and maximize the independence of the system.
2.3.3 Contrast Functions
A contrast function is a mapping from the space of distributions to the real line and is a measure of the statistical independence between latent variables in a linear system [comon1994independent, hyvarinen2000independent]. For a function to be a contrast function it must be permutation and scaling invariant, and also maximized if and only if the distributions are statistically independent [comon1994independent]. For instance, the negative mutual information satisfies the contrast conditions, although it can be difficult to estimate in practice. Typically, contrast functions are tractable approximations of information theoretical measures such as negative mutual information, maximum likelihood and negentropy. Comon proposed various cumulant based contrast functions in [comon1994independent], that are Edgeworth expansions of information theoretic measures. The simplest is given by
(17) 
where is the order cumulant tensor corresponding to the variable . When is maximum, the components of are independent giving and . A 4th order cumulant tensor that has a decomposition as given in (14) and is therefore a member of the model set, , maximizes any given cumulant based contrast function [comon1994independent], and hence minimizes the associated information theoretic measure and the risk in (2). For further details, comprehensive reviews of cumulants and tensors can be found in [comon1994independent, de1997signal].
As discussed in Section 2.2, the 4th order cumulant tensor will act as an intermediary statistic space (). It is well documented through identifiability results in the cumulant based ICA literature [comon1994independent, de2000introduction2ICA] that the parameters of the ICA model, namely the mixing matrix , can be estimated solely from the 4th order cumulant tensor. As such, the 4th order cumulant tensor can be seen in its own right as a sketch, albeit inefficient with respect to compression being . In the next section, we motivate the principles behind sketching the 4th order cumulant tensor to form a compact representational sketch that has size .
3 Compressive Learning Principles for Cumulant ICA
It was discussed in Section 2.3.2 that the model set of the ICA problem defined in (15), maximizes any given cumulant based contrast function [comon1994independent]. The model set is itself a lowdimensional space residing in the space of cumulant tensors . Specifically, can be described as the product set of the set of orthogonal matrices, denoted , and the set of diagonal cumulant tensors that was defined in (14). We can therefore initially count the degrees of freedom of the model set :

 A maximum of degrees of freedom on the leading diagonal.

 A maximum of degrees of freedom [Szarek_OrthogonalCovering].
In total, the model set has degrees of freedom. In comparison, the space of 4th order cumulant tensors , in which the model set resides, has degrees of freedom. As the model set is of low complexity, in principle we could form a sketch of the 4th order cumulant tensor and estimate the parameters of the ICA model, namely the mixing matrix , solely from the sketch. The sketch of the 4th order cumulant tensor is defined by
(18) 
where w denotes that the sketch is acting on the whitened data . The computation of the sketch is very related to the sketching method of compressive PCA highlighted in Table 1. Akin to compressive PCA, the sketching operator acts on the finite dimensional space of 4th order cumulant tensors instead of the infinite dimensional probability space which is left unspecified due to the nature of the ICA model. The ICA sketch defined in (18) draws strong connections to finite dimensional compressive sensing [candes2008introduction, Donoho_CS] where limited (random) measurements of a finite dimensional sparse vector are taken to reduce the complexities associated with signal acquisition. Throughout the compressive sensing literature [candes2008introduction, Donoho_CS, candes2011tight], the restricted isometry property (RIP) is fundamental tool that is extensively used to show that a sketching operator stably embeds elements of the model set into a compressive domain , provided that the sketch dimension is of sufficient size. In other words, given a sketching operator , it proves that the distance between every pair of signals in the model set are approximately preserved under the action of the sketch therefore providing a near isometry. In the case of compressive ICA, given two elements of the ICA model set and an RIP constant , then
(19) 
provided that the sketch size is of sufficient dimension. In many cases, the sketch size is sufficient to be of the order of the degrees of freedom of the model set. In [bourrier2014fundamental, BlumensathIPG11], it is proved that if the lower RIP (LRIP) holds for a given sketching operator , e.g. the left of (19), then there exists a robust decoder that recovers a signal from the model set in a stable manner with respect to noise and signals that lie close to the model set. Moreover, it is proved in [bourrier2014fundamental] that if the LRIP holds for the sketching operator on the model set then the decoder can be the constrained optimization, for instance
(20) 
In principle, if the RIP can be proved for a sketching operator on the ICA model set , then we have an optimization strategy for solving the compressive ICA problem.
4 Compressive Independent Component Analysis Theory
We begin by explicitly defining the sketching operator as
(21) 
where and vec denotes the vectorization operator. Here we assume is some random measurement matrix where the entries are sampled according to some distributing law, . In this paper, we consider two randomized linear dimension reduction maps, namely the Gaussian map and the subsampled randomized Hadamard transform (SRHT) stated below. The CICA RIP, our main result stated in Theorem 4.1, is proved using the Gaussian map, however fast JohnsonLindenstrauss transforms (FJLT), for instance the SRHT, still work in practice as will be discussed in Section 6.
4.0.1 Gaussian Maps
The most traditional randomized linear dimension reduction map is the subgaussian matrix which has been used extensively in the CS literature [candes2008introduction, Donoho_CS]. The subgaussian matrix has entries that follow
(22) 
Gaussian maps typically require in memory as well as exhibiting a computational complexity of .
4.0.2 Subsampled Randomized Hadamard Transform
The SRHT is an instance of a FJLT that approximates the properties of the full Gaussian map [krahmer2011new]. Here is defined as
(23) 
where

is a diagonal matrix whose elements are independent random signs .

is a normalised WalshHadamard matrix.

is a matrix consisting of a a subset of randomly sampled rows from the identity matrix.
The SRHT is particularly cheaper to compute and store in comparison to the Gaussian map. As we do not explicitly store , the SRHT only requires in memory [tropp2011improved]. In addition, the computational complexity of computing the sketch reduces to in comparison to using the Gaussian map [ailon2006approximate, tropp2011improved]. Below we state our main result of the paper.
Theorem 4.1 (Compressive ICA RIP).
Corollary 4.1 (Information Preservation).
Proof.
Given the LRIP in Theorem 4.1, we use Theorem 7 in [bourrier2014fundamental] to obtain our result. ∎
The proof of Theorem 4.1 uses covering numbers and nets of the normalized secant set of .
Definition 4.1 (Secant Set).
The secant set of a set is defined as
(26) 
Definition 4.2 (Normalised Secant Set).
The normalized secant set of a set is defined as
(27) 
where defines the zero tensor.
Definition 4.3.
(Covering number) Let . The covering number of a set is the minimum number of closed balls of radius , with respect to the norm , with centres in needed to cover . The set of centres of these balls is a minimal net for .
Lemma 4.1 (Covering number of ).
The covering number of with respect to the Frobenius norm is
(28) 
where is some constant.
Proof.
See Appendix A.1. ∎
Definition 4.4.
(Upper box counting dimension) The upper box counting dimension of a set is defined as
(29) 
4.1 Proof of Theorem 4.1
Proof.
To prove a RIP exists for the ICA model set using the sketching operator defined in (21), we follow a similar line of argument to [candes2011tight, rauhut2017low] by using an covering of to extend the concentration results of the random Gaussian matrix uniformally over the whole lowdimensional set. Specifically, we use the Recipe framework proposed by Puy et al. [puyrecipes], to formulate the compressive ICA RIP proof. The proof is separated by showing that the following assumptions hold:

The normalised secant set, denoted , has finite upperbox counting dimension which is strictly bounded by ,

The sketching operator satisfies concentration inequalities [puyrecipes].
We begin with Assumption (A1). Using Lemma 4.1 and the definition of the upper box counting dimension in Definition 4.4, it can be seen that , so for any we satisfy Assumption (A1). To prove Assumption (A2), we have the following definition.
Definition 4.5.
(Subguassian random variable) A subgaussian random variable
is a random variable that satisfieswith . The subgaussian norm of , denoted by is the smallest for which the last property holds, i.e.,
Let denote the th row of the random Gaussian matrix . Then we use the fact [vershynin2010introduction, puyrecipes] that
(30) 
for all , where is an absolute constant. Therefore and Assumption A2 is satisfied. Finally, using Theorem 8 of [puyrecipes], we get the desired RIP result in Theorem 4.1. ∎
4.2 Finite Sample Effects
In practice, the sketch is constructed from a finite set of data such that
(31) 
where is the feature function discussed in Section 2.1 acting on the whitened data . For compressive ICA we can explicitly define the feature function, acting on the whitened data, as
(32) 
for , where are the rows of a Gaussian matrix and denotes the Frobenius inner product. Furthermore, for shorthand we denote . In other words, the feature function is taking random quartics of the data point . Note that the empirical sketch is equivalent to , as specified in (5), where is the finite data approximation of the 4th order cumulant tensor defined by
(33)  
In this case, the error defined in Theorem 4.1 can be attributed to the finite sample effects of approximating the true 4th order cumulant tensor from finite data. We now state our final result of this section.
Theorem 4.2 (Finite Sample Effects).
Assume the independent components have finite nonzero kurtosis and have bounded support such that . Consider the ICA model with corresponding 4th order cumulant tensor as in (13). By considering any draw of and associated 4th order cumulant tensor approximation , we have
(34) 
with probability at least on the drawing of both
’s and the random matrix
.Proof.
See Appendix B. ∎
4.3 Discussion
The results in this section are all based on proving a RIP on the model set defined in (15), where it is assumed the data has been prewhitened to reduce the ICA model to as discussed in Section 2.3. The prewhitening stage removes some of the degrees of freedom within the ICA inference task as it is necessary to estimate an orthogonal mixing matrix . In some sketching cases, we may only see the data once, for example in the streaming context [tropp2019streamingScientific], and therefore prewhitening may not be possible. The fact that we are now estimating an arbitrary mixing matrix instead of an orthogonal mixing matrix increases the degrees of freedom from to . As a result, we must sketch the unwhitened moment tensor such that
(35) 
where and is a random matrix as defined in (21). Here u denotes that the sketch is acting on the unwhitened data . In addition, the feature function for the unwhitened data can be defined as
(36) 
for , where are the rows of the matrix . Note that the feature function for the unwhitened data now includes quadratic moments^{2}^{2}2One could further reduce the size of the unwhitened sketch by instead computing random quadratic moments, however the reduction in complexity is minimal and therefore we leave this for future work., as well as random quartic moments, that are needed to estimate the mixing matrix which has extra degrees of freedom. Recall from (10) that the mixing matrix has the following decomposition [de2000introduction2ICA]
(37) 
where is the eigendecomposition of the covariance matrix and , are an orthogonal and diagonal matrix, respectively.
5 CICA Algorithms
In this section we propose two distinct compressive ICA algorithms to estimate the mixing matrix for both the whitened and unwhitened case.
5.1 Iterative Projection Gradient
Iterative projection gradient (IPG) descent is a popular optimization scheme which enforces low dimensional structure e.g. sparsity, rank, etc, by projecting the object of interest onto the model set after each subsequent gradient step. An iterative hard thresholding scheme was proposed in sparsity based compressive sensing [blumensath2008iterative2, blumensath2009iterative], where the smallest absolute entries are thresholded to zero to enforce the sparsity constraint and project the object onto the sparse model set. Blumensath [BlumensathIPG11] shows that the thresholding operator is an orthogonal projection onto the sparse set thereby projecting to an element on the model set that is of minimal distance. For the case of compressive ICA, we also seek an orthogonal projection on to the ICA model set . Formally, we can define an orthogonal projection operator of a 4th order cumulant tensor as
(38) 
In other words, projects the object onto the element in the model set that is of minimum distance w.r.t the Frobenius norm. In practice it is often difficult to find a projection operator that is both orthogonal and tractable in terms of computation. In [cardoso1, cardoso2], Cardoso showed that the ICA model set where is the set of rank tensors defined as
(39) 
where is the matrix formed by rearranging the elements of the tensor into a Hermitian matrix and where rank defines the standard matrix rank [cardoso1]. Moreover, is the set of supersymmetric tensors defined by
(40) 
where defines all permutations of the index . In fact, Cardoso proved in [cardoso2] that locally the converse is true, for instance within some neighbourhood of the following holds:
(41) 
Therefore, within some neighbourhood of , projecting onto the ICA model set is equivalent to projecting onto . Moreover, in [Cadzow1988SignalEC], Cadzow proved that alternate projections onto and is guaranteed to converge onto the intersection^{3}^{3}3In general, rank forcing destroys symmetry while symmetrization destroys the rank property, therefore alternate projections are needed until convergence. . Fundamentally, the projections onto (rank approximation) and (averaging over permutations), denoted by and respectively, are both simple to compute and are orthogonal. Alternate orthogonal projections onto and ensures a stable projection onto [Cadzow1988SignalEC] which locally, results in an orthogonal projection onto the ICA model set . Formally, we define the orthogonal projection below in Algorithm 1. In practice, Algorithm 1 converges to below a small tolerance in very few iterations ( iterations). We can now state our full CICA IPG algorithm detailed in Algorithm 2. Here the step size is computed optimally to guarantee convergence [BlumensathIPG11, blumensath2009iterative], denotes the adjoint sketching operator and is a fixed shrinking step size parameter.
5.1.1 Unwhitened IPG
It was discussed in Section 4.3 that it is often convenient, from an online processing point of view, to directly sketch the unwhitened data . Using the properties of the matrixtensor product [de1997signal], it can be seen that
(42) 
where . As defined in (36), the unwhitened feature function includes the second order moment of , namely . The empirical sketch therefore includes the sample covariance , which can be used to estimate an approximation of , denoted
, by using the eigenvalue decomposition of
[comon1994independent] at the beginning of Algorithm 2. By denoting , the gradient step in Algorithm 2 can be replaced by(43) 
as well as the associated step size and stopping criteria. As a result, the CICA IPG algorithm proceeds as normal by employing the original orthogonal projection .
5.2 Alternating Steepest Descent
The second proposed algorithm in the way of alternating steepest descent (ASD) is inherently different from the IPG scheme previously discussed. To see why, it is insightful to rewrite (20) in terms of the elements of the product set and O:
(44) 
where we have used the multilinear property discussed in (13). As the optimization problem is now explicitly defined by the mixing matrix and a sparse diagonal tensor , it is sufficient to optimise with respect to these parameters in an alternating steepest descent scheme. This approach contrasts the IPG scheme, as once we initialise the mixing matrix and the diagonal cumulant tensor appropriately, then we can optimise directly on the model set . We can initially state the ASD steps:
Note that the diagonal cumulant tensor can be simply reformulated as an sparse vector with known support, therefore one can perform elementwise differentiation on the entries for . The second step requires more attention as we have the constraint (i.e. ). The set of orthogonal matrices is an instance of a Stiefel manifold [wotaoStiefel], therefore is minimized directly on the Stiefel manifold.
5.2.1 Stiefel Manifold Optimisation
Given a feasible matrix and the gradient
, define a skewsymmetric matrix
as(45) 
The update on the Stiefel manifold is determined by the CrankNicholson scheme [crank1947practical] denoted
(46) 
where . The matrix is referred to as the Cayley transform [wotaoStiefel] of . The descent curve has the following useful features

is smooth on


for all .
As a result, we perform a steepest descent on with line search along the descent curve with respect to . For more details on optimisation methods constrained to the Stiefel manifold refer to [wotaoStiefel]. We can now state our second proposed CICA algorithm in Algorithm 3.
5.2.2 Practicalities
We start by stating the computational complexity of each proposed CICA algorithm. Here we assume that a fast SRHT, as discussed in 4.0.2, is used to compute the sketch. For the IPG scheme, the symmetry projection costs flops through averaging along all index permutations. A rank approximation of a general matrix costs flops [WOOLFE2008335], therefore the rank projection operator costs a total of flops. The gradient step in Algorithm 2 costs a total of flops due to the use of the sketching operator at each iteration which results in the IPG algorithm therefore having a total cost of flops. In the second proposed ASD algorithm, the gradient step in terms of the diagonal tensor in Algorithm 3, again has a cost of flops. The line search costs a total of flops [wotaoStiefel] resulting in the ASD algorithm having a computational complexity of . Note that both proposed CICA algorithms have computational complexity that is independent of the length of the data which can be extremely large for modern day applications.
As is the case for the general ICA problem, the compressive ICA optimisation problem is nonconvex and both algorithms proposed may be prone to converging to local minima. As a result, we consider the option of possible restarts at random initialisations to obtain a good solution. We also consider a proxy projection operator that uses a Given’s rotation scheme, popular in many ICA algorithms (see [comon1994independent, cardoso1993blind]), that approximately diagonalises the cumulant tensor with respect to some contrast function, followed by thresholding the cross cumulants of that approximately diagonalised tensor to zero [sheehan2019compressive]. We have observed in practice that this proxy projection operator is less sensitive to the nonconvex landscape of the optimization problem, which could be explained by the robustness of Given’s rotations [comon1994independent], hence multiple restarts are rarely required. The proxy projection operator, which we denote by , costs flops for the Given’s rotation scheme to approximately diagonalise the cumulant tensor [comon1994independent], and flops for the thresholding of the crosscumulants. Therefore in total the proxy IPG algorithm has approximately the same computational complexity as our previous IPG algorithm.
6 Empirical Results
6.1 Phase Transition
Phase transitions are an integral part of analysis that are used frequently in the compressive sensing literature [phase_tran_Lotz] to show a sharp change in the probability of successful reconstruction of the low dimensional object as the sketch size increases. The location at which the phase transition occurs can provide a tight bound on the required sketch size needed given the number of independent components and further consolidates the theoretical bound of the RIP derived in Section 4. To set up the phase transition experiment, we constructed the expected cumulant tensor of Laplacian sources and transformed the tensor with an orthogonal mixing matrix using the multilinear property in (13), resulting in an expected cumulant tensor . For each number of independent components , 250 Monte Carlo simulations on the mixing matrix were executed for increasing sketch size between 2 and 700. A successful reconstruction was determined if the Amari error^{4}^{4}4The Amari error is used widely in the ICA literature as it is both scale and permutation invariant, which are the two inherent ambiguities of ICA inference. [amari1996new] between the true mixing matrix and the estimated mixing matrix , defined by
(47) 
was smaller than , where . The probability of successful reconstruction was given by the number of successful reconstructions within the 250 MonteCarlo tests. We use the IPG version of the CICA algorithm for these results, although the ASD version provides nearly exactly the same results. It is insightful to begin by fixing the number of sources, here , to highlight the sharp transition as shown in Figure 2. We highlight some important bounds including the multiples of 2 and 4 times the dimension of the model set , depicted by the orange lines. For comparison, the dimension of the space of cumulant tensors , in other words the size of the cumulant tensor, is shown by the red line. The phase transition occurs in between 2 and 4 times the model set dimension indicating that choosing would be sufficient in successfully inferring the mixing matrix with high probability.
Figure 3 generalises the single phase transition result for the number of independent components varying between and . Once again, the important bounds of the model set dimension (green), 2 and 4 multiples of the model set dimension (orange) and the dimension of the space of cumulant tensors (red) are shown. Figure 3 explicitly shows that the phase transition empirically occurs within the location of and and provides us with a tight practical lower bound of on the sketch size for successful inference of the mixing matrix with high probability. Recall that in Theorem 4.1, the RIP holds when . The location of the phase transition in the empirical results therefore further consolidates the theoretical result. For a given number of independent components , the ratio between the upper orange line (4 times the model set dimension) and the red line (space of cumulant tensor dimension) provides a realistic compression rate in comparison to using the whole cumulant tensor of which many ICA techniques use. Importantly, as the number of independent components increases the ratio between these two lines decreases, resulting in further compression.
6.2 Statistical Efficiency
As was shown in Section 6.1
, the potential compression rates of sketching the cumulant tensor are high which can lead to a significantly reduced memory requirement. In this section we numerically analyse the tradeoff between the sketch size and the loss of information. Statistical efficiency is a measure of the variability or quality of an unbiased estimator
[FisherEfficiency], where the CramérRao bound provides a lower bound on the variability of an estimator and gives a best case scenario. For fair comparison, we instead use the variability of an estimator inferred by an algorithm that explicitly makes use of the cumulant information, just as the proposed CICA algorithms, to infer the mixing matrix estimate. As such, we use Comon’s ICA algorithm, detailed in [comon1994independent], that minimizes a kurtosis based contrast function using a sequence of Given’s rotations on pairwise cumulants as the approximate full data bound (e.g. no compression). We could have equivalently used the wellknown Joint Approximation Diagonalization of Eigenmatrices (JADE) algorithm [cardoso1993blind] or any other cumulant based ICA algorithm as the approximate bound, which gives similar results. To this end, we make use of the relative efficiency, defined as(48) 
where is the Amari Error defined in (47) and is the true mixing matrix. Denoting and as the mixing matrix estimates of Comon’s ICA algorithm (full data) and the proposed CICA algorithm, respectively, we expect
Comments
There are no comments yet.