Compressive Independent Component Analysis: Theory and Algorithms

Compressive learning forms the exciting intersection between compressed sensing and statistical learning where one exploits forms of sparsity and structure to reduce the memory and/or computational complexity of the learning task. In this paper, we look at the independent component analysis (ICA) model through the compressive learning lens. In particular, we show that solutions to the cumulant based ICA model have particular structure that induces a low dimensional model set that resides in the cumulant tensor space. By showing a restricted isometry property holds for random cumulants e.g. Gaussian ensembles, we prove the existence of a compressive ICA scheme. Thereafter, we propose two algorithms of the form of an iterative projection gradient (IPG) and an alternating steepest descent (ASD) algorithm for compressive ICA, where the order of compression asserted from the restricted isometry property is realised through empirical results. We provide analysis of the CICA algorithms including the effects of finite samples. The effects of compression are characterised by a trade-off between the sketch size and the statistical efficiency of the ICA estimates. By considering synthetic and real datasets, we show the substantial memory gains achieved over well-known ICA algorithms by using one of the proposed CICA algorithms. Finally, we conclude the paper with open problems including interesting challenges from the emerging field of compressive learning.

Authors

• 2 publications
• 22 publications
10/22/2019

Compressive Learning for Semi-Parametric Models

In the compressive learning theory, instead of solving a statistical lea...
09/04/2019

What Happens on the Edge, Stays on the Edge: Toward Compressive Deep Learning

Machine learning at the edge offers great benefits such as increased pri...
04/17/2020

Statistical Learning Guarantees for Compressive Clustering and Compressive Mixture Modeling

We provide statistical learning guarantees for two unsupervised learning...
12/01/2021

Controlling Wasserstein distances by Kernel norms with application to Compressive Statistical Learning

Comparing probability distributions is at the crux of many machine learn...
04/26/2018

Quantized Compressive K-Means

The recent framework of compressive statistical learning aims at designi...
01/25/2012

06/09/2016

Sketching for Large-Scale Learning of Mixture Models

Learning parameters from voluminous data can be prohibitive in terms of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the size of datasets have grown exponentially as a result of advances in technology, signal acquisition, and the sophistication of modern day mobile phones and devices. This has enabled researchers, statisticians and machine learning practitioners to build increasingly accurate models as a consequence of larger sample sizes and feature dimensions. Nevertheless, this poses a fundamental challenge to large scale learning as (i) traditional algorithms have computational complexity that scales with the order of the dataset dimensions (ii) the whole dataset has to be stored or transferred on to local RAM as optimisation methods need to return to the data (or a random subset of the data) at subsequent iterations, and (iii) one is vulnerable to malicious attacks of potentially sensitive and personal information as the data needs to be stored or transferred locally. Compressive learning (CL)

[gribonval2021compressive, gribonval2021sketching] partially addresses these fundamental challenges by severely compressing the whole dataset into a random representation of fixed size, named a so-called sketch, in a single (or limited) pass of the data prior to learning. Once the sketch is formed, the parameters of the model are inferred solely from the sketch, hence a CL algorithm, for a given task or model, needs never to return to the original dataset, and it can be deleted from memory as a result. At the core of the CL framework [gribonval2021compressive, keriven2018sketching], is that in general, the size of the sketch does not scale with the dimensions of the dataset, or indeed the data’s underlying dimensionality, but instead is driven by the complexity or dimensionality of the task or model of interest. In theory, one can work with datasets of arbitrary length, as the dimension of the sketch is fixed constant throughout, making CL especially amenable to large scale learning. Inferring the parameters of a model solely from the sketch is an under determined inverse problem. As a result, we need regularity assumptions to make the problem well-posed. These assumptions come in the form of a low dimensional model set that the solution to the inference problem lies on or close to. The reader may notice this is reminiscent of compressive sensing where one assumes the signal of interest is sparse in some domain, and therefore the solution lies on or close to the union of dimensional subspaces representing a low dimensional model set. The sparse regularity assumption allows one to take a limited number of measurements to recover the signal of interest and reduce the complexity and cost of acquisition. In later sections, we take inspiration from compressive sensing to develop and analyse our CL algorithms.

In this paper, we develop a CL framework, including theory and practical algorithms, for independent component analysis (ICA). ICA is an unsupervised learning task that attempts to find the linear transformation that separates some given data into components of maximal independence. It is used extensively in the machine learning and signal processing communities for example as a dimensionality reduction tool

[hyvarinen2000independent], to uncover underlying factors that effect the price movements of a collection of stocks [882456] and to detect independent sources in the brain through EEG signals [841330]. As will be discussed in section 2

, the ICA problem can be solved directly from the data or through some higher order statistics of the data, such as the kurtosis. Given that the number of independent sources is denoted by

and the signal or data length is denoted by , then the memory complexity typically scales either with or depending on the method of choice. As one can see, this becomes infeasible for large scale datasets. In this paper, we show theoretically and empirically that it is possible to design a CL ICA algorithm where the sketch dimension, and therefore the memory complexity, scales at which can be orders of magnitudes smaller than current approaches.

1.1 Contributions and Outline

Below, we highlight the main contributions of the paper:

• Focusing on the higher order statistics method of ICA, we show that an independent component model set exists in the space of order cumulant tensors and we state the set’s dimensionality.

• We prove that the well-known restricted isometry property (RIP) holds for projections of the cumulant tensor on the model set with a sketch size , which is order optimal to the dimensional of the model set. Given a sketch constructed from random Gaussian ensembles, we show that the reconstruction error is stable when measuring a signal which lives close to, but not on, the ICA model set through the existence of an instance optimal decoder.

• Two inherently different CICA algorithms are proposed, in the form of an alternating steepest descent and an iterative projection gradient algorithm and show that they successfully recover the ICA mixing matrix with overwhelming probability provided that the sketch size

. We analyse the proposed CICA algorithms on synthetic and real datasets showing that the CICA scheme achieves substantial memory savings over existing ICA methods, whilst retaining competitive estimation accuracy.

• We analyse the tradeoff between the compression rate of the sketch and the overall statistical efficiency of the sketch estimate.

1.2 Related Works

1.2.1 Existing Compressive Learning Models

The framework of CL has been successfully applied to a host of learning tasks and models with the desired outcome of reducing the complexities associated with signal acquisition, computation and memory storage. In [keriven2018sketching], Keriven et al.

proposed a CL framework for mixture models, in particular the mixture of Gaussian distributions and

-means learning tasks. In both cases, a sketch is constructed by randomly sampling the characteristic function of the mixture model which can be equivalently seen as taking random Fourier features of the data

[rahimi2008random]. The compact representational sketch of each mixture model scales as , where is the number of mixtures in the model and is the feature space dimensions of the data. As a result, a compressive mixture model algorithm was proposed that had both computational and space complexities that scaled independently of the number of data points . In [gribonval2021compressive]

, a compressive principal component analysis (PCA) framework was proposed. As will be discussed in Section

2.2, the compressive PCA methodology is aligned closely to our compressive ICA framework. Distinct from compressive mixture models, the compressive PCA method is left distribution free and it is assumed that the data lives on, or can be approximately modelled, by a -dimensional subspace. As a result, a sketch of size can be computed by taking a random projection of the covariance matrix of the data, hence reducing the memory complexities of storing either the data of size or the covariance matrix of size .

1.2.2 Generalised Method of Moments

Compressive learning is similar to the technique of Generalised Method of Moments (GeMM)

[hansenGeMM, gemmhall] where the parameters of interest are estimated by matching a collection of generalised moments of the distribution with the empirical counterparts calculated through the data. In most cases it is used instead of maximum likelihood estimation when calculating the likelihood is not tractable. CL differs from much of the GeMM literature as the goal is fundamentally different: in compressive learning one attempts to construct a compact representation of data with the aim of reducing complexity constraints (computation, memory, acquisition) whilst in GeMM the goal is to primarily estimate when the model is either partially specified or the likelihood does not have a closed form solution. Moreover, the selected generalised moments may be a function of the parameter being estimated, hence not providing a one off sketch.

1.2.3 Streaming Methods

Closely related to CL is the collection of streaming methods [cormode2012synopses, tropp2019streamingScientific], where data items are seen and queried only once by the user and then discarded. This is of particular interest when the summary statistic of choice is updated and maintained in real time, for example in the online learning setting [1198387], to reduce space complexities. Notably, the count-min-sketch [cormodea2005improved]

was developed to query data in an online fashion with the application of maintaining histograms of quantiles. However, these methods in general focus on the discrete collection of objects and database queries while in CL the framework and method is applied to machine learning tasks where typically the signal is question is continuous. Tropp

et al. [tropp2017practical] proposed a streaming framework for large scale PCA. In particular, in [tropp2019streamingScientific], the authors design random sketches for on-the-fly compression of data matrices associated with large scale scientific simulations. Here the data matrix of interest can be decomposed into a sequence

 A=H1+H2+H3+… (1)

where it is assumed each has some structural redundancies for example sparsity or low-rank. These methods have a subtle yet fundamental difference from CL, as in CL the structural assumptions which are exploited to form the CL sketch arise from the model or distribution itself, while in these streaming methods the structural assumptions come directly from the data. Moreover, several passes of the data may be required to reduce the low-rank approximation error [tropp2017practical].

1.2.4 Other Compression Techniques

Coresets are a popular method used to compress a database into a summary statistic used for inferring the parameters of a given model and has been used primarily for subspace clustering based tasks [har2004coresets, feldman2020turning]

. In a similar vein to CL, the compact data representation has size that typically scales independently to the number of input items and the dimension of input feature space. However, the coresets are constructed in a hierarchical manner, possibly resulting in multiple passes of the data and are therefore not naturally amenable to online or distributed learning. Projections that include both random projections and feature selections

[boutsidis2010random, calderbank2009compressed] are used widely to reduce the dimensionality of the data. In [calderbank2009compressed], datasets were randomly projected into a compressed domain using both random Gaussian and Bernoulli matrices. In a similar vein to compressive sensing [Donoho_CS], the data was assumed to be -sparse therefore the dependency of the feature space dimension was removed within the space and acquisition complexities. In contrast to random projections, more structural based projections are proposed. In [tang2014feature], different feature selection techniques for classification are reviewed including structured graph methods and the use of embedded models. The well-known PCA method is a popular preprocessing technique that projects the dataset onto a

-dimensional subspace of maximal variance

[jolliffe2016principal]. In both random and structured projections, the methods discussed only tackle the dependency of the feature space dimension and do not address the challenges posed by a large data size . Sub-sampling methods are also a popular method for dimensionality reduction whereby a subset of the original dataset is used for learning. As discussed previously, the method of coresets [har2004coresets, feldman2020turning] is a sub-sampling technique that attempts to sub-select dominant items that well approximate the structure of the dataset. Other sub-sampling techniques include random and adaptive sub-sampling [cormode2012synopses]. The disadvantage of sub-sampling techniques is that there is a risk of discarding important information relating to non-sampled data items. Moreover, these techniques only tackle the constraint on the number of data items and don’t combat the complexity issues posed by the feature space dimensional .

Specifically to ICA compression, Sela et al. [sela2016randomized] used kernel approximation techniques to reduce the dimensions of the Kernel ICA method proposed by Bach [bach2002kernel]. Random Fourier features are used to approximate the kernel, reducing the memory complexity from to , where is the number of random Fourier weights used. Despite the reduction in memory complexity, the algorithm still has storage demands which scale linearly with . In comparison, we remove the dependency of the data length completely, within our framework, when estimating the ICA mixing matrix.

2 Background

2.1 Compressive Learning

Let

be independent and identically distributed samples from an unknown probability distribution

on where is some Euclidean space and is a Borel -field. Classically, is parametrized by some parameters denoted by . A statistical learning problem can be formalised as follows: find a hypothesis from a hypothesis class that best matches the probability distribution over the training collection

, given some data fidelity term. Given a loss function

, this is equivalent to minimizing the risk defined as

 h∗=argminh∈HR(π,h)=argminh∈HEx∼πl(x,h). (2)

Formally, the model set associated to the hypothesis class can be defined as:

 SH:={π∈P(\pazocalX):∃h∈H,R(π,h)=0}. (3)

In other words, the set containing all distributions for which zero risk is achievable. As a result, the model set has a dimension which is intrinsic to the hypothesis class of the model. In practice, one cannot minimize the true risk as we generally do not have access to the true distribution , so instead, one can minimize the empirical risk with respect to the finite samples of the true distribution and as a result this may mean all the data is required to be stored in memory.

In CL [keriven2018sketching, gribonval2021compressive, gribonval2021sketching], we find a compact representation, or a so-called sketch, that encodes some statistical properties of the data. Its size is ideally chosen relative to the intrinsic complexity of the problem, making it possible to work with arbitrarily large datasets while storing in memory an object of fixed size. Given a feature function , such that is integrable with respect to any , define a linear operator by

 A(π):=Ex∼πΦ(x). (4)

The sketch defined in (4) can be seen as taking the expectation of some particular features of the distribution , which is similar to the field of kernel mean embedding [muandet2017kernel] where one uses feature maps to embed probability distributions. Therefore, we would like to construct so that

captures sufficiently relevant information of the data to allow us to infer the parameters of the model directly from the sketch. As a trivial example, if we seek to infer only the mean of a normal distribution

, the construction where would constitute a trivial yet sufficient sketch. In reality, CL is applicable to much more complex models where the feature function is non-trivial and the model may not necessarily possess a finite dimensional sufficient statistic independent from the data. The goal of CL is to therefore construct a sketch of size that captures enough information to recover an estimated risk which is close to the true risk with high probability [gribonval2021compressive]. In practice, as in the kernel mean embedding literature [muandet2017kernel], the empirical distribution is used to form an empirical sketch defined as

 ^y=A(πN)where πN:=1NN∑i=1δxi (5)

denoting by the Dirac distribution on

, and therefore the empirical sketch can be formed directly from the data. Due to the law of large numbers,

. Once the sketch has been computed, one can discard the dataset from memory. As a result, CL reduces down to solving an inverse problem of the form

 ^θ=argminθ∈ΘC(θ∣^y) (6)

where is a cost function designed for the specific learning task at hand. In a compressive sensing light, we can exploit structural assumptions of the model set and the associated parameter space , e.g sparsity, low rankness, low dimensional manifold properties, to make (6) well-posed and finding a solution tractable. As such, one can design a decoder that exploits the structural assumptions of the model set to recover the parameters of the model from the sketch whilst minimizing the risk. The sketching operator and the decoder form the pair that define the CL algorithm for a specific learning problem. It should be noted that minimizing (6) plays the role of a proxy for minimizing the empirical risk as, by definition of the model set in (3), any has zero loss in expectation [gribonval2021compressive].

2.2 Compressive Principal Component Analysis

In Section 2.1, the framework of CL was discussed in a general manner without specific consideration of the distributional form of the model. As will be discussed in Section 2.3, the PCA and ICA models are similar in nature in that the model is often left distribution free. In other words, the distribution of the sampled data is left unspecified. In Table 1, it is shown that the compressive PCA model set [gribonval2021compressive] is defined as

 SH={π∣rank(Σπ)≤k}. (7)

Due to the distribution free assumption of the PCA model, we seek structural assumptions that are manifested within some intermediary statistic space to make computing a sketch possible [sheehan2019compressive]. In the case of compressive PCA, the space of covariance matrices is leveraged as an intermediary statistic space where the rank of the covariance matrices is exploited. Figure 1 depicts a geometric viewpoint of both compressive parametric learning (e.g. -means, GMM) and distribution free compressive learning (e.g. PCA, ICA). In general, distribution free CL poses distinct challenges and advantages from the typical parametric CL framework [sheehan2019compressiveSPM]. Challenges arise when choosing an intermediary statistic space , for instance (1) what set of intermediate statistics can we use? (2) How do the structural assumptions of the model set manifest within the intermediate statistic? Equivalently, there are many advantages. Specifically, by leveraging some set of intermediate statistics we have implicitly mapped the problem from an infinite dimensional probability space to a typically finite dimensional statistic space. As a result, we can utilise a host of existing techniques within the compressive sensing literature to design encoder and decoder pairs

. Moreover, it also allows us to use a more flexible semi-parametric model that is only partially specified. As will be discussed in Section

4, the compressive ICA framework follows a similar convention where the space of 4th order cumulant tensors is used as an intermediary statistic space to exploit structural assumptions of the model set to form a sketch.

2.3 Independent Component Analysis

ICA is used frequently in the machine learning and signal processing communities to identify latent variables that are mutually independent to one another. Consider a data vector

, then the problem of ICA concerns finding a mixing matrix (here we assume that ) such that

 x=Ms, (8)

where and the components are statistically independent:

 p(s1,s2,…,sn)=n∏i=1pi(si). (9)

The data point is only one realisation of a data matrix or signal of length , so therefore we attempt to infer with the collective set of linear equations , where is a realisation of . There are many techniques and methods in the literature to solve the ICA problem. The simplest method is to assume the distributional form of each of the independent components and then solve the ICA problem through a maximum likelihood approach [BellICA]. In practice, the distributions are not known a-priori so therefore in most methods the distributions are left unspecified. As a result, practitioners and researchers often resort to minimizing a given contrast function to solve the ICA problem.

2.3.1 Prewhitening

A popular preprocessing trick for ICA is to prewhiten the data beforehand. This involves the process of finding the matrix such that

 z=P−1x, (10)

where

has identity covariance matrix. This initial preprocessing step, which can be executed through singular value decomposition based techniques, uncorrelates the mixed components and handles the issue of when there are more mixing components than independent components

. Moreover, it has the advantage that the matrix to be found is necessarily orthogonal and square. For the sake of presentation, we will subsequently consider the whitened version of the data for the remainder of this section and the corresponding whitened ICA equation

 z=Qs. (11)

In Section 4.3, we propose 2 equivalent sketching frameworks that can either incorporate prewhitened and unwhitened data.

2.3.2 Cumulant Based ICA

Tensorial or cumulant based methods are a group of techniques used to solve the ICA problem and are of particular interest in this paper. Statistical properties of the data instance can be described by its cumulants . In the multivariate setting, cumulants give rise to tensors, denoted for a cumulant tensor of order . Assuming the data has zero mean, the first four cumulants are defined [de1997signal] as

 Z1i= 0 (12) Z2ij= E[zizj] Z3ijk= E[zizjzk] Z4ijkl= E[zizjzkzl]−E[zizj]E[zkzl]−E[zizk]E[zjzl]−E[zizl]E[zjzk]

where is the expectation operator. Given the model in (11) equating to , then the following multilinear property holds for their associated cumulant tensors:

 ZK=SK×1Q×2Q×3⋯×KQ, (13)

where represents the -mode tensor-matrix product and represents the order cumulant tensor of the independent source signals [de1997signal]. In this paper we will only consider order cumulant tensors (e.g. ) and for the sake of simplified notation we shall drop the superscript in (13) for the rest of the discussion. We denote by the space of 4th order cumulant tensors which account for the symmetry in (12), where each cumulant tensor has a maximum of

unique entries (degrees of freedom)

[COMONTensorDiag]. The diagonal entries are the auto-cumulants of , while the off-diagonal entries are the cross-cumulants. If the variables are statistically independent then, as seen by (12), the cross-cumulants vanish to 0 resulting in a strictly diagonal cumulant tensor. In other words, independence implies diagonality. It is shown in [comon2009tensor] that under mild conditions the converse is also true, i.e. diagonality implies independence. Cumulant based ICA can therefore be seen as finding a linear transformation such that the resulting cumulant tensor

 S:=Z×1W×2W×3W×4W (14)

is strictly diagonal. We can define the following ICA model set:

 (15)

where is the set of diagonal cumulant tensors, defined formally as

 D\coloneqq{S∣Sijkl=0∀ijkl≠iiii and Siiii≥ϵS}, (16)

Here, we have the additional requirement111A standard requirement in ICA is that at maximum one diagonal cumulant can be zero which arises from the ICA assumption that at maximum one source signal is Gaussian [hyvarinen2000independent]. Here we have the slightly stronger assumption that all source signals are non-gaussian. that each diagonal cumulant is greater than or equal to a small constant . The expected cumulant tensor is typically not known owing to finite data length approximations and non-Gaussian additive noise [de2000introduction2ICA] and so in general cannot be fully diagonalized by a linear transform. As a result, contrast functions are used to approximately diagonalize and maximize the independence of the system.

2.3.3 Contrast Functions

A contrast function is a mapping from the space of distributions to the real line and is a measure of the statistical independence between latent variables in a linear system [comon1994independent, hyvarinen2000independent]. For a function to be a contrast function it must be permutation and scaling invariant, and also maximized if and only if the distributions are statistically independent [comon1994independent]. For instance, the negative mutual information satisfies the contrast conditions, although it can be difficult to estimate in practice. Typically, contrast functions are tractable approximations of information theoretical measures such as negative mutual information, maximum likelihood and negentropy. Comon proposed various cumulant based contrast functions in [comon1994independent], that are Edgeworth expansions of information theoretic measures. The simplest is given by

 Ψ(W)=n∑i=1^Siiii. (17)

where is the order cumulant tensor corresponding to the variable . When is maximum, the components of are independent giving and . A 4th order cumulant tensor that has a decomposition as given in (14) and is therefore a member of the model set, , maximizes any given cumulant based contrast function [comon1994independent], and hence minimizes the associated information theoretic measure and the risk in (2). For further details, comprehensive reviews of cumulants and tensors can be found in [comon1994independent, de1997signal].

As discussed in Section 2.2, the 4th order cumulant tensor will act as an intermediary statistic space (). It is well documented through identifiability results in the cumulant based ICA literature [comon1994independent, de2000introduction2ICA] that the parameters of the ICA model, namely the mixing matrix , can be estimated solely from the 4th order cumulant tensor. As such, the 4th order cumulant tensor can be seen in its own right as a sketch, albeit inefficient with respect to compression being . In the next section, we motivate the principles behind sketching the 4th order cumulant tensor to form a compact representational sketch that has size .

3 Compressive Learning Principles for Cumulant ICA

It was discussed in Section 2.3.2 that the model set of the ICA problem defined in (15), maximizes any given cumulant based contrast function [comon1994independent]. The model set is itself a low-dimensional space residing in the space of cumulant tensors . Specifically, can be described as the product set of the set of orthogonal matrices, denoted , and the set of diagonal cumulant tensors that was defined in (14). We can therefore initially count the degrees of freedom of the model set :

• - A maximum of degrees of freedom on the leading diagonal.

• - A maximum of degrees of freedom [Szarek_OrthogonalCovering].

In total, the model set has degrees of freedom. In comparison, the space of 4th order cumulant tensors , in which the model set resides, has degrees of freedom. As the model set is of low complexity, in principle we could form a sketch of the 4th order cumulant tensor and estimate the parameters of the ICA model, namely the mixing matrix , solely from the sketch. The sketch of the 4th order cumulant tensor is defined by

 yw=A(Z), (18)

where w denotes that the sketch is acting on the whitened data . The computation of the sketch is very related to the sketching method of compressive PCA highlighted in Table 1. Akin to compressive PCA, the sketching operator acts on the finite dimensional space of 4th order cumulant tensors instead of the infinite dimensional probability space which is left unspecified due to the nature of the ICA model. The ICA sketch defined in (18) draws strong connections to finite dimensional compressive sensing [candes2008introduction, Donoho_CS] where limited (random) measurements of a finite dimensional sparse vector are taken to reduce the complexities associated with signal acquisition. Throughout the compressive sensing literature [candes2008introduction, Donoho_CS, candes2011tight], the restricted isometry property (RIP) is fundamental tool that is extensively used to show that a sketching operator stably embeds elements of the model set into a compressive domain , provided that the sketch dimension is of sufficient size. In other words, given a sketching operator , it proves that the distance between every pair of signals in the model set are approximately preserved under the action of the sketch therefore providing a near isometry. In the case of compressive ICA, given two elements of the ICA model set and an RIP constant , then

 (1−δ)∥Z1−Z2∥2≤∥A(Z1−Z2)∥2≤(1+δ)∥Z1−Z2∥2 (19)

provided that the sketch size is of sufficient dimension. In many cases, the sketch size is sufficient to be of the order of the degrees of freedom of the model set. In [bourrier2014fundamental, BlumensathIPG11], it is proved that if the lower RIP (LRIP) holds for a given sketching operator , e.g. the left of (19), then there exists a robust decoder that recovers a signal from the model set in a stable manner with respect to noise and signals that lie close to the model set. Moreover, it is proved in [bourrier2014fundamental] that if the LRIP holds for the sketching operator on the model set then the decoder can be the constrained optimization, for instance

 Δ(yw,A)∈minZ∈SH∥yw−A(Z)∥2. (20)

In principle, if the RIP can be proved for a sketching operator on the ICA model set , then we have an optimization strategy for solving the compressive ICA problem.

4 Compressive Independent Component Analysis Theory

We begin by explicitly defining the sketching operator as

 A(Z)=Avec(Z), (21)

where and vec denotes the vectorization operator. Here we assume is some random measurement matrix where the entries are sampled according to some distributing law, . In this paper, we consider two randomized linear dimension reduction maps, namely the Gaussian map and the subsampled randomized Hadamard transform (SRHT) stated below. The CICA RIP, our main result stated in Theorem 4.1, is proved using the Gaussian map, however fast Johnson-Lindenstrauss transforms (FJLT), for instance the SRHT, still work in practice as will be discussed in Section 6.

4.0.1 Gaussian Maps

The most traditional randomized linear dimension reduction map is the subgaussian matrix which has been used extensively in the CS literature [candes2008introduction, Donoho_CS]. The subgaussian matrix has entries that follow

 Aij∼N(0,m−12). (22)

Gaussian maps typically require in memory as well as exhibiting a computational complexity of .

The SRHT is an instance of a FJLT that approximates the properties of the full Gaussian map [krahmer2011new]. Here is defined as

 A=1√mpDHSsub, (23)

where

• is a diagonal matrix whose elements are independent random signs .

• is a normalised Walsh-Hadamard matrix.

• is a matrix consisting of a a subset of randomly sampled rows from the identity matrix.

The SRHT is particularly cheaper to compute and store in comparison to the Gaussian map. As we do not explicitly store , the SRHT only requires in memory [tropp2011improved]. In addition, the computational complexity of computing the sketch reduces to in comparison to using the Gaussian map [ailon2006approximate, tropp2011improved]. Below we state our main result of the paper.

Theorem 4.1 (Compressive ICA RIP).

Let and denote the Gaussian map sketching operator defined in (22). Then satisfies the RIP in (19) with constant and probability provided that

 m≥Cδ2max{2n(n+1)log(C5),log(6ξ)}, (24)

where is an absolute constant and is a constant defined in Lemma 4.1.

The proof of Theorem 4.1 is detailed in Section 4.1.

Corollary 4.1 (Information Preservation).

Let be an arbitrary 4th order cumulant tensor and denote where is some additive noise. Furthermore, let denote the solution to (20). Given that satisfies the RIP in Theorem 4.1, then with probability

 (25)

where is a small positive constant.

Proof.

Given the LRIP in Theorem 4.1, we use Theorem 7 in [bourrier2014fundamental] to obtain our result. ∎

The proof of Theorem 4.1 uses covering numbers and -nets of the normalized secant set of .

Definition 4.1 (Secant Set).

The secant set of a set is defined as

 SH−SH:={Y=Z1−Z2∣Z1,Z2∈SH}. (26)
Definition 4.2 (Normalised Secant Set).

The normalized secant set of a set is defined as

 N(SH):={Y/∥Y∥F∣Y∈(SH−SH)∖{0}}, (27)

where defines the zero tensor.

Definition 4.3.

(Covering number) Let . The covering number of a set is the minimum number of closed balls of radius , with respect to the norm , with centres in needed to cover . The set of centres of these balls is a minimal -net for .

Lemma 4.1 (Covering number of N(SH−SH)).

The covering number of with respect to the Frobenius norm is

 CN(N(SH−SH),∥⋅∥F,ϵ)≤(C5ϵ)2n(n+1), (28)

where is some constant.

Proof.

See Appendix A.1. ∎

Definition 4.4.

(Upper box counting dimension) The upper box counting dimension of a set is defined as

 dimB(S):=limsupϵ→0log[CN(S,∥⋅∥,ϵ)]/log[1/ϵ]. (29)

4.1 Proof of Theorem 4.1

Proof.

To prove a RIP exists for the ICA model set using the sketching operator defined in (21), we follow a similar line of argument to [candes2011tight, rauhut2017low] by using an -covering of to extend the concentration results of the random Gaussian matrix uniformally over the whole low-dimensional set. Specifically, we use the Recipe framework proposed by Puy et al. [puyrecipes], to formulate the compressive ICA RIP proof. The proof is separated by showing that the following assumptions hold:

• The normalised secant set, denoted , has finite upper-box counting dimension which is strictly bounded by ,

• The sketching operator satisfies concentration inequalities [puyrecipes].

We begin with Assumption (A1). Using Lemma 4.1 and the definition of the upper box counting dimension in Definition 4.4, it can be seen that , so for any we satisfy Assumption (A1). To prove Assumption (A2), we have the following definition.

Definition 4.5.

(Subguassian random variable) A subgaussian random variable

is a random variable that satisfies

 (E|X|q)1/q≤C1√q for all q≥1,

with . The subgaussian norm of , denoted by is the smallest for which the last property holds, i.e.,

 ∥X∥Ψ2:=supq≥1{q−1/2(E|X|q)1/q)}.

Let denote the th row of the random Gaussian matrix . Then we use the fact [vershynin2010introduction, puyrecipes] that

 ∥ATivec(Z)∥Ψ2≤D∥Z∥F (30)

for all , where is an absolute constant. Therefore and Assumption A2 is satisfied. Finally, using Theorem 8 of [puyrecipes], we get the desired RIP result in Theorem 4.1. ∎

4.2 Finite Sample Effects

In practice, the sketch is constructed from a finite set of data such that

 ^yw=1NN∑i=1Φw(zi), (31)

where is the feature function discussed in Section 2.1 acting on the whitened data . For compressive ICA we can explicitly define the feature function, acting on the whitened data, as

 Φw(z)=⟨Aj,z⊗4⟩F, (32)

for , where are the rows of a Gaussian matrix and denotes the Frobenius inner product. Furthermore, for shorthand we denote . In other words, the feature function is taking random quartics of the data point . Note that the empirical sketch is equivalent to , as specified in (5), where is the finite data approximation of the 4th order cumulant tensor defined by

 ^Z4ijkl=1NN∑i,j,k,l=1zizjzkzl−1N2N∑i,j=1zizjN∑k,l=1zkzl −1N2N∑i,k=1zizkN∑j,l=1zjzl (33) −1N2N∑i,l=1zizlN∑j,k=1zjzk.

In this case, the error defined in Theorem 4.1 can be attributed to the finite sample effects of approximating the true 4th order cumulant tensor from finite data. We now state our final result of this section.

Theorem 4.2 (Finite Sample Effects).

Assume the independent components have finite non-zero kurtosis and have bounded support such that . Consider the ICA model with corresponding 4th order cumulant tensor as in (13). By considering any draw of and associated 4th order cumulant tensor approximation , we have

 ∥Avec(Z)−Avec(^Z)∥2≤R√2(1+δ)log(1/ξ)√N (34)

with probability at least on the drawing of both

’s and the random matrix

.

Proof.

See Appendix B. ∎

4.3 Discussion

The results in this section are all based on proving a RIP on the model set defined in (15), where it is assumed the data has been prewhitened to reduce the ICA model to as discussed in Section 2.3. The prewhitening stage removes some of the degrees of freedom within the ICA inference task as it is necessary to estimate an orthogonal mixing matrix . In some sketching cases, we may only see the data once, for example in the streaming context [tropp2019streamingScientific], and therefore prewhitening may not be possible. The fact that we are now estimating an arbitrary mixing matrix instead of an orthogonal mixing matrix increases the degrees of freedom from to . As a result, we must sketch the unwhitened moment tensor such that

 yu=A(X), (35)

where and is a random matrix as defined in (21). Here u denotes that the sketch is acting on the unwhitened data . In addition, the feature function for the unwhitened data can be defined as

 Φu(x) =[⟨Aj,x⊗4⟩Fx⊗2], (36)

for , where are the rows of the matrix . Note that the feature function for the unwhitened data now includes quadratic moments222One could further reduce the size of the unwhitened sketch by instead computing random quadratic moments, however the reduction in complexity is minimal and therefore we leave this for future work., as well as random quartic moments, that are needed to estimate the mixing matrix which has extra degrees of freedom. Recall from (10) that the mixing matrix has the following decomposition [de2000introduction2ICA]

 M=PQ (37)

where is the eigendecomposition of the covariance matrix and , are an orthogonal and diagonal matrix, respectively.

5 CICA Algorithms

In this section we propose two distinct compressive ICA algorithms to estimate the mixing matrix for both the whitened and unwhitened case.

Iterative projection gradient (IPG) descent is a popular optimization scheme which enforces low dimensional structure e.g. sparsity, rank, etc, by projecting the object of interest onto the model set after each subsequent gradient step. An iterative hard thresholding scheme was proposed in sparsity based compressive sensing [blumensath2008iterative2, blumensath2009iterative], where the smallest absolute entries are thresholded to zero to enforce the sparsity constraint and project the object onto the -sparse model set. Blumensath [BlumensathIPG11] shows that the thresholding operator is an orthogonal projection onto the -sparse set thereby projecting to an element on the model set that is of minimal distance. For the case of compressive ICA, we also seek an orthogonal projection on to the ICA model set . Formally, we can define an orthogonal projection operator of a 4th order cumulant tensor as

 PSH(Z∗)∈argminZ∈SH∥Z∗−Z∥F. (38)

In other words, projects the object onto the element in the model set that is of minimum distance w.r.t the Frobenius norm. In practice it is often difficult to find a projection operator that is both orthogonal and tractable in terms of computation. In [cardoso1, cardoso2], Cardoso showed that the ICA model set where is the set of rank- tensors defined as

 R:={Z∈R∣rank(¯Z)=n}, (39)

where is the matrix formed by rearranging the elements of the tensor into a Hermitian matrix and where rank defines the standard matrix rank [cardoso1]. Moreover, is the set of super-symmetric tensors defined by

 L:={Z∈L∣Zq(ijkl)=Zijkl} (40)

where defines all permutations of the index . In fact, Cardoso proved in [cardoso2] that locally the converse is true, for instance within some neighbourhood of the following holds:

 R∩L⊆SH. (41)

Therefore, within some neighbourhood of , projecting onto the ICA model set is equivalent to projecting onto . Moreover, in [Cadzow1988SignalEC], Cadzow proved that alternate projections onto and is guaranteed to converge onto the intersection333In general, rank forcing destroys symmetry while symmetrization destroys the rank- property, therefore alternate projections are needed until convergence. . Fundamentally, the projections onto (rank- approximation) and (averaging over permutations), denoted by and respectively, are both simple to compute and are orthogonal. Alternate orthogonal projections onto and ensures a stable projection onto [Cadzow1988SignalEC] which locally, results in an orthogonal projection onto the ICA model set . Formally, we define the orthogonal projection below in Algorithm 1. In practice, Algorithm 1 converges to below a small tolerance in very few iterations ( iterations). We can now state our full CICA IPG algorithm detailed in Algorithm 2. Here the step size is computed optimally to guarantee convergence [BlumensathIPG11, blumensath2009iterative], denotes the adjoint sketching operator and is a fixed shrinking step size parameter.

5.1.1 Unwhitened IPG

It was discussed in Section 4.3 that it is often convenient, from an online processing point of view, to directly sketch the unwhitened data . Using the properties of the matrix-tensor product [de1997signal], it can be seen that

 Avec(X)=A¯Pvec(Z), (42)

where . As defined in (36), the unwhitened feature function includes the second order moment of , namely . The empirical sketch therefore includes the sample covariance , which can be used to estimate an approximation of , denoted

, by using the eigenvalue decomposition of

[comon1994independent] at the beginning of Algorithm 2. By denoting , the gradient step in Algorithm 2 can be replaced by

 Zj+12=Zj+μjAT(yu−A^¯P(Zj)), (43)

as well as the associated step size and stopping criteria. As a result, the CICA IPG algorithm proceeds as normal by employing the original orthogonal projection .

5.2 Alternating Steepest Descent

The second proposed algorithm in the way of alternating steepest descent (ASD) is inherently different from the IPG scheme previously discussed. To see why, it is insightful to rewrite (20) in terms of the elements of the product set and O:

 minQTQ=IS∈DF(S,Q)=∥yw−A(S×1Q×2Q×3Q×4Q)∥22, (44)

where we have used the multilinear property discussed in (13). As the optimization problem is now explicitly defined by the mixing matrix and a sparse diagonal tensor , it is sufficient to optimise with respect to these parameters in an alternating steepest descent scheme. This approach contrasts the IPG scheme, as once we initialise the mixing matrix and the diagonal cumulant tensor appropriately, then we can optimise directly on the model set . We can initially state the ASD steps:

Note that the diagonal cumulant tensor can be simply reformulated as an sparse vector with known support, therefore one can perform element-wise differentiation on the entries for . The second step requires more attention as we have the constraint (i.e. ). The set of orthogonal matrices is an instance of a Stiefel manifold [wotaoStiefel], therefore is minimized directly on the Stiefel manifold.

5.2.1 Stiefel Manifold Optimisation

Given a feasible matrix and the gradient

, define a skew-symmetric matrix

as

 B=∇QFQT−Q(∇Q)T. (45)

The update on the Stiefel manifold is determined by the Crank-Nicholson scheme [crank1947practical] denoted

 Y(τ)=Q−12B(Q+Y(τ)) (46)

where . The matrix is referred to as the Cayley transform [wotaoStiefel] of . The descent curve has the following useful features

• is smooth on

• for all .

As a result, we perform a steepest descent on with line search along the descent curve with respect to . For more details on optimisation methods constrained to the Stiefel manifold refer to [wotaoStiefel]. We can now state our second proposed CICA algorithm in Algorithm 3.

5.2.2 Practicalities

We start by stating the computational complexity of each proposed CICA algorithm. Here we assume that a fast SRHT, as discussed in 4.0.2, is used to compute the sketch. For the IPG scheme, the symmetry projection costs flops through averaging along all index permutations. A rank- approximation of a general matrix costs flops [WOOLFE2008335], therefore the rank projection operator costs a total of flops. The gradient step in Algorithm 2 costs a total of flops due to the use of the sketching operator at each iteration which results in the IPG algorithm therefore having a total cost of flops. In the second proposed ASD algorithm, the gradient step in terms of the diagonal tensor in Algorithm 3, again has a cost of flops. The line search costs a total of flops [wotaoStiefel] resulting in the ASD algorithm having a computational complexity of . Note that both proposed CICA algorithms have computational complexity that is independent of the length of the data which can be extremely large for modern day applications.

As is the case for the general ICA problem, the compressive ICA optimisation problem is non-convex and both algorithms proposed may be prone to converging to local minima. As a result, we consider the option of possible restarts at random initialisations to obtain a good solution. We also consider a proxy projection operator that uses a Given’s rotation scheme, popular in many ICA algorithms (see [comon1994independent, cardoso1993blind]), that approximately diagonalises the cumulant tensor with respect to some contrast function, followed by thresholding the cross cumulants of that approximately diagonalised tensor to zero [sheehan2019compressive]. We have observed in practice that this proxy projection operator is less sensitive to the non-convex landscape of the optimization problem, which could be explained by the robustness of Given’s rotations [comon1994independent], hence multiple restarts are rarely required. The proxy projection operator, which we denote by , costs flops for the Given’s rotation scheme to approximately diagonalise the cumulant tensor [comon1994independent], and flops for the thresholding of the cross-cumulants. Therefore in total the proxy IPG algorithm has approximately the same computational complexity as our previous IPG algorithm.

6 Empirical Results

6.1 Phase Transition

Phase transitions are an integral part of analysis that are used frequently in the compressive sensing literature [phase_tran_Lotz] to show a sharp change in the probability of successful reconstruction of the low dimensional object as the sketch size increases. The location at which the phase transition occurs can provide a tight bound on the required sketch size needed given the number of independent components and further consolidates the theoretical bound of the RIP derived in Section 4. To set up the phase transition experiment, we constructed the expected cumulant tensor of Laplacian sources and transformed the tensor with an orthogonal mixing matrix using the multilinear property in (13), resulting in an expected cumulant tensor . For each number of independent components , 250 Monte Carlo simulations on the mixing matrix were executed for increasing sketch size between 2 and 700. A successful reconstruction was determined if the Amari error444The Amari error is used widely in the ICA literature as it is both scale and permutation invariant, which are the two inherent ambiguities of ICA inference. [amari1996new] between the true mixing matrix and the estimated mixing matrix , defined by

 d(M,^M)= 12nn∑i=1(∑nj=1|bij|maxj|bij|−1)+12nn∑j=1(∑ni=1|bij|maxi|bij|−1), (47)

was smaller than , where . The probability of successful reconstruction was given by the number of successful reconstructions within the 250 Monte-Carlo tests. We use the IPG version of the CICA algorithm for these results, although the ASD version provides nearly exactly the same results. It is insightful to begin by fixing the number of sources, here , to highlight the sharp transition as shown in Figure 2. We highlight some important bounds including the multiples of 2 and 4 times the dimension of the model set , depicted by the orange lines. For comparison, the dimension of the space of cumulant tensors , in other words the size of the cumulant tensor, is shown by the red line. The phase transition occurs in between 2 and 4 times the model set dimension indicating that choosing would be sufficient in successfully inferring the mixing matrix with high probability.

Figure 3 generalises the single phase transition result for the number of independent components varying between and . Once again, the important bounds of the model set dimension (green), 2 and 4 multiples of the model set dimension (orange) and the dimension of the space of cumulant tensors (red) are shown. Figure 3 explicitly shows that the phase transition empirically occurs within the location of and and provides us with a tight practical lower bound of on the sketch size for successful inference of the mixing matrix with high probability. Recall that in Theorem 4.1, the RIP holds when . The location of the phase transition in the empirical results therefore further consolidates the theoretical result. For a given number of independent components , the ratio between the upper orange line (4 times the model set dimension) and the red line (space of cumulant tensor dimension) provides a realistic compression rate in comparison to using the whole cumulant tensor of which many ICA techniques use. Importantly, as the number of independent components increases the ratio between these two lines decreases, resulting in further compression.

6.2 Statistical Efficiency

As was shown in Section 6.1

, the potential compression rates of sketching the cumulant tensor are high which can lead to a significantly reduced memory requirement. In this section we numerically analyse the trade-off between the sketch size and the loss of information. Statistical efficiency is a measure of the variability or quality of an unbiased estimator

[FisherEfficiency], where the Cramér-Rao bound provides a lower bound on the variability of an estimator and gives a best case scenario. For fair comparison, we instead use the variability of an estimator inferred by an algorithm that explicitly makes use of the cumulant information, just as the proposed CICA algorithms, to infer the mixing matrix estimate. As such, we use Comon’s ICA algorithm, detailed in [comon1994independent], that minimizes a kurtosis based contrast function using a sequence of Given’s rotations on pairwise cumulants as the approximate full data bound (e.g. no compression). We could have equivalently used the well-known Joint Approximation Diagonalization of Eigen-matrices (JADE) algorithm [cardoso1993blind] or any other cumulant based ICA algorithm as the approximate bound, which gives similar results. To this end, we make use of the relative efficiency, defined as

 e(M1,M2)=var(d(Mθ,M1))var(d(Mθ,M2)), (48)

where is the Amari Error defined in (47) and is the true mixing matrix. Denoting and as the mixing matrix estimates of Comon’s ICA algorithm (full data) and the proposed CICA algorithm, respectively, we expect