The Incomplete Rosetta Stone Problem: Identifiability Results for Multi-View Nonlinear ICA

05/16/2019 ∙ by Luigi Gresele, et al. ∙ 9

We consider the problem of recovering a common latent source with independent components from multiple views. This applies to settings in which a variable is measured with multiple experimental modalities, and where the goal is to synthesize the disparate measurements into a single unified representation. We consider the case that the observed views are a nonlinear mixing of component-wise corruptions of the sources. When the views are considered separately, this reduces to nonlinear Independent Component Analysis (ICA) for which it is provably impossible to undo the mixing. We present novel identifiability proofs that this is possible when the multiple views are considered jointly, showing that the mixing can theoretically be undone using function approximators such as deep neural networks. In contrast to known identifiability results for nonlinear ICA, we prove that independent latent sources with arbitrary mixing can be recovered as long as multiple, sufficiently different noisy views are available.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the setting described by the following generative model


where and are arbitrary smooth and invertible transformations of the latent variable with mutually independent components. The goal is to recover , undoing the mixing induced by the , in the case where only observations of and are available.

The two decoupled problems defined by considering pairs of Equations 1, 3 and 2, 3

separately are instances of Independent Component Analysis (ICA). This unsupervised learning method aims at providing a principled approach to disentanglement of independent latent components, blind source separation, and feature extraction 

[21]. Its applications are ubiquitous, including neuroimaging [28], signal processing [34], text mining [17], astronomy [31] and financial time series analysis [32]. An ICA problem is identifiable when it is provably possible to simultaneously undo the mixing and recover the sources up to tolerable ambiguities. Proofs of identifiability are crucial for the characterization of reliable ICA methods; in absence of these, we cannot be confident that a method successfully retrieves the true sources, even within controlled settings.

The case in which is a linear function, called linear ICA, has been shown to be identifiable if at most one of the latent components is Gaussian [10, 38, 9]. This triggered the development of algorithms and encouraged their application. In contrast, the nonlinear ICA problem was shown to be provably unidentifiable without further assumptions on the data generating process [22]. Much research in this field has thus attempted to characterize the assumptions under which identifiability holds. Such assumptions may be grouped into two main categories: (i) those regarding properties of the sources (e.g. non-stationarity or time correlation in time series settings [6, 37]); and (ii) those restricting the functional form of the mixing functions (e.g., post-nonlinear mixing [40]).

A recent breakthrough was to leverage a technique known as contrastive learning, a method recasting the problem of unsupervised learning as a supervised one [15, 19, 20, 23]

. This is a powerful proof technique, which additionally provides algorithms which can be practically implemented using modern deep learning frameworks. The setup in 

[19, 20, 23] makes strong assumptions on the data generating mechanism, but allows for arbitrary nonlinear mixing of the sources. However, the unconditional independence assumption of the sources (Equation 3) is replaced by a conditional independence statement, and requires observations of the additional variable conditioned on.

In this paper, we employ contrastive learning to address the setting specified by Equations 13, where in contrast to [23], no observations of parent variables of the sources are available. This corresponds to cases in which multiple recordings of the same process, acquired with different instruments and possibly different modalities, are available, and the goal is to find an unambiguous representation of the latent state common to all. Multiview settings of this sort are common in large biomedical and neuroimaging datasets [1, 30, 41, 36], motivating the need for reliable statistical tools enabling simultaneous handling of multiple sets of variables.

As a metaphor for such a setting, consider the story of the Rosetta Stone, a stele discovered during Napoleon’s campaign in Egypt in 1799, inscribed with three versions of a decree issued at Memphis in 196 BC. The realization that the stone reported the same text translated into three different languages led the French philologist Champollion to succeed in translating two unknown languages (Ancient Egyptian, in hieroglyphic script and Demotic script) by exploiting a known one (Ancient Greek). Rather, we consider the radically unsupervised task in which, given a Rosetta Stone with only two texts, both in unknown languages, we want to learn an unambiguous common representation for both of them.

The main contribution of this paper is to show that jointly addressing multiple demixing problems allows for identifiability with assumptions which do not directly refer to the sources, nor to restriction of the class of mixing functions, but rather to the conditional probability distribution of one observation given the other. This provides identifiability results in a novel setting, with assumptions entailing a different interpretation - namely, that the views have to be sufficiently diverse.

The remainder of this paper is organized as follows. In Section 2 we provide background information about the technique of contrastive learning for ICA and briefly review recent work that employs it. In Section 3 we present our main results, providing identifiability for different multi-view settings. In Section 4 we discuss other relevant works in the literature. Finally, we summarize and discuss our results in Section 5.

2 Nonlinear Ica With Contrastive Learning

Consider the nonlinear ICA setting, where observations of a variable are available, where is an arbitrary nonlinear invertible mixing. The proof of non-identifiability for the general case with unconditionally independent sources was an important negative result [22]. We review it briefly in Appendix A.

A proposed modification of this setting [23] involves an auxiliary observed variable and a change in the independence properties. If the unconditional independence is substituted with a conditional independence given the auxiliary variable , i.e.


for some functions , the model becomes identifiable. The conditional independence statement in Equation 4 can be interpreted as positing that is a parent of the sources . A further assumption on the effect of variations in on , called variability in the paper, is required. Intuitively, it demands that has a sufficiently diverse influence on .

In the setting described above, a constructive proof of identifiability is attained by exploiting contrastive learning [15].

This technique transforms a density ratio estimation problem into one of supervised function approximation. This idea has a long history 

[12], and has attracted attention in machine learning in recent years [14, 15]. We recapitulate the method in Appendix B.

In the setting of nonlinear ICA with auxiliary variables, contrastive learning can be exploited by training a classifier to distinguish between a tuple sampled from the joint distribution, which we denote as

, and one where is a sample generated from the marginal independently of , . Intuitively, tuples drawn from the former distribution correspond to the same sources , and thus share information, while tuples from the latter correspond to different sources and thus do not share information. Since the marginals of both distributions are equal, the classifier must learn to distinguish between them based on the common information shared by and ; that is, ultimately, .

With this method, the reconstruction of is only possible up to an invertible scalar “gauge” transformation. This is due to a fundamental ambiguity in the setup of nonlinear ICA and does not represent a limitation of their results; it can therefore be considered a trivial one. We further comment on this in Appendix A.3.

3 Nonlinear Ica With Multiple Views

We described how naively splitting Equations 1, 2 and 3 into two separate nonlinear ICA problems renders both problems non-identifiable, unless strong assumptions are made on the or the distribution of .

In the Rosetta stone story, awareness that different texts reported on the stele were linked by a common topic helped solving the translation problem; similarly, in our setting, matched observations of the two views are linked through the shared latent variable . Thus the central question we investigate is whether these assumptions can be relaxed by exploiting the structure of the generative model; that is, whether jointly observing and provides sufficient constraints to the inverse problem, thus removing the ambiguities present in the vanilla nonlinear ICA setting. We consider a contrastive learning task in which a classifier is trained to distinguish between pairs corresponding to the same and corresponding to different realizations of . As discussed in Section 2, the classifier will be forced to employ the information shared by the simultaneous views in order to distinguish the two classes. As we show, this ultimately results in recovering (up to unavoidable ambiguities).

For technical reasons discussed in Appendix B, our method requires some stochasticity in the relationship between and at least one of the . However this is not a significant constraint in practice; in most real settings observations are corrupted by noise, and a truly deterministic relationship between and the would be unrealistic. We will consider a component-wise independent corruption of our sources, i.e.  with , where the components of are mutually independent, and similar for . The noise variables , and the sources are assumed to be mutually independent. Note that this only puts constraints on the way the signal is corrupted by the noise, namely , and not on the mixing . We will refer to such as component-wise corrupter throughout, and to its output as corruption. In the the vanilla ICA setting, inverting the mixing and recovering the sources are equivalent; in the setting that we consider, the inversion of the mixing only implies recovering the sources up to the effect of the corrupter .

We will consider three instances of the general setting, providing identifiability results for each.

  1. First we consider the case that only one of the observations, , is corrupted with noise. This corresponds, for instance, to a setting in which one accurate measurement device is supplemented with a second noisy device. We show that in this setting it is possible to fully reconstruct using the noiseless variable (Section 3.1).

  2. Next, we consider the case that both variables are corrupted with noise. In this setting, it is possible to recover up to the corruptions. Furthermore, we show that can be recovered with arbitrary precision in the limit that the corruptions go to zero (Section 3.2).

  3. Finally, we consider the case of having simultaneous views of the source rather than just two. When considering the limit , we prove sufficient conditions under which it is possible to reconstruct even if each observation is corrupted by noise (Section 3.3).

To the best of our knowledge, no result of identifiability of latent sources in the case in which only corrupted, mixed versions are observed has been given before.

3.1 One Noiseless View

Consider the generative model


where and are invertible, is a component-wise corrupter, and and are observed. This is represented in Figure 1.

Figure 1: The setting considered in Section 3.1. Two views of the sources are available, one of which, , is not corrupted by noise. In this and all other figures, each node is a deterministic function of all its parents in the graph.

Subject to some assumptions, it is possible to recover up to the component-wise invertible ambiguity.

The difference of the log joint probability and log product of marginals of the observed variables in the generative model specified by Equations 5-7 admits the following factorisation:


where , , and is the Jacobian of the transformation (note that the introduced Jacobians cancel). Suppose that

  1. satisfies the Sufficiently Distinct Views assumption (see after this theorem).

  2. We train a classifier to discriminate between

    where correspond to the same realization of and correspond to different realizations of .

  3. The classifier is constrained to use a regression function of the form

    where are invertible, smooth and have smooth inverse.

Then, in the limit of infinite data and with universal approximation capacity, inverts in the sense that the recover the independent components of up to component-wise invertible transformations. The proof can be found in Appendix D.1. The assumption of invertibility for could be satisfied by, e.g., the use of normalizing flows [33, 8] or deep invertible networks [24].

We remark that at several points in this paper we consider the difference between two log-probabilities. In all of these cases, the Jacobians introduced by a change of variables cancel out as in Equation 8. For brevity we omit explanation of this fact in the rest of the results.

The Sufficiently Distinct Views (SDV) assumption specifies in a technical sense that the two views available are sufficiently different from one another, resulting in more information being available in totality than from each view individually. In the context of Theorem 3.1, it is an assumption about the log-probability of the corruption conditioned on the source. Informally, it demands that the probability distribution of the corruption should vary significantly as a result of conditioning on different values of the source.

[Sufficiently Distinct Views] Let , be functions of two arguments. Denote by

the vector of functions and define


We say that satisfies the assumption of Sufficiently Distinct Views (SDV) if for any value of , there exist distinct values , such that the vectors are linearly independent.
This is closely related to the Assumption of Variability in [23]

. We provide simple cases of conditional log-probability density functions satisfying and violating the SDV assumption in Appendix


Theorem 3.1 shows that by jointly considering the two views, it is possible to recover , in contrast to the single-view setting. This result can be extended to learn the inverse of up to component-wise invertible functions. Consider the setting of Theorem 3.1, and the alternative factorization of the log joint probability given by


Suppose that satisfies the SDV assumption. Replacing the regression function with

results in inverting in the sense that the recover the independent components of up to component-wise invertible transformations. The proof can be found in Appendix D.2. These two results together mean that it is possible to learn inverses and of and , and therefore to recover and , up to component-wise intertible functions. Note, however, that doing so requires running two separate algorithms. Furthermore, there is no guarantee that the learned inverses and are ‘aligned’ in the sense that for each the components and correspond to the same components of .

This problem of misalignment can be resolved by changing the form of the regression function.

Consider the settings of Theorem 3.1 and Corollary 1. Suppose that both and satisfy the SDV assumption. Replacing the regression function with


results in , inverting , in the sense that the and recover the independent components of and up to two different component-wise invertible transformations. Furthermore, the two representations are aligned, i.e. for ,

The proof can be found in Appendix E.

Note that Theorem 3.1 is not a generalisation of Theorem 3.1 or Corollary 1, since it makes stricter assumptions by imposing the SDV assumption on both and . In contrast, Theorem 3.1 and Corollary 1 require that only one is valid for each.

For cases in which finding aligned representations for and are desired, Theorem 3.1 should be applied. If the only goal is recovery of , the assumptions of Theorem 3.1 are simpler to verify.

In practical applications, the multi-view scenario is useful in multimodal datasets where one of the two acquisition modalities has much higher signal to noise ratio than the other one (e.g., in neuroimaging, when simultaneous fMRI and Optical Imaging recordings are compared). In such cases, jointly exploiting the multiple modalities would help to discern a meaningful and identifiable latent representation which could not be attained through analysis of the more reliable modality alone.

3.1.1 Equivalence with Permutation Contrastive Learning for Time Dependent Sources

Note that the analysis of Theorem 3.1 covers the case of temporally dependent stationary sources analyzed in [20]. Indeed, if it is further assumed that and are uniformly dependent [20], they can be seen as a pair of subsequent time points of an ergodic stationary stochastic process for which the analysis of Theorem 1 of [20] would hold. In other words, we can define a stochastic process as . Note that while the two formulations are theoretically equivalent, our view offers a wider applicability as it covers the asynchronous sensing of , provided that multiple measurements (i.e. ) are available; additionally, our Sufficiently Distinct Views assumption does not necessarily imply uniform dependency. Furthermore, while [20] considers a generative model of the form , thus constraining the mixing function to be the same for any two data points , , in our setting we consider two different mixing functions, and , for the two different views. Finally, we study this setting as an intermediate step for the following two sections, in which no deterministic function of the sources is observed, learning to invert any of the can only recover up to the corruption operated by .

3.2 Two Noisy Views

Figure 2: Setting with two views of the sources , both corrupted by noise.

We next consider the setting in which both variables are corrupted by noise. Consider the following generative model (represented in Figure 2):

where all variables take value in , and and are nonlinear, invertible, deterministic functions, and are component-wise corrupters, and and the are independent with independent components. This class of models generalizes the setting of Section 3.1 since by taking we reduce to the case of one noiseless observation.

The difference admits similar factorizations to those given in Equations 8 and 12:


Since we only have access to corrupted observations, exact recovery of is not possible. Nonetheless, a generalization of Theorem 3.1 holds showing that the can be inverted and recovered up to the corruptions induced by the via . Suppose that and satisfy the SDV assumption. The algorithm described in Theorem 3.1 with regression function specified in Equation 13 results in and inverting and in the sense that the and recover the independent components of and up to two different component-wise invertible transformations. Furthermore, the two representations are aligned, i.e. for ,

The proof can be found in Appendix E.

We can thus recover the common source up to the corruptions . In the limit of the magnitude of one of the noise variables going to zero, the reconstruction of the sources attained through the corresponding view is exact up to the component-wise invertible functions, as stated in the following corollary.

Let for , where

is a fixed random variable, and

be a random variable that does not depend on . Let be the output of the algorithm specified by Theorem 3.2 with noise variables and .

Suppose that the corrupters satisfy the following two criteria:

  1. s.t. for all

  2. s.t.

Then, denoting by the set of all scalar, invertible functions, we have that

The proof can be found in Appendix F.

Corollary 3.2 implies that in the limit of small noise, the sources can be recovered exactly. Condition i) upper bounds the influence of on the corruption: we can not hope to retrieve if contains too little signal. Condition ii) ensures that the function is invertible with respect to when is equal to zero. If this were not satisfied, some information about would be washed out by even in absence of noise. This would make recovery of trivially impossible.

3.3 Multiple Noisy Views

The results of Section 3.2 state that in the two noisy view setting, can be recovered up to the corruptions. In the limit that the magnitude of the noises goes to zero, the uncorrupted can be recovered. The intuition is that the less noise there is, the more information each observation provides about .

Figure 3: Setting with corrupted views of the sources.

In this section we consider the multi-view setting, where distinct noisy views of are available,

and the noise variables are mutually independent, as represented in Figure 3. Since each view provides additional information about , we ask: in the limit as , is it possible to reconstruct exactly?

By applying Theorem 3.2 to the pair it is possible to recover such that the components are aligned, but up to different component-wise invertible functions and . Running the algorithm on a different pair will result in recovery up to different component-wise invertible functions and .

Note that these will not necessarily result in and being aligned with each other. However, the components of and are the same, up to permutation and component-wise invertible functions. This permutation can therefore be undone by performing independence testing between each pair of components. Components that are ‘different’ will be independent; those that are the same will be deterministically related. Therefore, they can be used as a reference to permute the components of and make it aligned with .

The problem is then how to combine the information from each aligned to more precisely identify . The fact that the components are recovered up to different scalar invertible functions makes combining information from different views non-trivial.

As a first step in this direction, we consider the special case that each acts additively and each is zero mean and each of and the are independent with independent components.


Suppose to begin with that we are able to recover each without the usual component-wise invertible functions. Then, writing to denote all of the , it is possible to estimate as

Subject to mild conditions on the rate of growth of the variances

as , Kolmogorov’s strong law implies that is a good approximation to as in the sense that . This implies moreover that it is possible to reconstruct the by considering the residue .

In the presence of the unknown functions , we would be able to reconstruct and the if we were able to identify the inverses for each . For any component-wise invertible functions , define

is something we can choose and is the output of the algorithm, and hence and are random variables with known distributions. Subject to mild conditions, the dependence of these quantities on most or all of the becomes increasingly small as grows and disappears in the limit .

Suppose that the sequence converges as for almost all , and write

Suppose further that there exists such that for all . Then

The proof can be found in Appendix G. Given some choice of , we can think of and as our putative candidates for and respectively. As discussed earlier, if we could identify , then we would have and , and thus and would satisfy the same independences and other statistical properties as and respectively. Can we use these properties as criteria to identify good choices of ?

The following theorem gives a set of sufficient conditions under which each inverts up to some affine ambiguity which is the same for every .

Suppose there exists such that for all and let s.t.



where denotes the element-wise product with the scalar elements of . If , then , and so is non-empty for sufficiently large. The proof can be found in Appendix H. It follows that it is possible recover and up to and via and .

We remark that each of the conditions 1721 can be verified from known information. We conjecture that condition 22 can be relaxed to assuming the verifiable condition of independence between and for all along with additional regularity assumptions on the functional form of (e.g. smoothness).

To conclude, Theorem 8 provides sufficient conditions under which it is possible to fully reconstruct with corrupted views. In contrast to previous results in Sections 3.1 and 3.2, this result leverages infinitely many corrupted views rather than vanishingly small corruption of finitely many views.

4 Related Work

A central concept in our work is that of multiple simultaneous views and joint extraction of features from them. We briefly review some related work considering similar settings.

4.1 Canonical Correlation Analysis

Given two (or more) random variables, the goal of Canonical Correlation Analysis (CCA) [18] is to find a corresponding pair of linear subspaces that have high cross-correlation, so that each component within one of the subspaces is correlated with a single component from the other subspace [5]

. In dealing with correlation instead of independence, CCA is more closely related to Principal Component Analysis (PCA) than to ICA.

CCA can be interpretated probabilistically [4] and is equivalent to maximum likelihood estimation in a graphical model which is a special case of that depicted in Figure 2. The differences compared to our setting are (i) the latent components retrieved in CCA are forced to be uncorrelated, whereas our method is retrieves independent components; (ii) in CCA, mappings between the sources and are linear, whereas our method allows for nonlinear mappings.

At a high level, the model we consider in Section 3.2 is to CCA as nonlinear ICA is to PCA. Nonlinear extensions of the basic CCA framework have been proposed [26, 13, 3, 29], but identifiability results in the sense we consider in this paper are lacking.

4.2 Multi-View Latent Variable Models

Bearing a strong resemblance to our considered setting, [27]

proposes a sequence of diffusion maps to find the common source of variability captured by multiple sensors, discarding irrelevant sensor-specific effects. It computes the distance among the samples measured by different sensors to form a similarity matrix for the measurements of each sensor; each similarity matrix is then associated to a diffusion operator, which is a Markov matrix by construction. A Markov chain is then run by alternately applying these Markov matrices on the initial state. During these Markovian dynamics, sensor specific information will eventually vanish, and the final state will only contain information on the common source. While the method focuses on recovering the common information in the form of a parametrization of the common variable, our method both inverts the mixing mechanisms of each view and recovers the common latent variables.

[39] proves identifiability for multi-view, latent variable models, unifying previously proposed spectral techniques [2]. However, while the setting is similar to the one considered in this work, both the objectives and the employed methods are different. The paper considers the setting in which variables , are observed; additionally, there exists an unobserved latent variable , such that conditional distributions are independent. While the setting bears obvious similarities with our multi-view ICA, the method proposed in [39] is aimed at learning the mixture parameters, rather than the exact realization of latent variables. Their method is based on the mean embedding of distributions in a Reproducing Kernel Hilbert Space and a result of identifiability for the parameters of the mean embeddings of and is proved. Another related field of study is multi-view clustering, which considers a multiview setting and aims at performing clustering on a given dataset, see e.g. [11] and [25]. While related to our setting, this line of work is different from it in two key ways. Firstly, clustering can be thought of as assigning a discrete latent label per datapoint. In contrast, our setting seeks to recover a continuous latent vector per datapoint. Second, since no underlying generative model with discrete latent variable is assumed, identifiability results are not given.

4.3 Half-Sibling Regression

Half-sibling regression [35] is a method to reconstruct a source from noisy observations by exploiting other sources that are affected by the same noise process but otherwise independent from it.

Suppose that a latent variable of interest is not directly available, and that we can only observe corrupted versions of it, denoted as , where the corruption is due to a noise . Without knowledge of , it is impossible to reconstruct . However, if one or more additional variables , also influenced by , are observed, we can exploit them to model the effect of on by regressing on .

Subtracting this from the observed recovers the latent variable up to a constant offset, provided that (1) the additivity assumption

holds, and (2) that contains sufficient information about . Analogous to our aim of recovering , the goal of half-sibling regression is not to infer only the distribution of , but rather the random variable itself (almost surely).

5 Discussion and Conclusion

We presented identifiability results in a novel setting by extending the formalism of nonlinear ICA. We have investigated different scenarios of multi-view latent variable models and provided theoretical proofs on the possibility of inverting the mixing function and recovering the sources in each case. Our results thus extend the scarce literature on identifiability for nonlinear ICA models.

In the classical noiseless ICA setting, the deterministic relationship between the sources and observations means that inverting the mixing function and recovering the sources are equivalent. In contrast, we consider views of corrupted versions of the common sources, resulting in the decoupling of the demixing and retrieval of the sources. Remarkably, Theorem 3.3 points towards the possibility of simultaneously solving the two problems in the limit of infinitely many views.

Classical nonlinear ICA is provably non-identifiable because a single view is not sufficiently informative to resolve non-trivial ambiguities when recovering the sources. While many papers in the ICA literature have explored placing restrictions either on the source distribution or on the form of the mixing to resolve these ambiguities, in this paper we consider exploiting additional views to constrain the inverse problem. Clearly, if a second view is identical to the first, then nothing is gained by its observation. Hence, in order for the second view to assist in resolving ambiguity, it must be sufficiently different from the first. This is the intuition behind the technical assumption of sufficiently distinct views.

Typically, noise is a nuisance variable that would be preferably non-existent. In our setting, however, the noise variables acting on the sources are a crucial component, without which the contrastive learning approach could not be applied. Furthermore, the assumption of sufficiently distinct views is ultimately an assumption about the complexity of the joint distribution of the (corrupted) sources corresponding to each view. Without the noise variables the sufficiently distinct views assumption could not hold.

Our setting is relevant in a number of practical real-world applications, namely in all datasets that include multiple distinct measurements of related phenomena. In practice, it may be better to think of the noise variables rather as intrinsic sources of variability specific to each view. In most practical applications this would probably not be a significant limitation due to the prevalence of stochasticity in real-world systems.

An exemplary application of our method can be found in the field of neuroimaging. Consider a study involving a cohort of subjects (perceivers), measuring their response to the presentation of the same stimulus. One of the key problems in the field is how to extract a shared response from all subjects despite high inter-subject variability and complex nonlinear mappings between latent source and observation [7, 16]. Our results provide principled ways to extract and decompose the components of the shared response. In particular, the setting described in our model is suited to account for the high variability of the responses throughout the cohort, since the measurement corresponding to each subject is given by a combination of individual variability and shared response.

Looking to the future, we note that Theorem 3.3 builds on the setting of Theorem 3.2 which only makes use of pairwise information from the observations. A natural extension of this work should investigate algorithms that explicitly make use of views, which we conjecture would allow relaxation of the additivity assumption on the corruptions. Furthermore, Theorem 3.3 provides results that only hold for the asymptotic limit as the number of views becomes large. Other extensions to this result could include analysis of the case of finitely many views.


Thanks to Krikamol Muandet for providing his office for fruitful discussions, to Matthias Bauer and Manuel Wüthrich for proofreading and to Lucia Busso for interesting input about linguistics.


  • [1] Naomi Allen, Cathie Sudlow, Paul Downey, Tim Peakman, John Danesh, Paul Elliott, John Gallacher, Jane Green, Paul Matthews, Jill Pell, et al. Uk biobank: Current status and what it means for epidemiology. Health Policy and Technology, 1(3):123–126, 2012.
  • [2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
  • [3] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255, 2013.
  • [4] Francis R Bach and Michael I Jordan. A probabilistic interpretation of canonical correlation analysis. 2005.
  • [5] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
  • [6] Jean-François Cardoso. The three easy routes to independent component analysis; contrasts and geometry. In Proc. ICA, volume 2001, 2001.
  • [7] Po-Hsuan Cameron Chen, Janice Chen, Yaara Yeshurun, Uri Hasson, James Haxby, and Peter J Ramadge. A reduced-dimension fMRI shared response model. In Advances in Neural Information Processing Systems, pages 460–468, 2015.
  • [8] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.

    Neural ordinary differential equations.

    In Advances in Neural Information Processing Systems, pages 6572–6583, 2018.
  • [9] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
  • [10] George Darmois. Analyse générale des liaisons stochastiques: etude particulière de l’analyse factorielle linéaire. Revue de l’Institut international de statistique, pages 2–8, 1953.
  • [11] Virginia R De Sa. Spectral clustering with two views. In ICML workshop on learning with multiple views, pages 20–27, 2005.
  • [12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • [13] Kenji Fukumizu, Francis R Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(Feb):361–383, 2007.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [15] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , pages 297–304, 2010.
  • [16] James V Haxby, J Swaroop Guntupalli, Andrew C Connolly, Yaroslav O Halchenko, Bryan R Conroy, M Ida Gobbini, Michael Hanke, and Peter J Ramadge. A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72(2):404–416, 2011.
  • [17] Timo Honkela, Aapo Hyvärinen, and Jaakko J Väyrynen. WordICA-emergence of linguistic representations for words by independent component analysis. Natural Language Engineering, 16(3):277–308, 2010.
  • [18] Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pages 162–190. Springer, 1992.
  • [19] Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In Advances in Neural Information Processing Systems, pages 3765–3773, 2016.
  • [20] Aapo Hyvärinen and Hiroshi Morioka. Nonlinear ICA of Temporally Dependent Stationary Sources. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 460–469, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
  • [21] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • [22] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
  • [23] Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 859–868. PMLR, 16–18 Apr 2019.
  • [24] Jörn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-RevNet: Deep Invertible Networks. In ICLR 2018 - International Conference on Learning Representations, Vancouver, Canada, April 2018.
  • [25] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In Advances in neural information processing systems, pages 1413–1421, 2011.
  • [26] Pei Ling Lai and Colin Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05):365–377, 2000.
  • [27] Roy R Lederman and Ronen Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
  • [28] Martin J McKeown and Terrence J Sejnowski. Independent component analysis of fMRI data: examining the assumptions. Human brain mapping, 6(5-6):368–372, 1998.
  • [29] Tomer Michaeli, Weiran Wang, and Karen Livescu. Nonparametric canonical correlation analysis. In International Conference on Machine Learning, pages 1967–1976, 2016.
  • [30] Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nature neuroscience, 19(11):1523, 2016.
  • [31] Danielle Nuzillard and Albert Bijaoui. Blind source separation and analysis of multispectral astronomical images. Astronomy and Astrophysics Supplement Series, 147(1):129–138, 2000.
  • [32] Erkki Oja, Kimmo Kiviluoto, and Simona Malaroiu. Independent component analysis for financial time series. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pages 111–116. IEEE, 2000.
  • [33] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.
  • [34] Hiroshi Sawada, Ryo Mukai, and Shoji Makino. Direction of arrival estimation for multiple source signals using independent component analysis. In Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings., volume 2, pages 411–414. IEEE, 2003.
  • [35] Bernhard Schölkopf, David W Hogg, Dun Wang, Daniel Foreman-Mackey, Dominik Janzing, Carl-Johann Simon-Gabriel, and Jonas Peters. Modeling confounding by half-sibling regression. Proceedings of the National Academy of Sciences, 113(27):7391–7398, 2016.
  • [36] Meredith A Shafto, Lorraine K Tyler, Marie Dixon, Jason R Taylor, James B Rowe, Rhodri Cusack, Andrew J Calder, William D Marslen-Wilson, John Duncan, Tim Dalgleish, et al. The cambridge centre for ageing and neuroscience (Cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC neurology, 14(1):204, 2014.
  • [37] Amit Singer and Ronald R Coifman. Non-linear independent component analysis with diffusion maps. Applied and Computational Harmonic Analysis, 25(2):226–239, 2008.
  • [38] Viktor Pavlovich Skitovich.

    Linear forms of independent random variables and the normal distribution law.

    Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 18(2):185–200, 1954.
  • [39] Le Song, Animashree Anandkumar, Bo Dai, and Bo Xie. Nonparametric estimation of multi-view latent variable models. In International Conference on Machine Learning, pages 640–648, 2014.
  • [40] Anisse Taleb and Christian Jutten. Source separation in post-nonlinear mixtures. IEEE Transactions on signal Processing, 47(10):2807–2820, 1999.
  • [41] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.

Appendix A On the Unidentifiability of Nonlinear Ica

The purpose of this section is to briefly review the proof of unidentifiablity of nonlinear ICA as [22]: In this section we assume the most general conventional form of nonlinear ICA where the generative model follows:


where are the independent sources and are mixed signals. In the following, we show how to construct a function so that the components are independent. More importantly, we show that this construction is by no means unique.

a.1 Existence

The proposed method in [22] is a generalization of the famous Gram-Schmidt orthogonalization. Given independent variables, and a variable , one constructs a new variable so that the set is mutually independent. The construction process is defined recursively as follows. Assume we have independent random variables

with uniform distribution in

. is any random variable and are some nonrandom scalars. Next, we define


Theorem 1 of [22] says that the random variable defined as is independent from the and are uniformly distributed in the unit cube .

a.2 Non-Uniqueness

In the previous section, it was shown that there exists a mapping that transforms any random vector into a uniformly distributed random vector . Here, we show that the construction of is not unique and this non-Uniqueness can be caused by several factors.

  • A linear transformation

    can precede the nonlinear map and then compute the independent components where is computed as describe in the previous section. The new map gives a new decomposition of into independent components which can not be trivially reduced to .

  • An element-wise function can apply on the independent sources first to give new sources such that . Constructing the solution for these new scaled version of sources gives a new decomposition into independent components.

  • Assume a class of measure-preserving automorphisms . The mapping does not change the probability distribution of a uniformly distributed random variable in -dimensional hypercube. The composition gives another solution to nonlinear ICA. Therefore, the class of measure-preserving automorphisms gives a parameterization of the solutions to nonlinear ICA introducing a class of non-trivial indeterminacies.

Figure 4: The Rosetta Stone, a stele found in 1799, inscribed with three versions of a decree issued at Memphis, Egypt in 196 BC. The top and middle texts are in Ancient Egyptian using hieroglyphic script and Demotic script, respectively, while the bottom is in Ancient Greek. (Source: Wikipedia)

If only independence among the components matters, it is possible to construct a mapping such that is independent of for and uniformely distributed in . This shows that at least one solution exists. The non-uniqueness of the solution can be shown by parameterising a class of infinitely many solutions. Once is found with above conditions, any measure-preserving automorphism can be used to parameterize as , suggesting that there are infinitely many solutions to nonlinear ICA whose relations are nontrivial.

a.3 The Scalar Invertible Function Gauge

Another indeterminacy is element-wise functions applying on which suggets another dimension of ambiguity. Non-Gaussianity cannot help here since we can construct any marginal distribution by combining the CDF of the observed variable with the inverse CDF of the target marginal distribution. This indeterminacy is in some sense unavoidable and is related to the fact that in linear ICA recovery of the sources is possible up to a scalar multiplicative ambiguity.

Appendix B Why Does Classification Result in the Log Ratio?

Let us suppose that a variable is drawn with equal probability from two distributions and with densities and respectively. We train a classifier

to estimate the posterior probability that a particular realization of

was drawn from with the cross entropy loss, i.e. the parameters of are chosen to minimize

As shown in, for instance, [14], the global optimum of this loss occurs when , which can be rewritten as


Recall that in our setting, the function is trained to classify between the two cases that is drawn from the joint distribution (class ) or the product of marginals (class ). is trained so that estimates the posterior probability of belonging to class 0. By comparing to Equation 26, it can be seen that

Note that in order for the classification trick of contrastive learning to be useful, the variables and cannot be deterministically related. If this is the case, the log-ratio is everywhere either or and hence the learned features are not useful.

To see why this is the case, suppose that , and are each -dimensional vectors. If they are deterministically related, puts mass on an -dimensional submanifold of a -dimensional space. On the other hand, will put mass on a -dim manifold since it is the product of two distributions each of which are N-dimensional.

In this case, the distributions and are therefore not absolutely continuous with respect to one another and thus the log-ratio is ill-defined: at any point at which puts mass and zero at points where puts mass and does not.

Appendix C The Sufficiently Distinct Views Assumption

We give the following two examples to provide intuition about the Sufficiently Distinct Views (SDV) assumption - one regarding a case in which it does not hold, and another one in which it does.

A simple case in which the assumption does not hold is when the conditional probability of given is Gaussian, as in


where is the normalization factor, . Since taking second derivatives of the log-probability with respect to results in constants, it can be easily shown that there is no way to find vectors , , such that the corresponding (see Definition 1) are linearly independent.

The fact that the assumption breaks down in this case is reminiscent of the breakdown in the case of Gaussianity for linear ICA. Interestingly, in our work, the true latent sources are allowed to be Gaussian. In fact, the distribution of does not enter the expression above.

An example in which the SDV assumption does hold is a conditional pdf given by


where is again a normalization function. Proving that this distribution satisfies the SDV assumption requires a few lines of computation. The idea is that can be written as the product of a matrix and vector which are functions only of and respectively. Once written in this form, it is straightforward to show that the columns of the matrix are linearly independent for almost all values of and that linearly independent vectors can be realized by different choices of .

Appendix D Proof of Theorem 3.1 and Corollary 1

d.1 Proof of Theorem 3.1

This proof is mainly inspired by the techniques employed by [23].


We have to show that, upon convergence, are s.t.

We start by writing the difference in log-densities of the two classes:

We now make the change of variables

and rewrite the first equation in the following form:


We take derivatives with respect to , , , of the LHS and RHS of equation 39. Adopting the conventions in 9 and 10 and


we have

where taking derivative w.r.t. and for makes LHS equal to zero, since the LHS has functions which depend only one each. If we now rearrange our variables by defining vectors collecting all entries , , , and vectors with the variables , , , the above equality can be rewritten as