1 Introduction
Linear Independent Components Analysis (ICA) has become an important data analysis technique. For example, it is routinely used for blind source separation in a wide range of signals. The objective of ICA is to identify a
linear transformation such that after the projection the components of the dataset are independent. More formally, the aim is to find an unmixing matrix that transforms the observed data into maximally independent componentswith respect to some measure of independence. Commonly the independence is approximated using a measure of nongaussianity (e.g. kurtosis
Hyvärinen (1999); Bell and Sejnowski (1995)).An obvious drawback of ICA is the restriction to linear transformations. Unfortunately, in many practical applications this linearity assumption does not hold, which motivates research into Nonlinear ICA (NICA)
(Hyvarinen and Morioka, 2016; ichiro Hirayama et al., 2017).One of the key challenges in developing nonlinear variant of ICA is devising an efficient measure of independence. The currently most popular approach is to constrain the transformation so that independence can be efficiently estimated
(Tan et al., 2001; Almeida, 2003, 2004; Dinh et al., 2014; Zhang and Chan, 2008). Another approach is to learn the independence measure. This can be achieved using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). In (Brakel and Bengio, 2017)(ANICA  Adversarial Nonlinear ICA) authors demonstrate efficacy of using GAN for learning an independence measure. They show that GAN based independency measure combined with an autoencoder architecture can be used to solve nonlinear blind source separation problems.
Unfortunately, the use of adversarial training in ANICA comes at the cost of added instability, as also noted by the authors. Our main contribution is developing an effective independence measure that does not require adversarial training, and matches ANICA performance. In other words, we found that the adversarial training is not the key contributor to the efficacy of ANICA, and based on this insight we developed a simpler, closedform independence measure. We demonstrate its efficacy on standard blind source separation problems.
This paper is structured as follows. We start by discussing related work in Section 2. In Section 3 we describe the key contribution: the independence measure based on CramerWold metric. ICA based on the introduced independence measure is described in Sec. 4. Finally, we report experimental results in Section 5.
2 Related work
The fundamental problem in solving NICA is that the solution is in principle nonidentifiable. Without any constraints on the space of the mixing functions, there exists an infinite number of solutions(Hyvärinen and Pajunen, 1999)
. To illustrate, consider that there is an infinite number of possible nonlinear decompositions of a random vector into independent components, and those decompositions are not related to each other in any trivial way. A related problem is that measuring true independence between distributions is often intractable. While ICA can be efficiently solved using approximated independence measures, such as kurtosis, these approaches do not transfer to the nonlinear scenario.
Perhaps the most common approach to solve NICA, which addresses both of the problems, is to pose a constraint on the nonlinear transformation (Lee et al., 1997; Tan et al., 2001; Almeida, 2003, 2004; Dinh et al., 2014). One of the first attempts was to generalize ICA by introducing nonlinear mixing models in which case the solution is still possible to identify (Lee et al., 1997). In (Le et al., 2011) authors propose Reconstruction ICA (RICA) which requires that mixing matrix is as close as possible to orthonormal one . Thanks to such constraints, one can directly apply independent measures from classical ICA method.
The aforementioned approaches are arguably limited in their expressive power. In a more recent attempt (Dinh et al., 2014)
the authors propose a neural model for modeling densities called Nonlinear Independent Component Estimation (NICE). The authors parameterize the neural network so that it is fully invertible and the output distribution is fully factorized (independent). However, the model incorporates learning the unmixing function using maximum likelihood, which requires specifying a prior density family.
Our work is most closely related to the recently introduced Adversarial Nonlinear ICA model (ANICA) (Brakel and Bengio, 2017). In contrast to the previous methods, ANICA does not make any strong explicit assumptions on the transformation function. Instead, a clever adversarialbased measure for estimating and optimizing independence efficiently is proposed. In this work we will take a closer look at this measure, and argue that the basic premise permits construction of an effective nonparametrized independence measure.
Finally, let us note that a large process has been made in learning factorized representations using deep neural networks (Burgess et al., 2018; Chen et al., 2016). What separates ANICA and our method from the previous work is the direct encouragement of independence in the latent space. A similar path was also taken by (Kim and Mnih, 2018)
where the VAE loss function is augmented with a cost term directly encouraging disentanglement.
3 Independence measure by CramerWold distance
In this chapter we develop an efficient independence measure, which contrary to ANICA model, does not require adversarial training. Our approach can be effectively used to solve nonlinear ICA, in contrast to many other metrics used solely in the context of linear ICA.
In the following we discuss three independence metrics. Firstly, we consider distance correlation, and adversarial–based metric used in ANICA. In the last part we introduce our CramerWold based independence metric.
Distance correlation
One of the most wellknown measures of independence of random vectors and is the distance correlation (dCor) (Székely et al., 2007), which is applied in (Matteson and Tsay, 2017) to solve the linear ICA problem. Importantly, equals zero if the random vectors and are independent. Moreover, has a closedform estimator.
However, to ensure the independence of components of a given random vector in , one has to compute for every subset^{1}^{1}1Except for the trivial cases when either or is emptyset. of indexes , where denotes the complement set of and is the restriction of to the set of coordinates given by . As this procedure has exponential complexity with respect to the number of dimensions, we decided to use a simplified version of which enforces only pairwise independence of the components:
where .
Adversarial–based independence metric
Now let us describe the adversarial approach used in ANICA. The basic idea is to leverage that a random permutation of features in a sample produces samples that come from a distribution with independent components. More precisely, let be a random vector which comes from pdf , and let be a sample from , where . We will describe how to draw a sample from the density
where are the marginal densities of . To do this, simply randomly choose maps from into itself, and consider
Then comes from the pdf , which has independent components. Consequently, if and are close, then the same holds for and , and consequently has independent components. In ANICA adversarial training is used to reduce distance between and .
CramerWold independence metric
The application of adversarial training in ANICA can lead to instability, as discussed by the authors, and slower training. In this paper we propose an alternative independence measure. Our main idea is to compute the distance between and without resorting to adversarial training.
In order to achieve this, one can choose commonly used metrics, such as the KullbackLeibler divergence
(Kingma and Welling, 2014) or Wasserstein distance (Tolstikhin et al., 2017). Instead, due to its simplicity, we have decided to use the recently introduced CramerWold distance (Tabor et al., 2018), which also possesses the advantage of having the closedform for the distance of two samples^{2}^{2}2In the computation we apply the equality . :where the bandwidth
is a hyperparameter, which may be set accordingly to the onedimensional Silverman’s rule of thumb to
. The function is computed with the asymptotic formula: .As a final step, we normalize each component of to ensure that the Silverman’s rule of thumb is optimal, and define our independence metric as:
(1) 
where is the componentwise normalization of .
4 Algorithm
We are now ready to define CWICA, a nonlinear ICA model based on the CramerWorld independence metric. Following ANICA, we use an AutoEncoder (AE) architecture.
Let denote the input data. An AutoEncoder is a model consisting of an encoder function and a complementary decoder function , aiming to enforce coding of the input variables that minimizes the reconstruction error:
(2) 
The goal of our method is to train an encoder network which maps data to informative, statistically independent features . In order to achieve this we introduce an independence measure on the latent space, by taking advantage of the independence index defined in (1). We denote this model as the CWICA(CramerWold Independent Component Analysis).
To obtain a procedure independent of a possible rescaling of the data, we have decided to use a multiplicative model instead of an additive:
(3) 
In contrary to ANICA we do not use an adversarial objective, proposing instead a closeform solution based on the independence index. However, enforcing independence by itself does not guarantee that the mapping from the observed signals to the predicted sources is informative about the input. Therefore, the decoder constrains the encoder, as proposed in ANICA.
As explained earlier, in the case of Cramer Wold index it is important to normalize the resampled (permuted) latents, which additionally prevents the encoder’s output from vanishing or exploding in magnitude.
ANICA  CwICA  PNLMISEP  dCorICA  PearsonICA  icafast exp  icaimax ext  jade  

0.0027  0.0017  —  0.0000  0.0017  0.0017  0.0017  0.0017  
0.9835  0.9697  —  0.3033  0.8969  0.8926  0.8940  0.9414  
0.0516  0.0332  —  0.1475  —  —  —  — 
ANICA  CwICA  PNLMISEP  dCorICA  PearsonICA  icafast exp  icaimax ext  jade  

0.0027  0.0175  0.0080  0.0000  0.0038  0.0038  0.0038  0.0038  
0.8913  0.7805  0.9012  0.2514  0.9997  0.9997  0.9998  0.9984  
0.0333  0.0094  —  0.1746  —  —  —  — 
2  5  10  20  2  5  10  20  
ANICA  0.78  0.67  0.69  0.7  0.17  0.13  0.10  0.14 
CwICA  0.79  0.69  0.66  0.68  0.22  0.19  0.15  0.12 
PNLMISEP  0.77  0.71  –  –  0.18  0.15  –  – 
dCorICA  0.79  0.68  0.73  0.67  0.24  0.20  0.20  0.23 
PearsonICA  0.73  0.61  0.59  0.57  0.29  0.21  0.12  0.10 
icafast  0.75  0.59  0.59  0.57  0.21  0.19  0.10  0.09 
icaimax  0.75  0.60  0.59  0.57  0.21  0.19  0.10  0.09 
jade  0.74  0.59  0.59  0.57  0.25  0.20  0.11  0.10 
baseline  0.70  0.60  0.61  0.59  0.36  0.24  0.15  0.11 
In addition we implement another AEbased, nonlinear model, which follows the same architecture as CwICA, but substitutes by . From this point onwards and in all figures and tables, for simplicity we shall also use the notation instead of the . The stands for the componentwise normalized features of the encodings of . We refer to this method as .
5 Experiments
We evaluate our method on mixed images and synthetic dataset. For comparison we use the nonlinear method ANICA (Brakel and Bengio, 2017) and the PNLMISEP (Zheng et al., 2007), an extension to the MISEP method (Almeida, 2003, 2004). It should be noted that the PNLMISEP is designed especially for postnonlinearity, not for the more general nonlinear mixing functions used in presented experiments. We also report the results obtained on the same datasets by four selected linear models. We choose the popular FastICA algorithm (Hyvärinen, 1999), the InformationMaximization (Infomax) approach (Bell and Sejnowski, 1995), the Joint Approximate Diagonalization of Eigenmatrices (JADE) (Cardoso and Souloumiac, 1993) and the Pearson (Stuart et al., 1968) system PearsonICA (Karvanen et al., 2000). We use the implementations of the linear models in R packages ica (Helwig, 2015) and PearsonICA (Karvanen, 2008).
5.1 Comparison with ANICA
The CwICA and dCorICA models follow a similar architecture as ANICA, but use a closed form independence measure on the latent variables, as opposed to the adversarial approach. We compare our algorithms with the ANICA model using the synthetic signals dataset defined in (Brakel and Bengio, 2017).
The dataset in the nonlinear setting consists of observations which are obtained by applying mixing function to the independent sources , where and are sampled uniformly from and is the hyperbolic tangent function. We select the first samples as the test dataset, and train on the remaining samples. We fit ANICA using the best hyperparameters setting for this dataset reported by (Brakel and Bengio, 2017). For CwICA we perform a grid search on the learning rate and bandwidth, using batches of size and choose the model with the smallest total loss on the validation dataset. The validation dataset has size and is drawn from the same distribution as the train and test sets. All other model hyperparameters are set as in ANICA. We also ran a similar grid search on learning rate and batch size for dCorICA. We do not execute the PNLMISEP, as the implementation of this method is not suitable for input data of this dimensionality.
We also report the performance of the nonlinear methods on linear data. The linear dataset is obtained from the same independent sources by a transformation defined by the matrix . We train the models using the same configuration as in the nonlinear experiment.
2  5  10  20  2  5  10  20  
ANICA  0.90  0.74  0.73  0.7  0.16  0.11  0.09  0.14 
CwICA  0.85  0.73  0.74  0.68  0.23  0.15  0.11  0.10 
PNLMISEP  0.87  0.74  –  –  0.14  0.09  –  – 
dCorICA  0.89  0.74  0.76  0.57  0.30  0.15  0.11  0.28 
PearsonICA  0.91  0.82  0.8  0.67  0.25  0.14  0.11  0.18 
icafast  0.92  0.83  0.82  0.75  0.22  0.15  0.10  0.08 
icaimax  0.91  0.84  0.82  0.77  0.24  0.14  0.10  0.10 
jade  0.93  0.84  0.79  0.68  0.23  0.14  0.10  0.09 
baseline  0.85  0.72  0.71  0.65  0.33  0.16  0.11  0.09 
We evaluate the methods on test data using the mean distance between all possible pairs of the unraveled latent independent factors . In addition, we compute the mean maximum correlation (denoted as ) between the sources and the results . As ICA extracts the source signals only up to a permutation, we consider all possible pairings of the predicted signals with the source signals and report only the highest value. Before computing the , the latents are normalized. The results are presented in tables 1 and 2. The original sources and the recovered by CwICA signals are presented in figure 3.
dim  ANICA  CwICA  dCorICA 

2  0.5839  0.0097  0.6041 
5  0.5811  0.0181  0.5491 
10  0.5146  0.0389  0.4616 
20  0.5299  0.2748  0.5079 
CwICA behaves very well on the nonlinear dataset, achieving a similar value to ANICA, at the same time outperforming it in and criterions. This makes the method the best choice if a balanced solution is desired.
In addition we run the ANICA and CwICA models times with different seeds. We pick the best model in terms of and summarize the reported metrics on the validation dataset during training in figure 1. In this experiment both models where trained using batch size .
In the linear synthetic data experiments all nonlinear models perform worse than the classical ICA algorithms. This sustains the claim that if the linear characteristic of the mixing function is assumed beforehand, the most efficient is the use of dedicated methods.
The dCorICA algorithm, as expected, achieves the lowest cost in both linear and nonlinear setting; however, fails to recover the original sources. This may suggest that the model focuses on the minimization of the independence loss, disregarding the information in the input.
5.2 Comparison on image dataset
One of the most popular applications of ICA is separation of images. We conduct experiments on a dataset composed of images from the USCSIPI Image Database, scaled to pixels and mixed using , where are the original sources, are the observations, , , is applied elementwise, and and are sampled uniformly from interval
. In addition, we prepare a linear dataset, where the mixing function is defined by the transformation imposed by a random matrix
sampled uniformly from . The components of are separate, flattened, grayscale images, chosen at random from a dataset of size . The observations are normalized before passing to the algorithms. The numbers of distinct observation examples for each dimension are , respectively.For each we test the ANICA, CW, dCor, PNLMISEP, FastICA, Infomax, Jade and PearsonICA algorithms. All the nonlinear models are trained using the same configurations as in the previous subsection. We report the mean and distance for each method in Table 3 (nonlinearly mixed data) and in Table 4 (linearly mixed data). We also report the loss for autoencoders (ANICA, CW, dCor) in Table 5.
CWICA achieves high on the nonlinearly mixed data, comparable to the other nonlinear ICA algorithms (in fact CwICA gets the best results among all ICA algorithms for ) and strongly outperforms ANICA and dCorICA separations on reconstruction loss.
Additionally, dCorICA gives satisfactory results on the nonlinear setting only for low dimensional data (). For dCorICA still manages to compete with other models in , but evidently obtains the worst results in , despite the fact that it minimizes this measure directly. This disproportion can be especially observed in Figure 2, which presents the mean rank of the methods based on the two metrics.
For higher dimensions, the nonlinear methods perform better in ; however, fail to surpass the classical algorithms in terms of . An opposite trend in the linear data experiments may be observed for the lower dimensions (up to ). In general, the linear methods achieve much better , and worse (higher) . For in both nonlinear and linear setting, the results obtained by autoencoders are even worse than the baseline scores.
6 Conclusions
In this paper we have proposed a closedform independence measure and applied it to the problem of nonlinear ICA. The resulting model, CwICA, achieves comparable results to ANICA, while by using a closedform formula avoids the pitfalls of adversarial training. Future work could focus on scaling up these approaches to higher dimensional datasets, and applying the developed independence metric in other contexts. Finally, we found that nonlinear methods generally under–perform on linearly mixed signals, which could be addressed in future work.
References

Almeida [2003]
Luís B Almeida.
Misep–linear and nonlinear ica based on mutual information.
Journal of Machine Learning Research
, 4(Dec):1297–1318, 2003.  Almeida [2004] Luís B Almeida. Linear and nonlinear ica based on mutual information—the misep method. Signal Processing, 84(2):231–245, 2004.
 Bell and Sejnowski [1995] Anthony J Bell and Terrence J Sejnowski. An informationmaximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
 Brakel and Bengio [2017] Philemon Brakel and Yoshua Bengio. Learning independent features with adversarial nets for nonlinear ica. arXiv preprint arXiv:1710.05050, 2017.
 Burgess et al. [2018] Christopher P. Burgess, Irina Higgins, Arka Pal, Loïc Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in vae. CoRR, abs/1804.03599, 2018.
 Cardoso and Souloumiac [1993] JeanFrancois Cardoso and Antoine Souloumiac. Blind beamforming for nongaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993.
 Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016.
 Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Helwig [2015] Nathaniel E. Helwig. ica: Independent Component Analysis, 2015. R package version 1.01.

Hyvarinen and Morioka [2016]
Aapo Hyvarinen and Hiroshi Morioka.
Unsupervised feature extraction by timecontrastive learning and nonlinear ica.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3765–3773. Curran Associates, Inc., 2016.  Hyvärinen and Pajunen [1999] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
 Hyvärinen [1999] Aapo Hyvärinen. Fast and robust fixedpoint algorithms for independent component analysis. Neural Networks, IEEE Transactions on, 10(3):626–634, 1999.
 ichiro Hirayama et al. [2017] Jun ichiro Hirayama, Aapo Hyvärinen, and Motoaki Kawanabe. SPLICE: Fully tractable hierarchical extension of ICA with pooling. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1491–1500, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 Karvanen et al. [2000] Juha Karvanen, Jan Eriksson, and Visa Koivunen. Pearson system based method for blind separation. In Proceedings of Second International Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000), Helsinki, Finland, pages 585–590, 2000.
 Karvanen [2008] J. Karvanen. PearsonICA, 2008. R package version 1.23.
 Kim and Mnih [2018] H. Kim and A. Mnih. Disentangling by Factorising. ArXiv eprints, February 2018.
 Kingma and Welling [2014] D.P. Kingma and M. Welling. Autoencoding variational bayes. arXiv:1312.6114, 2014.
 Le et al. [2011] Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. Ica with reconstruction cost for efficient overcomplete feature learning. In Advances in Neural Information Processing Systems, pages 1017–1025, 2011.
 Lee et al. [1997] TeWon Lee, BU Koehler, and Reinhold Orglmeister. Blind source separation of nonlinear mixing models. In Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop, pages 406–415. IEEE, 1997.
 Matteson and Tsay [2017] David S Matteson and Ruey S Tsay. Independent component analysis via distance covariance. Journal of the American Statistical Association, pages 1–16, 2017.
 Stuart et al. [1968] Alan Stuart, Maurice G Kendall, et al. The advanced theory of statistics. Charles Griffin, 1968.
 Székely et al. [2007] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769–2794, 2007.
 Tabor et al. [2018] Jacek Tabor, Szymon Knop, Przemysław Spurek, Igor Podolak, Marcin Mazur, and Stanislaw Jastrzebski. Cramerwold autoencoder. arXiv preprint arXiv:1805.09235, 2018.

Tan et al. [2001]
Ying Tan, Jun Wang, and Jacek M Zurada.
Nonlinear blind source separation using a radial basis function network.
IEEE transactions on neural networks, 12(1):124–134, 2001.  Tolstikhin et al. [2017] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein autoencoders. arXiv:1711.01558, 2017.
 Zhang and Chan [2008] Kun Zhang and Laiwan Chan. Minimal nonlinear distortion principle for nonlinear independent component analysis. Journal of Machine Learning Research, 9(Nov):2455–2487, 2008.
 Zheng et al. [2007] ChunHou Zheng, DeShuang Huang, Kang Li, George Irwin, and ZhanLi Sun. Misep method for postnonlinear blind source separation. Neural computation, 19:2557–78, 2007.
Comments
There are no comments yet.