Non-linear ICA based on Cramer-Wold metric

03/01/2019 ∙ by Przemysław Spurek, et al. ∙ Jagiellonian University 0

Non-linear source separation is a challenging open problem with many applications. We extend a recently proposed Adversarial Non-linear ICA (ANICA) model, and introduce Cramer-Wold ICA (CW-ICA). In contrast to ANICA we use a simple, closed--form optimization target instead of a discriminator--based independence measure. Our results show that CW-ICA achieves comparable results to ANICA, while foregoing the need for adversarial training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear Independent Components Analysis (ICA) has become an important data analysis technique. For example, it is routinely used for blind source separation in a wide range of signals. The objective of ICA is to identify a

linear transformation such that after the projection the components of the dataset are independent. More formally, the aim is to find an unmixing matrix  that transforms the observed data into maximally independent components

with respect to some measure of independence. Commonly the independence is approximated using a measure of nongaussianity (e.g. kurtosis

Hyvärinen (1999); Bell and Sejnowski (1995)).

An obvious drawback of ICA is the restriction to linear transformations. Unfortunately, in many practical applications this linearity assumption does not hold, which motivates research into Nonlinear ICA (NICA) 

(Hyvarinen and Morioka, 2016; ichiro Hirayama et al., 2017).

One of the key challenges in developing nonlinear variant of ICA is devising an efficient measure of independence. The currently most popular approach is to constrain the transformation so that independence can be efficiently estimated 

(Tan et al., 2001; Almeida, 2003, 2004; Dinh et al., 2014; Zhang and Chan, 2008). Another approach is to learn the independence measure. This can be achieved using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). In (Brakel and Bengio, 2017)

(ANICA - Adversarial Non-linear ICA) authors demonstrate efficacy of using GAN for learning an independence measure. They show that GAN based independency measure combined with an autoencoder architecture can be used to solve nonlinear blind source separation problems.

Unfortunately, the use of adversarial training in ANICA comes at the cost of added instability, as also noted by the authors. Our main contribution is developing an effective independence measure that does not require adversarial training, and matches ANICA performance. In other words, we found that the adversarial training is not the key contributor to the efficacy of ANICA, and based on this insight we developed a simpler, closed-form independence measure. We demonstrate its efficacy on standard blind source separation problems.

This paper is structured as follows. We start by discussing related work in Section 2. In Section 3 we describe the key contribution: the independence measure based on Cramer-Wold metric. ICA based on the introduced independence measure is described in Sec. 4. Finally, we report experimental results in Section 5.

2 Related work

The fundamental problem in solving NICA is that the solution is in principle non-identifiable. Without any constraints on the space of the mixing functions, there exists an infinite number of solutions(Hyvärinen and Pajunen, 1999)

. To illustrate, consider that there is an infinite number of possible nonlinear decompositions of a random vector into independent components, and those decompositions are not related to each other in any trivial way. A related problem is that measuring true independence between distributions is often intractable. While ICA can be efficiently solved using approximated independence measures, such as kurtosis, these approaches do not transfer to the nonlinear scenario.

Perhaps the most common approach to solve NICA, which addresses both of the problems, is to pose a constraint on the nonlinear transformation (Lee et al., 1997; Tan et al., 2001; Almeida, 2003, 2004; Dinh et al., 2014). One of the first attempts was to generalize ICA by introducing nonlinear mixing models in which case the solution is still possible to identify (Lee et al., 1997). In (Le et al., 2011) authors propose Reconstruction ICA (RICA) which requires that mixing matrix is as close as possible to orthonormal one . Thanks to such constraints, one can directly apply independent measures from classical ICA method.

The aforementioned approaches are arguably limited in their expressive power. In a more recent attempt (Dinh et al., 2014)

the authors propose a neural model for modeling densities called Nonlinear Independent Component Estimation (NICE). The authors parameterize the neural network so that it is fully invertible and the output distribution is fully factorized (independent). However, the model incorporates learning the unmixing function using maximum likelihood, which requires specifying a prior density family.

Our work is most closely related to the recently introduced Adversarial Non-linear ICA model (ANICA) (Brakel and Bengio, 2017). In contrast to the previous methods, ANICA does not make any strong explicit assumptions on the transformation function. Instead, a clever adversarial-based measure for estimating and optimizing independence efficiently is proposed. In this work we will take a closer look at this measure, and argue that the basic premise permits construction of an effective non-parametrized independence measure.

Finally, let us note that a large process has been made in learning factorized representations using deep neural networks (Burgess et al., 2018; Chen et al., 2016). What separates ANICA and our method from the previous work is the direct encouragement of independence in the latent space. A similar path was also taken by (Kim and Mnih, 2018)

where the VAE loss function is augmented with a cost term directly encouraging disentanglement.

3 Independence measure by Cramer-Wold distance

In this chapter we develop an efficient independence measure, which contrary to ANICA model, does not require adversarial training. Our approach can be effectively used to solve nonlinear ICA, in contrast to many other metrics used solely in the context of linear ICA.

In the following we discuss three independence metrics. Firstly, we consider distance correlation, and adversarial–based metric used in ANICA. In the last part we introduce our Cramer-Wold based independence metric.

Distance correlation

One of the most well-known measures of independence of random vectors and is the distance correlation (dCor) (Székely et al., 2007), which is applied in (Matteson and Tsay, 2017) to solve the linear ICA problem. Importantly, equals zero if the random vectors and are independent. Moreover, has a closed-form estimator.

However, to ensure the independence of components of a given random vector in , one has to compute for every subset111Except for the trivial cases when either or is emptyset. of indexes , where denotes the complement set of and is the restriction of to the set of coordinates given by . As this procedure has exponential complexity with respect to the number of dimensions, we decided to use a simplified version of which enforces only pairwise independence of the components:

where .

Adversarial–based independence metric

Now let us describe the adversarial approach used in ANICA. The basic idea is to leverage that a random permutation of features in a sample produces samples that come from a distribution with independent components. More precisely, let be a random vector which comes from pdf , and let be a sample from , where . We will describe how to draw a sample from the density

where are the marginal densities of . To do this, simply randomly choose maps from into itself, and consider

Then comes from the pdf , which has independent components. Consequently, if and are close, then the same holds for and , and consequently has independent components. In ANICA adversarial training is used to reduce distance between and .

Cramer-Wold independence metric

The application of adversarial training in ANICA can lead to instability, as discussed by the authors, and slower training. In this paper we propose an alternative independence measure. Our main idea is to compute the distance between and without resorting to adversarial training.

In order to achieve this, one can choose commonly used metrics, such as the Kullback-Leibler divergence

(Kingma and Welling, 2014) or Wasserstein distance (Tolstikhin et al., 2017). Instead, due to its simplicity, we have decided to use the recently introduced Cramer-Wold distance (Tabor et al., 2018), which also possesses the advantage of having the closed-form for the distance of two samples222In the computation we apply the equality . :

where the bandwidth

is a hyperparameter, which may be set accordingly to the one-dimensional Silverman’s rule of thumb to

. The function is computed with the asymptotic formula: .

As a final step, we normalize each component of to ensure that the Silverman’s rule of thumb is optimal, and define our independence metric as:

(1)

where is the componentwise normalization of .

Figure 1: The number of iterations versus (left), (middle) and (right) for ANICA (black) and CwICA (red) Please note that the and results are plotted in logarithmic scale on the y-axis. This experiment is separate to the one presented in table 1, therefore the results may slightly differ.

4 Algorithm

We are now ready to define CW-ICA, a nonlinear ICA model based on the Cramer-World independence metric. Following ANICA, we use an Auto-Encoder (AE) architecture.

Let denote the input data. An Auto-Encoder is a model consisting of an encoder function and a complementary decoder function , aiming to enforce coding of the input variables that minimizes the reconstruction error:

(2)

The goal of our method is to train an encoder network which maps data to informative, statistically independent features . In order to achieve this we introduce an independence measure on the latent space, by taking advantage of the independence index defined in (1). We denote this model as the CW-ICA(Cramer-Wold Independent Component Analysis).

To obtain a procedure independent of a possible rescaling of the data, we have decided to use a multiplicative model instead of an additive:

(3)
Figure 2: The ranking (lower is better) of algorithms based on mean maximum correlation between the latent variables and sources (left-hand side) and dCor (right-hand side) in dimension .

In contrary to ANICA we do not use an adversarial objective, proposing instead a close-form solution based on the independence index. However, enforcing independence by itself does not guarantee that the mapping from the observed signals to the predicted sources is informative about the input. Therefore, the decoder constrains the encoder, as proposed in ANICA.

As explained earlier, in the case of Cramer Wold index it is important to normalize the resampled (permuted) latents, which additionally prevents the encoder’s output from vanishing or exploding in magnitude.

  input
      data , with each sample in a separate row
      encoder , decoder
  repeat
     sample a batch of size from
     apply encoder
     resample to obtain :
     for   do
         for   do
             // sample col. index
            
         end for
     end for
     normalize by element-wise rescaling
     
     Update and to minimize
  until  converged
Algorithm 1 (CwICA train loop:)
ANICA CwICA PNLMISEP dCorICA PearsonICA icafast exp icaimax ext jade
0.0027 0.0017 0.0000 0.0017 0.0017 0.0017 0.0017
0.9835 0.9697 0.3033 0.8969 0.8926 0.8940 0.9414
0.0516 0.0332 0.1475
Table 1: Results on nonlinear synthetic data
ANICA CwICA PNLMISEP dCorICA PearsonICA icafast exp icaimax ext jade
0.0027 0.0175 0.0080 0.0000 0.0038 0.0038 0.0038 0.0038
0.8913 0.7805 0.9012 0.2514 0.9997 0.9997 0.9998 0.9984
0.0333 0.0094 0.1746
Table 2: Results on linear synthetic data
Figure 3: The original sources (left) and the independent components predicted by CwICA (right) obtained from nonlinear mixtures.
2 5 10 20 2 5 10 20
ANICA 0.78 0.67 0.69 0.7 0.17 0.13 0.10 0.14
CwICA 0.79 0.69 0.66 0.68 0.22 0.19 0.15 0.12
PNLMISEP 0.77 0.71 0.18 0.15
dCorICA 0.79 0.68 0.73 0.67 0.24 0.20 0.20 0.23
PearsonICA 0.73 0.61 0.59 0.57 0.29 0.21 0.12 0.10
icafast 0.75 0.59 0.59 0.57 0.21 0.19 0.10 0.09
icaimax 0.75 0.60 0.59 0.57 0.21 0.19 0.10 0.09
jade 0.74 0.59 0.59 0.57 0.25 0.20 0.11 0.10
baseline 0.70 0.60 0.61 0.59 0.36 0.24 0.15 0.11
Table 3: Results on nonlinear image dataset. For dimension and the PNLMISEP did not converge.

In addition we implement another AE-based, nonlinear model, which follows the same architecture as CwICA, but substitutes by . From this point onwards and in all figures and tables, for simplicity we shall also use the notation instead of the . The stands for the component-wise normalized features of the encodings of . We refer to this method as .

5 Experiments

We evaluate our method on mixed images and synthetic dataset. For comparison we use the nonlinear method ANICA (Brakel and Bengio, 2017) and the PNLMISEP (Zheng et al., 2007), an extension to the MISEP method (Almeida, 2003, 2004). It should be noted that the PNLMISEP is designed especially for post-nonlinearity, not for the more general nonlinear mixing functions used in presented experiments. We also report the results obtained on the same datasets by four selected linear models. We choose the popular FastICA algorithm (Hyvärinen, 1999), the Information-Maximization (Infomax) approach (Bell and Sejnowski, 1995), the Joint Approximate Diagonalization of Eigenmatrices (JADE) (Cardoso and Souloumiac, 1993) and the Pearson (Stuart et al., 1968) system PearsonICA (Karvanen et al., 2000). We use the implementations of the linear models in R packages ica (Helwig, 2015) and PearsonICA (Karvanen, 2008).

5.1 Comparison with ANICA

The CwICA and dCorICA models follow a similar architecture as ANICA, but use a closed form independence measure on the latent variables, as opposed to the adversarial approach. We compare our algorithms with the ANICA model using the synthetic signals dataset defined in (Brakel and Bengio, 2017).

The dataset in the nonlinear setting consists of observations which are obtained by applying mixing function to the independent sources , where and are sampled uniformly from and is the hyperbolic tangent function. We select the first samples as the test dataset, and train on the remaining samples. We fit ANICA using the best hyper-parameters setting for this dataset reported by (Brakel and Bengio, 2017). For CwICA we perform a grid search on the learning rate and bandwidth, using batches of size and choose the model with the smallest total loss on the validation dataset. The validation dataset has size and is drawn from the same distribution as the train and test sets. All other model hyper-parameters are set as in ANICA. We also ran a similar grid search on learning rate and batch size for dCorICA. We do not execute the PNLMISEP, as the implementation of this method is not suitable for input data of this dimensionality.

We also report the performance of the nonlinear methods on linear data. The linear dataset is obtained from the same independent sources by a transformation defined by the matrix . We train the models using the same configuration as in the nonlinear experiment.

2 5 10 20 2 5 10 20
ANICA 0.90 0.74 0.73 0.7 0.16 0.11 0.09 0.14
CwICA 0.85 0.73 0.74 0.68 0.23 0.15  0.11 0.10
PNLMISEP 0.87 0.74 0.14 0.09
dCorICA 0.89 0.74 0.76 0.57 0.30 0.15 0.11 0.28
PearsonICA 0.91 0.82 0.8 0.67 0.25 0.14 0.11 0.18
icafast 0.92 0.83 0.82 0.75 0.22 0.15 0.10 0.08
icaimax 0.91 0.84 0.82 0.77 0.24 0.14 0.10 0.10
jade 0.93 0.84 0.79 0.68 0.23 0.14 0.10 0.09
baseline 0.85 0.72 0.71 0.65 0.33 0.16 0.11 0.09
Table 4: Results on linear image dataset. For dimension and the PNLMISEP did not converge.

We evaluate the methods on test data using the mean distance between all possible pairs of the unraveled latent independent factors . In addition, we compute the mean maximum correlation (denoted as ) between the sources and the results . As ICA extracts the source signals only up to a permutation, we consider all possible pairings of the predicted signals with the source signals and report only the highest value. Before computing the , the latents are normalized. The results are presented in tables 1 and 2. The original sources and the recovered by CwICA signals are presented in figure 3.

dim ANICA CwICA dCorICA
2 0.5839 0.0097 0.6041
5 0.5811 0.0181 0.5491
10 0.5146 0.0389 0.4616
20 0.5299 0.2748 0.5079
Table 5: Reconstruction loss (MSE) for auto-encoders on the nonlinear image dataset.

CwICA behaves very well on the nonlinear dataset, achieving a similar value to ANICA, at the same time outperforming it in and criterions. This makes the method the best choice if a balanced solution is desired.

In addition we run the ANICA and CwICA models times with different seeds. We pick the best model in terms of and summarize the reported metrics on the validation dataset during training in figure 1. In this experiment both models where trained using batch size .

In the linear synthetic data experiments all non-linear models perform worse than the classical ICA algorithms. This sustains the claim that if the linear characteristic of the mixing function is assumed beforehand, the most efficient is the use of dedicated methods.

The dCorICA algorithm, as expected, achieves the lowest cost in both linear and nonlinear setting; however, fails to recover the original sources. This may suggest that the model focuses on the minimization of the independence loss, disregarding the information in the input.

5.2 Comparison on image dataset

One of the most popular applications of ICA is separation of images. We conduct experiments on a dataset composed of images from the USC-SIPI Image Database, scaled to pixels and mixed using , where are the original sources, are the observations, , , is applied element-wise, and and are sampled uniformly from interval

. In addition, we prepare a linear dataset, where the mixing function is defined by the transformation imposed by a random matrix

sampled uniformly from . The components of are separate, flattened, gray-scale images, chosen at random from a dataset of size . The observations are normalized before passing to the algorithms. The numbers of distinct observation examples for each dimension are , respectively.

For each we test the ANICA, CW, dCor, PNLMISEP, FastICA, Infomax, Jade and PearsonICA algorithms. All the nonlinear models are trained using the same configurations as in the previous subsection. We report the mean and distance for each method in Table 3 (nonlinearly mixed data) and in Table 4 (linearly mixed data). We also report the loss for auto-encoders (ANICA, CW, dCor) in Table 5.

CW-ICA achieves high on the nonlinearly mixed data, comparable to the other non-linear ICA algorithms (in fact CwICA gets the best results among all ICA algorithms for ) and strongly outperforms ANICA and dCorICA separations on reconstruction loss.

Additionally, dCorICA gives satisfactory results on the nonlinear setting only for low dimensional data (). For dCorICA still manages to compete with other models in , but evidently obtains the worst results in , despite the fact that it minimizes this measure directly. This disproportion can be especially observed in Figure 2, which presents the mean rank of the methods based on the two metrics.

For higher dimensions, the nonlinear methods perform better in ; however, fail to surpass the classical algorithms in terms of . An opposite trend in the linear data experiments may be observed for the lower dimensions (up to ). In general, the linear methods achieve much better , and worse (higher) . For in both nonlinear and linear setting, the results obtained by auto-encoders are even worse than the baseline scores.

6 Conclusions

In this paper we have proposed a closed-form independence measure and applied it to the problem of nonlinear ICA. The resulting model, CwICA, achieves comparable results to ANICA, while by using a closed-form formula avoids the pitfalls of adversarial training. Future work could focus on scaling up these approaches to higher dimensional datasets, and applying the developed independence metric in other contexts. Finally, we found that nonlinear methods generally under–perform on linearly mixed signals, which could be addressed in future work.

References

  • Almeida [2003] Luís B Almeida. Misep–linear and nonlinear ica based on mutual information.

    Journal of Machine Learning Research

    , 4(Dec):1297–1318, 2003.
  • Almeida [2004] Luís B Almeida. Linear and nonlinear ica based on mutual information—the misep method. Signal Processing, 84(2):231–245, 2004.
  • Bell and Sejnowski [1995] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  • Brakel and Bengio [2017] Philemon Brakel and Yoshua Bengio. Learning independent features with adversarial nets for non-linear ica. arXiv preprint arXiv:1710.05050, 2017.
  • Burgess et al. [2018] Christopher P. Burgess, Irina Higgins, Arka Pal, Loïc Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in -vae. CoRR, abs/1804.03599, 2018.
  • Cardoso and Souloumiac [1993] Jean-Francois Cardoso and Antoine Souloumiac. Blind beamforming for non-gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993.
  • Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016.
  • Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Helwig [2015] Nathaniel E. Helwig. ica: Independent Component Analysis, 2015. R package version 1.0-1.
  • Hyvarinen and Morioka [2016] Aapo Hyvarinen and Hiroshi Morioka.

    Unsupervised feature extraction by time-contrastive learning and nonlinear ica.

    In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3765–3773. Curran Associates, Inc., 2016.
  • Hyvärinen and Pajunen [1999] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
  • Hyvärinen [1999] Aapo Hyvärinen. Fast and robust fixed-point algorithms for independent component analysis. Neural Networks, IEEE Transactions on, 10(3):626–634, 1999.
  • ichiro Hirayama et al. [2017] Jun ichiro Hirayama, Aapo Hyvärinen, and Motoaki Kawanabe. SPLICE: Fully tractable hierarchical extension of ICA with pooling. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1491–1500, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Karvanen et al. [2000] Juha Karvanen, Jan Eriksson, and Visa Koivunen. Pearson system based method for blind separation. In Proceedings of Second International Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000), Helsinki, Finland, pages 585–590, 2000.
  • Karvanen [2008] J. Karvanen. PearsonICA, 2008. R package version 1.2-3.
  • Kim and Mnih [2018] H. Kim and A. Mnih. Disentangling by Factorising. ArXiv e-prints, February 2018.
  • Kingma and Welling [2014] D.P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2014.
  • Le et al. [2011] Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. Ica with reconstruction cost for efficient overcomplete feature learning. In Advances in Neural Information Processing Systems, pages 1017–1025, 2011.
  • Lee et al. [1997] Te-Won Lee, B-U Koehler, and Reinhold Orglmeister. Blind source separation of nonlinear mixing models. In Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop, pages 406–415. IEEE, 1997.
  • Matteson and Tsay [2017] David S Matteson and Ruey S Tsay. Independent component analysis via distance covariance. Journal of the American Statistical Association, pages 1–16, 2017.
  • Stuart et al. [1968] Alan Stuart, Maurice G Kendall, et al. The advanced theory of statistics. Charles Griffin, 1968.
  • Székely et al. [2007] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769–2794, 2007.
  • Tabor et al. [2018] Jacek Tabor, Szymon Knop, Przemysław Spurek, Igor Podolak, Marcin Mazur, and Stanislaw Jastrzebski. Cramer-wold autoencoder. arXiv preprint arXiv:1805.09235, 2018.
  • Tan et al. [2001] Ying Tan, Jun Wang, and Jacek M Zurada.

    Nonlinear blind source separation using a radial basis function network.

    IEEE transactions on neural networks, 12(1):124–134, 2001.
  • Tolstikhin et al. [2017] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. arXiv:1711.01558, 2017.
  • Zhang and Chan [2008] Kun Zhang and Laiwan Chan. Minimal nonlinear distortion principle for nonlinear independent component analysis. Journal of Machine Learning Research, 9(Nov):2455–2487, 2008.
  • Zheng et al. [2007] Chun-Hou Zheng, De-Shuang Huang, Kang Li, George Irwin, and Zhan-Li Sun. Misep method for postnonlinear blind source separation. Neural computation, 19:2557–78, 2007.