Optimization and Testing in Linear Non-Gaussian Component Analysis

12/23/2017
by   Ze Jin, et al.
0

Independent component analysis (ICA) decomposes multivariate data into mutually independent components (ICs). The ICA model is subject to a constraint that at most one of these components is Gaussian, which is required for model identifiability. Linear non-Gaussian component analysis (LNGCA) generalizes the ICA model to a linear latent factor model with any number of both non-Gaussian components (signals) and Gaussian components (noise), where observations are linear combinations of independent components. Although the individual Gaussian components are not identifiable, the Gaussian subspace is identifiable. We introduce an estimator along with its optimization approach in which non-Gaussian and Gaussian components are estimated simultaneously, maximizing the discrepancy of each non-Gaussian component from Gaussianity while minimizing the discrepancy of each Gaussian component from Gaussianity. When the number of non-Gaussian components is unknown, we develop a statistical test to determine it based on resampling and the discrepancy of estimated components. Through a variety of simulation studies, we demonstrate the improvements of our estimator over competing estimators, and we illustrate the effectiveness of the test to determine the number of non-Gaussian components. Further, we apply our method to real data examples and demonstrate its practical value.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

page 29

page 30

page 33

07/06/2020

Non-Gaussian component analysis: testing the dimension of the signal subspace

Dimension reduction is a common strategy in multivariate data analysis w...
04/02/2018

Sparse Gaussian ICA

Independent component analysis (ICA) is a cornerstone of modern data ana...
05/17/2018

Independent Component Analysis via Energy-based and Kernel-based Mutual Dependence Measures

We apply both distance-based (Jin and Matteson, 2017) and kernel-based (...
01/13/2021

Group Linear non-Gaussian Component Analysis with Applications to Neuroimaging

Independent component analysis (ICA) is an unsupervised learning method ...
06/18/2015

Simultaneous Estimation of Non-Gaussian Components and their Correlation Structure

The statistical dependencies which independent component analysis (ICA) ...
11/30/2021

Binary Independent Component Analysis via Non-stationarity

We consider independent component analysis of binary data. While fundame...
06/01/2011

Sparse Non Gaussian Component Analysis by Semidefinite Programming

Sparse non-Gaussian component analysis (SNGCA) is an unsupervised method...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Independent component analysis (ICA) finds a representation of multivariate data based on mutually independent components (ICs). As an unsupervised learning method, ICA has been developed for applications including blind source separation, feature extraction, brain imaging, and many others.

Hyvärinen et al. (2004) provided an overview of ICA approaches for measuring the non-Gaussianity and estimating the ICs.

Let

be a random vector of observations. Assume that

has a nonsingular, continuous distribution , with and , . Let be a random vector of latent components. Without loss of generality, is assumed to be standardized such that and , . A static linear latent factor model to estimate the components from the observations is given by

where is a constant, nonsingular mixing matrix, and is a constant, nonsingular unmixing matrix.

Pre-whitened random variables are uncorrelated and thus easier to work with from both practical and theoretical perspectives. Let

be the covariance matrix of , and be an uncorrelating matrix. Let be a random vector of uncorrelated observations, such that , the identity matrix. The ICA model further assumes that the components are mutually independent, in which the number of Gaussian components is at most one. Then the relationship between and in the ICA model is

(1)

where is a constant, nonsingular unmixing matrix, and is a constant, nonsingular mixing matrix. Given that are uncorrelated observations,

is an orthogonal matrix, and

is an orthogonal matrix as well. Thus, we have and .

Many methods have been proposed for estimating the ICA model in the literature, including the fourth-moment diagonalization of FOBI

(Cardoso, 1989) and JADE (Cardoso and Souloumiac, 1993), the information criterion of Infomax (Bell and Sejnowski, 1995), maximizing negentropy in FastICA (Hyvärinen and Oja, 1997), the maximum likelihood principle of ProDenICA (Hastie and Tibshirani, 2003), and the mutual dependence measure of dCovICA (Matteson and Tsay, 2017) and MDMICA (Jin and Matteson, 2017). Most of them use optimization to obtain the components such that they have maximal non-Gaussianity under the constraint that they are uncorrelated. The goal is to use to estimate both and , by maximizing the non-Gaussianity of the components in , according to a particular measure of non-Gaussianity.

To overcome the limit of the ICA model that at most one Gaussian component exists, the NGCA (non-Gaussian component analysis) model was proposed Blanchard et al. (2006). Beginning with (1), the components are decomposed into signals and noise , and into and , and into and correspondingly. The components in are assumed to be non-Gaussian, while the components in are assumed to be Gaussian. The NGCA model further assumes that the non-Gaussian components are independent of the Gaussian components , the components in are mutually independent and thus are multivariate normal, although the components in may remain mutually dependent. Then the relationship between and in the NGCA model is

(2)

where has rank , has rank , has rank , and has rank . The goal is to estimate the non-Gaussian subspace spanned by the rows in corresponding to , as the Gaussian subspace corresponding to is uninteresting. Kawanabe et al. (2007) developed an improved algorithm based on radial kernel functions. Theis et al. (2011) proved a necessary and sufficient condition for the uniqueness of the non-Gaussian subspace from projection methods. Bean (2014)

developed theory for an approach based on characteristic functions.

Sasaki et al. (2016)

introduced a least-squares NGCA (LSNGCA) algorithm based on least-squares estimation of log-density gradients and eigenvalue decomposition, and

Shiino et al. (2016) proposed a whitening-free variant of LSNGCA. Nordhausen et al. (2017) developed asymptotic and bootstrap tests for the dimension of non-Gaussian subspace based on the FOBI method.

To incorporate nice characteristics from both the ICA model and NGCA model, we consider the LNGCA (linear non-Gaussian component analysis) model proposed in Risk et al. (2017) as a special case of the NGCA model, which is the same as the the NGICA model in Virta et al. (2016). In the form of (2), the LNGCA model further assumes that the components are mutually independent, and allows any number of both non-Gaussian components and Gaussian components among them. Similarly, we have and . Then the relationship between and in the LNGCA model is

where has rank , has rank , has rank , and has rank . Risk et al. (2017) presented a parametric LNGCA using the logistic density and a semi-parametric LNGCA using tilted Gaussians with cubic B-splines to estimate this model. Virta et al. (2016) used projection pursuit to extract the non-Gaussian components and separate the corresponding signal and noise subspaces where the projection index is a convex combination of squared third and fourth cumulants.

In this paper, we study the LNGCA model by taking advantage of its flexibility in the number of Gaussian components, and mutual independence assumption between all components. With pre-whitening, the Gaussian contribution to the model likelihood is invariant to linear transformations that preserve unit variance, as shown in

Risk et al. (2017). Thus, an alternative framework is necessary in order to leverage the information in the Gaussian subspace. This motivates our novel objective function, which estimates the unmixing matrix by maximizing the discrepancy from Gaussianity for the non-Gaussian components and minimizing the discrepancy for the Gaussian components, thereby explicitly estimating the Gaussian subspace to improve upon constrained maximum likelihood approaches. The rest of this paper is organized as follows. In Section 2, we introduce the discrepancy functions to measure the distance from Gaussianity. In Section 3, we propose a framework of LNGCA estimation given the number of non-Gaussian components . In Section 4, we introduce a sequence of statistical tests to determine the number of non-Gaussian components when it is unknown. We present the simulation results in Section 5, followed by real data examples in Section 6. Finally, Section 7 is the summary of our work.

The following notations will be used throughout this paper. Let denote the set of matrices whose columns are orthonormal. Let denote the set of signed permutation matrices. Let denote the Frobenius norm of .

2 Discrepancy

2.1 Population Discrepancy Measures

In order to find the best estimate for the LNGCA model, we need a criterion to measure the discrepancy between and its underlying assumption, i.e., should be far from Gaussianity and should be close to Gaussianity. Specifically, we choose a general class of functions that measure the discrepancy between each component and Gaussianity.

Hastie and Tibshirani (2003) proposed the expected log-likelihood tilt function to measure the discrepancy from Gaussianity in the estimation of the ICA model. Suppose the density of is , , and each of the densities is represented by an exponentially tilted Gaussian density

where is the standard univariate Gaussian density, and is a smooth function. The log-tilt function represents departures from Gaussianity, and the expected log-likelihood ratio between and the Gaussian density is

Virta et al. (2015, 2016)

proposed the use of the Jarque-Bera (JB) test statistic

(Jarque and Bera, 1987)

to measure the discrepancy from Gaussianity in the estimation of ICA and LNGCA models, where

are squared skewness and squared excess kurtosis. In fact,

Virta et al. (2015, 2016) studied a linear combination of Skew and Kurt, i.e., , and advised the choice of , which corresponds to JB. This takes deviation of both skewness and kurtosis into account, while Skew and Kurt are valid discrepancy functions as well. Notice that JB, Skew, and Kurt are simplified due to standardized .

2.2 Empirical Discrepancy Measures

Let be an i.i.d. sample of observations from , and let be the corresponding i.i.d. sample of observations from , , such that . Let be the sample covariance matrix of , and be the estimated uncorrelating matrix. Although the covariance is unknown in practice, the sample covariance is a consistent estimate under the finite second-moment assumption. Let be the estimated uncorrelated observations, such that , and as .

To simplify notation, we assume that , an uncorrelated i.i.d. sample is given with mean zero and unit variance. Let be the sample of , where and , and let be the sample of , i.e., the th column in . Similarly, we can define . Notice that has sample mean 0 and sample variance 1.

We obtain the empirical discrepancy by replacing expectations by sample averages. The empirical GPois is given by

where is estimated by maximum penalized likelihood, maximizing the criterion

subject to

where is estimated by a smoothing spline, and

is selected by controlling the degrees of freedom of the smoothing spline, which is 6 by default in the R package

ProDenICA (Hastie and Tibshirani, 2010).

The empirical JB is given by

where

are the empirical Skew and empirical Kurt. We will see that JB (joint use of skewness and kurtosis) performs much better than either Skew (use of skewness only) or Kurt (use of kurtosis only) alone in the simulations of Section 5, which was shown in Virta et al. (2016) as well.

3 Optimization Strategy

Using to measure the difference between and Gaussianity, we seek an optimal such that is most likely to fit the underlying model with independent components.

For the ICA model, a classical ICA estimator to estimate in FastICA (Hyvärinen and Oja, 1997) and ProDenICA (Hastie and Tibshirani, 2003) is defined by

We can naturally extend the ICA estimator to an LNGCA estimator given as

(3)

which is named the max estimator, as we maximize the discrepancy between non-Gaussian components and Gaussianity. The algorithm for the max estimator is described in Alg. 1, where the fixed point algorithm is elaborated in Hastie and Tibshirani (2003). The objective function used in Spline-LCA from Risk et al. (2017) is the same as the max estimator when is GPois, but the optimization differs, which will be explored in Section 5.

Given the estimated unmixing matrix , the estimated non-Gaussian components are .

  1. Initialize .
  2. Alternate until convergence of , using the Frobenius norm.
   (a) Given , estimate the discrepancy of component for each .
   (b) Given , , perform one step of the fixed point algorithm towards finding the optimal .
Algorithm 1 LNGCA algorithm for the max estimator

Since any rotation of a Gaussian distribution will lead to the same Gaussian distribution, the Gaussian components

are not identifiable. However, we can benefit from estimating the Gaussian subspace for the LNGCA model, since the column space of is identifiable. Taking into account by optimizing and simultaneously in the objective function, we expect to recognize the Gaussian subspace, which helps shape the non-Gaussian subspace because the non-Gaussian subspace is the complement of the Gaussian subspace. Motivated by this optimization idea, we propose a new LNGCA estimator given as

(4)

which is named the max-min estimator for the LNGCA model, as we maximize the discrepancy between non-Gaussian components and Gaussianity, and minimize the discrepancy between Gaussian components and Gaussianity simultaneously. The algorithm for the max-min estimator is described in Alg. 2, where the fixed point algorithm is elaborated in Hastie and Tibshirani (2003). We will see that the max-min estimator (joint optimization of and ) performs much better than the max estimator (optimization of only) in the simulations of Section 5.

  1. Initialize .
  2. Alternate until convergence of , using the Frobenius metric.
   (a) Given , estimate the discrepancy of component for each .
   (b) Sort components by discrepancy in decreasing order.
   (c) Flip the sign of discrepancy of the last components.
   (d) Given , , perform one step of the fixed point algorithm towards finding the optimal .
  3. Sort components by discrepancy in decreasing order.
Algorithm 2 LNGCA algorithm for the max-min estimator

Given the estimated unmixing matrix , the estimated non-Gaussian and Gaussian components are . However, it is not clear which component in belongs to or , since and are obtained together instead of only. The solution is to sort the independent components by discrepancy value in decreasing order, and obtain the ordered independent components . Given that there are non-Gaussian components, it is natural to take and based on the discrepancy function measuring non-Gaussianity. As the non-Gaussian components in have the -largest discrepancy values among , the estimated non-Gaussian components in are expected to have the -largest empirical discrepancy values among .

Nevertheless, we cannot sort by the empirical discrepancy to determine which component in belongs to or at the beginning, and then stick to the order throughout the iterative algorithm and conclude which component in belongs to or in the end, since the optimization does depend on the initialization, and the order of components may change after each iteration. Instead, we repeatedly sort by empirical discrepancy value and adaptively determine the components in and at the end of each iteration in Alg 2. Finally, when the algorithm converges, we sort the estimated components by empirical discrepancy value, and obtain the ordered estimated components . Then we take , and . Accordingly, we decompose into and , and into and .

4 Testing and Subspace Estimation

In practice, the number of non-Gaussian components is unknown. Following the convention of ordered components with respect to non-Gaussianity, we introduce a sequence of statistical tests to decide . The main idea is that, for any , is more likely to be non-Gaussian than in terms of discrepancy value . If there are non-Gaussian independent components, then are non-Gaussian, and are Gaussian.

Based on this heuristic, we propose a sequence of hypotheses for searching

as

which is equivalent to testing whether there are exactly non-Gaussian components or at least non-Gaussian components.

Under , we first run the optimization from using the max-min estimator with , in which we estimate and from the sample data . One thing worth mentioning is that depends on as the optimization depends on , although we suppress the notation here.

Next we repeat the following resampling procedure for times: during the th time, we randomly generate independent Gaussian with the same number of observations as , and construct pseudo components . Based on the estimated unmixing matrix , we use the estimated mixing matrix to construct pseudo observations . Then we run the optimization from using the max-min estimator with , and we estimate and from the pseudo data .

At last, we calculate an approximate p-value by comparing to , or to as

(5)

which we name the current method and the cumulative method respectively.

Our test shares the resampling technique with Nordhausen et al. (2017). However, there are two major differences. On the one hand, our test does not need to bootstrap on , and thus saves remarkable computational cost, and we will show that it accurately estimates the number of components. On the other hand, our test is more flexible on the test statistic, as it does not need to match what is used in the objective function in the optimization. The algorithm for our sequential test is summarized in Alg. 3 below.

  1. Estimate from using the max-min estimator with .
  2. Estimate .
  3. Repeat the procedure for times:
   (a) Generate independent Gaussian .
   (b) Construct .
   (c) Construct .
   (d) Estimate from using the max-min estimator with .
   (e) Estimate .
  3. Calculate p-value using the current or cumulative method in (5).
Algorithm 3 The algorithm for the sequential test

The proposed procedure involves a sequence of tests, but the number of tests can be dramatically reduced by using a binary search. This approach quickly narrows in on the selected because we focus on the boundary that the p-value crosses a specific significance level. As we expect no more than tests, it makes sense to apply the Bonferroni correction. Note that even for fairly large , the number of tests remains reasonable, e.g., implies fewer than fourteen tests. Multiple testing in this setting of sequential testing may become more problematic as the dimension or search space grows, though the sequential searching works well in the simulations of Section 5. Issues with multiple testing is an important direction for future research.

5 Simulation Study

5.1 Sub- and Super-Gaussian Densities

In this section, we evaluate the performance of the max-min estimator by performing simulations similar to Matteson and Tsay (2017) for the LNGCA model, and compare it to that of the max estimator using several discrepancy functions including Skew, Kurt, JB, GPois, and Spline. Moreover, we elaborate on the implementation and performance measure of the LNGCA model.

We generate the non-Gaussian independent components from 18 distributions using rjordan in the R package ProDenICA (Hastie and Tibshirani, 2010) with sample size and dimension . See Figure 1 for the density functions of the 18 distributions. We also generate the Gaussian independent components with sample size and dimension . Then are the underlying components of interest. We simulate a mixing matrix with condition number between 1 and 2 using mixmat in the R package ProDenICA (Hastie and Tibshirani, 2010) and obtain the observations , which are centered by their sample mean, then pre-whitened by their sample covariance to obtain uncorrelated observations . Finally, we estimate and based on via the max estimator or the max-min estimator. Therefore, , and we evaluate the estimation by comparing the estimated unmixing matrix to the ground truth with respect to , i.e., comparing to where .

The optimization problem associated with the max estimator in (3) and the max-min estimator in (4) is non-convex, which requires the initialization step and is sensitive to the initial point. Risk et al. (2014) demonstrated strong sensitivity to the initialization matrix in various ICA algorithms for the eighteen distributions considered in the experiments below. To mitigate the presence of local maximum, we explore two options, one with a single initial point, and another with multiple initial points, where each initial point is generated by orthogonalizing matrices with random Gaussian elements. We suggest that the number of multiple initial points should grow with the dimension , e.g., .

Each method returns an estimate for the mixing matrix. To jointly measure the uncertainty associated with both pre-whitening observations and estimating non-Gaussian components, we introduce an error measure to evaluate the error between and as

which is similar to the measures in Ilmonen et al. (2010), Risk et al. (2017), and Miettinen et al. (2017). The infimum above is taken such that the measure is invariant to the sign and order of components with respect to the ambiguities associated with the LNGCA model, and the optimal is solved by the Hungarian method (Papadimitriou and Steiglitz, 1982).

We compare the max-min estimator to the max estimator with various distributions, dimensions of components, and discrepancy functions in Experiment 1 and 2 below.

Experiment 1 (Different distributions of components).

We sample from one of the 18 distributions with , , and . See Figure 2 for the error measures of 100 trials, with both multiple initial points () and a single initial point ().

For both multiple initial points and a single initial point, the error measure of the max-min estimator is much lower than that of the max estimator for most distributions and discrepancy functions. Therefore, the max-min estimator improves the performance of estimation over the max estimator, no matter whether a single initial point or multiple initial points is used in optimization.

For both the max-min estimator and max estimator, the error measure with multiple initial points is much lower than that with a single initial point for most of the distributions and discrepancy functions, which illustrates the advantage of using multiple initial points over a single initial point. Moreover, the max-min estimator and multiple initial points turns out to be a powerful combination, since the error measure of the max estimator with multiple initial points can be further reduced when replacing the max estimator with the max-min estimator.

The error measure of JB is much lower than that of Skew and Kurt for most of the distributions, which justifies the joint use of moments. In addition, GPois is equal and often better than other discrepancy functions for all the distributions, especially with multiple initial points.

Experiment 2 (Different dimensions of components).

We sample from randomly selected distributions of the 18 distributions, with , , . See Figure 3 for the error measures of 100 trials, with both multiple initial points () and a single initial point ().

As in the previous experiment, the max-min estimator improves the performance of estimation over the max estimator, where the error measure with multiple initial points is much lower than that with a single initial point for most cases. In addition, GPois performs the best for , and JB and GPois perform similarly for with the max-min estimator and multiple initial points.

Since GPois turns out to be more robust to different distributions than Spline in the simulations, and it shares the same idea with Spline, we omit the results of Spline in the following simulation experiments and data examples.

We compare the current method to the cumulative method for selecting with various sample sizes of components, and discrepancy functions using the max-min estimator in Experiment 3 below.

Experiment 3 (Selecting with varying ).

We sample from randomly selected distributions of the 18 distributions, with , , , . See Table 2 and 3 for the empirical size and power of 100 trials, with significance level , and both multiple initial points () and a single initial point ().

For both multiple initial points and a single initial point, the empirical power of the current method is much higher than that of the cumulative method, while both methods have empirical size around 5% or even lower, for all the sample sizes and discrepancy functions. Hence, the current method outperforms the cumulative method in testing, no matter whether a single initial point or multiple initial points is used in optimization.

For both the current method and cumulative method, the empirical size and power with multiple initial points are similar to those with a single initial point, for all the sample sizes and discrepancy functions, which implies no remarkable effect in testing from using multiple initial points or a single initial point in estimation. This suggests that the estimate of the rank of the subspace is less sensitive to initialization than estimates of the individual components.

The empirical power of JB is much higher than that of Skew and Kurt, for all the sample sizes, which justifies the joint use of moments. In addition, GPois outperforms the other discrepancy functions, for all the sample sizes.

5.2 Image Data

Fulfilling a task of unmixing vectorized images similar to Virta et al. (2016)

, we consider the three gray-scale images from the test images of Computer Vision Group at University of Granada, depicting a cameraman, a clock, and a leopard respectively. Each image is represented by a

matrix, where each element indicates the intensity value of a pixel. Three noise images of the same size are simulated with independent standard Gaussian pixels. We standardize the six images such that the intensity values across all the pixels in each image have mean zero and unit variance. Then we vectorize each image into a vector of length , and combine the vectors from all six images as a matrix , i.e., , . Thus, each row of contains the intensity values of a single pixel across all images, and each column of contains the intensity values of a single image.

Then we simulate a mixing matrix using mixmat in the R package ProDenICA (Hastie and Tibshirani, 2010), and mix the six images to obtain the observations , which are centered by their sample mean, then pre-whitened by their sample covariance to get uncorrelated observations . We aim to infer the number of true images, and then estimate the intensity values in them.

First, we run the sequential test to infer the number of true images with . See Table 1 for the p-values corresponding to each with a single initial point (). Both the current method and cumulative method correctly select with significance level , for all the discrepancy functions.

Second, we estimate the intensity values with and multiple initial points (). See Figures 4 and 5 for the recovered images and error images , where the Euclidean norm of vectorized error images is used to evaluate the accuracy of estimation. The max-min estimator outperforms the max estimator for Kurt, as the max-min estimator recovers the second image, while the second image recovered by the max estimator is masked by noise, and also the max-min estimator has much lower error than the max-min estimator in term of the first image recovered, which illustrates the advantage of the max-min estimator over the max estimator, especially when the max estimator does not perform well. For the other discrepancy functions, both the max-min estimator and max estimator nicely recover the true images. The estimation of JB is more accurate than that of Skew and Kurt, as its recovered images are mixed with less noise, indicated by both the estimated images and error images. In addition, JB and GPois have similar performance, as JB achieves the lowest error on the first image while GPois achieves the lowest error on the second image.

6 EEG Data

There are 24 subjects in the EEG data from the Human Ecology Department at Cornell University, where each subject receives 20 trials. In each trial, 128 EEG channels (3 unused) were collected with 1024 sample points for a few seconds. We study the first trial of the first subject. The data of interest is represented by a matrix, i.e., , . Here, we estimate the number of non-Gaussian signals and examine their time series. Since the max-min estimator and the current method with GPois perform the best in estimation and testing of the simulations, we only use the max-min estimator and the current method with GPois in this application.

First, we conduct the sequential test to estimate the number of non-Gaussian signals with . Using the binary search for , we expect to have at most tests. Hence, we correct the significance level to 0.714% from the original level 5%. See Figure 6 for the test statistic values (empirical discrepancy) and critical values at significance level

(i.e., 99.286%, 95%, and 90% quantiles of

) corresponding to chosen from the binary search with a single initial point (

). The current method rejects the null hypothesis that there are exactly 114 components (p-value

corrected ) and fails to reject the null hypothesis that there are exactly 115 non-Gaussian components (p-value corrected ), thus selecting .

We also iterate all and provide the complete testing results for reference. See Figure 7 for the test statistic values and critical values at significance level corresponding to each with a single initial point (). The dashed lines pinpoint where test statistic values meet with critical values, indicating that this component is assumed to be Gaussian because we cannot reject the null hypothesis.

Second, we estimate the true signals with and multiple initial points (). See Figure 8 for the estimated signals . The max-min estimator successfully extracts meaningful first and second components, which may be artifacts related to eyeblinks in the middle and at the end of the trial. The 115th and 116th components are likely to be Gaussian, as they are on the boundary of the p-value = 0.714%. The 125th (last) component is fairly close to Gaussian, compared to the Gaussian noise we randomly generate with the same sample size as a reference distribution.

7 Conclusion

In this paper, we study the LNGCA model as a generalization of the ICA model, which can have any number of non-Gaussian components and Gaussian components, given that all components are mutually independent. Our contributions are the following:

(1) We propose a new max-min estimator, maximizing the discrepancy of each non-Gaussian component from Gaussianity and minimizing the discrepancy of each Gaussian component from Gaussianity simultaneously. On the contrary, the existing max estimator only maximizes the discrepancy of each non-Gaussian component from Gaussianity, which has been used in the ICA model (Hastie and Tibshirani, 2003) and the LNGCA model (Risk et al., 2017). Our approach may seem unintuitive because the individual Gaussian components are not identifiable. However, the Gaussian subspace is identifiable, and joint estimation of the non-Gaussian components and Gaussian components balances the non-Gaussian subspace with the Gaussian subspace. This helps shape the non-Gaussian subspace, and thus improves the accuracy of estimating the non-Gaussian components.

(2) In practice, we need to choose the number of non-Gaussian components. We introduce a sequence of statistical tests based on generating Gaussian components and ordering estimated components by empirical discrepancy, which is computationally efficient with a binary search to reduce the actual number of tests. Two methods with different test statistics are proposed, where the current method considers the discrepancy value of the component under investigation, while the cumulative method considers the total discrepancy value of all the components from the first one up to the one under investigation. Although our test shares some characteristics with that of Nordhausen et al. (2017), it has less computational burden with no bootstrap needed and is more flexible in choosing the test statistics.

We evaluate the performance of our methods in simulations, demonstrating that the max-min estimator outperforms the max estimator given the number of non-Gaussian components for different discrepancy functions, dimensions, and distributions of the components, no matter whether a single initial point or multiple initial points is used in optimization. When the number of non-Gaussian components is unknown, our statistical test successfully finds the correct number with different discrepancy functions, and sample sizes, where the current method is more powerful than the cumulative method.

In the task of recovering true images from mixed image data, our test determines the correct number of true images, and we illustrate the advantage of the max-min estimator over the max estimator through some discrepancy functions. Specifically, the max-min estimator nicely recovers the images while the max estimator fails using the same discrepancy function, and the estimation error of the max-min estimator is equal and sometimes lower than of the max estimator.

In the task of exploring EEG data, our test finds a large number of non-Gaussian signals, and it extracts two components as the first two non-Gaussian components that may correspond to eye-blink artifacts. The distributions of estimated signals tend to become more Gaussian as their empirical discrepancy values decrease. There are a large number of non-Gaussian components in this data set. In data applications, applying a preliminary data reduction step using principal component analysis (PCA) would likely remove non-Gaussian signals. This underscores the importance of a flexible estimation and testing procedure.

There can be two directions for the future research. One is to look for a better way to address the multiple testing issue in searching a suitable . Another one is to better understand the improvements with the max-min estimator from a theoretical perspective. Our intuition is that the contributions of the non-Gaussian components to the asymptotic variances would equal zero. Therefore, it would be great to gain additional insight into the statistical versus computational advantages of the max-min estimator.

References

  • Bach and Jordan [2002] F. R. Bach and M. I. Jordan. Kernel independent component analysis.

    Journal of machine learning research

    , 3(Jul):1–48, 2002.
  • Bean [2014] D. M. Bean. Non-gaussian component analysis. PhD thesis, University of California, Berkeley, 2014.
  • Bell and Sejnowski [1995] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  • Blanchard et al. [2006] G. Blanchard, M. Sugiyama, M. Kawanabe, V. Spokoiny, and K.-R. Müller. Non-gaussian component analysis: a semi-parametric framework for linear dimension reduction. In Advances in Neural Information Processing Systems, pages 131–138, 2006.
  • Cardoso [1989] J.-F. Cardoso. Source separation using higher order moments. In Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on, pages 2109–2112. IEEE, 1989.
  • Cardoso and Souloumiac [1993] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non-gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages 362–370. IET, 1993.
  • Hastie and Tibshirani [2003] T. Hastie and R. Tibshirani. Independent components analysis through product density estimation. In Advances in neural information processing systems, pages 665–672, 2003.
  • Hastie and Tibshirani [2010] T. Hastie and R. Tibshirani. Prodenica: Product density estimation for ica using tilted gaussian density estimates. R package version, 1, 2010.
  • Hyvärinen and Oja [1997] A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural computation, 9(7):1483–1492, 1997.
  • Hyvärinen et al. [2004] A. Hyvärinen, J. Karhunen, and E. Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004.
  • Ilmonen et al. [2010] P. Ilmonen, K. Nordhausen, H. Oja, and E. Ollila.

    A new performance index for ica: properties, computation and asymptotic analysis.

    Latent Variable Analysis and Signal Separation, pages 229–236, 2010.
  • Jarque and Bera [1987] C. M. Jarque and A. K. Bera. A test for normality of observations and regression residuals. International Statistical Review/Revue Internationale de Statistique, pages 163–172, 1987.
  • Jin and Matteson [2017] Z. Jin and D. S. Matteson. Independent component analysis via energy-based mutual dependence measures. Under review, 2017.
  • Kawanabe et al. [2007] M. Kawanabe, M. Sugiyama, G. Blanchard, and K.-R. Müller. A new algorithm of non-gaussian component analysis with radial kernel functions. Annals of the Institute of Statistical Mathematics, 59(1):57–75, 2007.
  • Matteson and Tsay [2017] D. S. Matteson and R. S. Tsay. Independent component analysis via distance covariance. Journal of the American Statistical Association, 112(518):623–637, 2017.
  • Miettinen et al. [2017] J. Miettinen, K. Nordhausen, and S. Taskinen. Blind source separation based on joint diagonalization in r: The packages jade and bssasymp. Journal of Statistical Software, 76, 2017.
  • Nordhausen et al. [2017] K. Nordhausen, H. Oja, D. E. Tyler, and J. Virta. Asymptotic and bootstrap tests for the dimension of the non-gaussian subspace. IEEE Signal Processing Letters, 24(6):887–891, 2017.
  • Papadimitriou and Steiglitz [1982] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, Inc., 1982.
  • Risk et al. [2014] B. B. Risk, D. S. Matteson, D. Ruppert, A. Eloyan, and B. S. Caffo. An evaluation of independent component analyses with an application to resting-state fmri. Biometrics, 70(1):224–236, 2014.
  • Risk et al. [2017] B. B. Risk, D. S. Matteson, and D. Ruppert. Linear non-gaussian component analysis via maximum likelihood. Journal of the American Statistical Association, 2017. To appear.
  • Sasaki et al. [2016] H. Sasaki, G. Niu, and M. Sugiyama. Non-gaussian component analysis with log-density gradient estimation. In Artificial Intelligence and Statistics, pages 1177–1185, 2016.
  • Shiino et al. [2016] H. Shiino, H. Sasaki, G. Niu, and M. Sugiyama. Whitening-free least-squares non-gaussian component analysis. arXiv preprint arXiv:1603.01029, 2016.
  • Theis et al. [2011] F. J. Theis, M. Kawanabe, and K.-R. Muller. Uniqueness of non-gaussianity-based dimension reduction. IEEE Transactions on signal processing, 59(9):4478–4482, 2011.
  • Virta et al. [2015] J. Virta, K. Nordhausen, and H. Oja. Joint use of third and fourth cumulants in independent component analysis. arXiv preprint arXiv:1505.02613, 2015.
  • Virta et al. [2016] J. Virta, K. Nordhausen, and H. Oja. Projection pursuit for non-gaussian independent components. arXiv preprint arXiv:1612.05445, 2016.