The information sieve for discrete variables.
Measuring the relationship between any pair of variables is a rich and active area of research that is central to scientific practice. In contrast, characterizing the common information among any group of variables is typically a theoretical exercise with few practical methods for high-dimensional data. A promising solution would be a multivariate generalization of the famous Wyner common information, but this approach relies on solving an apparently intractable optimization problem. We leverage the recently introduced information sieve decomposition to formulate an incremental version of the common information problem that admits a simple fixed point solution, fast convergence, and complexity that is linear in the number of variables. This scalable approach allows us to demonstrate the usefulness of common information in high-dimensional learning problems. The sieve outperforms standard methods on dimensionality reduction tasks, solves a blind source separation problem that cannot be solved with ICA, and accurately recovers structure in brain imaging data.READ FULL TEXT VIEW PDF
Linear dimensionality reduction methods are commonly used to extract
We propose maximum likelihood estimation for learning Gaussian graphical...
Locality preserving projections (LPP) are a classical dimensionality
Mining useful clusters from high dimensional data has received significa...
Sparse Blind Source Separation (sparse BSS) is a key method to analyze
We consider a set of probabilistic functions of some input variables as ...
High-dimensional data common in genomics, proteomics, and chemometrics o...
The information sieve for discrete variables.
The Linear Information Sieve
One of the most fundamental measures of the relationship between two random variables,, is given by the mutual information, . While mutual information measures the strength of a relationship, the “common information” provides a concrete representation, , of the information that is shared between two variables. According to [Wyner1975], if contains the common information between , then we should have , i.e., makes the variables conditionally independent. We can extend this idea to many variables using the multivariate generalization of mutual information called total correlation [Watanabe1960], so that conditional independence is equivalent to the condition [Xu et al.2013]. The most succinct that has this property represents the multivariate common information in but finding such a in general is a challenging, unsolved problem.
The main contribution of this paper is to show that the concept of common information, long studied in information theory for applications like distributed source coding and cryptography [Kumar et al.2014]
, is also a useful concept for machine learning. Machine learning applications have been overlooked due to the intractability of recovering common information for high-dimensional problems. We propose a concrete and tractable algorithmic approach to extracting common information by exploiting a connection with the recently introduced “information sieve” decomposition[Ver Steeg and Galstyan2016]. The sieve decomposition works by searching for a single latent factor that reduces the conditional dependence in the data as much as possible. Then the data is transformed to remove this dependence and the “remainder information” trickles down to the next layer. The process is repeated until all the dependence has been extracted and the remainder contains nothing but independent noise. Thm. 3.3 connects the latent factors extracted by the sieve to a measure of common information.
Our second contribution is to show that under the assumptions of linearity and Gaussianity this optimization has a simple fixed-point solution (Eq. 6) with fast convergence and computational complexity linear in the number of variables. Although our final algorithm is limited to the linear case, extracting common information is an unsolved problem and our approach represents a logical first step in exploring the value of common information for machine learning. We offer suggestions for generalizing the method.
Our final contribution is to validate the usefulness of our approach on some canonical machine learning problems. While PCA finds components that explain the most variation, the sieve discovers components that explain the most dependence, making it a useful complement for exploratory data analysis. Common information can be used to solve a natural class of blind source separation problems that are impossible to solve using independent component analysis (ICA) due to the presence of Gaussian sources. Finally, we show that common information outperforms standard approaches for dimensionality reduction and recovering structure in fMRI data.
Using standard notation [Cover and Thomas2006], capital
denotes a continuous random variable whose instances are denoted in lowercase,. We abbreviate multivariate random variables,
, with an associated probability density function,, which is typically abbreviated to
, with vectors in bold. We will index different groups of multivariate random variables with superscripts,, as defined in Fig. 1. We let denote the original observed variables and we omit the superscript for readability when no confusion results.
Entropy is defined as
, where we use brackets for expectation values. Conditional multivariate mutual information, or conditional total correlation, is defined as the Kullback-Leibler divergence between the joint distribution, and the one that is conditionally independent.
This quantity is non-negative and zero if and only if all the ’s are independent conditioned on can be obtained by dropping the conditioning on in the expression above. In other words, if and only if the variables are (unconditionally) independent. If were the hidden source of all dependence in , then . Therefore, we consider the problem of searching for a factor that minimizes . In the statement of the theorems we make use of shorthand notation, , which is the reduction of TC after conditioning on . This notation mirrors the definition of mutual information between two groups of random variables, and , as the reduction of uncertainty in one variable, given information about the other, .
For to contain the common information in , we need . Instead of enforcing the condition that and looking for the most succinct that satisfies this condition, as Wyner does [Wyner1975], we consider the dual formulation where we minimize subject to constraints on , the size of the state space [Op’t Veld and Gastpar2016a]. This optimization can be written equivalently as follows.
We will show in Thm. 3.3 that an upper bound for this objective is obtained by solving a sequence of optimization problems of the following form, indexed by .
The definition of is discussed next, but the high level idea is that we have reduced the difficult optimization over many latent factors in Eq. 2 to a sequence of optimizations with a single latent factor in Eq. 3. Each optimization gives us a tighter upper bound on our original objective, Eq. 2.
We begin with some input data, , and then construct to minimize . After doing so, we would like to transform the original data into the remainder information, , so that we can use the same optimization to learn a factor, , that extracts more common information that was not already captured by . We diagram this construction at layer in Fig. 1 and show in Thm 3.1 the requirements for constructing the remainder information. The result of this procedure is encapsulated in Cor. 3.2 which says that we can iterate this procedure and will be reduced at each layer until it reaches zero and captures all the common information.
Incremental decomposition of common information For a function of , the following decomposition holds,
if the remainder information satisfies two properties.
1. Invertibility: there exist functions so that
2. Remainder contains no information about :
We have to show that this quantity equals zero under the assumptions specified. First, we multiply the fraction by one by putting terms in the numerator and denominator. After applying condition (2) that , we can remove two terms leaving the following.
If condition (1) of the theorem is satisfied, then, conditioned on , and
are related by a deterministic formula. We can see from applying the change of variables formula for probability distributions that the terms in this expression cancel, leaving us with, as we intended to prove. ∎
The decomposition above was originally introduced for discrete variables as the “information sieve” [Ver Steeg and Galstyan2016]; the continuous formulation we introduce here replaces the first condition used in the original statement with an analogous one that is appropriate for continuous variables. Note that because we can always find non-negative solutions for , it must be that . In other words, the remainder information is more independent than the input data. This is consistent with the intuition that the sieve is sifting out the common information at each layer.
Iterative decomposition of TC With a hierarchical representation where each is a function of and is the remainder information as defined in Thm 3.1,
This follows from repeated application of Eq. 4. is a constant that depends on the data. For high-dimensional data, it is impossible to measure , but by learning latent factors extracting progressively more dependence, we get a sequence of better bounds.
Decomposition of common information For the sieve decomposition, the following bound holds.
The equality comes from Cor. 3.2.
The first line follows from the the change of variables formula for the transformation connecting layer to the input layer. On the second line we multiply by 1 and re-arrange, collecting terms in the next two lines. The last inequality follows from non-negativity of TC and mutual information. ∎
It remains to solve the optimization in Eq. 3. For now we drop the index and focus on minimizing for a single factor . To get a simple and tractable solution to this non-convex problem, we consider a further simplification where is Gaussian with covariance matrix and inverse covariance . If is Gaussian and ’s dependence on is linear and Gaussian, the joint distribution over will also be Gaussian. We write out the optimization in Eq. 3 under this condition.
Two immediate simplifications are apparent. First, this objective is invariant to scaling of . Any solution with would be equivalent to a scaled solution . Therefore, without loss of generality we set . Second, we invoke Bayes rule to see where the first two terms on the right hand side are constants with respect to the optimization. We re-write the optimization accordingly.
The objective is invariant to translation of the marginals, so w.l.o.g. we also set . Define a nonlinear change of variables in terms of the correlation coefficient, . To translate between and , we also note, and . This leads to the following optimization, neglecting some constants.
Next, we set derivatives with respect to each to zero.
Now we use the identities to translate back to a fixed-point equation in terms of and rearrange.
Interestingly, we arrive at a novel nonlinear twist on the classic Hebbian learning rule [Baldi and Sadowski2015]. If and “fire together they wire together” (i.e. correlations lead to stronger weights), but this objective strongly prefers correlations that are nearly maximal, in which case the denominator becomes small and the weight becomes large. This optimization of for continuous random variables and is, to the best of our knowledge, the first tractable approach except for a special case discussed by [Op’t Veld and Gastpar2016a]. Also note that although we used in the derivation, the solution does not require us to calculate these computationally intensive quantities.
. In the discrete formulation of the sieve, constructing remainder information is a major problem that ultimately imposes a bottleneck on its usefulness because the state space of remainder information can grow quickly. In the linear case, however, the construction of remainder information is a simple linear transformation reminiscent of incremental PCA. We define the remainder information with a linear transformation,. This transformation is clearly invertible (condition (i)), and it can be checked that which implies (condition (2)).
The solution for the linear, Gaussian case is more flexible than it looks. We do not actually have to require that the data, , is drawn from a jointlynormal distribution to get meaningful results. It turns out that if each of the individual marginals is Gaussian, then the expression for mutual information for Gaussians provides a lower bound for mutual information [Foster and Grassberger2011]. Also, the objective (Eq. 2) is invariant under invertible transformations of the marginals [Cover and Thomas2006]. Therefore, to ensure that the optimization that we solved (Eq. 5
) is a lower bound for the optimization of interest, we should transform the marginals to be individually Gaussian distributed. Several nonlinear, parametric methods to Gaussianize one-dimensional data exist, including a recent method that works well for long-tailed data[Goerg2014]. Alternatively, a nonparametric approach is to Gaussianize data based on the rank statistics [Van der Waerden1952]. Finally, [Singh and Pøczos2017] study information measures for a large family of distributions that can be nonparametrically transformed into normal distributions.
A concrete implementation of one layer of the sieve transformation is straightforward and the algorithm is summarized in Alg. 1. Our implementation is available online [Ver Steeg2016]. The minimal preprocessing of the data is to subtract the mean of each variable. Optionally, further Gaussianizing preprocessing can be applied. Our fixed point optimization requires us to start with some weights, and we iteratively update using Eq. 6 until we reach a fixed point. This only guarantees that we find a local optima so we typically run the optimization 10 times and take the solution with the highest value of the objective. We initialize to be drawn from a normal with zero mean and scale . We scale each
by the standard deviation of each marginal so that one variable does not strongly dominate the random initialization,.
The iteration proceeds by estimating marginals and then applying Eq.6. Estimating the covariance at each step is the main computational burden, but the steps are all linear. If we have samples and variables, then we calculate labels for each data point, , which amounts to dot products of vectors with length . Then we calculate the covariance, , which amounts to dot products of vectors of length . These are the most intensive steps and could be easily sped up using GPUs or mini-batches if is large. Convergence is determined by checking when changes in the objective of Eq. 5 fall below a certain threshold, in our experiments.
After training one layer of the sieve, it is trivial to take the remainder information and feed it again through Alg. 1. While our optimization in Eq. 5 formally involved a probabilistic function, we take the final learned function to be deterministic, , as required by Thm. 3.1. Each layer contributes in our decomposition of , so we can stop when these contributions become negligible. This occurs when the variables in become independent. In that case, and since , we get no more positive contributions from optimizing .
We begin with some benchmark results on a synthetic model. We use this model to show that the sieve can uniquely recover the hidden sources, while other methods fail to do so.
For the synthetic examples, we consider data generated according to a model defined in Fig. 2. We have
sources, each with unit variance,. Each source has children and the children are not overlapping. Each channel is an additive white Gaussian noise (AWGN) channel defined as . The noise has some variance that may be different for each observed variable, . Each channel can be characterized as having a capacity, [Cover and Thomas2006], and we define the total capacity, . For experiments, we set to be some constant, and we set the noise so that the fraction, , allocated to each variable,
, is drawn from the uniform distribution over the simplex.
We examine how quickly the objective converges by plotting the error at the -th iteration. The error is defined as the difference between TC at each iteration and the final TC. We take the final value of TC to be the value obtained when the magnitude of changes falls below . We set for these experiments. In Fig. 3, we look at convergence for a few different settings of the generative model and see linear rates of convergence (where error is plotted on a log scale, as is conventional for convergence plots), with a coefficient that seems to depend on problem details. The slowest rate of convergence comes from data where each is generated from an independent normal distribution (i.e., there is no common information).
As a first test of performance, we consider a simple version of the model in Fig. 2 in which we have just a single source and we have observed variables that are noisy copies of the source. For this experiment, we set total capacity to . By varying , we are spreading this capacity across a larger number of noisier variables. We use the sieve to recover a single latent factor, , that captures as much of the dependence as possible (Eq. 2), and then we test how close this factor is to the true source, , using Pearson correlation. We also compare to various other standard methods: PCA [Halko et al.2011], ICA [Hyvärinen and Oja2000], Non-Negative Matrix Factorization (NMF) [Lin2007], Factor Analysis (FA) [Cattell1952], Local Linear Embedding (LLE) [Roweis and Saul2000], Isomap [Tenenbaum et al.2000]
, Restricted Boltzmann Machines (RBMs)[Hinton and Salakhutdinov2006]
, and k-Means[Sculley2010]. All methods were run using implementations in the scikit library [Pedregosa et al.2011].
Looking at the results in Fig. 4(a), we see that for a small number of variables almost any technique suffices to recover the source. As the number of variables rises, however, intuitively reasonable methods fail and only the sieve maintains high performance. The first component of PCA, for instance, is the projection with the largest variance but it can be shown that by changing the scale of the noise in different directions, this component can be made to point in any direction. Unlike PCA, the sieve is invariant under scale transformations of each variable. Error bars are produced by looking at the standard deviation of results over 10 randomly generated datasets. Some error bars are smaller than the plot markers. Besides being the most accurate method, the sieve also has the smallest variance.
In the generative model in Fig. 2, we have independent sources that are each Gaussian distributed. We could imagine applying an orthonormal rotation, , to the vector of sources and call these . Because of the Gaussianity of the original sources, also represent independent Gaussian sources. We can write down an equivalent generative model for the ’s, but each now depends on all the (i.e., ). From a generative model perspective, our original model is unidentifiable and therefore independent component analysis cannot recover it [Hyvärinen and Oja2000]. On the other hand, the original generating model is special because the common information about the ’s are localized in invidivual sources, while in the rotated model, you need to combine information from all the sources to predict any individual . The sieve is able to uniquely recover the true sources because they represent the optimal way to sift out common information.
To measure our ability to recover the independent sources in our model, we consider a model with sources and varying numbers of noisy observations. The results are shown in Fig. 4(b). We learn 10 layers of the sieve and check how well recover the true sources. We also specify 10 components for the other methods shown for comparison. As predicted, ICA does not recover the independent sources. While the generative model is in the class described by Factor Analysis (FA), there are many FA models that are equally good generative models of the data. In other words, FA suffers from an identifiability problem that makes it impossible to uniquely pick out the correct model [Shalizi2013]. In contrast, common information provides a simple and effective principle for uniquely identifying the true sources.
The first component of PCA explains the most variance in the data, and the weights of the first component are often used in exploratory analysis to understand the semantics of discovered factors. Analogously, the first component of the sieve extracts the largest source of common information. In Fig. 5 we compare the top components learned by the sieve on the Olivetti faces dataset to those learned by PCA. The sieve may be more practical for extracting components if data is high dimensional since its complexity is linear in the number of variables while PCA is quadratic. Like PCA, we can also use the sieve for reconstructing data from a small number of learned factors. Note that the sieve transform is invertible so that . If we have a sieve transformation with layers, then we can continue this expansion as follows.
If we knew the remainder information, , this reconstruction would be perfect. However, we can simply set the and we will get a prediction for based only on the learned factors, , as in Fig. 5.
To demonstrate that our approach is practical for blind source separation in a more realistic scenario, we applied the sieve to recover spatial brain components from fMRI data. This data is generated according to a synthetic but biologically motivated model that incorporates realistic spatial modes and heterogeneous temporal signals [Erhardt et al.2012]. We show in Fig. 6(b) that we recover components that match well with the true spatial components. For comparison, we show ICA’s performance in Fig. 6(c) which looks qualitatively worse. ICA’s poor performance for recovering spatial MRI components is known and various extensions have been proposed to remedy this [Allen et al.2012]. This preliminary result suggests that the concept of “common information” may be a more useful starting point than “independent components” as an underlying principle for brain imaging analysis.
The sieve can be viewed as a dimensionality reduction (DR) technique. Therefore, we apply various DR methods to two standard datasets and use a Support Vector Machine with a Gaussian kernel to compare the classification accuracy after dimensionality reduction. The two datasets we studied were GISETTE and MADELON and consist of 5000 and 500 dimensions respectively. For each method and dataset, we learn a low-dimensional representation on training data and then transform held-out test data and report the classification accuracy on that. The results are summarized in Fig.7.
For the GISETTE dataset, we see factor analysis, the sieve, and PCA performing the best, producing low dimensional representations with similar quality using a relatively small number of dimensions. For the MADELON dataset, the sieve representation gives the best accuracy with factor analysis and PCA resulting in accuracy drops of about five and ten percent respectively. Interestingly, all three techniques peak at five dimensions, which was intended to be the correct number of latent factors embedded in this dataset [Guyon et al.2004].
Although the sieve is linear, the information objective that is optimized is nonlinear so the sieve substantially differs from methods like PCA. Superficially, the sieve might seem related to methods like Canonical Correlation Analysis (CCA) that seek to find a that makes and independent, but that method requires some set of labels, . One possibility would be to make a copy of , so that is reducing dependence between and a copy of itself [Wang et al.2010]. However, this objective differs from common information as can be seen by considering the case where consists of independent variables. In that case the common information within is zero, but and its copy still have dependence. The concept of “common information” has largely remained restricted to information-theoretic contexts [Xu et al.2013, Wyner1975, Kumar et al.2014, Op’t Veld and Gastpar2016a, Op’t Veld and Gastpar2016b]. The common information in that is about some variable, , is called intersection information and is also an active area of research [Griffith et al.2014].
Insofar as the sieve reduces the dependence in the data, it can be seen as an alternate approach to independent component analysis [Comon1994] that is more directly comparable to “least dependent component analysis” [Stögbauer et al.2004]. As an information theoretic learning framework, the sieve could be compared to the information bottleneck [Tishby et al.2000], which also has an interesting Gaussian counterpart [Chechik et al.2005]. The bottleneck requires labeled data to define its objective. In contrast, the sieve relies on an unsupervised objective that fits more closely into a recent program for decomposing information in high-dimensional data [Ver Steeg and Galstyan2014, Ver Steeg and Galstyan2015, Ver Steeg and Galstyan2016], except that work focused on discrete latent factors.
The sieve could be viewed as a new objective for projection pursuit [Friedman1987] based on common information. The sieve stands out from standard pursuit algorithms in two ways. First, an information based “orthogonality” criteria for subsequent projections naturally emerges and, second, new factors may depend on factors learned at previous layers (note that in Fig. 1 each learned latent factor is included in the remainder information that is optimized over in the next step). More broadly, the sieve can be viewed as a new approach to unsupervised deep representation learning [Bengio et al.2013, Hinton and Salakhutdinov2006]. In particular, our setup can be directly viewed as an auto-encoder with a novel objective [Bengio et al.2007]. From that point of view, it is clear that the sieve can also be directly leveraged for unsupervised density estimation [Dinh et al.2014].
We introduced a new scheme for incrementally extracting common information from high-dimensional data. The foundation of our approach is an efficient information theoretic optimization that finds latent factors that capture as much information about multivariate dependence in the data as possible. With a practical method for extracting common information from high-dimensional data, we were able to explore new applications of common information in machine learning. Besides promising applications for exploratory data analysis and dimensionality reduction, common information seems to provide a compelling approach to blind source separation.
While the results here relied on assumptions of linearity and Gaussianity, the invariance of the objective under nonlinear marginal transforms, a common ingredient in deep learning schemes, suggests a straightforward path to generalization that we leave to future work. The greedy nature of the sieve construction may be a limitation so another potential direction would be to jointly optimize several latent factors at once[Ver Steeg and Galstyan2017]
. Sifting out common information in high-dimensional data provides a practical and distinctive new principle for unsupervised learning.
GV thanks Sanjoy Dasgupta, Lawrence Saul, and Yoav Freund for encouraging exploration of the linear, Gaussian case for decomposing multivariate information. This work was supported in part by DARPA grant W911NF-12-1-0034 and IARPA grant FA8750-15-C-0071.
Result analysis of the nips 2003 feature selection challenge.In Advances in neural information processing systems, pages 545–552, 2004.
Reducing the dimensionality of data with neural networks.Science, 313(5786):504–507, 2006.