Common Information Components Analysis

02/03/2020 ∙ by Michael Gastpar, et al. ∙ EPFL 0

We give an information-theoretic interpretation of Canonical Correlation Analysis (CCA) via (relaxed) Wyner's common information. CCA permits to extract from two high-dimensional data sets low-dimensional descriptions (features) that capture the commonalities between the data sets, using a framework of correlations and linear transforms. Our interpretation first extracts the common information up to a pre-selected resolution level, and then projects this back onto each of the data sets. In the case of Gaussian statistics, this procedure precisely reduces to CCA, where the resolution level specifies the number of CCA components that are extracted. This also suggests a novel algorithm, Common Information Components Analysis (CICA), with several desirable features, including a natural extension to beyond just two data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Understanding relations between two (or more) sets of variates is key to many tasks in data analysis and beyond. To approach this problem, it is natural to reduce each of the sets of variates separately in such a way that the reduced descriptions, or features, fully capture the commonality  between the two sets, while suppressing aspects that are individual to each of the sets. This permits to understand the relation between the data sets without obfuscation.

A popular framework to accomplish this task follows the classical viewpoint of dimensionality reduction and is referred to as Canonical Correlation Analysis (CCA) [1]. CCA seeks the best linear

extraction, i.e., we consider linear projections of the original variates. In this case, the quality of the extraction is assessed via the resulting correlation coefficient. The result can be expressed directly via the singular value decomposition. Via the so-called Kernel trick, this can be extended to cover arbitrary (fixed) function classes.

An alternative framework is built around the concept of maximal correlation. Here, one seeks arbitrary (not necessarily linear) remappings of the original data in such a way as to maximize their correlation coefficient. This perspective culminates in the well-known alternating conditional expectation  (ACE) algorithm [2], but the problem does not admit a compact solution.

In both approaches, the commonality between variates is measured by correlation. By contrast, in this paper, we consider a different approach that measures commonality between variates via (relaxed Wyner’s) Common Information [3, 4], a variant of a mutual information measure.

I-a Contributions

The main contributions of our work are:

  • The introduction of a novel algorithm, referred to as Common Information Components Analysis (CICA), to separately reduce each set of variates in such a way as to retain the commonalities between the sets of variates while suppressing their individual features. A conceptual sketch is given in Figure 1.

  • The proof that for the special case of Gaussian variates, CICA reduces to CCA. Thus, CICA is a strict generalization of CCA.

Original Data

Sets

Common Information

at level

Common Information

Components

projection
Fig. 1: Common Information Components Analysis (for the case of two data sets): For two (high-dimensional) data sets (sources) and we first determine their (Wyner’s) common information. The Common Information Components are then obtained by projecting the common information back onto the two data sources, respectively. The parameter is the compression level: a larger means coarser common information.

I-B Related Work

Connections between CCA and Wyner’s common information have been explored in the past. It is well known that for Gaussian vectors, (standard, non-relaxed) Wyner’s common information is attained by all of the CCA components together, see 

[5]. This has been further interpreted, see e.g. [6]. To put our work into context, we note it is only the relaxed Wyner’s common information [3, 4] that permits to conceptualize the sequential, one-by-one recovery of the CCA components, and thus, the spirit of dimensionality reduction.

Information measures have played a role in earlier considerations with some connections to dimensionality reduction and feature extraction. This includes independent components analysis (ICA) 

[7] and the information bottleneck [8, 9], amongst others. Finally, we note that an interpretation of CCA as a (Gaussian) probabilistic model was presented in [10].

I-C Notation

A bold capital letter such as denotes a random vector, and its realization. A non-bold capital letter such as denotes a (fixed) matrix, and its Hermitian transpose. Specifically, denotes the covariance matrix of the random vector denotes the covariance matrix between random vectors and

Ii Relaxed Wyner’s Common Information

The main framework and underpinning of the proposed algorithm is Wyner’s common information and its extension, which is briefly reviewed in the sequel, along with its key properties.

Ii-a Wyner’s Common Information

Wyner’s common information is defined for two random variables (or random vectors)

and

of arbitrary fixed joint distribution

Definition 1 (from [11]).

For random variables and with joint distribution Wyner’s common information is defined as

(1)

Basic properties are stated below in Lemma 1 (setting ). We note that explicit formulas for Wyner’s common information are known only for a small number of special cases. The case of the doubly symmetric binary source is solved completely in [11] and can be written as

(2)

where

denotes the probability that the two sources are unequal (assuming without loss of generality

). Further special cases of discrete-alphabet sources appear in [12].

Moreover, when and are jointly Gaussian with correlation coefficient then Note that for this example, This case was solved in [13, 14] using a parameterization of conditionally independent distributions. We note that an alternative proof follows from the arguments presented in [3, 4].

Ii-B Relaxed Wyner’s Common Information

Definition 2 (from [3]).

For random variables and with joint distribution the relaxed Wyner’s common information is defined as (for )

(3)
Lemma 1 (from [4]).

The relaxed Wyner’s common information satisfies the following properties:

  1. For discrete and the cardinality of may be restricted to

  2. with equality if and only if

  3. Data processing inequality: If

    form a Markov chain, then

  4. is a convex and continuous function of for

  5. If and are one-to-one functions, then

  6. For discrete we have

  7. Let be independent pairs of random variables. Then

    (4)

Explicit formulas for the relaxed Wyner’s common information are not currently known for most A notable exception is when and are jointly Gaussian random vectors of length Denote the covariance matrices of the vectors and by and respectively, and the covariance matrix between and by Then (see [4]),

(5)

where

(6)

and (for ) are the singular values of . By contrast, for the doubly symmetric binary source, the relaxed Wyner’s common information is currently unknown (a bound and conjecture appear in [4]).

Iii The Algorithm

In this section, we present the proposed algorithm in the idealized setting of unlimited data. Specifically, for the proposed algorithm, this means that we assume perfect knowledge of the data distribution

Iii-a High-level Description

The idea of the proposed algorithm is to estimate the relaxed Wyner’s Common Information of Equation (

3) between the information sources (data sets) at the chosen level This estimate will come with an associated conditional distribution Obtaining the dimension-reduced versions then can be thought of as a type of projection of the resulting random variable back on and respectively. For the case of Gaussian statistics, this can be made precise.

Iii-B Main Steps of the Algorithm

The algorithm proposed here starts from the joint distribution of the data, Estimates of this distribution can be obtained from data samples and via standard techniques. The main steps of the procedure can then be described as follows:

Algorithm 1 (Cica).
  1. Select a real number where This is the compression level: A low value of represents low compression, and thus, many components are retained. A high value of represents high compression, and thus, only a small number of components are retained.

  2. Solve the relaxed Wyner’s common information problem,

    (7)

    leading to an associated conditional distribution 111We note that this is not generally unique. For example, if is a minimizer, then so is for any one-to-one mapping

  3. The dimension-reduced data sets are

    1. Version 1: MAP (maximum a posteriori):

    2. Version 2: Conditional Expectation:

    3. Version 3: Marginal Integration:

Iii-C A binary toy example

Let us illustrate the proposed algorithm via a simple toy example. Consider the vector of binary random variables. Suppose that are a doubly symmetric binary source (i.e.,   is uniform, and is the result of passing through a binary symmetric (“bit-flipping”) channel) while and are independent binary uniform random variables (also independent of ). We will then form the vectors and as

(8)

and

(9)

where denotes the modulo-reduced addition, as usual. Observe that any pair amongst the four entries in these two vectors are (pairwise) independent binary uniform random variables. Hence, the overall covariance matrix of the merged random vector

is merely a scaled identity matrix, implying that CCA does not do anything.

By contrast, for the CICA algorithm (with and using the MAP version), an optimal solution is to reduce to and to This captures all the dependence between the vectors and which appears to be the most desirable outcome.

Iv For Gaussian, CICA is CCA

In this section, we consider the proposed CICA algorithm in the idealized setting where the data distribution is known exactly. Specifically, we establish that if

is a (multivariate) Gaussian distribution, then the classic CCA is a solution to all versions of the proposed CICA algorithm. This is the main technical contribution of the present work.

CCA is perhaps best described by first changing coordinates,

(10)
(11)

With this, the covariance matrix of the vector is the identity matrix, and so is the covariance matrix of the vector CCA is then easily described by considering the covariance matrix between these two vectors,

(12)

A brief overview is given in Appendix A. Let us denote the singular value decomposition of this matrix by

(13)

where contains, on its diagonal, the ordered singular values of this matrix, denoted by CCA then performs the dimensonality reduction

(14)
(15)

where the matrix contains the first columns of (that is, the left singular vectors corresponding to the largest singular values), and the matrix the respective right singular vectors. We refer to these as the “top CCA components.”

Theorem 1.

Let and be jointly Gaussian random vectors. Then, the top CCA components are a solution to all three versions of Algorithm 1, and controls the number as follows:

(16)

where

Remark 1.

Note that is a decreasing, integer-valued function.

This theorem is a consequence of the main result in [3]. A proof outline is provided in Appendix B.

As mentioned earlier, the connection between CCA and (standard non-relaxed) Gaussian Wyner’s common information is well known [5]. What is new in the present paper is the extension of this insight to relaxed Wyner’s common information. This extension permits to extract the CCA components one-by-one via the compression parameter Evidently, the CICA algorithm only makes sense because we can tune how much common information we wish to extract. In this sense, the choice (the non-relaxed case) is not interesting since it amounts to a one-to-one transform of the original data (up to completely independent portions), and thus, fails to capture the spirit of “dimensionality reduction.”

V Extension to More Than Two Sources

It is unclear how one would extend CCA to more than two databases. By contrast, for CICA, this extension is conceptually straightforward. The definition of relaxed Wyner’s common information is readily extended to the general case:

Definition 3 (Relaxed Wyner’s Common Information for variables).

For a fixed probability distribution

we define

(17)

such that where the minimum is over all probability distributions with marginal

Hence, to extend CICA (Algorithm 1) to the case of databases, it now suffices to replace Step 2) with Definition 3. In Step 3), for all three versions, it is immediately clear how they can be extended. For example, for Version 1), we use

(18)

for

It will be shown elsewhere how one can obtain the analogs of Equations (5)-(6) for this generalized case, and thus, an extended version of Theorem 1.

Vi Concluding Remarks and Future Work

In a practical setting, one does not have access to the correct data distribution A first version is to simply work with an estimate of this distribution, based on the data available. But a more interesting implementation is to combine the estimation step with the optimization step. A fast algorithmic implementation will be presented elsewhere.

Appendix A Cca

A brief review of CCA [1] is presented, mostly in view of the proof of Theorem 1, given below in Appendix B. Let and be zero-mean real-valued random vectors with covariance matrices and respectively. Moreover, let Let us first form

(19)
(20)

With this, the covariance matrix of the vector is the identity matrix, and so is the covariance matrix of the vector CCA seeks to find vectors and such as to maximize the correlation between and that is,

(21)

which can be rewritten as

(22)

where

(23)

Note that this expression is invariant to arbitrary (separate) scaling of and To obtain a unique solution, we could choose to impose that both vectors be unit vectors,

(24)

From Cauchy-Schwarz, for a fixed the maximizing (unit-norm) is given by

(25)

or equivalently, for a fixed the maximizing (unit-norm) is given by

(26)

Plugging in the latter, we obtain

(27)

or, dividing through,

(28)

The solution to this problem is well known: is the right singular vector corresponding to the largest singular vector of the matrix Evidently, is the corresponding left singular vector. Restarting again from Equation (21), but restricting to vectors that are orthogonal to the optimal choices of the first round leads to the second CCA components, and so on.

Appendix B Proof Outline for Theorem 1

In the case of Gaussian vectors, the solution to the optimization problem in Equation (3) is most easily described in two steps. First, we apply the change of basis indicated in Equations (19)-(20). This is a one-to-one transform, leaving all information expressions in Equation (3) unchanged. In the new basis, we have independent pairs. When and consist of independent pairs, the solution to the optimization problem in Equation (3) can be reduced to separate scalar optimizations, see [4, Theorem 3] (also quoted above in Lemma 1, Item 8). The remaining crux then is solving the scalar Gaussian version of the optimization problem in Equation (3). This is done in [4, Theorem 4] via an argument of factorization of convex envelope. The full solution to the optimization problem is given in Equation (5)-(6). The remaining allocation problem over the non-negative numbers can be shown to lead to a water-filling solution, see [4, Section IV]. More explicitly, to understand this solution, start by setting Then, the corresponding and the optimizing distribution trivializes. Now, as we lower the various terms in the sum in Equation (5) start to become non-zero, starting with the term with the largest correlation coefficient Hence, an optimizing distribution can be expressed as where the matrices and are precisely the top CCA components (see Equations (14)-(15) and the following discussion), and is additive Gaussian noise with mean zero, independent of and

For the algorithm, we need the corresponding conditional marginals, and By symmetry, it suffices to prove one formula. Changing basis as in Equations (19)-(20), we can write

(29)
(30)
(31)
(32)
(33)

Finally, note that Equation (25) can be read as

(34)

for some real-valued constant Thus, combining the top CCA components,

(35)

where is a diagonal matrix. Hence,

(36)
(37)

where is the diagonal matrix

(38)

This is precisely the top CCA components (note that the solution to the CCA problem (21) is only specified up to a scaling). This establishes the theorem for the case of Version 2) of the proposed algorithm. Clearly, it also establishes that is a Gaussian distribution with mean given by (37), thus establishing the theorem for Version 1) of the proposed algorithm. The proof for Version 3) follows along similar lines and is thus omitted.

Acknowledgment

This work was supported in part by the Swiss National Science Foundation under Grant 169294, Grant P2ELP2_165137.

References

  • [1] H. Hotelling, “Relations between two sets of variants,” Biometrika, vol. 28, no. 3/4, pp. 321–377, December 1936.
  • [2] L. Breiman and J. H. Friedman, “Estimating optimal transformations for multiple regression and correlation,” J. Am. Stat. Assoc., vol. 80, no. 391, pp. 580–598, September 1985.
  • [3] M. Gastpar and E. Sula, “Relaxed Wyner’s common information,” in Proceedings of the 2019 IEEE Information Theory Workshop, Visby, Sweden, 2019.
  • [4] E. Sula and M. Gastpar, “Relaxed Wyner’s common information,” CoRR, vol. abs/1912.07083, 2019. [Online]. Available: http://arxiv.org/abs/1912.07083
  • [5] S. Satpathy and P. Cuff, “Gaussian secure source coding and Wyner’s common information,” in 2015 IEEE International Symposium on Information Theory (ISIT), June 2015, pp. 116–120.
  • [6] S.-L. Huang, G. W. Wornell, and L. Zheng, “Gaussian universal features, canonical correlations, and common information,” in 2018 IEEE Information Theory Workshop (ITW), November 2018.
  • [7] P. Comon, “Independent component analysis,” in lnternat. Signal Processing Workshop on High-Order Statistics, Chamrousse, France, 1991, pp. 111–120.
  • [8]

    H. S. Witsenhausen and A. D. Wyner, “A conditional entropy bound for a pair of discrete random variables,”

    IEEE Transactions on Information Theory, vol. 21, no. 5, pp. 493–501, September 1975.
  • [9] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in The 37th annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, U.S.A., Sep. 1999, pp. 368–377.
  • [10] F. Bach and M. Jordan, “A Probabilistic Interpretation of Canonical Correlation Analysis,” University of California, Berkeley, Department of Statistics, Tech. Rep. 688, April 2005.
  • [11] A. Wyner, “The common information of two dependent random variables,” IEEE Transactions on Information Theory, vol. 21, no. 2, pp. 163–179, March 1975.
  • [12] H. S. Witsenhausen, “Values and bounds for the common information of two discrete random variables,” SIAM J. Appl. Math, vol. 31, no. 2, pp. 313–333, September 1976.
  • [13]

    G. Xu, W. Liu, and B. Chen, “Wyners common information for continuous random variables - a lossy source coding interpretation,”

    Annual Conference on Information Sciences and Systems, March 2011.
  • [14] ——, “A lossy source coding interpretation of Wyner’s common information,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 754–768, 2016.