 # Gaussian Lower Bound for the Information Bottleneck Limit

The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predictive power. Specifically, it is achieved by minimizing the level of mutual information (MI) between the representation and the original variables, subject to a minimal level of MI between the representation and the target. This problem is shown to be in general NP hard. One important exception is the multivariate Gaussian case, for which the Gaussian IB (GIB) is known to obtain an analytical closed form solution, similar to Canonical Correlation Analysis (CCA). In this work we introduce a Gaussian lower bound to the IB curve; we find an embedding of the data which maximizes its "Gaussian part", on which we apply the GIB. This embedding provides an efficient (and practical) representation of any arbitrary data-set (in the IB sense), which in addition holds the favorable properties of a Gaussian distribution. Importantly, we show that the optimal Gaussian embedding is bounded from above by non-linear CCA. This allows a fundamental limit for our ability to Gaussianize arbitrary data-sets and solve complex problems by linear methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem of extracting the relevant aspects of complex data is a long standing staple in statistics and machine learning. The Information Bottleneck (IB) method, presented by

Tishby et al. (1999)

, approaches this problem by extending its classical notion to a broader information-theoretic setup. Specifically, given the joint distribution of a set of explanatory variables

and a target variable (which may also be of a higher dimension), the IB method strives to find the most compressed representation of , while preserving information about . Thus, implicitly regulates the compression of , so that its compressed representation maintains a level of relevance as explanatory variables with regards to . The IB problem is formally defined as follows:

 minP(T––|X––) I(X––;T––) (1) subject to I(T––;Y––)≥IY

where is the compressed representation of and the minimization is over the mapping of to

, defined by the conditional probability

. Here, is a constant parameter that sets the level of information to be preserved between the compressed representation and the target. Solving this problem for a range of values defines the IB curve – a continuous concave curve which demonstrates the optimal trade-off between representation complexity (regarded as ) and predictive power ().

The IB method showed to be a powerful tool in a variety of machine learning domains and related areas (Slonim and Tishby, 2000; Friedman et al., 2001; Sinkkonen and Kaski, 2002; Slonim et al., 2005; Hecht et al., 2009). It is also applicable to other fields such as neuroscience (Schneidman et al., 2001) and optimal control (Tishby and Polani, 2011). Recently, Tishby and Zaslavsky (2015) and Shwartz-Ziv and Tishby (2017)

demonstrated its abilities in analyzing and optimizing the performance of deep neural networks.

Generally speaking, solving the IB problem (1) for an arbitrary joint distribution is not a simple task. In the introduction of the IB method, Tishby et al. (1999) defined a set of self-consistent equations which formulate the necessary conditions for the optimal solution of (1). Further, they provide an iterative Arimoto– Blahut like algorithm which shows to converge to local optimum. In general, these equations do not hold a tractable solution and are usually approximated by different means (Slonim, 2002). An extensive attention was given to the simpler categorical setup, where the IB curve is somewhat easier to approximate. Here, and take values on a finite set and represents (soft and informative) clusters of (REF). Naturally, the IB problem also applies for continuous variables. In this case, approximating the solution to the self-consistent equations is even more involved. A special exception is the Gaussian case, where and

are assumed to follow a jointly normal distribution and the

Gaussian IB

problem (GIB) is analytically solved by linear projections to the canonical correlation vector space

(Chechik et al., 2005)

. However, evaluating the IB curve for arbitrary continuous random variables is still considered a highly complicated task where most attempts focus on approximating or bounding it

(Rey and Roth, 2012; Chalk et al., 2016). A detailed discussion regarding currently known methods is provided in the following section.

In this work we present a novel Gaussian lower bound to the IB curve, which applies to all types of random variables (continuous, nominal and categorical). Our bound strives to maximize the “jointly Gaussian part” of the data and apply the analytical GIB to it. Specifically, we seek for two transformations, and so that and are highly correlated and “as jointly Gaussian as possible”. In addition, we ask that the transformations preserve as much information as possible between and . This way, we maximize the portion of the data that can be explained by linear means, , specifically using the GIB.

In fact, our results goes beyond the specific context of the information bottleneck. In this work we tackle the fundamental question of linearizing non-linear problems. Specifically, we ask ourselves whether it is possible to “push” all the information in the data to its second moments. This problem has received a great amount of attention over the years. For example,

Schneidman et al. (2006)

discuss this problem in the context of neural networks; they provide preliminary evidence that in the vertebrate retina, weak pairwise correlations may describe the collective (non-linear) behavior of neurons. In this work, we provide both fundamental limits and constructive algorithms for maximizing the part of the data that can be optimally analyzed by linear means. This basic property holds both theoretical and practical implications, as it defines the maximal portion which allows favorable analytical properties in many applications. Interestingly, we show that even if we allow the transformations

and to increase the dimensions of and , our ability to linearize the problem is still limited, and governed by the non-linear canonical correlations (Breiman and Friedman, 1985) of the original variables.

Our suggested approach may also be viewed as an extension of the Shannon lower bound (Cover and Thomas, 2012), for evaluating the mutual information. In his seminal work, Shannon provided an analytical Gaussian lower bound for the generally involved rate distortion function. He showed that the rate distortion function can be bounded from below by where is the compressed source, is its corresponding deferential entropy and is the differential entropy of an independent Gaussian noise with a maximal distortion level . This bound holds some favorable theoretical properties (Cover and Thomas, 2012) and serves as one of the most basic tools for approximating the rate distortion function to this very day. In this work, we use a somewhat similar rationale and derive a Gaussian lower bound for the mutual information of two random variables, which holds an analytical expression just like the Shannon’s bound. We then extend our result to the entire IB curve and discuss its theoretical properties and practical considerations.

The rest of this manuscript is organized as follows: In Section 2 we review previous work on the IB method for continuous random variables. Section 3 defines our suggested lower bound and formulates it as an optimization problem. We then propose a set of solutions and bounds to this problem, as we distinguish between the easier univariate case (Section 4) and the more involved multivariate case (Section 5). Finally, in Section 6 we extend our results to the entire IB curve.

## 2 Related work

As discussed in the previous section, solving the IB problem for continuous variables is in general a difficult task. A special exception is where and follow a jointly normal distribution. Chechik et al. (2005) show that in this case, the Gaussian IB problem (GIB) is solved by a noisy linear projection, . Specifically, assume that and are of dimensions and respectively and denote the covariance matrix of as while the conditional covariance matrix of is . Then, is a Gaussian random vector with a zero mean and a unit covariance matrix, independent of . The matrix is defined as follows:

 A=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩[0T;…;0T]0≤β≤βC1[a1vT1;0T;…;0T]βC1≤β≤βC2[a1vT1;a2vT2;0T;…;0T]βC2≤β≤βC3⋮⋮⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭. (2)

where

are the left eigenvectors of

, sorted by their corresponding ascending eigenvalues

, are the critical values, are defined by . and is an row vector of zeros. Notice that the critical values correspond to the slope of the IB curve, as they represent the Lagrange multipliers of the IB problem.

Unfortunately, this solution is limited to jointly Gaussian random variables. In fact, it can be shown that a closed form analytical solution (for continuous random variables) may only exist under quite restrictive assumptions on the underlaying distribution. Moreover, as the IB curve is so challenging to evaluate in the general case, most known attempts either focus on extending the GIB to other distributions under varying assumptions, or approximate the IB curve by different means.

Rey and Roth (2012) reformulate the IB problem in terms of probabilistic copulas. They show that under a Gaussian copula assumption, an analytical solution (which extends the GIB) applies to joint distributions with arbitrary marginals. This formulation provides several interesting insights on the IB problem. However, its practical implications are quite limited as the Gaussian copula assumption is very restrictive. In fact, it implicitly requires that the joint distribution would maintain a Gaussian structure. As we show in the following sections, this assumption makes the problem significantly easier and does not hold in general.

Chalk et al. (2016)

provide a lower bound to the IB curve by using an approximate variational scheme, analogous to variational expectation maximization. Their method relaxes the IB problem by restricting the class of distributions,

and

to a set of parametric models. This way, the relaxed IB problem may be solved in EM-like steps; their suggested algorithm iteratively maximize the objective over the mappings (for fixed parameters) and then maximize the set of parameters, for fixed mappings.

Chalk et al. (2016) show that this method can be effectively applied to “sparse” data in which and are generated by sparsely occurring latent features. However, in the general case, their suggested bound strongly depends on the assumption that the chosen parametric models provide reasonable approximations for the optimal distributions. This assumption is obviously quite restrictive. Moreover, it is usually difficult to validate, as the optimal distributions are unknown. Kolchinsky et al. (2017) take a somewhat similar approach, as they suggest a variational upper bound to the IB curve. The main difference between the two methods relies on the variational approximation of objective, . However, they are both prune to the same difficulties stated above.

Alemi et al. (2016)

propose an additional variational inference method to construct a lower bound to the IB curve. Here, they re-parameterize the IB problem followed by Monte Carlo sampling, to get an unbiased estimate of the IB objective gradient. This allows them to apply deep neural networks in order to parameterize any given distribution. However, this method fails to provide guarantees on the obtained bound, as a result of the suggested stochastic gradient decent optimization approach.

Achille and Soatto (2016) relax the bottleneck problem by introducing an additional total correlation (TC) regularization term that strives to maximize the independence among the components of the representation . They show that under the assumption that the Lagrange multipliers of the TC and MI constraints are identical, the relaxed problem may be solved by adding auxiliary variables. However, this assumption is usually invalid, and the suggested method fails to provide guarantees on difference between the obtained objective and original IB formulation.

In this work we suggest a novel lower bound to the IB curve which provides both theoretical and practical guarantees. In addition, we introduce upper and lower bounds for our suggested solution that are very easy to attain. This way we allow immediate benchmarks to the IB curve using common off-shelf methods.

## 3 Problem formulation

Throughout this manuscript we use the following standard notation: underlines denote vector quantities, where their respective components are written without underlines but with index. For example, the components of the -dimensional vector are . Random variables are denoted with capital letters while their realizations are denoted with the respective lower-case letters. The mutual information of two random variables is defined as where is the differential entropy of and

is its probability density function.

We begin by introducing a Gaussian lower bound to the mutual information . We then extend our result to the entire IB curve.

### 3.1 Problem statement

Let

be two multivariate random vectors with a joint cumulative distribution function (CDF)

and mutual information . In the following sections we focus on bounding from below with an analytical expression. Let and be two transformations of and , respectively. Assume that and are separately normal distributed. This means that and but the vector is not necessarily normal distributed. This allows us to derive the following fundamental inequality

 I(X––,Y––)≥ I(U––,V––)=h(U––)+h(V––)−h(U––,V––)≥ (3) h(U––)+h(V––)−h(U––jg,V––jg)=12log(∣∣C[U––,V––]∣∣|CU––||CV––|)

where the first inequality follows from the Data Processing lemma (Cover and Thomas, 2012) and the second inequality follows from being jointly Gaussian (jg) distributed with the same covariance matrix as , , so that (Cover and Thomas, 2012). Notice that (3) can also be derived from an information geometry (IG) view point, as shown by Cardoso (2003).

Equality is attained in (3) iff (no information is lost in the transformation) and , are jointly normally distributed. In other words, in order to preserve all the information we must find and that capture all the mutual information, and at the same time make and jointly normal. This is obviously a complicated task as and only operate on and separately. Therefore, we are interested in maximizing this lower bound as much as possible:

 maxϕ,ψ log(∣∣C[U––,V––]∣∣|CU––||CV––|) (4) subject to U––=ϕ(X––)∼N(0,CU––) V––=ψ(Y––)∼N(0,CV––)

In other words, we would like to maximize Cardoso (2003) IG bound by applying two transformations, and , to the original variables. This would allow us to achieve a tighter result.

Notice that our objective is invariant to the means of so they are chosen to be zero. In addition, it is easy to show that our objective is invariant to linear scaling of . This means we can equivalently assume that are identity covariance matrices.As shown by Kay (1992) and others (Klami and Kaski, 2005; Chechik et al., 2005), maximizing the objective of (4) is equivalent to maximizing the canonical correlations, . Therefore, our problem can be written as

 maxϕ,ψ k∑i=1E(U––iV––i) (5) subject to U––=ϕ(X––)∼N(0,I) V––=ψ(Y––)∼N(0,I)

where . This problem may also be viewed as a variant of the well-known CCA problem (Hotelling, 1936), where we optimize over nonlinear transformations and , and impose additional normality constraints. As in CCA, this problem can be solved iteratively by gradually finding the the optimal canonical components in each step (subject to the normality constraint), while maintaining orthogonality with the components that were previously found. For simplicity of the presentation we begin by solving (5) in the univariate (-D) case. Then, we generalize to the multivariate case. In each of these setups we present a solution to the problem, followed by simpler upper and lower bounds.

## 4 The univariate case

In the univariate case we assume that . We would like to find such that

 maxϕ,ψ ρ=E(U,V) (6) subject to U=ϕ(X)∼N(0,1) V=ψ(Y)∼N(0,1)

As a first step towards this goal, let us relax our problem by replacing the normality constraint with simpler second order statistics constraints,

 maxϕ,ψ ρ=E(U,V) (7) subject to U=ϕ(X),E(U)=0,E(U2)=1 V=ψ(Y),E(V)=0,E(V2)=1

As mentioned above, this problem is a non-linear extension of CCA, which traces back to early work by Lancaster (1963). As this problem is also a relaxed version of our original task (6), it may serve us as an upper bound. This means that the optimum of (7), denoted as , necessarily bound from above , the optimum of (6).

### 4.1 Alternation Conditional Expectation (ACE)

Breiman and Friedman (1985) show that the optimal solution to (7) is achieved by a simple alternating conditional expectation procedure, named ACE. Assume that is fixed, known and satistfies the constraints. Then, we optimize (7) only over and by Cauchy-Schwarz inequality, we have that

 E(ϕ(X)ψ(Y))=Ex(ϕ(X)E(ψ(Y)|X))≤√var(ϕ(X))√var(E(ψ(Y)|X))

with equality iff . Therefore, choosing the constant

to satisfy the unit variance constraint we achieve

. In the same manner we may fix and attain . These coupled equations are in fact necessary conditions for the optimality of and , leading to an alternating procedure in which at each step we fix one transformation and optimize the other. Breiman and Friedman (1985) prove that this procedure convergences to the global optimum using Hilbert space algebra. They show that the transformations and may be represented in a zero-mean and finite variance Hilbert space, while the conditional expectation projection is linear, closed, and shown to be self-adjoint and compact under mild assumptions. Then, the coupled equations may be formulate as an eigen problem in the Hilbert space, for which there exists a unique and optimal solution.

The following lemma defines a strict connection between the non-linear canonical correlations and the Gaussinized IB problem.

Let be the solution to (7). If , then there are no transformations such that and are jointly normally distributed and preserve all of the mutual information, .

Let be the solution to (6). As mentioned above, . Therefore, . This means that the inequality (3) cannot be achieved with equality. Hence, there are no transformations and so that and are jointly normal and preserve all of the mutual information, . Lemma 4.1 suggests that if the optimal transformations of the relaxed problem (which can be obtained by ACE) fails to capture all the mutual information between and , then there are no transformations that can project and onto jointly normal variables without losing information. Moreover, notice that the maximal level of correlation cannot be further increased, even if we allow and to reside in higher spaces. This means that Lemma 4.1 holds for any and , such that .

### 4.2 Alternating Gaussinized Conditional Expectations (AGCE)

Let us go back to our original problem, which strives to maximize the correlation between and , subject to marginal normality constraints (6). Here we follow Breiman and Friedman (1985), and suggest an alternating optimization procedure.

Let us fix and optimize (6) with respect to . As before, we can write the correlation objective as . Since is constrained to be equal to , while is fixed, maximizing is equivalent to minimizing . For simplicity, denote . Then, our optimization problem can be reformulated as

 minϕ E(ϕ(¯X)−¯X)2 (8) subject to ¯X∼F¯X ϕ(¯X)∼N(0,1)

where is the (fixed) CDF of . Notice that is necessarily a function of alone (as opposed to ), for simple optimization considerations. Assuming that and are two separable metric spaces such that any probability measure on (or ) is a Radon measure (i.e. they are Radon spaces), then (8) is simply an optimal transportation problem (Monge, 1781) with a strictly convex cost function (mean square error). We refer to that minimizes (8) as the optimal map.

The optimal transportation problem was presented by Monge (1781) and has generated an important branch of mathematics. The problem originally studied by Monge was the following: assume we are given a pile of sand (in ) and a hole that we have to completely fill up with that sand. Clearly the pile and the hole must have the same volume and different ways of moving the sand will give different costs of the operation. Monge wanted to minimize the cost of this operation. Formally, the optimal transportation problem is defined as

 inf{∫¯Xc(¯X,ϕ(¯X))dμ(¯X)∣∣ϕ∗(μ)=ν}

where and are the probability measures of and respectively, is some cost function and denotes the push forward of by the map . Clearly, (8) is a special case of the optimal transportation problem where the , is a standard normal distribution and the cost function is the euclidean distance between the two.

Assume that has finite moments for and a strictly continuous CDF, (that is is a strictly continuous random variable). Then, Rachev and Rüschendorf (1998) show that the optimal map (which minimizes (8)) is exactly where is the inverse CDF of a standard normal distribution. As shown by Rachev and Rüschendorf (1998), the optimal map is unique and achieves

 E((ϕ∗(¯X)−¯X)2)=∫10(F¯X(s)−ΦN(s))2ds. (9)

Notice that the optimal map may be generalized to the multivariate case, as discussed in the next Section. The solution to the optimal transportation problem is in fact the “optimal projection” of our problem (8). Further, it allows us to quantify how much we lose from imposing the marginal normality constraint, compared with ACE’s optimal projection.

Notice that the optimal map, , is simply marginal Gaussianization of : applying

’s CDF to itself results in a uniformly distributed random variable, while

shapes this uniform distribution into a standard normal. In other words, while the optimal projection of on is its conditional expectation, the optimal projection under a normality constraint is simply a Gaussianization of the conditional expectation. The uniqueness of the optimal map leads to the following necessary conditions for an optimal solution to (6),

 ϕ(X)=Φ−1N∘FE(ψ(Y)|X)(E(ψ(Y)|X)) (10) ψ(Y)=Φ−1N∘FE(ϕ(X)|Y)(E(ϕ(X)|Y))

As in ACE, these necessary conditions imply an alternating projection algorithm, namely, the Alternating Gaussinized Conditional Expectation (AGCE). Here, we begin by randomly choosing a transformation that only satisfies the normality constraint . Then, we iterate by fixing one of the transformation while optimizing the other, according to (10). We terminate once fails to increase, which means that we converged to a set of transformations that satisfy the necessary conditions for optimal solution. Notice that in every step of our procedure, we may either:

1. Increase our objective value, as a result of the optimal map for (8).

2. Maintain with the same objective value and with the same transformation that was found in of the previous iteration, as we converged to (10).

This means that our alternating method generates a monotonically increasing sequence of objective values. Moreover, as shown in Section 4, this sequence is bounded from above by the optimal correlation given by ACE. Therefore, according to the monotone convergence theorem, our suggested method converges to a local optimum.

Unfortunately, as opposed to ACE, our projection operator is not linear and we cannot claim for global optimality. We see that for different random initializations we converge to (a limited number) of local optima. Yet, AGCE provides an effective tool for finding local maximizers of (4), which together with MCMC (Gilks, 2005) initializations (or any other random search mechanisms) is capable of finding the global optimum.

### 4.3 Off-shelf lower bound

Although the AGCE method provides a (locally) optimal solution to (4), we would still like to consider a simpler “off-shelf” mechanism that is easier to implement and gives a lower bound to the best we can hope for. Here, we tackle (4) in two phases. In the first phase we would like to maximize the correlation objective, , subject to the relaxed second order statistics constraints (as defined in (7)). Then, we enforce the marginal normality constraints by simply applying separate Gaussianization to the outcome of the first phase. In other words, we first apply ACE to increase our objective as much as possible, and then separately Gaussianize the results to meet the normality constraints, hoping this process does not reduce our objective “too much”. Notice that in this univariate case, separate Gaussianization is achieved according to Theorem 4.3: Let be any random variable and be statistically independent of it. In order to shape to a normal distribution the following applies:

1. Assume is a non-atomic distribution ( is strictly increasing) then

2. Assume

is discrete or a mixture probability distribution then

The proof of this theorem can be located in Appendix of (Shayevitz and Feder, 2011). Theorem 4.3 implies that if is strictly continuous then we may achieve a normal distribution by applying to it, as discussed in the previous section. Otherwise, we shall handle its CDF’s singularity points by randomly scattering them in a uniform manner, followed by applying to the random variable we achieved. Notice that this process do not allow any flexibility in the Gaussianization process. However, we show that in the multivariate case (Section (5.3)) the equivalent process is quite flexible and allows us to control the correlation objective.

Further, notice that this lower bound is by no means a candidate for an optimal solution to (6), as it does not meet the necessary conditions described in (10). Yet, by finding both an upper and lower bounds (through ACE, and then separately Gaussianizing the result of ACE) we may immediately achieve the range in which the optimal solution necessarily resides. Assuming this range is not too large, one may settle for a sub-optimal solution without a need to apply AGCE at all.

### 4.4 Illustrative example

We now demonstrate our suggested methodology with a simple illustrative example. Let , and be three normally distributed random variables, all independent of each other. Let

be a Bernoulli distributed random variable with a parameter

, independent of and . Define as:

Then, is a balanced Gaussian mixture with parameters

 θy={μ1=0,σ21=1+ϵ2,μ2=μz,σ22=1}.

The joint probability density function of and is also a balanced two-dimensional Gaussian mixture with parameters

Let us further assume that is large enough, and is small enough, so that the overlap between the two Gaussian is negligible. For example, we set and . The correlation between and is easily shown to be . The mutual information between and is defined as

 I(X;Y)=h(X)+h(Y)−h(X;Y)

Since we assume that the Gaussians in the mixture practically do not overlap, we have that

 h(Y)= −∫fY(y)logfY(y)dy≈14log(2πe(1+ϵ2))+14log(2πe)+1 (11)

In the same manner,

 h(X,Y)= −∫fX,Y(x,y)logfX,Y(x,y)dxdy≈ (12) 14log((2πe)2|C1|)+14log((2πe)2|C2|)+1

Plugging and we have that

 I(X;Y)= h(X)+h(Y)−h(X;Y)≈1.66bits. (13)

The scatter plot on the left of Figure 1 illustrates independent draws of and , where the blue circles corresponds to the “correlated samples” () while the blue crosses are the “noise” ().

Before we proceed to apply our suggested methods, let us first examine two benchmark options for separate Gaussianization. As an immediate option, we may always apply separate Gaussianization, directly to and , denoted as and respectively. This corresponds to Cardoso (2003) information geometry bound. Since is already normally distributed we may set and only apply Gaussinization to . Let be the Gaussianization of . This means that

 Va=Φ−1N(FY(Y))=Φ−1N(ΦGM(θy)(Y))

where is the cumulative distribution function a Gaussian Mixture with the parameters described above. Therefore,

 ρua,va=E(XV)=12E(XΦ−1N(ΦGM(θy)(X+W))).

Although it is not possible to obtain a closed form solution to this expectation, it may be numerically evaluated quite easily, as and are independent. Assuming and we get that and our lower bound on the mutual information, as appears in (3), is . The middle scatter plot of Figure 1 presents this separate marginal Gaussianization of the previously drawn samples of and . Notice that the marginal Gaussianization is a monotonic transformation, so that the samples are not being shuffled and maintain the separation between the two parts of the mixture. While the red circles are now “half Gaussian”, the blue crosses are shaped in a curvy manner, so that their marginal distribution (projected on the axis) is also a “half Gaussian”, leading to a normal marginal distribution of . We notice that while the mutual information between and is bits, the lower bound attained by this naive Gaussianization approach is close to zero. This is obviously an unsatisfactory result.

A second benchmark alternative for separate Gaussianization is to take advantage of the Gaussian mixture properties. Since we assume that the two Gaussians of are practically separable, we may distinguish between observations from the two Gaussians. Therefore, we can simply reduce from the samples (the red circles), and normalize the observations of . This way the transformed becomes a Gaussian mixture of two co-centered standard Gaussians, and no further Gaussianization is necessary. For and , this leads to a correlation of

 ρub,vb=12E(1√1+ϵ2(X+W)X)=121√1+ϵ2=0.497 (14)

and a corresponding mutual information lower bound of . However, notice that the suggested transformation is not invertible and may cause a reduction in mutual information. Specifically, we now have that the joint distribution of and

follows a Gaussian mixture model with parameters:

Therefore,

 h(Ub,Vb)= −∫fUb,Vb(u,v)log(fUb,Vb(u,v))dudv= (15) −∫ϕGN(θub,vb)(u,v)logϕGN(θub,vb)(u,v)dudv≈3.1384bits

where is the probability density function of a Gaussian mixture with the parameters described above, and the last approximation step is due to numerical integration. This leads to bits.

To conclude, although the mutual information is reduced from bits to bits, the suggested bound increased quite dramatically, from bits to bits. The right plot of Figure 1 demonstrates this customized separate Gaussianization (as it only applies for this specific setup) to the previously sampled and . Again, we emphasis that this solution is not applicable in general, and is only feasible due to the specific nature of this Gaussian mixture model. Figure 1: Naive Univariate Gaussianization: Left: scatter of X and Y, as described in the text. Middle: naive separate Gaussianization to X and Y. Right: separate Gaussianization which considers the separable Gaussian Mixture model of X and Y, as described in the text.

Let us now turn to our suggested methods, as described in detail in the previous sections. We begin by applying the ACE procedure (Section 4.1), to attain an upper bound on our problem (6). Not surprisingly, ACE converges to a solution in which the samples of that are independent of (the ones that come from ) are set to zero, while the rest are normalized to achieve an unit variance. Therefore, the resulting correlation is . This results further implies that we can never find a Gaussianization procedure that will capture all the information between and , as bits, according to Lemma 4.1. The left scatter plot of Figure 2 demonstrates the outcome of the ACE procedure, applied to the drawn samples of and .

Next, we apply our suggested AGCE routine, described in Section 4.2. As discussed above, the AGCE only converges to a local optimum. Therefore, we initialize it with several random transformation (including the ACE solution that we just found). We notice that the number of convergence points are very limited and result in almost similar maxima. The middle scatter plot of Figure 4.2 shows the best result we achieve, leading to a correlation coefficient of and a lower bound on a corresponding Gaussian lower bound (3) of bits. This result demonstrates the power of our suggested approach, as it significantly improves the benchmarks, even compared with the that considers the separable Gaussian mixture nature of our samples.

Finally, we evaluate a lower bound for (6), as described in Section 4.3. Here, we simply apply separate Gaussianization to the outcome of the ACE procedure. This results in and a corresponding . The right scatter plot of Figure 2 shows the Gaussianized samples the we achieve. We notice that this lower bound is not significantly lower than AGCE, suggesting that in some case we may settle for this less involved method.

To conclude, our suggested solution surpasses the benchmarks quite easily, as we increase the lower bound from bits using the custom Gaussianization procedure to bits using our general solution. We notice that all of the discussed procedures result in a joint distribution that are quite far from normal. This is not surprising, since and were highly “non-normal” to begin with. Specifically, in all suggested procedures we loose information, compared with the original . However, our suggested solution minimizes this loss, and may be considered “more jointly normal” than others, in this regards. Figure 2: Our Suggested Univariate Gaussianization Schemes: Left: upper bound by ACE. Middle: (local) optimal solution by AGCE. Right: lower bound by separate Guassianization to ACE.

## 5 The multivariate case

Let us now consider the multivariate case where both and are random vectors with a joint CDF . One of the fundamental differences from the univariate case is that Gaussianizing each of these vectors (even separately) is not a simple task. In other words, finding a transformation such that is normally distributed may be theoretically straight-forward but practically involved.

For simplicity of the presentation, assume that is a two dimensional, strictly continuous, random vector. Then, Gaussianization may be achieved in two steps: first, apply marginal Gaussianization to , so that . Then, apply marginal Gaussianization on , conditioned on each possible realization of the previous component, . This results in a jointly normally distributed vector . While this procedure is theoretically simple, it is quite problematic to apply in practice, as it requires Gaussianizing each and every conditional CDF. This is obviously impossible, given a finite number of samples. Yet, it gives us a constructive method, assuming that all the CDF’s are known. In the following sections we shall present several alternatives for Gaussianization in finite sample size setup.

### 5.1 Upper bound by ACE

As in the univariate case, we begin our analysis by relaxing the normality constraints with softer second order statistics constraints. This leads to a straight forward multivariate generalization of the ACE procedure:

We begin by extracting the first canonical pair, which satisfies and . As in the univariate case, is a normalization coefficient (the square root of the variance of the conditional expectation), and the optimization is done by alternating projections. Then, we shall extract the second pair of canonical components, subject to an orthogonality constraint with the first pair. It is easy to show that if is orthogonal to , then is orthogonal to , and obviously maximizes the correlation with . Therefore, we may extract the second canonical pair by first randomly assigning a zero-mean and unit variance that is also orthogonal to (by Gram-Schmidt procedure, for example), followed by alternating conditional expectations with respect to and , in the same manner as we did with the first pair. We continue this way for the rest of the canonical pairs. As in the univariate case, convergence to a global maximum is guaranteed from the same Hilbert space arguments. As before, the multivariate ACE sets an upper bound to (5) as it maximizes a relaxed version of this problem.

Let be the outcome of multivariate ACE procedure (the canonical vectors). Assuming that , there are no transformations such that and follow a jointly normal distribution and preserve all of the mutual information, . The proof of Lemma 5.1 follows exactly the proof of Lemma 4.1. Here again, the multivariate ACE objective, , cannot be further increased by artificially inflating the dimension of the problem. Therefore, Lemma 5.1 holds for any and , such that .

### 5.2 multivariate AGCE

As with the multivariate ACE, we propose a generalized multivariate procedure for AGCE. We begin by extracting the first pair, in the same manner as we did in the univariate case. That is, we find a pair and that satisfies

 U1=Φ−1N∘FE(U1|X––)(E(U1|X––)) (16) V1=Φ−1N∘FE(V1|Y––)(E(V1|Y––))

by applying the alternating optimization scheme. As we proceed to the second pair, we require that is both orthogonal and jointly normally distributed with (same goes for with respect to ). This means that the second pair needs not only to be orthogonal, but also statistically independent with the first pair. In other words, assuming is fixed, our basic projection step is

 maxϕ2 E(ϕ2(X––)V2) (17) subject to ϕ2(X––)∼N(0,1) ϕ2(X––)\rotatebox[origin=c]90.0$⊨$ϕ1(X––)

Let us denote a subspace that is statistically independent of . Then, the problem of maximizing subject to is again solved by the optimal map, . Therefore, the remaining task is to find the “best” subspace , so that is maximal, when plugging the optimal map.

Let be the value (realization) of . Let be a subspace of , independent of . If is an invertible function with respect to given , then is an optimal subspace for maximizing subject to .

Assume there exists a different subspace so that

 maxϕ′2E(ϕ′2(~X––′)V2)>maxϕ2E(ϕ2(~X––)V2)

subject to the normality constraint. Since is invertible we have that . Therefore, . Plugging this to the inequality above leads to

 maxϕ′2E(ϕ′2(f(~X––,u1))V2)>maxϕ2E(ϕ2(~X––)V2)

which obviously contradicts the optimality of maximization over . Therefore, we are left with finding that is a subspace of , independent of and invertible with respect to given . For simplicity of th presentation, let us first assume that is univariate. Then, the function is independent of (as it holds the same (uniform) distribution, regardless to the value of ), and invertible given (assuming that the conditional CDF’s are continuous for every ). Going back to the multivariate , we may follow the same rationale by choosing a single -dimensional distribution that all the conditional CDF’s, will be transformed to. For simplicity we choose a -dimensional uniform distribution, denoted by its CDF as . Then, , where refers to a mapping that pushes forward the distribution into , given x. Specifically, if and are the corresponding density functions of the (absolutely continuous) CDF’s and

respectively, then we know from basic probability theory that the push forward transformation

satisfies

 p(w)=q(S(w))|JS(S(w))|

where is the Jacoby operator of the map .

To conclude, in order to construct that is independent of and invertible given , we need to push forward all the conditional CDF’s into a predefined distribution (say, uniform). Then, the optimal map that maximizes subject to