 # On Conditional Correlations

The Pearson correlation, correlation ratio, and maximal correlation have been well-studied in the literature. In this paper, we studied the conditional versions of these quantities. We extend the most important properties of the unconditional versions to the conditional versions, and also derive some new properties.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the literature, there are various measures available to quantify the strength of dependence between two random variables, including the Pearson correlation coefficient, the correlation ratio, the maximal correlation coefficient, etc. The Pearson correlation coefficient is such a well-known measure that quantifies the linear dependence between two real-valued random variables. For real-valued random variables

and , it is defined as

 ρ(X;Y)=⎧⎨⎩cov(X,Y)√var(X)√var(Y),var(X)var(Y)>0,0,var(X)var(Y)=0.

The correlation ratio was introduced by Pearson (see e.g. ), and studied by Rényi [15, 14]. For a real-valued random variable and an arbitrary random variable , the correlation ratio of on is defined by

 θ(X;Y)=supgρ(X;g(Y)),

where the supremum is taken over all Borel-measurable real-valued functions such that . It was shown that

 θ(X;Y) =√var(E[X|Y])var(X)=√1−E[var(X|Y)]var(X).

Another related dependence measure is Hirscbfeld-Gebelein-Rényi maximal correlation (or simply maximal correlation), which measures the maximum possible (Pearson) correlation between square integrable real-valued random variables generated by either of two random variables. For two arbitrary random variables and , the maximal correlation of and is defined by

 ρm(X;Y)=supf,gρ(f(X);g(Y)),

where the supremum is taken over all Borel-measurable real-valued functions such that . This measure was first introduced by Hirschfeld  and Gebelein , then studied by Rényi , and recently it has been exploited to study some interesting problems in information theory, such as measuring non-local correlations , maximal correlation secrecy , deriving a converse result for distributed communication , etc. Furthermore, the maximal correlation also indicates the existence of Gács-Körner’s or Wyner’s common information [7, 17].

## 2 Definition

Let

be a probability space. Let

be a real-valued random vector, where

denotes the Borel -algebra on . For a random variable (or random vector) , we denote the probability measure (a.k.a. distribution) as . If is discrete, then we use to denote the probability mass function (pmf). If is absolutely continuous (the distribution is absolutely continuous respect to the Lebesgue measure), then we use

to denote the probability density function (pdf).

In the following, we define several conditional correlations, including the conditional (Pearson) correlation, conditional correlation ratio, and conditional maximal correlation.

###### Definition 1.

The conditional (Pearson) correlation111Here does not need to be real-valued. But for brevity, we assume it is. Similarly, in the following, does not need to be real-valued in the definition of conditional correlation ratio, and does not need to be real-valued in the definition of conditional maximal correlation. of and given is defined by

 ρ(X;Y|U)=⎧⎨⎩E[cov(X,Y|U)]√E[var(X|U)]√E[var(Y|U)],E[var(X|U)]E[var(Y|U)]>0,0,E[var(X|U)]E[var(Y|U)]=0.
###### Definition 2.

The conditional correlation ratio of on given is defined by

 θ(X;Y|U)=supgρ(X;g(Y,U)|U),

where the supremum is taken over all Borel-measurable real-valued functions such that .

###### Definition 3.

The conditional maximal correlation of and given is defined by

 ρm(X;Y|U)=supf,gρ(f(X,U);g(Y,U)|U),

where the supremum is taken over all Borel-measurable real-valued functions such that , .

###### Remark 1.

If is degenerate, then these three conditional correlations reduce to the unconditional versions.

###### Remark 2.

Note that and , but in general . That is, the conditional correlation and conditional maximal correlation are symmetric, but the conditional correlation ratio is not.

By the definitions, it is easy to verify that

 ρm(X;Y|U)=supfθ(f(X,U);Y|U), (1)

where the supremum is taken over all Borel-measurable real-valued functions such that .

Note that the unconditional versions of correlation coefficient, correlation ratio, and maximal correlation were well studied in the literature; see [15, 14]. The conditional version of maximal correlation was first introduced by Ardestanizadeh et al. . Later Beigi and Gohari  used it to study the problem of non-local correlations. In this paper, we study these conditional correlations, especially the conditional maximal correlation, and derive some useful properties. Furthermore, to state our results clearly, we also need to define event conditional correlations as follows.

###### Definition 4.

Given an event , denote the conditional distribution of given as . Assume is a pair of random variables satisfying . Then we define as the event conditional correlations of and given , where and denotes the corresponding unconditional correlation of and .

Obviously, event conditional correlations are special cases of corresponding conditional correlations. Moreover, if the distribution of is the same as the conditional distribution of given , then the unconditional correlations of respectively equal the corresponding event conditional correlations of given , i.e., where . If the distribution of satisfies for some , then the conditional correlations of given respectively equal the corresponding event conditional correlations of given , i.e., where .

## 3 Properties

### 3.1 Basic Properties: Other Characterizations, Continuity, and Concavity

In this subsection, we provide other characterizations for the conditional correlation ratio and conditional maximal correlation, and then study continuity (or discontinuity) and concavity properties of the conditional maximal correlation. First by the definitions, we have the following basic properties.

###### Theorem 1.

For any random variables , the following inequalities hold.

 θ(X;Y,Z|U) ≥θ(X;Y|U); ρm(X;Y,Z|U) ≥ρm(X;Y|U).

Next we characterize the conditional correlation ratio and conditional maximal correlation by ratios of variances.

###### Theorem 2.

(Characterization by the ratio of variances). For any random variables , the following properties hold.

 θ(X;Y|U) =√E[var(E[X|Y,U]|U)]E[var(X|U)] =√1−E[var(X|Y,U)]E[var(X|U)]; (2) ρm(X;Y|U) =supf√E[var(E[f(X,U)|Y,U]|U)]E[var(f(X,U)|U)] =supf√1−E[var(f(X,U)|Y,U)]E[var(f(X,U)|U)]. (3)
###### Remark 3.

The correlation ratio is also closely related to the Minimum Mean Square Error (MMSE). The optimal MMSE estimator is

, hence the variance of the MMSE for estimating given is

The unconditional version of Theorem 2 was proven by Rényi . Theorem 2 can be proven by a proof similar to that in , and hence omitted here. Next we characterize conditional correlations by event conditional versions.

###### Theorem 3.

(Characterization by event conditional correlations). For any random variables

 ρ(X;Y|U) ≤esssupuρ(X;Y|U=u), (4) essinfuθ(X;Y|U=u)≤θ(X;Y|U) ≤esssupuθ(X;Y|U=u), (5) ρm(X;Y|U) =esssupuρm(X;Y|U=u), (6)

where and respectively denote the essential supremum and the essential infimum of .

###### Remark 4.

It is worth noting that does not hold in general. This can be seen from the following example. Assume are three numbers such that . Suppose that is a pair of random variables such that and . (It is obvious that there are many random variable pairs satisfying the conditions.) Denote the distribution of as . Now we consider a triple of random variables such that and and . Then we have . Hence . However, . Hence for this example.

###### Remark 5.

If

is a discrete random variable, then

 ρm(X;Y|U)=supu:PU(u)>0ρm(X;Y|U=u), (7)

where denotes the pmf of . Note that Beigi and Gohari  defined the conditional maximal correlation via (7). This theorem implies the equivalence between the conditional maximal correlation defined by us and that defined by Beigi and Gohari.

###### Remark 6.

If

is an absolutely continuous random variable, then

 ρm(X;Y|U)=infqU:qU=pUa.e.supu:qU(u)>0ρm(X;Y|U=u),

where denotes the pdf of .

###### Proof.

We first prove (4). Denote and . Hence for any ; and for any . It means that . Therefore, to show (4), we only need to show . We can upper bound as follows.

 ρ(X;Y|U) =E[cov(X,Y|U)]√E[var(X|U)]√E[var(Y|U)] (8) ≤E[cov(X,Y|U)]E√var(X|U)var(Y|U) (9) =infλ>λ∗E[cov(X,Y|U)⋅1{U∈R∖Aλ}]E[√var(X|U)var(Y|U)⋅1{U∈R∖Aλ}] (10) ≤infλ>λ∗supu∈R∖Aλρ(X;Y|U=u) ≤infλ>λ∗λ =λ∗, (11)

where (9) follows by the Cauchy-Schwarz inequality, and (10) follows from Theorem 15.2 of  and the fact for any .

From (2), (5) follows immediately.

Finally, we prove (6). Similarly as in the proof above, we denote and . Hence for any ; for any ; and . Therefore, to prove (6), we only need to show . On one hand, by derivations similar as (8)-(11), we can upper bound as follows.

 ρm(X;Y|U) =supf√E[var(E[f(X,U)|Y,U]|U)]E[var(f(X,U)|U)] =supfinfλ>λ∗√E[var(E[f(X,U)|Y,U]|U)⋅1{U∈R∖Aλ}]E[var(f(X,U)|U)⋅1{U∈R∖Aλ}] ≤supfinfλ>λ∗supu∈R∖Aλ√E[var(E[f(X,U)|Y,U]|U=u)]E[var(f(X,U)|U=u)] ≤infλ>λ∗supu∈R∖Aλsupf√E[var(E[f(X,U)|Y,U]|U=u)]E[var(f(X,U)|U=u)] ≤infλ>λ∗λ =λ∗.

On the other hand, we assume is a function such that for each , where and . The existence of follows from the definition of . According to the definition of , we have that , and for each ,

 var(E[˜f(X,U)|Y,U=u]|U=u)var(˜f(X,U)|U=u)≥(αλ)2. (12)

Set . Then

 ρm(X;Y|U) ≥√E[var(E[f(X,U)|Y,U]|U)]E[var(f(X,U)|U)] = ⎷E[var(E[˜f(X,U)|Y,U]|U)⋅1{U∈Aλ}]E[var(˜f(X,U)|U)⋅1{U∈Aλ}] ≥ ⎷infu∈Aλvar(E[˜f(X,U)|Y,U=u]|U=u)var(˜f(X,U)|U=u) ≥αλ, (13)

where (13) follows from (12). Since and are arbitrary, we have .

Combining the two points above, we have . ∎

For discrete with finite supports, without loss of generality, the supports of and are assumed to be and , respectively. For this case, denote

as the second largest singular value of the matrix

with entries

 Qu(x,y):=P(x,y|u)√P(x|u)P(y|u)=P(x,y,u)√P(x,u)P(y,u).

For absolutely-continuous , denote as the second largest singular value of the bivariate function , where denotes a conditional pdf of respect to . Then we have the following singular value characterization of conditional maximal correlation.

###### Theorem 4.

(Singular value characterization). Assume are discrete random variables with finite supports, or absolutely-continuous random variables such that a.s. Then

 ρm(X;Y|U)=esssupuλ2(u). (14)
###### Remark 7.

This property is consistent with the one of the unconditional version by setting to a constant, i.e.,

###### Proof.

The unconditional version of this theorem was proven in . That is, for discrete with finite supports, equals the second largest singular value of the matrix with entries ; for absolutely-continuous such that with denoting a pdf of , equals the second largest singular value of the bivariate function . Combining this with Theorem 3, we have (14). ∎

Note that, is a mapping that maps a distribution to a real number in . Now we study the concavity of such a mapping.

###### Corollary 1.

(Concavity). Given , is concave in . That is, for any distributions and , and any , , where denotes the conditional maximal correlation under distribution .

###### Proof.

This theorem directly follows from the characterization in (6). ∎

For a discrete random variable, the distribution is uniquely determined by its pmf. Therefore, for discrete random variables , can be also seen as a mapping that maps a pmf to a real number in . Assume are three finite sets. Denote as the set of pmfs defined on (i.e., the dimensional probability simplex). Consider as a mapping . Now we study the continuity (or discontinuity) of such a mapping.

###### Corollary 2.

(Continuity and discontinuity). For finite sets , is continuous (under the total variation distance) on . But in general, is discontinuous at such that .

###### Proof.

For a pmf , . On the other hand, singular values are continuous in the matrix (see [9, Corollary 8.6.2]), hence is continuous in . Furthermore, since , in the total variation distance sense implies and . Therefore, , where denotes the conditional maximal correlation under distribution . However, if there exists such that and . Then letting in a direction such that always holds, we have that always holds. This implies is discontinuous at . ∎

###### Theorem 5.

1) For any random variables ,

 0≤|ρ(X;Y|U)|≤θ(X;Y|U)≤ρm(X;Y|U)≤1.

2) Moreover, if and only if and are conditionally independent given . Furthermore, for discrete random variables with finite supports, if and only if for some functions and such that (i.e., and have conditional Gács-Körner common information given ), where denotes the conditional entropy of given .

###### Proof.

The statement 1) follows from the definitions of these three conditional correlations. The statement 2) with degenerate (unconditional version) was proven by Rényi . The statement 2) (conditional version) follows by combining Rényi’s results and the characterization in (6). ∎

Next we show that the three conditional correlations are equal for the Gaussian case.

###### Theorem 6.

(Gaussian case). For jointly Gaussian random variables , we have

 |ρ(X;Y|U)|=θ(X;Y|U)=θ(Y;X|U)=ρm(X;Y|U). (15)
###### Proof.

The unconditional version of (15) was proven in [16, Sec. IV, Lem. 10.2]. On the other hand, given

also follows jointly Gaussian distribution, and

for any . Hence

 ρm(X;Y|U) =esssupuρm(X;Y|U=u) (16) =esssupu|ρ(X;Y|U=u)| (17) =|ρ(X;Y|U)|,

where (16) follows from Theorem 3, and (17) follows from [16, Sec. IV, Lem. 10.2].

Furthermore, both and are between and . Hence (15) holds. ∎

### 3.2 Other Properties: Tensorization, DPI, Correlation ratio equality, and Conditioning reducing covariance gap

The tensorization property and the data processing inequality for the unconditional maximal correlation were proven in

[17, Thm. 1] and [20, Lem. 2.1] respectively. Here we extend them to the conditional case.

###### Theorem 7.

(Tensorization). Assume given with and , is a sequence of pairs of conditionally independent random variables, then we have

 ρm(Xn;Yn|U)=max1≤i≤nρm(Xi;Yi|U).
###### Proof.

The unconditional version

for a sequence of pairs of independent random variables is proven in [17, Thm. 1]. Hence the result for the event conditional maximal correlation also holds. Using this result and Theorem 4, we have

 ρm(Xn;Yn|U) =esssupuρm(Xn;Yn|U=u) =esssupumax1≤i≤nρm(Xi;Yi|U=u) =max1≤i≤nesssupuρm(Xi;Yi|U=u) (18) =max1≤i≤nρm(Xi;Yi|U),

where (18) follows by the following lemma.

###### Lemma 1.

Assume is a countable set, and is an arbitrary distribution on . Then for any function , we have

 esssupusupi∈If(i,u)=supi∈Iesssupuf(i,u).

This lemma follows from the following two points. For a number , assume satisfies that . Then for any function ,

 esssupusupi∈If(i,u)≥esssupuf(i∗,u)=supi∈Iesssupuf(i,u)−ϵ.

Since is arbitrary, we have .

On the other hand, denote . Then by the definition of , we have for all . Hence by the union bound, we have . Furthermore, for a number , implies that there exists an such that . Hence

 PU(u:supi∈If(i,u)>supi∈Iλ∗i+ϵ) ≤PU(u:∃i∈I s.t. f(i,u)>λ∗i) =0.

By the definition of , we have . Since is arbitrary, we have . ∎

###### Theorem 8.

(Data processing inequality). If random variables

form a Markov chain

(i.e., and are conditionally independent given ), then

 |ρ(X;Y|U)| ≤θ(X;Z|U)θ(Y;Z|U), (19) θ(X;Y|U) ≤θ(X;Z|U)ρm(Y;Z|U), (20) ρm(X;Y|U) ≤ρm(X;Z|U)ρm(Y;Z|U). (21)

Moreover, equalities hold in (19)-(21) if and

have the same joint distribution.

###### Proof.

Consider

 E[cov(X,Y|U)] =E[(X−E[X|U])(Y−E[Y|U])] =E[E[(X−E[X|U])(Y−E[Y|U])|Z,U]] (22) =E[(E[X|Z,U]−E[X|U])(E[Y|Z,U]−E[Y|U])] ≤√E[(E[X|Z,U]−E[X|U])2]E[(E[Y|Z,U]−E[Y|U])2] (23) =√E[var(E[X|Z,U]|U)]E[var(E[Y|Z,U]