# Conditional Mutual Information Neural Estimator

Several recent works in communication systems have proposed to leverage the power of neural networks in the design of encoders and decoders. In this approach, these blocks can be tailored to maximize the transmission rate based on aggregated samples from the channel. Motivated by the fact that, in many communication schemes, the achievable transmission rate is determined by a conditional mutual information, this paper focuses on neural-based estimators for this information-theoretic quantity. Our results are based on variational bounds for the KL-divergence and, in contrast to some previous works, we provide a mathematically rigorous lower bound. However, additional challenges with respect to the unconditional mutual information emerge due to the presence of a conditional density function; this is also addressed here.

## Authors

• 4 publications
• 8 publications
• 61 publications
• ### Neural Computation of Capacity Region of Memoryless Multiple Access Channels

This paper provides a numerical framework for computing the achievable r...
05/10/2021 ∙ by Farhad Mirkarimi, et al. ∙ 0

• ### Individually Conditional Individual Mutual Information Bound on Generalization Error

We propose a new information-theoretic bound on generalization error bas...
12/17/2020 ∙ by Ruida Zhou, et al. ∙ 0

• ### A Training-Based Mutual Information Lower Bound for Large-Scale Systems

We provide a mutual information lower bound that can be used to analyze ...
07/30/2021 ∙ by Xiangbo Meng, et al. ∙ 0

• ### Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning

The mutual information is a core statistical quantity that has applicati...
09/29/2015 ∙ by Shakir Mohamed, et al. ∙ 0

• ### Neural Mutual Information Estimation for Channel Coding: State-of-the-Art Estimators, Analysis, and Performance Comparison

Deep learning based physical layer design, i.e., using dense neural netw...
06/29/2020 ∙ by Rick Fritschek, et al. ∙ 0

• ### C-MI-GAN : Estimation of Conditional Mutual Information using MinMax formulation

Estimation of information theoretic quantities such as mutual informatio...
05/17/2020 ∙ by Arnab Kumar Mondal, et al. ∙ 0

• ### A Practical Consideration on Convex Mutual Information

In this paper, we focus on the convex mutual information, which was foun...
05/24/2021 ∙ by Mingxi Yin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Although originally conceived to address communication problems, information-theoretic measures have been insightful in many fields including statistics, signal processing, and even neuroscience. Entropy, KL-divergence, and mutual information (MI) are extensively used to explain the behavior and relation among random variables. For example, MI and its extensions are used to characterize the capacity of communication channels

[1] as well as define notions of causality [2, 3]

. Information-theoretic quantities have also made their way into machine learning and deep neural networks

[4]. They have been adopted to regularize optimizations in neural networks [5][6], or to express the flow of information in layers of a deep network [7]. In a generative adversarial network (GAN), in which a neural network is optimized to perform towards the best-trained adversary, the relative entropy plays an eminent role [8, 9].

On the other hand, neural networks have also been applied in communication setups as part of the encoder/decoder blocks [10, 11, 12]. Learning the end-to-end communication system is challenging though, and requires the knowledge of the channel model, which it is not available in many applications. One solution to this problem is to train a GAN model to mimic the channel, which can later be used to learn the whole system [13, 14]. Alternatively, one can design encoders and optimize them to achieve the maximum rate (capacity), which is characterized by the MI. In [15], the authors exploit a neural network estimator of the mutual information proposed in [16] to optimize their encoders. Although estimating the MI has been studied extensively (see e.g., [17, 18]), the capacity in many communication setups (such as the relay channel, channels with random state, wiretap channels) is described by the conditional mutual information (CMI), which requires specialized estimators. Extending the existing estimators of MI for this purpose is not trivial and is the main focus of this paper. Motivated by [15], we investigate estimators using artificial neural networks.

The main challenges in estimating the CMI stem from empirically computing the conditional density function and from the curse of dimensionality. The latter is, in fact, a common problem in many data-driven estimators for MI and CMI. A conventional estimator in the literature for MI is based on the

nearest neighbors (-NN) method [19], which has been extensively studied [20] and extended to estimate CMI [21, 22, 23]. In a recent work, the authors of [16] propose a new approach to estimate the MI using a neural network, which is based on the Donsker–Varadhan representation of the KL-divergence [24]. Improvements are shown in the performance of estimating the MI between high-dimensional variables compared to the -NN method. Several other works also take advantage of variational bounds ([25]) to estimate the MI and show similar improvements [26, 27]. For estimating the CMI, the authors of [28]

introduce a neural network classifier to categorize the density of the input into two classes (joint and product); then they show that by training such classifier, they can optimize a variational bound for the relative entropy and correspondingly for the MI and the CMI. Furthermore, they suggest the use of a GAN model, a variational autoencoder, and

-NN to generate or select samples with the appropriate conditional density.

Although the trained classifier proposed in [28]

asymptotically converges to the optimal function for the variational bound, a relatively high variance can be observed in the final output. Consequently, the final estimation is an average over several Monte Carlo trials. However, the variational bound used in

[28] contains non-linearities which results in a biased estimation when taking a Monte Carlo average. A similar problem is pointed out in [27] where the authors suggest a looser but linear variational bound. In this paper, we take advantage of the classifier technique applied to a linearized variational bound to avoid the bias problem. For simplicity, the -NN method is used to collect samples from the conditional density. In Section 2 the variational bounds are explained and we shed light on the Monte Carlo bias problem. Afterwards, the proposed technique to train the classifier and estimate the CMI is stated in Section 3. Then, we investigate the problem of estimating the secrecy capacity of the degraded wiretap channel characterized by the CMI. Finally the paper is concluded in Section 4.

## 2 Preliminaries

For random variables jointly distributed according to , the CMI is defined as:

 I(X;Y|Z)=DKL(p(x,y,z)||p(y,z)p(x|z)), (1)

where

is the KL-divergence between the probability density functions (

pdfs) and , ,

 DKL(p||q)=∫p(u)logp(u)q(u)du. (2)

The CMI characterizes the capacity of communication systems such as channels with random states, network communication models like the relay channel or the degraded wiretap channel (DWTC) [1]. For example, in the DWTC (see Fig. 1), the secrecy capacity is:

 Cs=maxp(x)I(X;Y|Z). (3)

By similar steps as in [25], we can bound (1) from below:

 I(X;Y|Z)≥Ep(x,y,z)[logq(x|y,z)p(x|z)], (4)

where is any arbitrary pdf; the last step holds true since the relative entropy is non-negative. The lower bound in (4) is tight if the conditional density functions and are equal. As discussed in [27], by choosing an energy-based density, one can obtain lower bounds for the CMI as stated in the following proposition.

###### Proposition 1.

For any function , let , then the following bound holds and is tight when :

 I(X;Y|Z)≥Ep(x,y,z)[f(x,y,z)]−Ep(y,z)[logM(y,z)]. (5)
###### Proof.

Choosing and substituting in (4) yields the desired bound.∎

It is worth noting that computing the optimal is non-trivial since we may not have access to the joint pdf. Using Proposition 1 and Jensen’s inequality, a variant of the Donsker–Varadhan bound [24] can be obtained as follows:

 I(X;Y|Z)≥Ep(x,y,z)[f(x,y,z)]−logEp(x|z)p(y,z)[expf(x,y,z)]. (6)

Hereafter let the r.h.s. of (6) be denoted as . The bound is tight when .

In order to estimate from samples, an empirical average may be taken with respect to and . Let and be a random batch of triples sampled respectively from and ; then the estimated for an arbitrary choice of is:

 ^IbDV =1b∑(x,y,z)∈Bb% jointf(x,y,z) −log1b∑(x,y,z)∈Bbprodexpf(x,y,z). (7)

Since the outcomes of the previous estimator exhibit a relatively high variance ([27, 28, 29]), the authors of [28] averaged the estimated results over several trials. However, due to the concavity of the logarithm in the second term of , a Monte Carlo average of different instances of creates an upper bound for , while is itself a lower bound of the CMI. This issue can be resolved by taking a looser bound where the term is linearized. A similar bound is obtained by Nguyen, Wainwright, and Jordan [30], also adopted in [16, 28] to estimate the MI and which was denoted f-MINE (since it corresponds to a variational representation for the f-divergence). The corresponding bound for the CMI is:

 I(X;Y|Z)≥Ep(x,y,z)[f(x,y,z)]−e−1Ep(x|z)p(y,z)[expf(x,y,z)], (8)

where we use in (6). In this paper, let refer to the lower bound (8). The bound is tight for , while a Monte Carlo average of several instances of the estimator is unbiased and, consequently, justified for estimating a lower bound on the CMI.

Although the optimal choice for is known, since the true joint density is unknown, it is non-trivial to compute. Several approaches have been proposed to estimate for MI and CMI including [16, 27] and [28]. In [16], the searching space is restricted to functions generated via a neural network; then using minibatch gradient descent, the r.h.s. of (8) is optimized. Even though training the network can be unstable [27], their method is shown to scale better with dimension than the -NN estimator for mutual information in [19]. Alternatively, [27] discusses estimators for log density ratio to be used in variational bounds, denoted there as Jensen–Shannon estimator. The motivation originates in the form of the functions that optimize these bounds. Similarly to estimate , [28] introduces a binary classifier using a neural network to discriminate samples from and and shows that by using cross-entropy loss, the density ratio present in the optimal can be estimated. The authors then applied this classifier to estimate the CMI using .

Despite the variational bound for the CMI resembling the one for the MI, in practice, obtaining the empirical estimation cannot be immediately extended due to the appearance of the conditional density in preparing the batches. A proposed workaround in [28] is to compute the CMI as a difference between two MIs, i.e., . However, given that both estimated MIs are lower bounds of the true values, it is not clear that their difference is a lower or upper bound of the desired CMI.

Nonetheless, the authors of [28] do propose several methods to construct batches including GAN, variational autoencoder, and -NN method. Then the classifier is trained using them to distinguish samples from either or . In this work, we only focus on the -NN method to create the batches due to limited space and train the classifier similar to [28]. The main advantage of our work is to use to estimate the CMI, instead of as was applied in [28], to resolve the Monte Carlo bias problem.

## 3 Main results

In this section we explain the steps to estimate the CMI using neural networks. The estimator is subsequently exploited to compute the secrecy capacity of a degraded wiretap channel.

### 3.1 Constructing the batches

A major challenge in estimating the CMI for continuous random variables is dealing with the aforementioned conditional density function. Suppose we have a dataset

generated from a joint density . In order to sample triples to construct , we can take random choices of from the dataset. However, in order to build , it is not straightforward how to take a sample according to for a chosen pair of distributed by . Here, to be more tractable, we exploit the notion of -NN as follows. Consider a pair chosen randomly from the dataset, which represents a sample from . In order to sample , this technique tells us to choose corresponding to the triple such that is one of the nearest neighbors of .

### 3.2 Classifier

Let denote the class of samples distributed according to , while represents the class of samples randomly taken from , and the classes are equally probable. Consider a neural network with samples of as input and corresponding output

, where the objective is to classify the samples. Using a sigmoid function

, one can map the to a real value in . Let be the classifier’s decision and assume cross-entropy loss defined as

 L(θ)=−1n∑ni=1 cilogσ(ωi) +(1−ci)log(1−σ(ωi)) (9)

to be minimized; it is conventional to denote as

. With a sufficient number of samples and by the central limit theorem, the loss function converges to its expected value. Thus, for a sufficiently trained network, the output

is close to the optimal which verifies that:

 λ(ω∗)\coloneqqσ(ω∗)1−σ(ω∗)=p(x,y,z)p(x|z)p(y,z). (10)

In conclusion, by minimizing the cross-entropy loss, the trained network generates the output which determines the density ratio between the pdfs for a particular triple .

### 3.3 Lower bound on I(x;y|z)

We choose to estimate the CMI with since it is unbiased and shown to be tight. We train a neural-network–based classifier as previously described and compute (10); this allows us to obtain the optimal for (8), i.e.,

 f∗(x,y,z)=1+logλ(ω∗). (11)

Hence the empirical estimation is computed as:

 ^IbNWJ=1+1b∑Bbjointlogλ(ω∗)−1b∑Bbprodλ(ω∗). (12)

The steps of the estimation are stated in Algorithm 1.

### 3.4 Numerical results

#### Wiretap channel

As motivated in Section 1, estimators for CMI can be adopted to compute the capacity in communication systems and optimize encoders accordingly. One example is the DWTC (Fig. 1) where a source is transmitting a message to a legitimate receiver while keeping it secret from an eavesdropper who has access to a degraded signal. The secrecy capacity (3) can be estimated if samples of are available. Consider the channels and to have additive Gaussian noise with variance and , respectively, then the model simplifies as shown in Fig. 2. So for any input such that , can be computed as:

 I(X;Y|Z)=I(X;Y)−I(X;Z) ≤12log(1+Pσ21)−12log(1+Pσ21+σ22), (13)

with equality when . For our estimation, we consider and the input to be zero-mean Gaussian with variance ; we collect samples of according to the described model and create batches of size

. The neural network in our experiment has two layers with 64 hidden ReLU activation units, and we use Adam optimizer with a learning rate

and 100 epochs. Final estimations are the averages of

Monte Carlo trials.

The estimated CMI is depicted with respect to in Fig. 3 and for different choices of the number of neighbors . For , i.e., the eavesdropper has access to the same signal as the legitimate receiver, the secrecy capacity is zero. It can be observed that increasing results in a better estimation, since -NN density estimator is shown to be consistent asymptotically when and .

#### Monte Carlo bias

To give some insights into the discussion on Section 2, we compare the estimators and for the previously defined DWTC (fixing ) in Fig. 4. The classifier is trained with samples, batch size , and neighbors; to compute the lower bound, the batches and are chosen with size and the results are averaged over Monte Carlo trials. This procedure is repeated times to produce the boxplots.

It can be observed that, for small values of , averaging the lower bound (2) over trials actually yields an overestimation of the true CMI far more often than with (12). This is a joint effect of the Monte Carlo bias and the small sample size –the latter affecting both estimators. Although not shown here, by increasing , the variance of the estimators is reduced, but the mean of still remains above the true CMI for small . This justifies our decision in using .

## 4 Conclusion

In this paper, we investigated the problem of estimating the conditional mutual information using a neural network. This was motivated by its application in learning encoders in communication systems. Since the conventional methods to estimate information-theoretic quantities do not scale well with dimension, recent works have proposed to estimate them utilizing neural networks. Although not shown in this work due to lack of space, our estimator also exhibits a better scaling with dimension than non–neural-based estimators.

Challenges of the extensions from estimators of mutual information have been discussed. Additionally, we argued on the advantages of our method in terms of estimation bias and showed the performance in estimating the secrecy transmission rate in a degraded wiretap channel. As a future direction, this method can be applied to other communication schemes and coupled with an optimizer for encoders.