1 Introduction
Although originally conceived to address communication problems, informationtheoretic measures have been insightful in many fields including statistics, signal processing, and even neuroscience. Entropy, KLdivergence, and mutual information (MI) are extensively used to explain the behavior and relation among random variables. For example, MI and its extensions are used to characterize the capacity of communication channels
[1] as well as define notions of causality [2, 3]. Informationtheoretic quantities have also made their way into machine learning and deep neural networks
[4]. They have been adopted to regularize optimizations in neural networks [5][6], or to express the flow of information in layers of a deep network [7]. In a generative adversarial network (GAN), in which a neural network is optimized to perform towards the besttrained adversary, the relative entropy plays an eminent role [8, 9].On the other hand, neural networks have also been applied in communication setups as part of the encoder/decoder blocks [10, 11, 12]. Learning the endtoend communication system is challenging though, and requires the knowledge of the channel model, which it is not available in many applications. One solution to this problem is to train a GAN model to mimic the channel, which can later be used to learn the whole system [13, 14]. Alternatively, one can design encoders and optimize them to achieve the maximum rate (capacity), which is characterized by the MI. In [15], the authors exploit a neural network estimator of the mutual information proposed in [16] to optimize their encoders. Although estimating the MI has been studied extensively (see e.g., [17, 18]), the capacity in many communication setups (such as the relay channel, channels with random state, wiretap channels) is described by the conditional mutual information (CMI), which requires specialized estimators. Extending the existing estimators of MI for this purpose is not trivial and is the main focus of this paper. Motivated by [15], we investigate estimators using artificial neural networks.
The main challenges in estimating the CMI stem from empirically computing the conditional density function and from the curse of dimensionality. The latter is, in fact, a common problem in many datadriven estimators for MI and CMI. A conventional estimator in the literature for MI is based on the
nearest neighbors (NN) method [19], which has been extensively studied [20] and extended to estimate CMI [21, 22, 23]. In a recent work, the authors of [16] propose a new approach to estimate the MI using a neural network, which is based on the Donsker–Varadhan representation of the KLdivergence [24]. Improvements are shown in the performance of estimating the MI between highdimensional variables compared to the NN method. Several other works also take advantage of variational bounds ([25]) to estimate the MI and show similar improvements [26, 27]. For estimating the CMI, the authors of [28]introduce a neural network classifier to categorize the density of the input into two classes (joint and product); then they show that by training such classifier, they can optimize a variational bound for the relative entropy and correspondingly for the MI and the CMI. Furthermore, they suggest the use of a GAN model, a variational autoencoder, and
NN to generate or select samples with the appropriate conditional density.Although the trained classifier proposed in [28]
asymptotically converges to the optimal function for the variational bound, a relatively high variance can be observed in the final output. Consequently, the final estimation is an average over several Monte Carlo trials. However, the variational bound used in
[28] contains nonlinearities which results in a biased estimation when taking a Monte Carlo average. A similar problem is pointed out in [27] where the authors suggest a looser but linear variational bound. In this paper, we take advantage of the classifier technique applied to a linearized variational bound to avoid the bias problem. For simplicity, the NN method is used to collect samples from the conditional density. In Section 2 the variational bounds are explained and we shed light on the Monte Carlo bias problem. Afterwards, the proposed technique to train the classifier and estimate the CMI is stated in Section 3. Then, we investigate the problem of estimating the secrecy capacity of the degraded wiretap channel characterized by the CMI. Finally the paper is concluded in Section 4.2 Preliminaries
For random variables jointly distributed according to , the CMI is defined as:
(1) 
where
is the KLdivergence between the probability density functions (
pdfs) and , ,(2) 
The CMI characterizes the capacity of communication systems such as channels with random states, network communication models like the relay channel or the degraded wiretap channel (DWTC) [1]. For example, in the DWTC (see Fig. 1), the secrecy capacity is:
(3) 
By similar steps as in [25], we can bound (1) from below:
(4) 
where is any arbitrary pdf; the last step holds true since the relative entropy is nonnegative. The lower bound in (4) is tight if the conditional density functions and are equal. As discussed in [27], by choosing an energybased density, one can obtain lower bounds for the CMI as stated in the following proposition.
Proposition 1.
For any function , let , then the following bound holds and is tight when :
(5) 
Proof.
Choosing and substituting in (4) yields the desired bound.∎
It is worth noting that computing the optimal is nontrivial since we may not have access to the joint pdf. Using Proposition 1 and Jensen’s inequality, a variant of the Donsker–Varadhan bound [24] can be obtained as follows:
(6) 
Hereafter let the r.h.s. of (6) be denoted as . The bound is tight when .
In order to estimate from samples, an empirical average may be taken with respect to and . Let and be a random batch of triples sampled respectively from and ; then the estimated for an arbitrary choice of is:
(7) 
Since the outcomes of the previous estimator exhibit a relatively high variance ([27, 28, 29]), the authors of [28] averaged the estimated results over several trials. However, due to the concavity of the logarithm in the second term of , a Monte Carlo average of different instances of creates an upper bound for , while is itself a lower bound of the CMI. This issue can be resolved by taking a looser bound where the term is linearized. A similar bound is obtained by Nguyen, Wainwright, and Jordan [30], also adopted in [16, 28] to estimate the MI and which was denoted fMINE (since it corresponds to a variational representation for the fdivergence). The corresponding bound for the CMI is:
(8) 
where we use in (6). In this paper, let refer to the lower bound (8). The bound is tight for , while a Monte Carlo average of several instances of the estimator is unbiased and, consequently, justified for estimating a lower bound on the CMI.
Although the optimal choice for is known, since the true joint density is unknown, it is nontrivial to compute. Several approaches have been proposed to estimate for MI and CMI including [16, 27] and [28]. In [16], the searching space is restricted to functions generated via a neural network; then using minibatch gradient descent, the r.h.s. of (8) is optimized. Even though training the network can be unstable [27], their method is shown to scale better with dimension than the NN estimator for mutual information in [19]. Alternatively, [27] discusses estimators for log density ratio to be used in variational bounds, denoted there as Jensen–Shannon estimator. The motivation originates in the form of the functions that optimize these bounds. Similarly to estimate , [28] introduces a binary classifier using a neural network to discriminate samples from and and shows that by using crossentropy loss, the density ratio present in the optimal can be estimated. The authors then applied this classifier to estimate the CMI using .
Despite the variational bound for the CMI resembling the one for the MI, in practice, obtaining the empirical estimation cannot be immediately extended due to the appearance of the conditional density in preparing the batches. A proposed workaround in [28] is to compute the CMI as a difference between two MIs, i.e., . However, given that both estimated MIs are lower bounds of the true values, it is not clear that their difference is a lower or upper bound of the desired CMI.
Nonetheless, the authors of [28] do propose several methods to construct batches including GAN, variational autoencoder, and NN method. Then the classifier is trained using them to distinguish samples from either or . In this work, we only focus on the NN method to create the batches due to limited space and train the classifier similar to [28]. The main advantage of our work is to use to estimate the CMI, instead of as was applied in [28], to resolve the Monte Carlo bias problem.
3 Main results
In this section we explain the steps to estimate the CMI using neural networks. The estimator is subsequently exploited to compute the secrecy capacity of a degraded wiretap channel.
3.1 Constructing the batches
A major challenge in estimating the CMI for continuous random variables is dealing with the aforementioned conditional density function. Suppose we have a dataset
generated from a joint density . In order to sample triples to construct , we can take random choices of from the dataset. However, in order to build , it is not straightforward how to take a sample according to for a chosen pair of distributed by . Here, to be more tractable, we exploit the notion of NN as follows. Consider a pair chosen randomly from the dataset, which represents a sample from . In order to sample , this technique tells us to choose corresponding to the triple such that is one of the nearest neighbors of .3.2 Classifier
Let denote the class of samples distributed according to , while represents the class of samples randomly taken from , and the classes are equally probable. Consider a neural network with samples of as input and corresponding output
, where the objective is to classify the samples. Using a sigmoid function
, one can map the to a real value in . Let be the classifier’s decision and assume crossentropy loss defined as(9) 
to be minimized; it is conventional to denote as
. With a sufficient number of samples and by the central limit theorem, the loss function converges to its expected value. Thus, for a sufficiently trained network, the output
is close to the optimal which verifies that:(10) 
In conclusion, by minimizing the crossentropy loss, the trained network generates the output which determines the density ratio between the pdfs for a particular triple .
3.3 Lower bound on
We choose to estimate the CMI with since it is unbiased and shown to be tight. We train a neuralnetwork–based classifier as previously described and compute (10); this allows us to obtain the optimal for (8), i.e.,
(11) 
Hence the empirical estimation is computed as:
(12) 
The steps of the estimation are stated in Algorithm 1.
3.4 Numerical results
Wiretap channel
As motivated in Section 1, estimators for CMI can be adopted to compute the capacity in communication systems and optimize encoders accordingly. One example is the DWTC (Fig. 1) where a source is transmitting a message to a legitimate receiver while keeping it secret from an eavesdropper who has access to a degraded signal. The secrecy capacity (3) can be estimated if samples of are available. Consider the channels and to have additive Gaussian noise with variance and , respectively, then the model simplifies as shown in Fig. 2. So for any input such that , can be computed as:
(13) 
with equality when . For our estimation, we consider and the input to be zeromean Gaussian with variance ; we collect samples of according to the described model and create batches of size
. The neural network in our experiment has two layers with 64 hidden ReLU activation units, and we use Adam optimizer with a learning rate
and 100 epochs. Final estimations are the averages of
Monte Carlo trials.The estimated CMI is depicted with respect to in Fig. 3 and for different choices of the number of neighbors . For , i.e., the eavesdropper has access to the same signal as the legitimate receiver, the secrecy capacity is zero. It can be observed that increasing results in a better estimation, since NN density estimator is shown to be consistent asymptotically when and .
Monte Carlo bias
To give some insights into the discussion on Section 2, we compare the estimators and for the previously defined DWTC (fixing ) in Fig. 4. The classifier is trained with samples, batch size , and neighbors; to compute the lower bound, the batches and are chosen with size and the results are averaged over Monte Carlo trials. This procedure is repeated times to produce the boxplots.
It can be observed that, for small values of , averaging the lower bound (2) over trials actually yields an overestimation of the true CMI far more often than with (12). This is a joint effect of the Monte Carlo bias and the small sample size –the latter affecting both estimators. Although not shown here, by increasing , the variance of the estimators is reduced, but the mean of still remains above the true CMI for small . This justifies our decision in using .
4 Conclusion
In this paper, we investigated the problem of estimating the conditional mutual information using a neural network. This was motivated by its application in learning encoders in communication systems. Since the conventional methods to estimate informationtheoretic quantities do not scale well with dimension, recent works have proposed to estimate them utilizing neural networks. Although not shown in this work due to lack of space, our estimator also exhibits a better scaling with dimension than non–neuralbased estimators.
Challenges of the extensions from estimators of mutual information have been discussed. Additionally, we argued on the advantages of our method in terms of estimation bias and showed the performance in estimating the secrecy transmission rate in a degraded wiretap channel. As a future direction, this method can be applied to other communication schemes and coupled with an optimizer for encoders.
References
 [1] A. El Gamal and Y.H. Kim, Network information theory. Cambridge University Press, 2011.
 [2] C. J. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Estimating the directed information to infer causal relationships in ensemble neural spike train recordings,” J. Comput. Neurosci., vol. 30, no. 1, pp. 17–44, Feb. 2011.
 [3] S. Molavipour, G. Bassi, and M. Skoglund, “Testing for directed information graphs,” in 2017 55th Annual Allerton Conf. on Comm., Control, Comput. (Allerton), Oct. 2017, pp. 212–219.
 [4] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv:physics/0004057, Apr. 2000.
 [5] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv:1808.06670, Aug. 2018.
 [6] T. Tanaka, H. Sandberg, and M. Skoglund, “Transferentropyregularized markov decision processes,” arXiv:1708.09096, Aug. 2017.
 [7] M. Gabrié, A. Manoel, C. Luneau, N. Macris, F. Krzakala, L. Zdeborová et al., “Entropy and mutual information in models of deep neural networks,” in Adv. Neural. Inf. Process. Syst., 2018, pp. 1821–1831.
 [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Adv. Neural. Inf. Process. Syst., 2014, pp. 2672–2680.
 [9] S. Nowozin, B. Cseke, and R. Tomioka, “FGAN: Training generative neural samplers using variational divergence minimization,” in Adv. Neural. Inf. Process. Syst., 2016, pp. 271–279.

[10]
T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”
IEEE Trans. on Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017.  [11] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning based communication over the air,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 132–143, Feb. 2018.
 [12] R. Fritschek, R. F. Schaefer, and G. Wunder, “Deep learning for the Gaussian wiretap channel,” in 2019 IEEE Int. Conf. Comm. (ICC), May 2019, pp. 1–6.
 [13] H. Ye, G. Y. Li, B.H. F. Juang, and K. Sivanesan, “Channel agnostic endtoend learning based communication systems with conditional GAN,” in 2018 IEEE Globecom Workshops, Dec. 2018, pp. 1–5.
 [14] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Physical layer communications system design overtheair using adversarial networks,” in 2018 26th European Signal Process. Conf. (EUSIPCO), Sep. 2018, pp. 529–532.
 [15] R. Fritschek, R. F. Schaefer, and G. Wunder, “Deep learning for channel coding via neural mutual information estimation,” arXiv:1903.02865, Mar. 2019.
 [16] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “MINE: Mutual Information Neural Estimation,” in 35th Int. Conf. Mach. Learn. (ICML), Jul. 2018, pp. 531–540.
 [17] Q. Wang, S. R. Kulkarni, and S. Verdú, “Universal estimation of information measures for analog sources,” Foundations and Trends® in Communications and Information Theory, vol. 5, no. 3, pp. 265–353, 2009.
 [18] L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, no. 6, pp. 1191–1253, Jun. 2003.
 [19] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical Review E, vol. 69, no. 6, p. 066138, Jun. 2004.
 [20] W. Gao, S. Oh, and P. Viswanath, “Demystifying fixed nearest neighbor information estimators,” IEEE Trans. Inf. Theory, vol. 64, no. 8, pp. 5629–5661, Aug. 2018.
 [21] J. Runge, “Conditional independence testing based on a nearestneighbor estimator of conditional mutual information,” arXiv:1709.01447, Sep. 2017.
 [22] M. Vejmelka and M. Paluš, “Inferring the directionality of coupling with conditional mutual information,” Physical Review E, vol. 77, no. 2, p. 026214, Feb. 2008.
 [23] S. Frenzel and B. Pompe, “Partial mutual information for coupling analysis of multivariate time series,” Phys. Rev. Lett., vol. 99, no. 20, p. 204101, Nov. 2007.
 [24] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation of certain Markov process expectations for large time. IV,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, Mar. 1983.
 [25] D. Barber and F. V. Agakov, “The IM algorithm: a variational approach to information maximization,” in Adv. Neural. Inf. Process. Syst., 2003, pp. 201–208.
 [26] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, Jul. 2018.
 [27] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, “On variational lower bounds of mutual information,” in NeurIPS Workshop on Bayesian Deep Learning, Dec. 2018.
 [28] S. Mukherjee, H. Asnani, and S. Kannan, “CCMI: Classifier based Conditional Mutual Information Estimation,” arXiv:1906.01824, Jun. 2019.
 [29] D. McAllester and K. Statos, “Formal limitations on the measurement of mutual information,” arXiv:1811.04251, Nov. 2018.
 [30] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Trans. Inf. Theory, vol. 56, no. 11, pp. 5847–5861, Nov. 2010.