# Generalized Bregman Divergence and Gradient of Mutual Information for Vector Poisson Channels

We investigate connections between information-theoretic and estimation-theoretic quantities in vector Poisson channel models. In particular, we generalize the gradient of mutual information with respect to key system parameters from the scalar to the vector Poisson channel model. We also propose, as another contribution, a generalization of the classical Bregman divergence that offers a means to encapsulate under a unifying framework the gradient of mutual information results for scalar and vector Poisson and Gaussian channel models. The so-called generalized Bregman divergence is also shown to exhibit various properties akin to the properties of the classical version. The vector Poisson channel model is drawing considerable attention in view of its application in various domains: as an example, the availability of the gradient of mutual information can be used in conjunction with gradient descent methods to effect compressive-sensing projection designs in emerging X-ray and document classification applications.

## Authors

• 7 publications
• 11 publications
• 172 publications
11/26/2018

### Divergence radii and the strong converse exponent of classical-quantum channel coding with constant compositions

There are different inequivalent ways to define the Rényi mutual informa...
01/19/2022

### Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Channel

It is an experimental design problem in which there are two Poisson sour...
09/26/2014

### Generalized Twin Gaussian Processes using Sharma-Mittal Divergence

There has been a growing interest in mutual information measures due to ...
11/06/2019

### Conditional Mutual Information Neural Estimator

Several recent works in communication systems have proposed to leverage ...
04/20/2018

### Inter-Annotator Agreement Networks

This work develops a simple information theoretic framework that capture...
05/15/2020

### Broadcasting on trees near criticality

We revisit the problem of broadcasting on d-ary trees: starting from a B...
07/13/2018

### Unique Informations and Deficiencies

Given two channels that convey information about the same random variabl...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

There has been a recent emergence of intimate connections between various quantities in information theory and estimation theory. The perhaps most prominent connections reveal the interplay between two notions with operational relevance in each of the domains: mutual information and conditional mean estimation.

In particular, Guo, Shamai and Verdú [1] have expressed the derivative of mutual information in a scalar Gaussian channel via the (non-linear) minimum mean-squared error (MMSE), and Palomar and Verdú [2] have expressed the gradient of mutual information in a vector Gaussian channel in terms of the MMSE matrix. The connections have also been extended from the scalar Gaussian to the scalar Poisson channel model, which has been ubiquitously used to model optical communications [3, 4]. Recently, parallel results for scalar binomial and negative binomial channels have been established [5, 6]. Inspired by the Lipster-Shiryaev formula [7], it has been demonstrated that it is often easier to investigate the gradient of mutual information rather than mutual information itself [3]. Further, it has also been shown that the derivative of mutual information with respect to key system parameters also relates to the conditional mean estimator [3].

This paper also pursues this overarching theme. One of the goals is to generalize the gradient of mutual information from scalar to vector Poisson channel models. This generalization is relevant not only from the theoretical but also from the practical perspective, in view of the numerous emerging applications of the vector Poisson channel model in X-ray systems [8] and document classification systems (based on word counts) [9]. The availability of the gradient then provides the means to optimize the mutual information with respect to specific system parameters via gradient descent methods.

The other goal is to encapsulate under a unified framework the gradient of mutual information results for scalar Gaussian channels, scalar Poisson channels and their vector counterparts.

This encapsulation, which is inspired by recent results that express the derivative of mutual information in scalar Poisson channels as the average value of the Bregman divergence associated with a particular loss function between the input and the conditional mean estimate of the input

[10], is possible by constructing a generalization of the classical Bregman divergence from the scalar to the vector case. This generalization of Bregman divergence appears to be new to the best of our knowledge. The gradients of mutual information of the vector Poisson model and the vector Gaussian model, as well as the scalar counterparts, are then also expressed - and akin to [10] - in terms of the average value of the so called generalized Bregman divergence associated with particular (vector) loss function between the input vector and the conditional mean estimate of the input vector.

We also study in detail various properties of the generalized Bregman divergence: the properties of the proposed divergence are shown to mimic closely those of the classical Bregman divergence.

The generalized Bregman divergence framework is of interest not only from the theoretical but also the practical standpoint: for example, it has been shown that re-expressing results via a Bregman divergence can often lead to enhancements to the speed of various optimization algorithms [11].

This paper is organized as follows: Section II introduces the channel model. Section III derives the gradient of mutual information with respect to key system parameters for vector Poisson channel models. Section IV introduces the notion of a generalized Bregman divergence and its properties. Section V re-derives the gradient of mutual information of vector Poisson and Gaussian channel models under the light of the proposed Bregman divergence. A possible application of the theoretical results in an emerging domain is succintly described in Section VI. Section VII concludes the paper.

## Ii The Vector Poisson Channel

We define the vector Poisson channel model via the random transformation:

 P(Y|X)=m∏i=1P(Yi|X)=m∏i=1% Pois((ΦX)i+λi) (1)

where the random vector represents the channel input, the random vector represents the channel output, the matrix

represents a linear transformation whose role is to entangle the different inputs, and the vector

represents the dark current.

denotes a standard Poisson distribution with parameter

.

This vector Poisson channel model associated with arbitrary and is a generalization of the standard scalar Poisson model associated with given by [3, 10]:

 P(Y|X)=Pois(ϕX+λ) (2)

where the scalar random variables

and are associated with the input and output of the scalar channel, respectively, is a scaling factor, and is associated with the dark current.111We use – except for the scaling matrix and the scaling factor – identical notation for the scalar Poisson channel and the vector Poisson channel. The context defines whether we are dealing with scalar or vector quantities.

The generalization of the scalar Poisson model in (2) to the vector one in (1) offers the means to address relevant problems in various emerging applications, most notably in X-ray and document classification applications as discussed in the sequel [9, 12].

The goal is to define the gradient of mutual information between the input and the output of the vector Poisson channel with respect to the scaling matrix, i.e.

 ∇ΦI(X;Y)=[∇ΦI(X;Y)ij] (3)

where represents the -th entry of the matrix , and with respect to the dark current, i.e.

 ∇λI(X;Y)=[∇λI(X;Y)i] (4)

where represents the i-th entry of the vector .

We will also be concerned with drawing connections between the gradient result for the vector Poisson channel and the gradient result for the Gaussian counterpart in the sequel. In particular, we will consider the vector Gaussian channel model given by:

 Y=ΦX+N (5)

where represents the vector-valued channel input, represents the vector-valued channel output, represents the channel matrix, and represents white Gaussian noise.

It has been established that the gradient of mutual information between the input and the output of the vector Gaussian channel model in (5) with respect to the channel matrix obeys the simple relationship [2]:

 ∇ΦI(X;Y)=ΦE, (6)

where

 E=E[(X−E(X|Y))(X−E(X|Y))T] (7)

denotes the MMSE matrix.

## Iii Gradient of Mutual Information for Vector Poisson Channels

We now introduce the gradient of mutual information with respect to the scaling matrix and with respect to the dark current for vector Poisson channel models. In particular, we assume that the regularity conditions necessary to interchange freely the order of integration and differentiation hold in the sequel, i.e., order of the differential operators , and the expectation operator . 222We consider for convenience natural logarithms throughout the paper.

###### Theorem 1.

Consider the vector Poisson channel model in (1). Then, the gradient of mutual information between the input and output of the channel with respect to the scaling matrix is given by:

 [∇ΦI(X;Y)ij]= [E[Xjlog((ΦX)i+λi)] −E[E[Xj|Y]logE[(ΦX)i+λi|Y]]], (8)

and with respect to the dark current is given by:

 [∇λI(X;Y)i]= [E[log((ΦX)i+λi)] −E[logE[(ΦX)i+λi|Y]]]. (9)

irrespective of the input distribution provided that the regularity conditions hold.

It is clear that Theorem 1 represents a multi-dimensional generalization of Theorems 1 and 2 in [3]. The scalar result follows immediately from the vector counterpart by taking .

###### Corollary 1.

Consider the scalar Poisson channel model in (2). Then, the derivative of mutual information between the input and output of the channel with respect to the scaling factor is given by:

 ∂∂ϕI(X;Y)= E[Xlog((ϕX)+λ)] −E[E[X|Y]logE[ϕX+λ|Y]], (10)

and with respect to the dark current is given by:

 ∂∂λI(X;Y)= E[log(ϕX+λ)] −E[logE[ϕX+λ|Y]]. (11)

irrespective of the input distribution provided that the regularity conditions hold.

It is also of interest to note that the gradient of mutual information for vector Poisson channels appears to admit an interpretation akin to that of the gradient of mutual information for vector Gaussian channels in (6) and (7) (see also [2]): Both gradient results can be expressed in terms of the average of a multi-dimensional measure of the error between the input vector and the conditional mean estimate of the input vector under appropriate loss functions. This interpretation can be made precise – as well as unified – by constructing a generalized notion of Bregman divergence that encapsulates the classical one.

## Iv Generalized Bregman Divergences: Definitions and Properties

The classical Bregman divergence was originally constructed to determine common points of convex sets [13]. It has been discovered later the Bregman divergence induces numerous well-known metrics and has a bijection to the exponential family [14].

###### Definition 1 (Classical Bregman Divergence [13]).

Let be a continuously-differentiable real-valued and strictly convex function defined on a closed convex set . The Bregman divergence between is defined as follows:

 DF(x,y):=F(x)−F(y)−⟨∇F(y),x−y⟩. (12)

Note that different choices of the function

induce different metrics. For example, Euclidean distance, Kullback-Leibler divergence, Mahalanobis distance and many other widely-used distances are specializations of the Bregman divergence associated with different choices of the function

[14].

There exist several generalizations of the classical Bregman divergence, including the extension to functional spaces [15] and the sub-modular extension [16]. However, such generalizations aim to extend the domain rather than the range of the Bregman divergence. This renders such generalizations unsuitable to problems where the “error” term is multi-dimensional rather than uni-dimensional, e.g. the MMSE matrix in (7).

We now construct a generalization that extends the range of a Bregman divergence from scalar to matrix spaces (viewed as multi-dimensional vector spaces) to address the issue. We start by reviewing several notions that are useful for the definition of the generalized Bregman divergence.

###### Definition 2 (Generalized Inequality [17]).

Let be a continuously-differentiable function, where is a convex subset. Let be a proper cone, , is convex, closed, with non-empty interior and pointed. We define a partial ordering on as follows:

 x⪯Ky⟺y−x∈K, (13)
 x≺Ky⟺y−x∈int(K), (14)

where denotes the interior of the set. We write and if and , respectively.

We define to be K-convex if and only if:

 F(θx+(1−θ)y)⪯KθF(x)+(1−θ)F(y) (15)

for .

We define to be strictly K-convex if and only if:

 F(θx+(1−θ)y)≺KθF(x)+(1−θ)F(y) (16)

for and .

###### Definition 3 (Fréchet Derivative [18]).

Let and be Banach spaces with norms and , respectively, and be open. is called Fréchet differentiable at , if there exists a bounded linear operator such that

 lim∥h∥V→0∥F(x+h)−F(x)−DF(x)(h)∥Z∥h∥V=0. (17)

is called the Fréchet derivative of at .

Note that the Fréchet derivative corresponds to the usual derivative of matrix calculus for finite dimensional vector spaces. However, by employing the Fréchet derivative, it is also possible to make extensions from finite to infinite dimensional spaces such as spaces.

We are now in a position to offer a definition of the generalized Bregman divergence.

###### Definition 4.

Let be a proper cone and be a convex subset in a Banach space . is a Fréchet-differentiable strictly -convex function. The generalized Bregman divergence between is defined as follows:

 DF(x,y):=F(x)−F(y)−DF(y)(x−y), (18)

where is the Fréchet derivative of at .

This notion of a generalized Bregman divergence is able to incorporate various previous extensions depending on the choices of the proper cone and the Banach space . For example, if we choose to be the first quadrant (all coordinators are non-negative), we have the entry-wise convexity extension. If we choose to be the space of positive definite bounded linear operators, we have the positive definiteness extension. By choosing to be an space, then the definition is similar to that in [15].

The generalized Bregman divergence also inherits various properties akin to the properties of the classical Bregman divergence, that has led to its wide utilization in optimization and computer vision problems

[11, 12].

###### Theorem 2.

Let be a proper cone and be a convex subset in a Banach space . are Fréchet-differentiable strictly -convex functions. Then the generalized Bregman divergence associated with the function exhibits the properties:

1. .

2. for constants .

3. is -convex for any .

The generalized Bregman divergence also exhibits a duality property similar to the duality property of the classical Bregman divergence, that may be useful for many optimization problems [12, 19].

###### Theorem 3.

Let be a strictly -convex function, where is a convex subset. Choose to be the space of first quadrant (space formed by matrices with all entries positive). Let be the Legendre transform of . Then, we have that:

 DF(x,y)=DF⋆(y⋆,x⋆). (19)

Via this theorem, it is possible to simplify the calculation of the Bregman divergence in scenarios where the dual form is easier to calculate than the original form. Mirror descent methods, which have been shown to be computationally efficient for many optimization problems [12, 20], leverage this idea.

The generalized Bregman divergence also exhibits another property akin to that of the classical Bregman divergence. In particular, it has been shown that for a metric that can be expressed in terms of the classical Bregman divergence then the optimal error relates to the conditional mean estimator [21]. Similarly, it can also be shown that for a metric that can be expressed in terms of a generalized Bregman divergence the optimal error also relates to the conditional mean estimator. However, this generalization from the scalar to the vector case requires the partial order interpretation of the minimization.

###### Theorem 4.

Consider a probability space

. Let be strictly -convex as before and is a convex subset in a Banach space . Let be a random variable with and . Let be a sub -algebra. Then, for any -measurable random variable , we have that:

 argminyE[DF(X,Y)]=E[X|s1], (20)

where the minimization is interpreted in the partial ordering sensing, i.e., if such that , then .

## V Gradient of Mutual Information: A Generalized Bregman Divergences Perspective

We now re-visit the gradient of mutual information for vector Poisson channel models and for vector Gaussian channel models with respect to the scaling/channel matrix, under the light of the generalized Bregman divergence.

The interpretation of the gradient results for vector Poisson and vector Gaussian channels, i.e., as the average of a multi-dimensional generalization of the error between the input vector and the conditional mean estimate of the input vector under appropriate loss functions, together with the properties of the generalized Bregman divergences pave the way to the unification of the various Theorems. In particular, we offer two Theorems that reveal that the gradient of mutual information for vector Poisson and vector Gaussian channels admit a representation that involves the average of the generalized Bregman divergence between the channel input and the conditional mean estimate of the channel input under appropriate choices of the vector-valued loss functions.

###### Theorem 5.

The gradient of mutual information with respect to the scaling matrix for the vector Poisson channel model in (1) can be represented as follows:

 ∇ΦI(X;Y)=E[DF(X,E[X|Y])], (21)

where is a generalized Bregman divergence associated with the function

 F(x)=x(log(Φx+λ))T−[x,…,x]+[1,…,1]T, (22)

where .

###### Theorem 6.

The gradient of mutual information with respect to the channel matrix for the vector Gaussian channel model in (5) can be represented as follows:

 ∇ΦI(X;Y)=E[DF(X,E[X|Y])], (23)

where is a generalized Bregman divergence associated with the function

 F(x)=ΦxxT. (24)

Atar and Weissman [10] have also recognized that the derivative of mutual information with respect to the scaling for the scalar Poisson channel could also be represented in terms of a (classical) Bregman divergence. Such a result applicable to the scalar Poisson channel as well as a result applicable to the scalar Gaussian channel can be seen to be Corollaries to Theorems 5 and 6, respectively, in view of the fact that the classical Bregman divergence is a specialization of the generalized one.

###### Corollary 2.

The derivative of mutual information with respect to the scaling factor for the scalar Poisson channel model is given by:

 ∂∂ϕI(X;Y)=E[DF(X,E[X|Y])], (25)

where .

###### Proof.

By Theorem 5, we have . It is straightforward to verify that induces the scalar gradient result. ∎

###### Corollary 3.

The derivative of mutual information with respect to the scaling factor for the scalar Gaussian channel model is given by:

 ∂∂ϕI(X;Y)=E[DF(X,E[X|Y])], (26)

where .

###### Proof.

By Theorem 6, . (26) follows from a simple calculation and the result from [2] that

Theorem 5 and 6 suggest a deep connection between the gradient and the generalized Bregman divergence. Besides, if a gradient is given in terms of a generalized Bregman divergence, it is possible to simplify optimization algorithms based on gradient-descent. Rather than calculating the gradient itself, one may work directly on its dual form provided that it is easier to calculate the dual function. This idea is behind the essence of the mirror descent methods which have been shown to be very computationally efficient [12, 20].

## Vi Applications: Document Classification

The practical relevance of the vector Poisson channel model relates to its numerous applications in various domains. We now briefly shed some light on how our results link to one emerging application that involves classification of documents

Let the random vector model the Poisson rates of count measurements, e.g. the Poisson rates of the counts of words in a documents for a vocabulary/dictionary of words.

It turns out that – in view of its compressive nature – it may be preferable to use the model , where with , rather than the conventional model [9], as the basis for document classification. In particular, each row of defines a set of words (those with row elements equal to one) that characterize a certain topic. The corresponding count relates to the number of times words in that set are manifested in a document.

The problem then relates to the determination of the “most informative” set of topics, i.e. the matrix . The availability of the gradient of mutual information with respect to the scaling matrix, which has been unveiled in this work, then offers a means to tackle this problem via gradient descent methods.

## Vii Conclusion

The focus has been on the generalization of connections between information-theoretic and estimation-theoretic quantities from the scalar to the vector Poisson channel model. In particular, in doing so, we have revealed that the connection between the gradient of mutual information with respect to key system parameters and conditional mean estimation is an overarching theme that transverses not only the scalar but also the vector counterparts of the Gaussian and Poisson channel.

By constructing a generalized version of the classical Bregman divergence, we have also established further intimate links between the gradient of mutual information in vector Poisson channel models and the gradient of mutual information in vector Gaussian channels. This generalized notion, which aims to extend the range of the conventional Bregman divergence from scalar to vector domains, has been shown to exhibit various properties akin to the properties of the classical notion, including non-negativity, linearity, convexity and duality.

By revealing the gradient of mutual information with respect to key system parameters of the vector Poisson model, including the scaling matrix and the dark current, it will be possible to use gradient-descent methods to address several problems, including generalizations of compressive-sensing projection designs from the Gaussian [22] to the Poisson model, that are known to be relevant in emerging applications (e.g. in X-ray and document classification).

## References

• [1] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in Gaussian channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1261–1282, April 2005.
• [2] D.P. Palomar and S. Verdú, “Gradient of mutual information in linear vector Gaussian channels,” IEEE Transactions on Information Theory, vol. 52, no. 1, pp. 141–154, Jan. 2006.
• [3] D. Guo, S. Shamai, and S. Verdú, “Mutual information and conditional mean estimation in Poisson channels,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 1837–1849, May 2008.
• [4] S. Verdú, “Poisson communication theory,” Invited talk in the International Technion Communication Day in honor of Israel Bar-David, May 1999.
• [5] C.G. Taborda and F. Perez-Cruz, “Mutual information and relative entropy over the binomial and negative binomial channels,” in IEEE International Symposium on Information Theory Proceedings (ISIT). IEEE, 2012, pp. 696–700.
• [6] D. Guo, “Information and estimation over binomial and negative binomial models,” arXiv preprint arXiv:1207.7144, 2012.
• [7] R.S. Liptser and A.N. Shiryaev, Statistics of Random Processes: II. Applications, vol. 2, Springer, 2000.
• [8] I.A. Elbakri and J.A. Fessler, “Statistical image reconstruction for polyenergetic X-ray computed tomography,” IEEE Transactions on Medical Imaging, vol. 21, no. 2, pp. 89–99, Feb. 2002.
• [9] M. Zhou, L. Hannah, D. Dunson, and L. Carin, “Beta-negative binomial process and Poisson factor analysis,”

International Conference on Artificial Intelligence and Statistics (AISTATS)

, 2012.
• [10] R. Atar and T. Weissman, “Mutual information, relative entropy, and estimation in the Poisson channel,” IEEE Transactions on Information Theory, vol. 58, no. 3, pp. 1302–1318, March 2012.
• [11] J. Duchi, E. Hazan, and Y. Singer,

Journal of Machine Learning Research

, vol. 12, pp. 2121–2159, 2010.
• [12] A. Ben-Tal, T. Margalit, and A. Nemirovski, “The ordered subsets mirror descent optimization method with applications to tomography,” SIAM Journal on Optimization, vol. 12, no. 1, pp. 79–108, Jan. 2001.
• [13] L.M. Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,” USSR Computational Mathematics and Mathematical Physics, vol. 7, no. 3, pp. 200–217, March 1967.
• [14] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” Journal of Machine Learning Research, vol. 6, pp. 1705–1749, 2005.
• [15] B.A. Frigyik, S. Srivastava, and M.R. Gupta, “Functional Bregman divergence and Bayesian estimation of distributions,” IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 5130–5139, Nov. 2008.
• [16] R. Iyer and J. Bilmes, “Submodular-Bregman and the Lovász-Bregman divergences with applications,” in Advances in Neural Information Processing Systems, 2012, pp. 2942–2950.
• [17] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.
• [18] G.B. Folland, Real Analysis: Modern Techniques and Their Applications, Wiley New York, 1999.
• [19] A. Agarwal, P.L. Bartlett, P. Ravikumar, and M.J. Wainwright, “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, May 2012.
• [20] A.S. Nemirovsky and D.B. Yudin, Problem Complexity and Method Efficiency in Optimization., Wiley, 1983.
• [21] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditional expectation as a Bregman predictor,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2664 –2669, July 2005.
• [22] W.R. Carson, M. Chen, M.R.D. Rodrigues, R. Calderbank, and L. Carin, “Communications-inspired projection design with application to compressive sensing,” SIAM J. Imaging Sciences, vol. 5, no. 4, pp. 1185–1212, Oct. 2012.