I Introduction
There has been a recent emergence of intimate connections between various quantities in information theory and estimation theory. The perhaps most prominent connections reveal the interplay between two notions with operational relevance in each of the domains: mutual information and conditional mean estimation.
In particular, Guo, Shamai and Verdú [1] have expressed the derivative of mutual information in a scalar Gaussian channel via the (nonlinear) minimum meansquared error (MMSE), and Palomar and Verdú [2] have expressed the gradient of mutual information in a vector Gaussian channel in terms of the MMSE matrix. The connections have also been extended from the scalar Gaussian to the scalar Poisson channel model, which has been ubiquitously used to model optical communications [3, 4]. Recently, parallel results for scalar binomial and negative binomial channels have been established [5, 6]. Inspired by the LipsterShiryaev formula [7], it has been demonstrated that it is often easier to investigate the gradient of mutual information rather than mutual information itself [3]. Further, it has also been shown that the derivative of mutual information with respect to key system parameters also relates to the conditional mean estimator [3].
This paper also pursues this overarching theme. One of the goals is to generalize the gradient of mutual information from scalar to vector Poisson channel models. This generalization is relevant not only from the theoretical but also from the practical perspective, in view of the numerous emerging applications of the vector Poisson channel model in Xray systems [8] and document classification systems (based on word counts) [9]. The availability of the gradient then provides the means to optimize the mutual information with respect to specific system parameters via gradient descent methods.
The other goal is to encapsulate under a unified framework the gradient of mutual information results for scalar Gaussian channels, scalar Poisson channels and their vector counterparts.
This encapsulation, which is inspired by recent results that express the derivative of mutual information in scalar Poisson channels as the average value of the Bregman divergence associated with a particular loss function between the input and the conditional mean estimate of the input
[10], is possible by constructing a generalization of the classical Bregman divergence from the scalar to the vector case. This generalization of Bregman divergence appears to be new to the best of our knowledge. The gradients of mutual information of the vector Poisson model and the vector Gaussian model, as well as the scalar counterparts, are then also expressed  and akin to [10]  in terms of the average value of the so called generalized Bregman divergence associated with particular (vector) loss function between the input vector and the conditional mean estimate of the input vector.We also study in detail various properties of the generalized Bregman divergence: the properties of the proposed divergence are shown to mimic closely those of the classical Bregman divergence.
The generalized Bregman divergence framework is of interest not only from the theoretical but also the practical standpoint: for example, it has been shown that reexpressing results via a Bregman divergence can often lead to enhancements to the speed of various optimization algorithms [11].
This paper is organized as follows: Section II introduces the channel model. Section III derives the gradient of mutual information with respect to key system parameters for vector Poisson channel models. Section IV introduces the notion of a generalized Bregman divergence and its properties. Section V rederives the gradient of mutual information of vector Poisson and Gaussian channel models under the light of the proposed Bregman divergence. A possible application of the theoretical results in an emerging domain is succintly described in Section VI. Section VII concludes the paper.
Ii The Vector Poisson Channel
We define the vector Poisson channel model via the random transformation:
(1) 
where the random vector represents the channel input, the random vector represents the channel output, the matrix
represents a linear transformation whose role is to entangle the different inputs, and the vector
represents the dark current.denotes a standard Poisson distribution with parameter
.This vector Poisson channel model associated with arbitrary and is a generalization of the standard scalar Poisson model associated with given by [3, 10]:
(2) 
where the scalar random variables
and are associated with the input and output of the scalar channel, respectively, is a scaling factor, and is associated with the dark current.^{1}^{1}1We use – except for the scaling matrix and the scaling factor – identical notation for the scalar Poisson channel and the vector Poisson channel. The context defines whether we are dealing with scalar or vector quantities.The generalization of the scalar Poisson model in (2) to the vector one in (1) offers the means to address relevant problems in various emerging applications, most notably in Xray and document classification applications as discussed in the sequel [9, 12].
The goal is to define the gradient of mutual information between the input and the output of the vector Poisson channel with respect to the scaling matrix, i.e.
(3) 
where represents the th entry of the matrix , and with respect to the dark current, i.e.
(4) 
where represents the ith entry of the vector .
We will also be concerned with drawing connections between the gradient result for the vector Poisson channel and the gradient result for the Gaussian counterpart in the sequel. In particular, we will consider the vector Gaussian channel model given by:
(5) 
where represents the vectorvalued channel input, represents the vectorvalued channel output, represents the channel matrix, and represents white Gaussian noise.
Iii Gradient of Mutual Information for
Vector Poisson Channels
We now introduce the gradient of mutual information with respect to the scaling matrix and with respect to the dark current for vector Poisson channel models. In particular, we assume that the regularity conditions necessary to interchange freely the order of integration and differentiation hold in the sequel, i.e., order of the differential operators , and the expectation operator . ^{2}^{2}2We consider for convenience natural logarithms throughout the paper.
Theorem 1.
Consider the vector Poisson channel model in (1). Then, the gradient of mutual information between the input and output of the channel with respect to the scaling matrix is given by:
(8) 
and with respect to the dark current is given by:
(9) 
irrespective of the input distribution provided that the regularity conditions hold.
It is clear that Theorem 1 represents a multidimensional generalization of Theorems 1 and 2 in [3]. The scalar result follows immediately from the vector counterpart by taking .
Corollary 1.
Consider the scalar Poisson channel model in (2). Then, the derivative of mutual information between the input and output of the channel with respect to the scaling factor is given by:
(10) 
and with respect to the dark current is given by:
(11) 
irrespective of the input distribution provided that the regularity conditions hold.
It is also of interest to note that the gradient of mutual information for vector Poisson channels appears to admit an interpretation akin to that of the gradient of mutual information for vector Gaussian channels in (6) and (7) (see also [2]): Both gradient results can be expressed in terms of the average of a multidimensional measure of the error between the input vector and the conditional mean estimate of the input vector under appropriate loss functions. This interpretation can be made precise – as well as unified – by constructing a generalized notion of Bregman divergence that encapsulates the classical one.
Iv Generalized Bregman Divergences: Definitions and Properties
The classical Bregman divergence was originally constructed to determine common points of convex sets [13]. It has been discovered later the Bregman divergence induces numerous wellknown metrics and has a bijection to the exponential family [14].
Definition 1 (Classical Bregman Divergence [13]).
Let be a continuouslydifferentiable realvalued and strictly convex function defined on a closed convex set . The Bregman divergence between is defined as follows:
(12) 
Note that different choices of the function
induce different metrics. For example, Euclidean distance, KullbackLeibler divergence, Mahalanobis distance and many other widelyused distances are specializations of the Bregman divergence associated with different choices of the function
[14].There exist several generalizations of the classical Bregman divergence, including the extension to functional spaces [15] and the submodular extension [16]. However, such generalizations aim to extend the domain rather than the range of the Bregman divergence. This renders such generalizations unsuitable to problems where the “error” term is multidimensional rather than unidimensional, e.g. the MMSE matrix in (7).
We now construct a generalization that extends the range of a Bregman divergence from scalar to matrix spaces (viewed as multidimensional vector spaces) to address the issue. We start by reviewing several notions that are useful for the definition of the generalized Bregman divergence.
Definition 2 (Generalized Inequality [17]).
Let be a continuouslydifferentiable function, where is a convex subset. Let be a proper cone, , is convex, closed, with nonempty interior and pointed. We define a partial ordering on as follows:
(13) 
(14) 
where denotes the interior of the set. We write and if and , respectively.
We define to be Kconvex if and only if:
(15) 
for .
We define to be strictly Kconvex if and only if:
(16) 
for and .
Definition 3 (Fréchet Derivative [18]).
Let and be Banach spaces with norms and , respectively, and be open. is called Fréchet differentiable at , if there exists a bounded linear operator such that
(17) 
is called the Fréchet derivative of at .
Note that the Fréchet derivative corresponds to the usual derivative of matrix calculus for finite dimensional vector spaces. However, by employing the Fréchet derivative, it is also possible to make extensions from finite to infinite dimensional spaces such as spaces.
We are now in a position to offer a definition of the generalized Bregman divergence.
Definition 4.
Let be a proper cone and be a convex subset in a Banach space . is a Fréchetdifferentiable strictly convex function. The generalized Bregman divergence between is defined as follows:
(18) 
where is the Fréchet derivative of at .
This notion of a generalized Bregman divergence is able to incorporate various previous extensions depending on the choices of the proper cone and the Banach space . For example, if we choose to be the first quadrant (all coordinators are nonnegative), we have the entrywise convexity extension. If we choose to be the space of positive definite bounded linear operators, we have the positive definiteness extension. By choosing to be an space, then the definition is similar to that in [15].
The generalized Bregman divergence also inherits various properties akin to the properties of the classical Bregman divergence, that has led to its wide utilization in optimization and computer vision problems
[11, 12].Theorem 2.
Let be a proper cone and be a convex subset in a Banach space . are Fréchetdifferentiable strictly convex functions. Then the generalized Bregman divergence associated with the function exhibits the properties:

.

for constants .

is convex for any .
The generalized Bregman divergence also exhibits a duality property similar to the duality property of the classical Bregman divergence, that may be useful for many optimization problems [12, 19].
Theorem 3.
Let be a strictly convex function, where is a convex subset. Choose to be the space of first quadrant (space formed by matrices with all entries positive). Let be the Legendre transform of . Then, we have that:
(19) 
Via this theorem, it is possible to simplify the calculation of the Bregman divergence in scenarios where the dual form is easier to calculate than the original form. Mirror descent methods, which have been shown to be computationally efficient for many optimization problems [12, 20], leverage this idea.
The generalized Bregman divergence also exhibits another property akin to that of the classical Bregman divergence. In particular, it has been shown that for a metric that can be expressed in terms of the classical Bregman divergence then the optimal error relates to the conditional mean estimator [21]. Similarly, it can also be shown that for a metric that can be expressed in terms of a generalized Bregman divergence the optimal error also relates to the conditional mean estimator. However, this generalization from the scalar to the vector case requires the partial order interpretation of the minimization.
Theorem 4.
Consider a probability space
. Let be strictly convex as before and is a convex subset in a Banach space . Let be a random variable with and . Let be a sub algebra. Then, for any measurable random variable , we have that:(20) 
where the minimization is interpreted in the partial ordering sensing, i.e., if such that , then .
V Gradient of Mutual Information: A Generalized Bregman Divergences Perspective
We now revisit the gradient of mutual information for vector Poisson channel models and for vector Gaussian channel models with respect to the scaling/channel matrix, under the light of the generalized Bregman divergence.
The interpretation of the gradient results for vector Poisson and vector Gaussian channels, i.e., as the average of a multidimensional generalization of the error between the input vector and the conditional mean estimate of the input vector under appropriate loss functions, together with the properties of the generalized Bregman divergences pave the way to the unification of the various Theorems. In particular, we offer two Theorems that reveal that the gradient of mutual information for vector Poisson and vector Gaussian channels admit a representation that involves the average of the generalized Bregman divergence between the channel input and the conditional mean estimate of the channel input under appropriate choices of the vectorvalued loss functions.
Theorem 5.
The gradient of mutual information with respect to the scaling matrix for the vector Poisson channel model in (1) can be represented as follows:
(21) 
where is a generalized Bregman divergence associated with the function
(22) 
where .
Theorem 6.
The gradient of mutual information with respect to the channel matrix for the vector Gaussian channel model in (5) can be represented as follows:
(23) 
where is a generalized Bregman divergence associated with the function
(24) 
Atar and Weissman [10] have also recognized that the derivative of mutual information with respect to the scaling for the scalar Poisson channel could also be represented in terms of a (classical) Bregman divergence. Such a result applicable to the scalar Poisson channel as well as a result applicable to the scalar Gaussian channel can be seen to be Corollaries to Theorems 5 and 6, respectively, in view of the fact that the classical Bregman divergence is a specialization of the generalized one.
Corollary 2.
The derivative of mutual information with respect to the scaling factor for the scalar Poisson channel model is given by:
(25) 
where .
Proof.
By Theorem 5, we have . It is straightforward to verify that induces the scalar gradient result. ∎
Corollary 3.
The derivative of mutual information with respect to the scaling factor for the scalar Gaussian channel model is given by:
(26) 
where .
Va Algorithmic Advantages
Theorem 5 and 6 suggest a deep connection between the gradient and the generalized Bregman divergence. Besides, if a gradient is given in terms of a generalized Bregman divergence, it is possible to simplify optimization algorithms based on gradientdescent. Rather than calculating the gradient itself, one may work directly on its dual form provided that it is easier to calculate the dual function. This idea is behind the essence of the mirror descent methods which have been shown to be very computationally efficient [12, 20].
Vi Applications: Document Classification
The practical relevance of the vector Poisson channel model relates to its numerous applications in various domains. We now briefly shed some light on how our results link to one emerging application that involves classification of documents
Let the random vector model the Poisson rates of count measurements, e.g. the Poisson rates of the counts of words in a documents for a vocabulary/dictionary of words.
It turns out that – in view of its compressive nature – it may be preferable to use the model , where with , rather than the conventional model [9], as the basis for document classification. In particular, each row of defines a set of words (those with row elements equal to one) that characterize a certain topic. The corresponding count relates to the number of times words in that set are manifested in a document.
The problem then relates to the determination of the “most informative” set of topics, i.e. the matrix . The availability of the gradient of mutual information with respect to the scaling matrix, which has been unveiled in this work, then offers a means to tackle this problem via gradient descent methods.
Vii Conclusion
The focus has been on the generalization of connections between informationtheoretic and estimationtheoretic quantities from the scalar to the vector Poisson channel model. In particular, in doing so, we have revealed that the connection between the gradient of mutual information with respect to key system parameters and conditional mean estimation is an overarching theme that transverses not only the scalar but also the vector counterparts of the Gaussian and Poisson channel.
By constructing a generalized version of the classical Bregman divergence, we have also established further intimate links between the gradient of mutual information in vector Poisson channel models and the gradient of mutual information in vector Gaussian channels. This generalized notion, which aims to extend the range of the conventional Bregman divergence from scalar to vector domains, has been shown to exhibit various properties akin to the properties of the classical notion, including nonnegativity, linearity, convexity and duality.
By revealing the gradient of mutual information with respect to key system parameters of the vector Poisson model, including the scaling matrix and the dark current, it will be possible to use gradientdescent methods to address several problems, including generalizations of compressivesensing projection designs from the Gaussian [22] to the Poisson model, that are known to be relevant in emerging applications (e.g. in Xray and document classification).
References
 [1] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum meansquare error in Gaussian channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1261–1282, April 2005.
 [2] D.P. Palomar and S. Verdú, “Gradient of mutual information in linear vector Gaussian channels,” IEEE Transactions on Information Theory, vol. 52, no. 1, pp. 141–154, Jan. 2006.
 [3] D. Guo, S. Shamai, and S. Verdú, “Mutual information and conditional mean estimation in Poisson channels,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 1837–1849, May 2008.
 [4] S. Verdú, “Poisson communication theory,” Invited talk in the International Technion Communication Day in honor of Israel BarDavid, May 1999.
 [5] C.G. Taborda and F. PerezCruz, “Mutual information and relative entropy over the binomial and negative binomial channels,” in IEEE International Symposium on Information Theory Proceedings (ISIT). IEEE, 2012, pp. 696–700.
 [6] D. Guo, “Information and estimation over binomial and negative binomial models,” arXiv preprint arXiv:1207.7144, 2012.
 [7] R.S. Liptser and A.N. Shiryaev, Statistics of Random Processes: II. Applications, vol. 2, Springer, 2000.
 [8] I.A. Elbakri and J.A. Fessler, “Statistical image reconstruction for polyenergetic Xray computed tomography,” IEEE Transactions on Medical Imaging, vol. 21, no. 2, pp. 89–99, Feb. 2002.

[9]
M. Zhou, L. Hannah, D. Dunson, and L. Carin,
“Betanegative binomial process and Poisson factor analysis,”
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2012.  [10] R. Atar and T. Weissman, “Mutual information, relative entropy, and estimation in the Poisson channel,” IEEE Transactions on Information Theory, vol. 58, no. 3, pp. 1302–1318, March 2012.

[11]
J. Duchi, E. Hazan, and Y. Singer,
“Adaptive subgradient methods for online learning and stochastic optimization,”
Journal of Machine Learning Research
, vol. 12, pp. 2121–2159, 2010.  [12] A. BenTal, T. Margalit, and A. Nemirovski, “The ordered subsets mirror descent optimization method with applications to tomography,” SIAM Journal on Optimization, vol. 12, no. 1, pp. 79–108, Jan. 2001.
 [13] L.M. Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,” USSR Computational Mathematics and Mathematical Physics, vol. 7, no. 3, pp. 200–217, March 1967.
 [14] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” Journal of Machine Learning Research, vol. 6, pp. 1705–1749, 2005.
 [15] B.A. Frigyik, S. Srivastava, and M.R. Gupta, “Functional Bregman divergence and Bayesian estimation of distributions,” IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 5130–5139, Nov. 2008.
 [16] R. Iyer and J. Bilmes, “SubmodularBregman and the LovászBregman divergences with applications,” in Advances in Neural Information Processing Systems, 2012, pp. 2942–2950.
 [17] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.
 [18] G.B. Folland, Real Analysis: Modern Techniques and Their Applications, Wiley New York, 1999.
 [19] A. Agarwal, P.L. Bartlett, P. Ravikumar, and M.J. Wainwright, “Informationtheoretic lower bounds on the oracle complexity of stochastic convex optimization,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, May 2012.
 [20] A.S. Nemirovsky and D.B. Yudin, Problem Complexity and Method Efficiency in Optimization., Wiley, 1983.
 [21] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditional expectation as a Bregman predictor,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2664 –2669, July 2005.
 [22] W.R. Carson, M. Chen, M.R.D. Rodrigues, R. Calderbank, and L. Carin, “Communicationsinspired projection design with application to compressive sensing,” SIAM J. Imaging Sciences, vol. 5, no. 4, pp. 1185–1212, Oct. 2012.
Comments
There are no comments yet.