Convexity of mutual information along the Ornstein-Uhlenbeck flow

05/03/2018 ∙ by Andre Wibisono, et al. ∙ University of Wisconsin-Madison Georgia Institute of Technology 0

We study the convexity of mutual information as a function of time along the flow of the Ornstein-Uhlenbeck process. We prove that if the initial distribution is strongly log-concave, then mutual information is eventually convex, i.e., convex for all large time. In particular, if the initial distribution is sufficiently strongly log-concave with respect to the target Gaussian measure, then mutual information is always a convex function of time. We also prove that if the initial distribution is either bounded or has finite fourth moment and Fisher information, then mutual information is eventually convex. Finally, we provide counterexamples to show that mutual information can be nonconvex at small time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Ornstein-Uhlenbeck (OU) process is the simplest stochastic process after Brownian motion, and it is the most general stochastic process for which we know the exact solution, which is an exponential interpolation of Gaussian noise. The OU process plays an important role in theory and applications. The OU process provides an interpolation between any distribution and a Gaussian along a constant covariance path. This property has been a vital tool to prove the optimality of Gaussian in various information theoretic inequalities, including the entropy power inequality 

[1, 2, 3].

In applications, the OU process appears as the continuous-time limit of basic Gaussian autoregressive models. It also appears as the approximate dynamics of stochastic algorithms after linearization around stationary points 

[4]. The OU process is a model example for general stochastic processes, and serves as a testbed to examine and test conjectures for the general case. Indeed, our approach to the OU process in this paper is motivated by a desire to understand how various information-theoretic quantities such as entropy, mutual information, and Fisher information evolve in general stochastic processes. This builds up on our previous works [5, 6], where we carried out a similar analysis for the heat flow, and complements recent investigations along closely related lines [7, 8, 9, 10].

Both the heat diffusion and the OU processes are examples of Fokker-Planck processes. The Fokker-Planck process is the sampling analogue of the usual gradient flow for optimization; indeed, we can view the Fokker-Planck process as the gradient flow of the relative entropy functional in the space of measures with the Wasserstein metric [11, 12, 13, 14]. This interpretation provides valuable information on the behavior of relative entropy. For example, if the target measure is log-concave, then relative entropy is decreasing in a convex manner along the Fokker-Planck process. Such a result also follows from the seminal work of Bakry and Emery [15, 16], where diffusion processes have been examined in exquisite detail.

The behavior of mutual information, on the other hand, is not as well understood. By the data processing inequality one may note that mutual information is decreasing along the Fokker-Planck process, which means the first time derivative is negative. Furthermore, we can derive identities relating information and estimation quantities 

[5], which generalize the I-MMSE relationship for the Gaussian channel [17]. At the level of second time derivative, however, the behavior of mutual information is more complicated. In [6] we studied the basic case of the heat flow, or the Brownian motion, and we showed there are interesting non-convex behaviors of mutual information especially at initial time.

In this paper we study the corresponding questions for the OU process. Notice that the OU process may be obtained by rescaling time and space in the heat flow, and one could hope that convexity results for the heat flow [6] carry over with little work. However, this does not appear to be the case since rescaling does not preserve signs of higher order derivatives—a fact also noted in [8] with regards to the derivatives of entropy obtained in [7]. For simplicity in this paper we treat the case when the target Gaussian measure has isotropic covariance, but our technique extends to the general case. We show that the results for the heat flow qualitatively extend to the OU process, but now with a subtle interplay with the size of the covariance of the target Gaussian measure.

Our first main result states that if the initial distribution is strongly log-concave, then mutual information is eventually convex. In particular, if the initial distribution is sufficiently strongly log-concave compared to the target Gaussian measure, then mutual information is always convex. We also prove that if the initial distribution is either bounded, or has finite fourth moment and Fisher information, then mutual information is eventually convex, with a time threshold that depends on the target covariance. We also provide counterexamples to show that mutual information can be nonconvex at some small time. In particular, when the initial distribution is a mixture of point masses, we show that mutual information along the OU process initially starts at the discrete entropy, which is the same behavior as seen in the heat flow. In the limit of infinite target covariance, in which case the OU process becomes the Brownian motion, most of our results recover the corresponding results for the heat flow from [6].

Ii Background and problem setup

Ii-a The Ornstein-Uhlenbeck (OU) process

Let be the Gaussian measure in with mean and isotropic covariance , for some . The Ornstein-Uhlenbeck (OU) process in with target measure is the stochastic differential equation

(1)

where is a stochastic process in and is the standard Brownian motion in starting at . The OU process admits a closed-form solution in terms of Itô integral: . At each time , the solution has equality in distribution

(2)

where is independent of . As , exponentially fast, so indeed is the target stationary measure for the OU process.

In the space of probability measures, the OU process (

1

) corresponds to the following partial differential equation:

(3)

Here is a probability density over space for each time , is the divergence, and

is the Laplacian operator. Concretely, if the random variable

evolves following the OU process (1

), then its probability density function

evolves following the equation (3) in the space of measures.

From the solution (2) in terms of random variables, we also have the following solution for (3) in the space of measures:

(4)

where for . One may also directly verify that (4) satisfies the equation (3). We refer to the flow of (3), or the exact solution (4) above, as the Ornstein-Uhlenbeck (OU) flow in the space of measures. Note that as , and the Ornstein-Uhlenbeck process above recovers the Brownian motion or the heat flow.

Ii-B Derivatives of relative entropy along the OU flow

For a reference probability measure on , let

denote the relative entropy of a probability measure with respect to . This is also known as the Kullback-Leibler (KL) divergence. Relative entropy has the property that , and if and only if .

The relative Fisher information of with respect to is

The relative second-order Fisher information of with respect to is

where is the Hilbert-Schmidt (or Frobenius) norm of a symmetric matrix . For , we also write , , and in place of , , and , respectively.

We recall the interpretation of the OU flow as the gradient flow of relative entropy with respect to in the space of probability measures with the Wasserstein metric; see for example [13, 14, 12], or Appendix -A. This allows us to translate general gradient flow relations to obtain the following identities for the time derivatives of relative entropy along the OU flow.

Lemma 1.

Along the OU flow for ,

Note that and . So by Lemma 1, the first derivative of is negative while the second derivative is positive, which means relative entropy is decreasing in a convex manner along the OU flow.

Ii-C Derivatives of mutual information along the OU flow

Recall that given a functional of a random variable , we can define its mutual version for a joint random variable by

(5)

where is the expected value of on the conditional random variables , averaged over . The mutual version picks up only the nonlinear part, so two functionals that differ by a linear function have the same mutual version.

For example, for any , the relative entropy differs from the negative entropy by the linear (in ) term . Therefore, the mutual version of relative entropy is equal to the mutual version of negative entropy, which is mutual information:

We apply this definition to the joint random variable where is the OU flow from . By the linearity of the OU channel, Lemma 1 yields the following identities for the time derivatives of mutual information along the OU flow.

Lemma 2.

Along the OU flow for ,

We discuss the signs of the first and second derivatives.

Ii-C1 First derivative of mutual information

We recall that for a general joint random variable , the mutual relative Fisher information is equal to the backward Fisher information , which is manifestly nonnegative. Furthermore, we also recall that along the OU flow, the backward Fisher information

is proportional to the conditional variance of

given ; see for example [5, III-D.2], or Appendix -E.

Lemma 3.

Along the OU flow for ,

Combining Lemma 3 with the first identity of Lemma 2 expresses the time derivative of mutual information along the OU flow in terms of the minimum mean-square error (mmse) of estimating from ; this generalizes the I-MMSE relationship of [17] from the Gaussian channel (the heat flow) to the OU flow.

Ii-C2 Second derivative of mutual information

Unlike which is always positive, in general the mutual relative second-order Fisher information can be negative. However, if is sufficiently log-concave compared to the reference measure , then we can lower bound and use the additional term in the second identity of Lemma 2 to offset it. Altogether, we have the following result.

We say is -strongly log-concave for some if for all .

Lemma 4.

Along the OU flow for , if is -strongly log-concave, then mutual information is convex at time .

Lemma 4 provides a sufficient condition for mutual information to be convex along the OU flow. Since the target measure is -strongly log-concave, we expect any initial distribution will eventually be transformed to a distribution that is at least -strongly log-concave. We prove this in two cases: when the initial distribution is strongly log-concave, or when the initial distribution is bounded (or a convolution of the two cases). First, we have the following classical estimate.

Lemma 5.

Along the OU flow for , if is -strongly log-concave for some , then is -strongly log-concave.

We say is -bounded for some if the support of is contained in a ball of diameter . Then we also have the following.

Lemma 6.

Along the OU flow for , if is -bounded, then is -strongly log-concave for .

The threshold in Lemma 6 is so that the log-concavity constant is nonnegative. It is possible to combine Lemmas 5 and 6 to handle the case when the initial distribution is a convolution of a strongly log-concave and a bounded distribution (for example, a mixture of Gaussians), but with a more complicated threshold. For simplicity, we omit it here.

Iii Convexity of mutual information

We present our main results on the convexity of mutual information along the OU flow. Throughout, let denote the OU flow from .

Iii-a Eventual convexity when initial distribution is strongly log-concave

By combining Lemmas 4 and 5, we establish the following result for strongly log-concave distributions:

Theorem 1.

Suppose is -strongly log-concave for some . Along the OU flow for , mutual information is convex for all , where

Theorem 1 above proves the eventual convexity of mutual information for any strongly log-concave initial distribution. However, the threshold is not tight. For example, if , then where

are the eigenvalues of

. Then is -strongly log-concave, but one can verify that in this case mutual information is always convex for all , for any . Ultimately this gap is due to the fact that the sufficient condition in Lemma 4 is not necessary, and it would be interesting to see how to tighten it.

Iii-B Eventual convexity when initial distribution is bounded

By combining Lemmas 4 and 6, we establish the following result for distributions with bounded support.

Theorem 2.

Suppose is -bounded for some . Along the OU flow for , mutual information is convex for all .

Note that as , the threshold in Theorem 2 above becomes , thus recovering the corresponding result for the heat flow from [6, Theorem 2]. As noted previously, we can combine Theorems 1 and 2 to show the eventual convexity of mutual information when the initial distribution is a convolution of a strongly log-concave and a bounded distribution, but for simplicity we omit it here.

Iii-C Eventual convexity when Fisher information is finite

We now investigate the convexity of mutual information in general, regardless of the log-concavity of the distributions. We can show that if the initial distribution has finite fourth moment and Fisher information, then mutual information is eventually convex along the OU flow.

For , let denote the -th moment of a random variable with mean . Let denote the (absolute) Fisher information of . Then we have the following.

Theorem 3.

Suppose has and . Along the OU flow for , mutual information is convex for all .

Theorem 3 above proves the eventual convexity of mutual information under rather general conditions. However, in the limit the time threshold becomes , rather than recovering the corresponding result for the heat flow from [6, Theorem 3]. This is because in the proof we use a simple bound to estimate the root of a cubic polynomial (see Appendix -K), and it would be interesting to refine the analysis to obtain a better estimate.

Iv Nonconvexity of mutual information

We now provide counterexamples to show that mutual information can be concave for small time along the OU flow. Throughout, for , let . For , let denote the one-dimensional Gaussian random variable with mean and variance equal to .

Iv-a Mixture of two point masses

Let be a uniform mixture of two point masses centered at and , for some , . Along the OU flow for , is a uniform mixture of two Gaussians.

By direct calculation, the mutual information is:

The behavior is depicted in Figure 0(a) for in . We see that mutual information is not convex at small time. It starts at the value , which follows from the general result in Theorem 4, then stays flat for a while before decreasing and becoming convex.

Iv-B Mixture of two Gaussians

Let be a uniform mixture of two Gaussians with the same covariance for some , centered at and for some , . Along the OU flow for , is also a uniform mixture of two Gaussians.

By direct calculation, the mutual information is:

The behavior is depicted in Figure 0(b) for in (). We see that now mutual information starts at , but it decreases quickly and flattens out for a while before decreasing again. Therefore, mutual information is still not convex at some small time.

(a) Mixture of point masses
(b) Mixture of Gaussians
Fig. 1: Mutual information along the OU flow for . (a) Left: . (b) Right: with .

Iv-C General mixture of point masses

Let be a mixture of point masses centered at distinct , with mixture probabilities , . Along the OU flow for , is a mixture of Gaussians.

By adapting the estimates from [6, IV-C], we can show that mutual information along the OU flow starts at a finite value which is equal to the discrete entropy of the mixture probability, and it is exponentially concentrated at small time. This is the same phenomenon as in the heat flow case, except that the bound on the small time now depends on .

Let and . Let denote the discrete entropy.

Theorem 4.

Along the OU flow for , for all ,

Theorem 4 above implies that Thus, for discrete initial distribution, the initial value of mutual information only depends on the mixture proportions, and does not depend on the locations of the centers as long as they are distinct. This is the same interesting behavior as in the heat flow case, and shows that we can obtain discontinuities of the mutual information at the origin with respect to the initial distribution, by moving the centers and merging them.

Furthermore, if a function converges exponentially fast, then all its derivatives converge to zero exponentially fast. In our case for discrete initial distribution, this gives the following.

Corollary 1.

For all , .

In particular, the mutual relative Fisher information also starts at . Since the initial distribution which is a mixture of point masses is bounded, by Theorem 2 we know mutual information is eventually convex, which means is eventually decreasing. Since is always nonnegative, it must initially increase first, before it can decrease. During this period in which is increasing, mutual information is concave. This is similar to the behavior for the mixture of two point masses as observed in IV-A. Moreover, by the continuity of the mutual relative first and second-order Fisher information with respect to the initial distribution, this suggests that mutual information can also be concave at some small time when the initial distribution is a mixture of Gaussians, similar to the observation in IV-B.

V Discussion and future work

In this paper we have studied the convexity of mutual information along the Ornstein-Uhlenbeck flow. We considered the gradient flow interpretation of the Ornstein-Uhlenbeck process in the space of measures, and derived formulae for the various derivatives of relative entropy and mutual information. We have shown that mutual information is eventually convex under rather general conditions on the initial distribution. We have also shown examples in which mutual information is concave at some small time. These results generalize the behaviors seen in the heat flow [6].

For simplicity in this paper we have treated only the case when the target Gaussian distribution has isotropic covariance. It is possible to extend our results to handle the case of a general covariance matrix. In this case, extra caution needs to be exercised since matrices in general do not commute. In the simple case when the covariance matrices of the initial and target distributions commute, our results extend naturally and the various thresholds are now controlled by the eigenvalues of the matrices.

As noted in the introduction, our interest in studying this problem is to better understand the general case of the Fokker-Planck process. Indeed, there is an interesting dichotomy in which we understand the intricate properties of the Ornstein-Uhlenbeck process since we have an explicit solution, whereas we know very little about the general Fokker-Planck process. The gradient flow interpretation applies to the general Fokker-Planck process and provides information for the convexity properties of the relative entropy if the target measure is log-concave. However, much is not known, even about mutual information. Hence in this paper we have attempted to settle the case of the Ornstein-Uhlenbeck process. Even in this case some of our results are not tight and can be improved.

Some interesting future directions are to understand the convexity property of the solution to the Fokker-Planck process, even in the nice case when the target measure is strongly log-concave. For example, does the Fokker-Planck process preserve log-concavity relative to the target measure? Furthermore, is self-information (mutual information at initial time) for discrete initial distribution still equal to the discrete entropy for the Fokker-Planck process? In general, it is interesting to bridge the gap in our understanding between the Ornstein-Uhlenbeck process and the general Fokker-Planck process. One avenue to do that may be to study a perturbation of the Ornstein-Uhlenbeck process, when the target distribution is a small perturbation of the Gaussian.

References

  • [1] N. Blachman, “The convolution inequality for entropy powers,” IEEE Transactions on Information Theory, vol. 11, no. 2, pp. 267–271, 1965.
  • [2] O. Rioul, “Information theoretic proofs of entropy power inequalities,” IEEE Transactions on Information Theory, vol. 57, no. 1, pp. 33–55, 2011.
  • [3] M. Madiman and A. Barron, “Generalized entropy power inequalities and monotonicity properties of information,” IEEE Transactions on Information Theory, vol. 53, no. 7, pp. 2317–2329, 2007.
  • [4] S. Mandt, M. Hoffman, and D. Blei, “A variational analysis of stochastic gradient algorithms,” in

    International Conference on Machine Learning

    , 2016, pp. 354–363.
  • [5] A. Wibisono, V. Jog, and P. Loh, “Information and estimation in Fokker-Planck channels,” in 2017 IEEE International Symposium on Information Theory, ISIT 2017, Aachen, Germany, 2017, pp. 2673–2677.
  • [6] A. Wibisono and V. Jog, “Convexity of mutual information along the heat flow,” in 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, USA, 2018.
  • [7] D. Guo, Y. Wu, S. S. Shitz, and S. Verdú, “Estimation in gaussian noise: Properties of the minimum mean-square error,” IEEE Transactions on Information Theory, vol. 57, no. 4, pp. 2371–2385, 2011.
  • [8] F. Cheng and Y. Geng, “Higher order derivatives in Costa’s entropy power inequality,” IEEE Transactions on Information Theory, vol. 61, no. 11, pp. 5892–5905, 2015.
  • [9] G. Toscani, “A concavity property for the reciprocal of Fisher information and its consequences on Costa’s EPI,” Physica A: Statistical Mechanics and its Applications, vol. 432, pp. 35–42, 2015.
  • [10] X. Zhang, V. Anantharam, and Y. Geng, “Gaussian optimality for derivatives of differential entropy using linear matrix inequalities,” Entropy, vol. 20, no. 3, p. 182, 2018.
  • [11] R. Jordan, D. Kinderlehrer, and F. Otto, “The variational formulation of the Fokker–Planck equation,” SIAM Journal on Mathematical Analysis, vol. 29, no. 1, pp. 1–17, January 1998.
  • [12] F. Otto and C. Villani, “Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality,” Journal of Functional Analysis, vol. 173, no. 2, pp. 361–400, 2000.
  • [13] C. Villani, Topics in optimal transportation.   American Mathematical Society, 2003, no. 58.
  • [14] ——, Optimal Transport: Old and New, ser. Grundlehren der mathematischen Wissenschaften.   Springer Berlin Heidelberg, 2008, vol. 338.
  • [15] D. Bakry and M. Émery, “Diffusions hypercontractives,” in Séminaire de Probabilités XIX 1983/84.   Springer, 1985, pp. 177–206.
  • [16] D. Bakry, I. Gentil, and M. Ledoux, Analysis and geometry of Markov diffusion operators.   Springer, 2013, vol. 348.
  • [17] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in Gaussian channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1261–1282, 2005.
  • [18] A. Wibisono and V. Jog, “Convexity of mutual information along the Ornstein-Uhlenbeck flow,” arXiv preprint arXiv:1805.01401, 2018.
  • [19] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1–2, pp. 1–305, Jan. 2008.
  • [20] A. Saumard and J. A. Wellner, “Log-concavity and strong log-concavity: A review,” Statistics Surveys, vol. 8, p. 45, 2014.
  • [21] A. Dembo, T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,” IEEE Transactions on Information Theory, vol. 37, no. 6, pp. 1501–1518, 1991.