I Introduction
The OrnsteinUhlenbeck (OU) process is the simplest stochastic process after Brownian motion, and it is the most general stochastic process for which we know the exact solution, which is an exponential interpolation of Gaussian noise. The OU process plays an important role in theory and applications. The OU process provides an interpolation between any distribution and a Gaussian along a constant covariance path. This property has been a vital tool to prove the optimality of Gaussian in various information theoretic inequalities, including the entropy power inequality
[1, 2, 3].In applications, the OU process appears as the continuoustime limit of basic Gaussian autoregressive models. It also appears as the approximate dynamics of stochastic algorithms after linearization around stationary points
[4]. The OU process is a model example for general stochastic processes, and serves as a testbed to examine and test conjectures for the general case. Indeed, our approach to the OU process in this paper is motivated by a desire to understand how various informationtheoretic quantities such as entropy, mutual information, and Fisher information evolve in general stochastic processes. This builds up on our previous works [5, 6], where we carried out a similar analysis for the heat flow, and complements recent investigations along closely related lines [7, 8, 9, 10].Both the heat diffusion and the OU processes are examples of FokkerPlanck processes. The FokkerPlanck process is the sampling analogue of the usual gradient flow for optimization; indeed, we can view the FokkerPlanck process as the gradient flow of the relative entropy functional in the space of measures with the Wasserstein metric [11, 12, 13, 14]. This interpretation provides valuable information on the behavior of relative entropy. For example, if the target measure is logconcave, then relative entropy is decreasing in a convex manner along the FokkerPlanck process. Such a result also follows from the seminal work of Bakry and Emery [15, 16], where diffusion processes have been examined in exquisite detail.
The behavior of mutual information, on the other hand, is not as well understood. By the data processing inequality one may note that mutual information is decreasing along the FokkerPlanck process, which means the first time derivative is negative. Furthermore, we can derive identities relating information and estimation quantities
[5], which generalize the IMMSE relationship for the Gaussian channel [17]. At the level of second time derivative, however, the behavior of mutual information is more complicated. In [6] we studied the basic case of the heat flow, or the Brownian motion, and we showed there are interesting nonconvex behaviors of mutual information especially at initial time.In this paper we study the corresponding questions for the OU process. Notice that the OU process may be obtained by rescaling time and space in the heat flow, and one could hope that convexity results for the heat flow [6] carry over with little work. However, this does not appear to be the case since rescaling does not preserve signs of higher order derivatives—a fact also noted in [8] with regards to the derivatives of entropy obtained in [7]. For simplicity in this paper we treat the case when the target Gaussian measure has isotropic covariance, but our technique extends to the general case. We show that the results for the heat flow qualitatively extend to the OU process, but now with a subtle interplay with the size of the covariance of the target Gaussian measure.
Our first main result states that if the initial distribution is strongly logconcave, then mutual information is eventually convex. In particular, if the initial distribution is sufficiently strongly logconcave compared to the target Gaussian measure, then mutual information is always convex. We also prove that if the initial distribution is either bounded, or has finite fourth moment and Fisher information, then mutual information is eventually convex, with a time threshold that depends on the target covariance. We also provide counterexamples to show that mutual information can be nonconvex at some small time. In particular, when the initial distribution is a mixture of point masses, we show that mutual information along the OU process initially starts at the discrete entropy, which is the same behavior as seen in the heat flow. In the limit of infinite target covariance, in which case the OU process becomes the Brownian motion, most of our results recover the corresponding results for the heat flow from [6].
Ii Background and problem setup
Iia The OrnsteinUhlenbeck (OU) process
Let be the Gaussian measure in with mean and isotropic covariance , for some . The OrnsteinUhlenbeck (OU) process in with target measure is the stochastic differential equation
(1) 
where is a stochastic process in and is the standard Brownian motion in starting at . The OU process admits a closedform solution in terms of Itô integral: . At each time , the solution has equality in distribution
(2) 
where is independent of . As , exponentially fast, so indeed is the target stationary measure for the OU process.
In the space of probability measures, the OU process (
1) corresponds to the following partial differential equation:
(3) 
Here is a probability density over space for each time , is the divergence, and
is the Laplacian operator. Concretely, if the random variable
evolves following the OU process (1), then its probability density function
evolves following the equation (3) in the space of measures.From the solution (2) in terms of random variables, we also have the following solution for (3) in the space of measures:
(4) 
where for . One may also directly verify that (4) satisfies the equation (3). We refer to the flow of (3), or the exact solution (4) above, as the OrnsteinUhlenbeck (OU) flow in the space of measures. Note that as , and the OrnsteinUhlenbeck process above recovers the Brownian motion or the heat flow.
IiB Derivatives of relative entropy along the OU flow
For a reference probability measure on , let
denote the relative entropy of a probability measure with respect to . This is also known as the KullbackLeibler (KL) divergence. Relative entropy has the property that , and if and only if .
The relative Fisher information of with respect to is
The relative secondorder Fisher information of with respect to is
where is the HilbertSchmidt (or Frobenius) norm of a symmetric matrix . For , we also write , , and in place of , , and , respectively.
We recall the interpretation of the OU flow as the gradient flow of relative entropy with respect to in the space of probability measures with the Wasserstein metric; see for example [13, 14, 12], or Appendix A. This allows us to translate general gradient flow relations to obtain the following identities for the time derivatives of relative entropy along the OU flow.
Lemma 1.
Along the OU flow for ,
Note that and . So by Lemma 1, the first derivative of is negative while the second derivative is positive, which means relative entropy is decreasing in a convex manner along the OU flow.
IiC Derivatives of mutual information along the OU flow
Recall that given a functional of a random variable , we can define its mutual version for a joint random variable by
(5) 
where is the expected value of on the conditional random variables , averaged over . The mutual version picks up only the nonlinear part, so two functionals that differ by a linear function have the same mutual version.
For example, for any , the relative entropy differs from the negative entropy by the linear (in ) term . Therefore, the mutual version of relative entropy is equal to the mutual version of negative entropy, which is mutual information:
We apply this definition to the joint random variable where is the OU flow from . By the linearity of the OU channel, Lemma 1 yields the following identities for the time derivatives of mutual information along the OU flow.
Lemma 2.
Along the OU flow for ,
We discuss the signs of the first and second derivatives.
IiC1 First derivative of mutual information
We recall that for a general joint random variable , the mutual relative Fisher information is equal to the backward Fisher information , which is manifestly nonnegative. Furthermore, we also recall that along the OU flow, the backward Fisher information
is proportional to the conditional variance of
given ; see for example [5, IIID.2], or Appendix E.Lemma 3.
Along the OU flow for ,
IiC2 Second derivative of mutual information
Unlike which is always positive, in general the mutual relative secondorder Fisher information can be negative. However, if is sufficiently logconcave compared to the reference measure , then we can lower bound and use the additional term in the second identity of Lemma 2 to offset it. Altogether, we have the following result.
We say is strongly logconcave for some if for all .
Lemma 4.
Along the OU flow for , if is strongly logconcave, then mutual information is convex at time .
Lemma 4 provides a sufficient condition for mutual information to be convex along the OU flow. Since the target measure is strongly logconcave, we expect any initial distribution will eventually be transformed to a distribution that is at least strongly logconcave. We prove this in two cases: when the initial distribution is strongly logconcave, or when the initial distribution is bounded (or a convolution of the two cases). First, we have the following classical estimate.
Lemma 5.
Along the OU flow for , if is strongly logconcave for some , then is strongly logconcave.
We say is bounded for some if the support of is contained in a ball of diameter . Then we also have the following.
Lemma 6.
Along the OU flow for , if is bounded, then is strongly logconcave for .
The threshold in Lemma 6 is so that the logconcavity constant is nonnegative. It is possible to combine Lemmas 5 and 6 to handle the case when the initial distribution is a convolution of a strongly logconcave and a bounded distribution (for example, a mixture of Gaussians), but with a more complicated threshold. For simplicity, we omit it here.
Iii Convexity of mutual information
We present our main results on the convexity of mutual information along the OU flow. Throughout, let denote the OU flow from .
Iiia Eventual convexity when initial distribution is strongly logconcave
By combining Lemmas 4 and 5, we establish the following result for strongly logconcave distributions:
Theorem 1.
Suppose is strongly logconcave for some . Along the OU flow for , mutual information is convex for all , where
Theorem 1 above proves the eventual convexity of mutual information for any strongly logconcave initial distribution. However, the threshold is not tight. For example, if , then where
are the eigenvalues of
. Then is strongly logconcave, but one can verify that in this case mutual information is always convex for all , for any . Ultimately this gap is due to the fact that the sufficient condition in Lemma 4 is not necessary, and it would be interesting to see how to tighten it.IiiB Eventual convexity when initial distribution is bounded
By combining Lemmas 4 and 6, we establish the following result for distributions with bounded support.
Theorem 2.
Suppose is bounded for some . Along the OU flow for , mutual information is convex for all .
Note that as , the threshold in Theorem 2 above becomes , thus recovering the corresponding result for the heat flow from [6, Theorem 2]. As noted previously, we can combine Theorems 1 and 2 to show the eventual convexity of mutual information when the initial distribution is a convolution of a strongly logconcave and a bounded distribution, but for simplicity we omit it here.
IiiC Eventual convexity when Fisher information is finite
We now investigate the convexity of mutual information in general, regardless of the logconcavity of the distributions. We can show that if the initial distribution has finite fourth moment and Fisher information, then mutual information is eventually convex along the OU flow.
For , let denote the th moment of a random variable with mean . Let denote the (absolute) Fisher information of . Then we have the following.
Theorem 3.
Suppose has and . Along the OU flow for , mutual information is convex for all .
Theorem 3 above proves the eventual convexity of mutual information under rather general conditions. However, in the limit the time threshold becomes , rather than recovering the corresponding result for the heat flow from [6, Theorem 3]. This is because in the proof we use a simple bound to estimate the root of a cubic polynomial (see Appendix K), and it would be interesting to refine the analysis to obtain a better estimate.
Iv Nonconvexity of mutual information
We now provide counterexamples to show that mutual information can be concave for small time along the OU flow. Throughout, for , let . For , let denote the onedimensional Gaussian random variable with mean and variance equal to .
Iva Mixture of two point masses
Let be a uniform mixture of two point masses centered at and , for some , . Along the OU flow for , is a uniform mixture of two Gaussians.
By direct calculation, the mutual information is:
The behavior is depicted in Figure 0(a) for in . We see that mutual information is not convex at small time. It starts at the value , which follows from the general result in Theorem 4, then stays flat for a while before decreasing and becoming convex.
IvB Mixture of two Gaussians
Let be a uniform mixture of two Gaussians with the same covariance for some , centered at and for some , . Along the OU flow for , is also a uniform mixture of two Gaussians.
By direct calculation, the mutual information is:
The behavior is depicted in Figure 0(b) for in (). We see that now mutual information starts at , but it decreases quickly and flattens out for a while before decreasing again. Therefore, mutual information is still not convex at some small time.
IvC General mixture of point masses
Let be a mixture of point masses centered at distinct , with mixture probabilities , . Along the OU flow for , is a mixture of Gaussians.
By adapting the estimates from [6, IVC], we can show that mutual information along the OU flow starts at a finite value which is equal to the discrete entropy of the mixture probability, and it is exponentially concentrated at small time. This is the same phenomenon as in the heat flow case, except that the bound on the small time now depends on .
Let and . Let denote the discrete entropy.
Theorem 4.
Along the OU flow for , for all ,
Theorem 4 above implies that Thus, for discrete initial distribution, the initial value of mutual information only depends on the mixture proportions, and does not depend on the locations of the centers as long as they are distinct. This is the same interesting behavior as in the heat flow case, and shows that we can obtain discontinuities of the mutual information at the origin with respect to the initial distribution, by moving the centers and merging them.
Furthermore, if a function converges exponentially fast, then all its derivatives converge to zero exponentially fast. In our case for discrete initial distribution, this gives the following.
Corollary 1.
For all , .
In particular, the mutual relative Fisher information also starts at . Since the initial distribution which is a mixture of point masses is bounded, by Theorem 2 we know mutual information is eventually convex, which means is eventually decreasing. Since is always nonnegative, it must initially increase first, before it can decrease. During this period in which is increasing, mutual information is concave. This is similar to the behavior for the mixture of two point masses as observed in IVA. Moreover, by the continuity of the mutual relative first and secondorder Fisher information with respect to the initial distribution, this suggests that mutual information can also be concave at some small time when the initial distribution is a mixture of Gaussians, similar to the observation in IVB.
V Discussion and future work
In this paper we have studied the convexity of mutual information along the OrnsteinUhlenbeck flow. We considered the gradient flow interpretation of the OrnsteinUhlenbeck process in the space of measures, and derived formulae for the various derivatives of relative entropy and mutual information. We have shown that mutual information is eventually convex under rather general conditions on the initial distribution. We have also shown examples in which mutual information is concave at some small time. These results generalize the behaviors seen in the heat flow [6].
For simplicity in this paper we have treated only the case when the target Gaussian distribution has isotropic covariance. It is possible to extend our results to handle the case of a general covariance matrix. In this case, extra caution needs to be exercised since matrices in general do not commute. In the simple case when the covariance matrices of the initial and target distributions commute, our results extend naturally and the various thresholds are now controlled by the eigenvalues of the matrices.
As noted in the introduction, our interest in studying this problem is to better understand the general case of the FokkerPlanck process. Indeed, there is an interesting dichotomy in which we understand the intricate properties of the OrnsteinUhlenbeck process since we have an explicit solution, whereas we know very little about the general FokkerPlanck process. The gradient flow interpretation applies to the general FokkerPlanck process and provides information for the convexity properties of the relative entropy if the target measure is logconcave. However, much is not known, even about mutual information. Hence in this paper we have attempted to settle the case of the OrnsteinUhlenbeck process. Even in this case some of our results are not tight and can be improved.
Some interesting future directions are to understand the convexity property of the solution to the FokkerPlanck process, even in the nice case when the target measure is strongly logconcave. For example, does the FokkerPlanck process preserve logconcavity relative to the target measure? Furthermore, is selfinformation (mutual information at initial time) for discrete initial distribution still equal to the discrete entropy for the FokkerPlanck process? In general, it is interesting to bridge the gap in our understanding between the OrnsteinUhlenbeck process and the general FokkerPlanck process. One avenue to do that may be to study a perturbation of the OrnsteinUhlenbeck process, when the target distribution is a small perturbation of the Gaussian.
References
 [1] N. Blachman, “The convolution inequality for entropy powers,” IEEE Transactions on Information Theory, vol. 11, no. 2, pp. 267–271, 1965.
 [2] O. Rioul, “Information theoretic proofs of entropy power inequalities,” IEEE Transactions on Information Theory, vol. 57, no. 1, pp. 33–55, 2011.
 [3] M. Madiman and A. Barron, “Generalized entropy power inequalities and monotonicity properties of information,” IEEE Transactions on Information Theory, vol. 53, no. 7, pp. 2317–2329, 2007.

[4]
S. Mandt, M. Hoffman, and D. Blei, “A variational analysis of stochastic
gradient algorithms,” in
International Conference on Machine Learning
, 2016, pp. 354–363.  [5] A. Wibisono, V. Jog, and P. Loh, “Information and estimation in FokkerPlanck channels,” in 2017 IEEE International Symposium on Information Theory, ISIT 2017, Aachen, Germany, 2017, pp. 2673–2677.
 [6] A. Wibisono and V. Jog, “Convexity of mutual information along the heat flow,” in 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, USA, 2018.
 [7] D. Guo, Y. Wu, S. S. Shitz, and S. Verdú, “Estimation in gaussian noise: Properties of the minimum meansquare error,” IEEE Transactions on Information Theory, vol. 57, no. 4, pp. 2371–2385, 2011.
 [8] F. Cheng and Y. Geng, “Higher order derivatives in Costa’s entropy power inequality,” IEEE Transactions on Information Theory, vol. 61, no. 11, pp. 5892–5905, 2015.
 [9] G. Toscani, “A concavity property for the reciprocal of Fisher information and its consequences on Costa’s EPI,” Physica A: Statistical Mechanics and its Applications, vol. 432, pp. 35–42, 2015.
 [10] X. Zhang, V. Anantharam, and Y. Geng, “Gaussian optimality for derivatives of differential entropy using linear matrix inequalities,” Entropy, vol. 20, no. 3, p. 182, 2018.
 [11] R. Jordan, D. Kinderlehrer, and F. Otto, “The variational formulation of the Fokker–Planck equation,” SIAM Journal on Mathematical Analysis, vol. 29, no. 1, pp. 1–17, January 1998.
 [12] F. Otto and C. Villani, “Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality,” Journal of Functional Analysis, vol. 173, no. 2, pp. 361–400, 2000.
 [13] C. Villani, Topics in optimal transportation. American Mathematical Society, 2003, no. 58.
 [14] ——, Optimal Transport: Old and New, ser. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008, vol. 338.
 [15] D. Bakry and M. Émery, “Diffusions hypercontractives,” in Séminaire de Probabilités XIX 1983/84. Springer, 1985, pp. 177–206.
 [16] D. Bakry, I. Gentil, and M. Ledoux, Analysis and geometry of Markov diffusion operators. Springer, 2013, vol. 348.
 [17] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum meansquare error in Gaussian channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1261–1282, 2005.
 [18] A. Wibisono and V. Jog, “Convexity of mutual information along the OrnsteinUhlenbeck flow,” arXiv preprint arXiv:1805.01401, 2018.
 [19] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1–2, pp. 1–305, Jan. 2008.
 [20] A. Saumard and J. A. Wellner, “Logconcavity and strong logconcavity: A review,” Statistics Surveys, vol. 8, p. 45, 2014.
 [21] A. Dembo, T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,” IEEE Transactions on Information Theory, vol. 37, no. 6, pp. 1501–1518, 1991.
Comments
There are no comments yet.