We investigate the notion of well-posedness
of a Bayesian statistical inference. For a givenconditional probability distribution
, we refer to well-posedness as a continuity property with respect to the conditioning variable. Indeed, we aim at quantitative estimates of the discrepancy between two inferences in terms of the distance between the observations. Our problem could be compared with the Bayesian sensitivity analysis by specifying that we are working under fixed prior and statistical model, the imprecision being concerned only with the data.
Few general results are available on this topic, even if it naturally arises—sometimes as a technical tool—in connection with different Bayesian procedures, such as consistency [DF], [GV, Chapters 6-9], mixture approximations [W, RS], deconvolution [E], inverse problems [S] and computability [AFR]. While the pioneering paper [Z], essentially inspired by foundational questions, dealt with the qualitative definition of continuity for conditional distributions, more recent studies highlight the relevance of modulus of continuity estimates. We refer, for instance, to the well-posedness theory developed in [S] and to different results found in [DS, CDRS, ILS, L1]. Our contribution moves in the same direction.
In this work, we confine ourselves to dealing with the posterior distribution. We introduce two measurable spaces and , representing the space of the observations and the parameters, respectively. We further introduce a probability measure on , the prior distribution, and a probability kernel , called statistical model. We assume that:
is a metric space with distance and coincides with the Borel -algebra on ;
is a Polish space and coincides with its Borel -algebra;
the model is dominated: , , for some -finite measure on .
For any , we consider a density of w.r.t. and we put , since proves to be -measurable. See [K, Chapter 5] for details. In this framework, the well-known Bayes theorem provides an explicit form of the posterior distribution, namely
for any and such that . Thus, the Bayes mapping can be seen as a measurable function from into the space of all probability measures on endowed with the topology of weak convergence. Our main task is to find sufficient conditions on and such that satisfies a uniform continuity condition of the following form: given a modulus of continuity and a set , there exists a constant such that
holds, where denotes the total variation distance.
In order to motivate the study of a property like (1
), let us briefly discuss some of its applications to Bayesian inference. By itself, uniform continuity is of interest in the theory of regular conditional distributions[T, Sections 9.6-9], [P]. Indeed, a natural approximation of the posterior is
where stands for a suitable neighborhood of . Thus, (1) would express the approximation error . Anyway, the main applications are concerned with the theory of exchangeable observations, where , and the model is in product form, by de Finetti’s representation theorem. The main advantage of an estimate like (1) arises when is re-expressed in terms of a sufficient statistic (e.g., the empirical measure), so that the asymptotic behavior of the posterior for large could be studied by resorting to the asymptotic behavior of such statistic. We believe that uniform continuity would represent a new technique to prove asymptotic properties of the posterior distribution, like Bayesian consistency. See Sections 3.2 and 3.5 below. Finally, uniform continuity would represent also a powerful technical tool to solve the problem of approximating the posterior by mixtures, on the basis of a discretization of the sample space. See [RS] and, in particular, Proposition 2 therein, where an estimate like (1) would allow to quantitatively determine how fine the discretization should be in order to achieve a desired degree of approximation.
2. Continuous dependence on data
In the sequel, we refer to a modulus of continuity as a continuous strictly increasing function such that , and we consider the space of -continuous functions. In particular, we say that belongs to if
If , , we get the class of Hölder continuous functions.
First of all, by assumption (2), and since , there holds
for any , so that . The dual formulation of the total variation (see, e.g., [GS]) reads
for any , the supremum being taken among all continuous functions such that for any . For any such , define and note that the Bayes formula entails . We shall prove the -continuity of the map on . First of all, this map satisfies for any since . Then, for , there holds
yielding . Since on , and , for any we get
The latter estimate being uniform with respect to , we conclude that
holds for any such thay , proving the theorem. ∎
If , we can take in Theorem 2.1. If , we get -continuity on the whole for the map , w.r.t. .
Some examples may also be treated within the following simple
Let us now consider the Euclidean case, letting have nonempty interior and be an open set with Lipschitz boundary. Usually, a Sobolev regularity might be simpler to verify, the Hölder regularity following then by Sobolev embedding. For instance, for , by Morrey inequality there exists a constant such that holds for any , with and . More generally, if , the fractional Sobolev embedding (see, e.g., [DD]) states that
holds with a suitable constant , and
for any . We readily obtain the following
3. Examples and applications
3.1. Exponential models
A remarkably interesting statistical model is the exponential family, which includes many popular distributions, such as the Gaussian, the exponential and the gamma. For terminology and basic results about this family, see, e.g., [B]. For the sake of definiteness, we consider a -finite reference measure on and a measurable map such that the interior of the convex hull of the support of is nonempty and is a nonempty open subset of . As for , we resort to the so-called canonical parametrization, by which , ,
and, for any ,
is a probability density function w.r.t.. Now, given a prior on and a set compactly contained in the interior of , we observe that
where the last term is positive if is continuous. On the other hand, we have
Therefore, if we suppose and that the integral is finite, we can invoke Theorem 2.1 to obtain the -continuity on of the posterior distribution.
We finally notice that, in connection with an exponential model, it is natural to choose a conjugate prior, yielding an explicit form of the posterior [DY]. Thus, the LHS of (1) can be directly computed, claiming a fair comparison with the RHS. Actually, nothing seems lost at the level of the modulus of continuity, though our constant is usually sub-optimal. To illustrate this phenomenon, we can take , with and
. Chosen a conjugate prior like, we observe that (1) holds with , for any , and . But our constant behaves asymptotically like for large , whilst the optimal one remains bounded as grows.
3.2. Exponential models for exchangeable observations
Here, we adapt the result of Section 3.1 to the -observations setting, assuming exchangeability. In this case and the statistical model is of the form for some density . If belongs to the exponential family considered in Section 3.1, we have By Neyman’s factorization lemma, we rewrite the model as
for suitable functions , , , , with symmetric and standing for the dimension of . We can recast the statistical model by considering as the new parameter, and as the observable. Indeed, we introduce
where and is the reference measure on . Letting , we have , where is a probability density with respect to the product measure , parametrized by . Given a prior on , the posterior reads
where denotes the product distance. We stress that a bound in terms of is statistically more meaningful than a bound in terms of , as the former agrees with the symmetry assumption coming from exchangeability.
3.3. Global regularity for models with .
When , we can check whether Theorem 2.1 holds with , yielding a global uniform continuity. We discuss the case with , the -dimensional Lebesgue measure, forcing to be bounded. Many popular models do not satisfy the assumptions of Theorem 2.1 with , but only with some compactly contained in the interior of . This is the case of Beta and Dirichlet models. On the other hand, a model that fits the assumptions of Corollary 2.3 is the Bradford distribution, given by
with and . Such a model is used for the description of the occurencies of references in a set of documents on the same subject [L2]. Choosing
3.4. Infinite-dimensional models
One of the merits of our approach consists in the fact that we can handle also complex statistical models of non parametric type. Two noteworthy examples in Bayesian analysis are the infinite dimensional exponential family and the infinite mixture models. See [GN] for a comprehensive treatment.
As for the first model, keeping in mind the Karhunen-Loéve theorem [K, Chapter 13], we confine ourselves to considering densities of the form with , with a fixed and . See also [L3]. After fixing a prior , we show how Theorem 2.1 can be applied. First, we deal with the condition on the infimum of with . In fact, we have , where denotes the range of the (random) trajectory . Whence,
where stands for the density of with respect to the Lebesuge measure. To check (2), we consider a Hölder condition with exponent . Thus, we note that
By Hölder’s inequality, for and , we write
By the fractional Sobolev inequality (6), for we have
where is the two-times density of . Typically, the Kolmogorov-Chentsov condition [K, Chapter 3]
yielding for all . Therefore, we obtain Hölder continuity of the posterior distribution for any exponent .
The second model of interest, namely the so-called infinite dimensional mixture model, is based on a family of densities of the form , with and equal to the space of all probability measures on . The kernel consists of a family of densities (in the -variable) parametrized by . A noteworthy case of interest is the Gaussian kernel . Now, after fixing a prior of nonparametric type (e.g. the Ferguson-Dirichlet prior), the application of Theorem 2.1 is straightforward. First, for kernels in the form , condition (2) holds even independently of , provided that . For the condition on , for some compact , it is enough to assume that , where , .
3.5. Application to Bayesian consistency
We have seen in Section 3.2 that, in presence of exchangeable observations, the posterior can be written as
where and denotes the empirical measure. In the theory of Bayesian consistency, one fixes and generates from a sequence
of i.i.d. random variables. The objective is to prove that the posterior piles up near the true value, i.e. in probability, for some weak distance (e.g., Prokhorov or bounded-Lipschitz metric [GS]) between probability measures on , possibly with an estimation of the convergence rate. To establish a link with our theory, we introduce the probability kernel
where , a subset of probability measures containing in its closure both and , with
for some measure on , and . In this notation, . Whence,
As for the first term on the RHS, convergence to zero is well-known with explicit rates, as a consequence of the so-called Kullback-Leibler property [GV, Definition 6.15]. The second term on the RHS can be studied under expectation, by splitting it as follows:
with , where is a vanishing sequence of positive numbers and a weak distance (e.g., Prokhorov or bounded-Lipschitz metric [GS]) between probability measures on . If the distance is bounded, the second term in (9) is handled in terms of , and hence resorting to well-known large deviations inequalities for empirical processes [K, Chapter 27]. Finally, if (see [GS]), we can study the first term in (9) by applying Theorem 2.1, with and for some . The role of the local Hölder continuity is now functional to reducing the analysis of the first term in (9) to that of , whose rates of contraction are well-known [FG].
ED received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257. This research was also supported by the Italian Ministry of Education, University and Research (MIUR) under “PRIN project” grant No 2017TEXA3H, and “Dipartimenti di Eccellenza Program” (2018–2022) - Dept. of Mathematics “F. Casorati”, University of Pavia.
- [AFR] N. L. Ackerman, C. E. Freer, D. M. Roy: On computability and disintegration. Math. Struct. in Comp. Science 27 (2017), 1287–1314.
- [B] L. D. Brown: Fundamentals of Statistical Exponential Families with Application in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes-Monograph Series, 9. Hayward, California (1986).
- [CDRS] S.L. Cotter, M. Dashti, J.C. Robinson and A.M. Stuart: Bayesian inverse problems for functions and applications to fluid mechanics. Inverse Problems 25 (2009), 115008
- [DS] M. Dashti and A.M. Stuart: The Bayesian Approach to Inverse Problems. In: Ghanem R., Higdon D., Owhadi H. (eds) Handbook of Uncertainty Quantification. Springer, Cham (2017).
- [DF] P. Diaconis and D. Freedman: On consistency of Bayes estimates. Ann. Statist. 14 (1986), 1–26.
- [DY] P. Diaconis and D. Ylvisaker: Conjugate priors for exponential families. Ann. Statist. 7 (1979), 269–281.
- [ILS] M. A. Iglesias, K. Lin and A. M. Stuart: Well-posed Bayesian geometric inverse problems arising in subsurface flow. Inverse Problems 30 (2014) 114001
F. Demengel and G. Demengel:
Functional spaces for the theory of elliptic partial differential equations. Universitext. Springer, London; EDP Sciences, Les Ulis, (2012).
- [E] B. Efron: Empirical Bayes deconvolution estimates. Biometrika 103 (2016), 1–20.
- [F] W. Feller: The asymptotic distribution of the range of sums of independent random variables. Ann. Math. Statistics 22 (1951), 427–432.
- [FG] N. Fournier and A. Guillin: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields (2015), 707–738.
- [GV] S. Ghosal and A. van der Vaart: Fundamentals of nonparametric Bayesian inference. Cambridge University Press, Cambridge (2017).
- [GS] A. L. Gibbs and F. E. Su: On choosing and bounding probability metrics. International Statistical Review 70 (2002), 419–435.
- [GN] E. Giné and R. Nickl: Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, Cambridge (2016).
- [K] O. Kallenberg: Foundations of Modern Probability. Second ed. Springer-Verlag, New York (2002).
- [L1] J. Latz: On the well-posedness of Bayesian inverse problems. Prerint arXiv:1902.10257
- [L2] F. F. Leimkuhler: Bradford’s distribution. Journal of Documentation 23 (3) (1967), 197–207.
- [L3] P.J. Lenk: Towards a practicable Bayesian nonparametric density estimator. Biometrika 78 (1991), 531–543.
- [P] J. Pfanzagl: Conditional distributions as derivatives. Ann. Probab. 7 (1979), 1046–1050.
- [RS] E. Regazzini and V. V. Sazonov: Approximation of laws of random probabilities by mixtures of Dirichlet distributions with applications to nonparametric bayesian inference. Theory Probab. Appl. 45 (2001), 93–110.
- [S] A.M. Stuart: Inverse problems: a Bayesian perspective. Acta Numerica, 19 (2010) 451–559.
- [T] T. Tjur: Probability based on Radon measures. Wiley Series in Probability and Mathematical Statistics. Wiley, Chichester (1980).
- [W] M. West: Approximating posterior distributions by mixtures. J.R. Statist. Soc. B 55 (1979), 409–422.
- [Z] S. L. Zabell: Continuous versions of regular conditional distributions. Ann. Probab. 7 (1979), 159–165.