1 Introduction
Robustness has been of ongoing interest in both the Bayesian [DF61, BMP94] and frequentist setting [Tuk60, Hub64, Hub73a] since being introduced by George Box in 1953 [Box53]. The goal is to capture the sensitivity of inferential procedures to the presence of outliers in the data and misspecifications in the modelling assumptions, and to mitigate overly large sensitivity. The Bayesian approach has been focused on capturing possible anomalies in the observed data via the model and in choosing priors that have minimal effect on inferences. The frequentist approach, on the other hand, has focused on the development of estimators that identify and guard against outliers in the data. We refer the reader to [Hub11, Chap 15] for a comprehensive discussion.
The focus on model robustness
in Bayesian statistics is implemented via sensitivity studies to understand effects of misspecification of the prior distribution
[BMP94, MSLD17] and its propagation towards the posterior [Hub73b]. There is, however, little in the way of a comprehensive formal finitesample framework for Bayesian robustness. Huber asked “Why there is no finite sample Bayesian robustness theory?” and Kong suggested that such a theory is infeasible in full generality, arguing that it is computationally infeasible to carry out the necessary calculations even in finite spaces.We address this issue by providing a formal framework for studying Bayesian robustness and by proposing a robust inferential procedure with finitesample guarantees. We address issues of computational infeasibility by refraining from modelling outlier data explicitly. Instead, we posit that the collected data contains a small fraction of observations which are not explained by the modelling assumptions. This corruption model, termed an contamination model, was first studied by Huber in 1964 [Hub64] and has been the subject of recent computationallyfocused work in the frequentist setting [DKK16, LRV16, PSBR18].
Given data corrupted according to an contamination model, our goal is to sample from the clean posterior distribution : the posterior distribution conditioning only on the uncorrupted (“clean”) data. Our key idea is to leverage a robust approach for estimating the mean in the context of gradientbased sampling techniques. Our overall procedure is a robust variant of the Unadjusted Langevin Algorithm (ULA) that we refer to as “RobULA.” The underlying ULA algorithm and its variants have been used for efficient largescale Bayesian posterior sampling [RT96, WT11] and their convergence analysis has been a recent topic of interest [Dal17, DM17, CB18, DCWY18, MCJ18]; see Section 2.2 for a detailed overview. Informally, our main result shows that after iterations of RobULA, the iterate has a distribution such that , where is the fraction of corrupted points in the data set.
The remainder of the paper is organized as follows: Section 2 contains a discussion of the related literature, Section 3 discusses relevant background as well as the formal problem setup, and Section 4 describes the proposed algorithm RobULA and states our main theorem regarding its convergence. In Section 5
we discuss the fundamental problems of Bayesian mean estimation and linear regression in the robust setting. Section
6 consists of experimental evaluation of RobULA on synthetic as well as real world data sets and we conclude with Section 7.2 Related Work
In this section, we review related work on robust estimation procedures in both the Bayesian and frequentist settings. We also discuss work on using Langevin dynamics to sample from distributions over continuous spaces.
2.1 Robust statistical procedures
There are many threads in the literature on robust estimation and outlier detection
[Hub73a, Box53, DF61]. In the frequentist parameter estimation setting, the most commonly studied model is Huber’s classical contamination model. There has also been a recent focus on an adversarial paradigm that is devoted to developing computationally efficient problemdependent estimators for mean estimation [LRV16, DKK16], linear regression [KKM18, BJK15, BJKK17, SBRJ19], and general risk minimization problems [PSBR18, DKK19]. Particular relevant to our setup are [PSBR18] and [DKK19] which utilize the robust mean estimators of [LRV16, DKK16] to robustify gradientbased procedures for empirical risk minimization.The study of robustness in the Bayesian framework has focused primarily on developing models and priors that have minimal impact on inference. An important line of work has focused on the sensitivity of the posterior distribution to and has led to the development of noninformative priors [BMP94, MSLD17, MD18]. These methods are orthogonal to those considered in the current paper, as they do not aim to robustify inferential procedures against corruptions in the observed data set. In principle a complete Bayesian model would have the capacity for explaining the outliers present in the data, but this would require performing an integral over all models with a proper prior. Such an approach would generally be computationally intractable.
An important class of procedures for handling outliers in the data set focuses on reweighing the data points via a transformation of the likelihood function. Huber [Hub11] considers assigning binary weights to data points and identifies modeldependent procedures to identify outliers. In contrast, Wang et al. [WKB17] consider these weights as latent variables and infers these weight variables along with the parameters of the model. These methods are susceptible to the choice of priors over these weighting variables. An alternate set of robust procedures are based on the idea of localization [DF61, WB18]
: each data point is allowed to depend on its own version of the latent variable and these individual latent variables are tied together with the use of a hyperparameter, often fitted using empirical Bayes estimation. Although these methods build on intuitive notions of outlier removal, there is little theoretical understanding of the kind of outliers that these methods can tolerate.
2.2 Sampling methods
Turning to sampling methods for posterior inference, there have been various zerothorder [LS93, MT96] and firstorder methods [Erm75, RR98, Nea11] proposed for sampling from distributions over continuous spaces. Our focus in this paper is on the overdamped Langevin MCMC method which was first proposed by Ermak [Erm75] in the context of molecular dynamics applications. Its nonasymptotic convergence (in total variation distance) was first studied by Dalalyan [Dal17] for logsmooth and logstrongly concave distributions. Cheng and Bartlett [CB18]
extended the analysis to obtain a similar convergence result in KullbackLeibler divergence. Such nonasymptotic convergence guarantees are essential to understanding the robustness of computational procedures as they simultaneously capture the dependence of the sampling error on the number of iterations
, the dimensionality , and the contamination level .3 Background
In the section we briefly review relevant background on Bayesian computation and we formally describe our problem setup.
3.1 Bayesian modelling
Given parameters and a data set , we assume that the statistician has specified a prior distribution, , and a likelihood, . We can then form the posterior distribution, , as follows:
We are generally concerned with the estimation of some test function under the posterior distribution, which is accomplished in the Bayesian setting by computing a posterior mean:
In practice, one way of computing this posterior mean is to use a Monte Carlo algorithm to generate a sequence of samples and form an estimate :
3.2 Unadjusted Langevin Algorithm
We consider a specific Monte Carlo algorithm, the Unadjusted Langevin Algorithm (ULA), for sampling from probability distributions over continuous spaces. Generically, we consider distributions over
of the formfor a class of functions which are square integrable. The ULA algorithm starts from some initial point and defines a sequence of parameters, , according to the following update equation:
(1) 
where denotes a sequence of step sizes and
are i.i.d. Gaussian vectors. The Markov chain in Equation (
1) is the Euler discretization of a continuoustime diffusion process known as the Langevin diffusion. The stochastic differential equation governing the Langevin diffusion is given by(2) 
where represents a dimensional Brownian motion. Denoting the distribution of by , Cheng and Bartlett [CB18] showed that after steps for functions which are smooth and stronglyconvex. Specializing to the Bayesian modelling setup we rewrite the posterior distribution as:
where is the negative loglikelihood corresponding to the data point. The ULA algorithm can then be used to form an approximation to the posterior as follows:
where and are the stepsize sequence and independent Gaussian noise respectively.
3.3 Problem Setup
We turn to a formal treatment of the robustness problem in the Bayesian setting. We consider the contamination model introduced by Huber [Hub64] and let the collection of data points be obtained from the following mixture distribution:
(3) 
where denotes the true underlying generative distribution while is any arbitrary distribution. A data set drawn from such a mixture distribution has each data point corrupted adversarially with probability . We denote by the subset of data points in sampled from the true distribution and similarly let denote the subset of data sampled from . Given data points , the likelihood function and the prior , we aim to form a clean posterior distribution given by:
Accordingly, as in Section 3.1, we would like to robustly estimate the mean of the test function under the uncorrupted posterior :
which we approximate via an estimate:
In the following section we present an algorithm, RobULA, for generating the sequence of samples and provide theoretical guarantees on its convergence properties. The main idea is to exploit gradientbased iterative sampling methods, and to leverage a robust mean estimation procedure to compute a robust estimate of the gradient at each iteration.
4 RobULA: Robust Unadjusted Langevin Algorithm
We turn to our proposed algorithm, RobULA (Algorithm 1), which aims to solve the robust posterior inference problem defined in Section 3.3. RobULA is a simple modification of the ULA algorithm, described in Section 3.2, where in each iteration instead of using the complete set of data points for computing the gradient, we construct a robust estimator of the gradient and update the parameter using this estimate. This robust estimator ensures that the outlier data points do not exert too much influence on the gradient and allow RobULA to obtain samples from a distribution close to the clean posterior distribution:
(4) 
Before proceeding to establish the convergence guarantees for RobULA, we present the robust gradient estimation procedure.
4.1 Robust Gradient Estimation
Algorithm 2 describes our robust gradient estimation procedure. Based on the robust mean estimator of Lai et al. [LRV16], it takes as input the gradients of the negative loglikelihoods and outputs an estimate of the robust mean of the gradient vectors ( in Algorithm 1), assuming a fraction of them are arbitrarily corrupted. Algorithm 1 then scales this gradient estimate by the number of samples , to obtain a robust estimate of gradients of the likelihood .
Note that the model described in Section 3.3 assumes that each data point is sampled i.i.d. from the mixture distribution , where represents the true generative distribution and
can be any arbitrary distribution. An application of the Hoeffding bound for Bernoulli random variables shows that with probability at least
, the fraction of corrupted points in the sampled data set satisfy(5) 
For the remainder of the paper, we condition on this high probability event and state our results assuming this event holds. Following the proof strategy of [LRV16] and [PSBR18], we derive a bound on the estimation error of the true average loglikelihood gradient,
uniformly for any value of the iterate in the following lemma. We let denote the true value of this average loglikelihood gradient.
Lemma 1 (Robust Gradient Estimation)
Let
denote the uniform distribution over
and let denote the corresponding distribution over with mean given by , covarianceand fourth moment given by
. There exists a positive constant for which the robust mean estimator when instantiated with the contamination level , returns, with probability , an estimate such that for all , we have that,Remark.
Note that Proposition 1 of Prasad et al. [PSBR18] presents a highprobability bound similar to ours which is applicable for a fixed parameter . Such a bound, however, does not suffice to ensure convergence of RobULA because the additive Gaussian noise at every iterate requires us to obtain a uniform highprobability recovery error bound (see Section 4.2 for details). Lemma 1 establishes a uniform bound for the specific distribution which is uniform over the clean data . This restriction of the distribution also allows us to avoid samplesplitting at every iteration of Algorithm 1 which was essential for both [LRV16] and [PSBR18].
In addition, Lemma 1 indicates that irrespective of the sample size , one can estimate the mean of the gradient robustly up to error . This implies that at each iteration of RobULA, we incur an error of since we scale the average gradient estimate by during the update. In Theorem 2 we show how with an appropriate choice of step size , one can control the propagation of this bias in the convergence analysis for RobULA.
4.2 Convergence Analysis
In this section, we study the convergence of the proposed algorithm RobULA. For ease of notation, we let and similarly denote the clean and corrupted versions of the function and . The objective of the robust Bayesian posterior inference problem is then to obtain samples from the clean posterior distribution given by . For clarity of exposition, we drop the dependence of the posterior distribution on the data set as well as the hyperparameters and let .
We quantify the convergence of distribution following a stochastic process to the stationary distribution through the KullbackLeibler divergence, :
We define the Wasserstein distance between a pair of distributions as:
where
denotes the set of joint distributions such that the first set of coordinates has marginal
and the second set has marginal .We begin by making the following assumptions on the function :

Assumption 1 (Lipschitz smoothness). The function is Lipschitz smooth and its Hessian exists for all . That is,

Assumption 2 (Strong convexity). The function is strongly convex for all . That is,
We further denote the condition number of the function as .
The assumptions of Lipschitz smoothness and strong convexity are standard in both the sampling and optimization literatures. In addition to the assumptions, we define the average Lipschitz constant and the average strong convexity of as . We now state our main theorem concerning the convergence guarantees for RobULA.
Theorem 2 (Main Result)
Let , where satisfies Assumptions 4.2 and 4.2. Further, assume that the gradient estimates satisfy
where is the covariance of uniform distribution on induced by the clean data set , satisfies and is the fraction of corrupted points, satisfying . Then the iterates of RobULA, when initialized with (with corresponding density ) and step size (and define ), satisfy:
where represents the distribution of the iterate .
Remarks.
Before proceeding to the proof of this theorem, a few comments are in order. First observe that the error term consists of three different components:
where a) term (I) comprises an exponentially decaying dependence (with the number of timesteps ) on the initial error , b) term (II) is a discretization error term and c) term (III) captures the dependence on the fraction of corrupted points and vanishes as goes to zero.
For any given accuracy , if the step size and the number of iterations satisfy:
then the error in convergence can be bounded as
As we show in Section 5, for problems such as Bayesian linear regression and Bayesian mean estimation, the average strong convexity parameter scales independently of the sample size . This implies that the resulting error can be bounded by . While the accuracy can be set to arbitrarily small values which would result in a corresponding increase in the number of time steps, there is a bias term depending on the contamination level which cannot be reduced by either increasing the sample size or by increasing the number of iterations. This is consistent with results in the frequentist literature [BJK15, DKK16, LRV16, PSBR18], which show that such inconsistency is a result of the adversarial corruptions and in general cannot be avoided.
Proof of Theorem 2
We proceed to a proof of our main convergence result, Theorem 2. We begin by considering the process described by RobULA as a discretization of the Langevin dynamics given by Equation 2, with the following gradient estimate:
(6) 
where represents the random variable describing the process at the iterate using step size . This is equivalent to defining the following stochastic differential equation
(7) 
We next state a lemma which provides a bound on the variance of the distribution of the
iterate.Lemma 3
For following Eq. (7), if , , and , then for all ,
where is a universal constant and is the fixed point satisfying .
The proof of this lemma is deferred to Appendix B. Treating Lemma 3 as given, we proceed to the proof of Theorem 2.
We consider the dynamics in Equation 7 within the time range . From the Girsanov theorem [Oks03] we have that admits a density function with respect to the Lebesgue measure. This density function can also be represented as
where is the weak solution to the following Kolmogorov forward equation:
where and its derivatives are defined via as a functional over the space of smooth bounded functions on (we refer the readers to [SP14] for more details). As shown by Cheng and Bartlett [CB18], the time derivative of the KL divergence along is given by:
where the expectation is taken with respect to the joint distribution of and . Hence
(8) 
where follows from the definition of . We first focus on the second term in the above expression which can be bounded as:
(9) 
where follows by an application of Young’s inequality and follows from the pointwise assumption that and that . Let us define new constant and .
Next, we proceed to bound the term using Lemma 3. Let us define the variable and bound the term as:
(10) 
where follows from Assumption 4.2 and we define (). Plugging in the bounds obtained in Equations (4) and (4) into Equation (4), we get for :
where follows from an application of the logSobolev inequality with being the logSobolev constant and follows from the fact that and the substitution
Finally, using Gronwall’s inequality we have the following onestep progress equation:
Repeated application of this progress inequality leads us to
The final result of the theorem can then be obtained by using Talagrand’s inequality [OV00] which states that for the probability distributions and , we have that
which concludes the proof of the theorem.
5 Consequences for Mean Estimation and Regression
In this section, we study the fundamental problems of Bayesian mean estimation (Section 5.1) and Bayesian linear regression (Section 5.2) under the Huber contamination model.
5.1 Robust Bayesian mean estimation
We begin with the robust Bayesian mean estimation (RBME) problem and instantiate the convergence guarantees for RobULA (Algorithm 1) for this problem. For simplicity, we study the setup in which the likelihood is Gaussian:
for a mean vector and a fixed positive definite covariance matrix
. We consider the corresponding conjugate prior over
given bywhere is the mean and is the covariance matrix. The set of parameters are the hyperparameters . Given data points sampled from the Huber contamination model, where , with being an arbitrary adversarially chosen distribution, the objective of the RBME problem is to sample from the posterior induced by the uncorrupted data points,
(11) 
where represents the subset of points in sampled from the distribution . We note that for data sampled from the Huber contamination model, we have from Equation (5) that with probability at least ,
(12) 
Let us denote by and by the corresponding upper and lower bounds on the number of corrupted data points. Following the notation in Section 4, the function for this problem is given by:
(13) 
Also, we define the corresponding function from Section 3 as well as the sample mean and covariance for the clean data points.
The following corollary instantiates the guarantees of Theorem 2 for the specific function defined for the RBME problem.
Corollary 4
Consider the RBME problem with posterior given by Equation (11) and data sampled from the Huber model. Then the iterates of RobULA with step size and satisfy:
Comments
There are no comments yet.