Bayesian Robustness: A Nonasymptotic Viewpoint

07/27/2019 ∙ by Kush Bhatia, et al. ∙ berkeley college 18

We study the problem of robustly estimating the posterior distribution for the setting where observed data can be contaminated with potentially adversarial outliers. We propose Rob-ULA, a robust variant of the Unadjusted Langevin Algorithm (ULA), and provide a finite-sample analysis of its sampling distribution. In particular, we show that after T= Õ(d/ε_acc) iterations, we can sample from p_T such that dist(p_T, p^*) ≤ε_acc + Õ(ϵ), where ϵ is the fraction of corruptions. We corroborate our theoretical analysis with experiments on both synthetic and real-world data sets for mean estimation, regression and binary classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robustness has been of ongoing interest in both the Bayesian [DF61, BMP94] and frequentist setting [Tuk60, Hub64, Hub73a] since being introduced by George Box in 1953 [Box53]. The goal is to capture the sensitivity of inferential procedures to the presence of outliers in the data and misspecifications in the modelling assumptions, and to mitigate overly large sensitivity. The Bayesian approach has been focused on capturing possible anomalies in the observed data via the model and in choosing priors that have minimal effect on inferences. The frequentist approach, on the other hand, has focused on the development of estimators that identify and guard against outliers in the data. We refer the reader to [Hub11, Chap 15] for a comprehensive discussion.

The focus on model robustness

in Bayesian statistics is implemented via sensitivity studies to understand effects of misspecification of the prior distribution 

[BMP94, MSLD17] and its propagation towards the posterior [Hub73b]. There is, however, little in the way of a comprehensive formal finite-sample framework for Bayesian robustness. Huber asked “Why there is no finite sample Bayesian robustness theory?” and Kong suggested that such a theory is infeasible in full generality, arguing that it is computationally infeasible to carry out the necessary calculations even in finite spaces.

We address this issue by providing a formal framework for studying Bayesian robustness and by proposing a robust inferential procedure with finite-sample guarantees. We address issues of computational infeasibility by refraining from modelling outlier data explicitly. Instead, we posit that the collected data contains a small fraction of observations which are not explained by the modelling assumptions. This corruption model, termed an -contamination model, was first studied by Huber in 1964 [Hub64] and has been the subject of recent computationally-focused work in the frequentist setting [DKK16, LRV16, PSBR18].

Given data corrupted according to an -contamination model, our goal is to sample from the clean posterior distribution : the posterior distribution conditioning only on the uncorrupted (“clean”) data. Our key idea is to leverage a robust approach for estimating the mean in the context of gradient-based sampling techniques. Our overall procedure is a robust variant of the Unadjusted Langevin Algorithm (ULA) that we refer to as “Rob-ULA.” The underlying ULA algorithm and its variants have been used for efficient large-scale Bayesian posterior sampling [RT96, WT11] and their convergence analysis has been a recent topic of interest [Dal17, DM17, CB18, DCWY18, MCJ18]; see Section 2.2 for a detailed overview. Informally, our main result shows that after iterations of Rob-ULA, the iterate has a distribution such that , where is the fraction of corrupted points in the data set.

The remainder of the paper is organized as follows: Section 2 contains a discussion of the related literature, Section 3 discusses relevant background as well as the formal problem setup, and Section 4 describes the proposed algorithm Rob-ULA and states our main theorem regarding its convergence. In Section 5

we discuss the fundamental problems of Bayesian mean estimation and linear regression in the robust setting. Section 

6 consists of experimental evaluation of Rob-ULA on synthetic as well as real world data sets and we conclude with Section 7.

2 Related Work

In this section, we review related work on robust estimation procedures in both the Bayesian and frequentist settings. We also discuss work on using Langevin dynamics to sample from distributions over continuous spaces.

2.1 Robust statistical procedures

There are many threads in the literature on robust estimation and outlier detection 

[Hub73a, Box53, DF61]. In the frequentist parameter estimation setting, the most commonly studied model is Huber’s classical -contamination model. There has also been a recent focus on an adversarial paradigm that is devoted to developing computationally efficient problem-dependent estimators for mean estimation  [LRV16, DKK16], linear regression [KKM18, BJK15, BJKK17, SBRJ19], and general risk minimization problems [PSBR18, DKK19]. Particular relevant to our setup are [PSBR18] and [DKK19] which utilize the robust mean estimators of  [LRV16, DKK16] to robustify gradient-based procedures for empirical risk minimization.

The study of robustness in the Bayesian framework has focused primarily on developing models and priors that have minimal impact on inference. An important line of work has focused on the sensitivity of the posterior distribution to and has led to the development of noninformative priors [BMP94, MSLD17, MD18]. These methods are orthogonal to those considered in the current paper, as they do not aim to robustify inferential procedures against corruptions in the observed data set. In principle a complete Bayesian model would have the capacity for explaining the outliers present in the data, but this would require performing an integral over all models with a proper prior. Such an approach would generally be computationally intractable.

An important class of procedures for handling outliers in the data set focuses on reweighing the data points via a transformation of the likelihood function. Huber [Hub11] considers assigning binary weights to data points and identifies model-dependent procedures to identify outliers. In contrast, Wang et al. [WKB17] consider these weights as latent variables and infers these weight variables along with the parameters of the model. These methods are susceptible to the choice of priors over these weighting variables. An alternate set of robust procedures are based on the idea of localization [DF61, WB18]

: each data point is allowed to depend on its own version of the latent variable and these individual latent variables are tied together with the use of a hyperparameter, often fitted using empirical Bayes estimation. Although these methods build on intuitive notions of outlier removal, there is little theoretical understanding of the kind of outliers that these methods can tolerate.

2.2 Sampling methods

Turning to sampling methods for posterior inference, there have been various zeroth-order [LS93, MT96] and first-order methods [Erm75, RR98, Nea11] proposed for sampling from distributions over continuous spaces. Our focus in this paper is on the overdamped Langevin MCMC method which was first proposed by Ermak [Erm75] in the context of molecular dynamics applications. Its nonasymptotic convergence (in total variation distance) was first studied by Dalalyan [Dal17] for log-smooth and log-strongly concave distributions. Cheng and Bartlett [CB18]

extended the analysis to obtain a similar convergence result in Kullback-Leibler divergence. Such nonasymptotic convergence guarantees are essential to understanding the robustness of computational procedures as they simultaneously capture the dependence of the sampling error on the number of iterations

, the dimensionality , and the contamination level .

3 Background

In the section we briefly review relevant background on Bayesian computation and we formally describe our problem setup.

3.1 Bayesian modelling

Given parameters and a data set , we assume that the statistician has specified a prior distribution, , and a likelihood, . We can then form the posterior distribution, , as follows:

We are generally concerned with the estimation of some test function under the posterior distribution, which is accomplished in the Bayesian setting by computing a posterior mean:

In practice, one way of computing this posterior mean is to use a Monte Carlo algorithm to generate a sequence of samples and form an estimate :

3.2 Unadjusted Langevin Algorithm

We consider a specific Monte Carlo algorithm, the Unadjusted Langevin Algorithm (ULA), for sampling from probability distributions over continuous spaces. Generically, we consider distributions over

of the form

for a class of functions which are square integrable. The ULA algorithm starts from some initial point and defines a sequence of parameters, , according to the following update equation:

(1)

where denotes a sequence of step sizes and

are i.i.d. Gaussian vectors. The Markov chain in Equation (

1) is the Euler discretization of a continuous-time diffusion process known as the Langevin diffusion. The stochastic differential equation governing the Langevin diffusion is given by

(2)

where represents a -dimensional Brownian motion. Denoting the distribution of by , Cheng and Bartlett [CB18] showed that after steps for functions which are smooth and strongly-convex. Specializing to the Bayesian modelling setup we rewrite the posterior distribution as:

where is the negative log-likelihood corresponding to the data point. The ULA algorithm can then be used to form an approximation to the posterior as follows:

where and are the step-size sequence and independent Gaussian noise respectively.

3.3 Problem Setup

We turn to a formal treatment of the robustness problem in the Bayesian setting. We consider the -contamination model introduced by Huber [Hub64] and let the collection of data points be obtained from the following mixture distribution:

(3)

where denotes the true underlying generative distribution while is any arbitrary distribution. A data set drawn from such a mixture distribution has each data point corrupted adversarially with probability . We denote by the subset of data points in sampled from the true distribution and similarly let denote the subset of data sampled from . Given data points , the likelihood function and the prior , we aim to form a clean posterior distribution given by:

Accordingly, as in Section 3.1, we would like to robustly estimate the mean of the test function under the uncorrupted posterior :

which we approximate via an estimate:

In the following section we present an algorithm, Rob-ULA, for generating the sequence of samples and provide theoretical guarantees on its convergence properties. The main idea is to exploit gradient-based iterative sampling methods, and to leverage a robust mean estimation procedure to compute a robust estimate of the gradient at each iteration.

Input: Data set , step-size sequence , initial covariance scaling , timesteps , prior distribution , hyperparameters , likelihood function , corruption level
Sample for  do
       Let for
Output: Iterates
Algorithm 1 Rob-ULA: Robust Unadjusted Langevin Algorithm

4 Rob-ULA: Robust Unadjusted Langevin Algorithm

We turn to our proposed algorithm, Rob-ULA (Algorithm 1), which aims to solve the robust posterior inference problem defined in Section 3.3. Rob-ULA is a simple modification of the ULA algorithm, described in Section 3.2, where in each iteration instead of using the complete set of data points for computing the gradient, we construct a robust estimator of the gradient and update the parameter using this estimate. This robust estimator ensures that the outlier data points do not exert too much influence on the gradient and allow Rob-ULA to obtain samples from a distribution close to the clean posterior distribution:

(4)

Before proceeding to establish the convergence guarantees for Rob-ULA, we present the robust gradient estimation procedure.

4.1 Robust Gradient Estimation

Input: Sample Gradients , Corruption Level , Dimension
= Outlier Truncation() if  then
       mean()
else
       sample covariance of Let span of top principal components of and where is projection onto Robust Gradient Estimator
Output: Estimate of Robust Gradient:
Algorithm 2 RobustGradientEstimate
Input: Sample Gradients , Corruption Level , Dimension
if d = 1 then
       smallest interval containing fraction of points.
else
       Let be samples with only coordinate for  do
             = Robust Gradient Estimator
      Let be ball of smallest radius centred at containing fraction of points.
Output: Points after outlier removal :
Algorithm 3 Outlier Truncation

Algorithm 2 describes our robust gradient estimation procedure. Based on the robust mean estimator of Lai et al. [LRV16], it takes as input the gradients of the negative log-likelihoods and outputs an estimate of the robust mean of the gradient vectors ( in Algorithm 1), assuming a fraction of them are arbitrarily corrupted. Algorithm 1 then scales this gradient estimate by the number of samples , to obtain a robust estimate of gradients of the likelihood .

Note that the model described in Section 3.3 assumes that each data point is sampled i.i.d. from the mixture distribution , where represents the true generative distribution and

can be any arbitrary distribution. An application of the Hoeffding bound for Bernoulli random variables shows that with probability at least

, the fraction of corrupted points in the sampled data set satisfy

(5)

For the remainder of the paper, we condition on this high probability event and state our results assuming this event holds. Following the proof strategy of [LRV16] and [PSBR18], we derive a bound on the estimation error of the true average log-likelihood gradient,

uniformly for any value of the iterate in the following lemma. We let denote the true value of this average log-likelihood gradient.

Lemma 1 (Robust Gradient Estimation)

Let

denote the uniform distribution over

and let denote the corresponding distribution over with mean given by , covariance

and fourth moment given by

. There exists a positive constant for which the robust mean estimator when instantiated with the contamination level , returns, with probability , an estimate such that for all , we have that,

Remark.

Note that Proposition 1 of Prasad et al. [PSBR18] presents a high-probability bound similar to ours which is applicable for a fixed parameter . Such a bound, however, does not suffice to ensure convergence of Rob-ULA because the additive Gaussian noise at every iterate requires us to obtain a uniform high-probability recovery error bound (see Section 4.2 for details). Lemma 1 establishes a uniform bound for the specific distribution which is uniform over the clean data . This restriction of the distribution also allows us to avoid sample-splitting at every iteration of Algorithm 1 which was essential for both [LRV16] and [PSBR18].

In addition, Lemma 1 indicates that irrespective of the sample size , one can estimate the mean of the gradient robustly up to error . This implies that at each iteration of Rob-ULA, we incur an error of since we scale the average gradient estimate by during the update. In Theorem 2 we show how with an appropriate choice of step size , one can control the propagation of this bias in the convergence analysis for Rob-ULA.

The detailed proof of Lemma 1 can be found in Appendix A.

4.2 Convergence Analysis

In this section, we study the convergence of the proposed algorithm Rob-ULA. For ease of notation, we let and similarly denote the clean and corrupted versions of the function and . The objective of the robust Bayesian posterior inference problem is then to obtain samples from the clean posterior distribution given by . For clarity of exposition, we drop the dependence of the posterior distribution on the data set as well as the hyperparameters and let .

We quantify the convergence of distribution following a stochastic process to the stationary distribution through the Kullback-Leibler divergence, :

We define the Wasserstein distance between a pair of distributions as:

where

denotes the set of joint distributions such that the first set of coordinates has marginal

and the second set has marginal .

We begin by making the following assumptions on the function :

  • Assumption 1 (Lipschitz smoothness). The function is -Lipschitz smooth and its Hessian exists for all . That is,

  • Assumption 2 (Strong convexity). The function is -strongly convex for all . That is,

    We further denote the condition number of the function as .

The assumptions of Lipschitz smoothness and strong convexity are standard in both the sampling and optimization literatures. In addition to the assumptions, we define the average Lipschitz constant and the average strong convexity of as . We now state our main theorem concerning the convergence guarantees for Rob-ULA.

Theorem 2 (Main Result)

Let , where satisfies Assumptions 4.2 and 4.2. Further, assume that the gradient estimates satisfy

where is the covariance of uniform distribution on induced by the clean data set , satisfies and is the fraction of corrupted points, satisfying . Then the iterates of Rob-ULA, when initialized with (with corresponding density ) and step size (and define ), satisfy:

where represents the distribution of the iterate .

Remarks.

Before proceeding to the proof of this theorem, a few comments are in order. First observe that the error term consists of three different components:

where a) term (I) comprises an exponentially decaying dependence (with the number of time-steps ) on the initial error , b) term (II) is a discretization error term and c) term (III) captures the dependence on the fraction of corrupted points and vanishes as goes to zero.

For any given accuracy , if the step size and the number of iterations satisfy:

then the error in convergence can be bounded as

As we show in Section 5, for problems such as Bayesian linear regression and Bayesian mean estimation, the average strong convexity parameter scales independently of the sample size . This implies that the resulting error can be bounded by . While the accuracy can be set to arbitrarily small values which would result in a corresponding increase in the number of time steps, there is a bias term depending on the contamination level which cannot be reduced by either increasing the sample size or by increasing the number of iterations. This is consistent with results in the frequentist literature [BJK15, DKK16, LRV16, PSBR18], which show that such inconsistency is a result of the adversarial corruptions and in general cannot be avoided.

Lemma 13 in Appendix B presents the following bound on the initial error:

Proof of Theorem 2

We proceed to a proof of our main convergence result, Theorem 2. We begin by considering the process described by Rob-ULA as a discretization of the Langevin dynamics given by Equation 2, with the following gradient estimate:

(6)

where represents the random variable describing the process at the iterate using step size . This is equivalent to defining the following stochastic differential equation

(7)

We next state a lemma which provides a bound on the variance of the distribution of the

iterate.

Lemma 3

For following Eq. (7), if , , and , then for all ,

where is a universal constant and is the fixed point satisfying .

The proof of this lemma is deferred to Appendix B. Treating Lemma 3 as given, we proceed to the proof of Theorem 2.

We consider the dynamics in Equation 7 within the time range . From the Girsanov theorem [Oks03] we have that admits a density function with respect to the Lebesgue measure. This density function can also be represented as

where is the weak solution to the following Kolmogorov forward equation:

where and its derivatives are defined via as a functional over the space of smooth bounded functions on (we refer the readers to  [SP14] for more details). As shown by Cheng and Bartlett [CB18], the time derivative of the KL divergence along is given by:

where the expectation is taken with respect to the joint distribution of and . Hence

(8)

where follows from the definition of . We first focus on the second term in the above expression which can be bounded as:

(9)

where follows by an application of Young’s inequality and follows from the point-wise assumption that and that . Let us define new constant and .

Next, we proceed to bound the term using Lemma 3. Let us define the variable and bound the term as:

(10)

where follows from Assumption 4.2 and we define (). Plugging in the bounds obtained in Equations (4) and (4) into Equation (4), we get for :

where follows from an application of the log-Sobolev inequality with being the log-Sobolev constant and follows from the fact that and the substitution

Finally, using Gronwall’s inequality we have the following one-step progress equation:

Repeated application of this progress inequality leads us to

The final result of the theorem can then be obtained by using Talagrand’s inequality [OV00] which states that for the probability distributions and , we have that

which concludes the proof of the theorem.

5 Consequences for Mean Estimation and Regression

In this section, we study the fundamental problems of Bayesian mean estimation (Section 5.1) and Bayesian linear regression (Section 5.2) under the Huber -contamination model.

5.1 Robust Bayesian mean estimation

We begin with the robust Bayesian mean estimation (RBME) problem and instantiate the convergence guarantees for Rob-ULA (Algorithm 1) for this problem. For simplicity, we study the setup in which the likelihood is Gaussian:

for a mean vector and a fixed positive definite covariance matrix

. We consider the corresponding conjugate prior over

given by

where is the mean and is the covariance matrix. The set of parameters are the hyperparameters . Given data points sampled from the Huber contamination model, where , with being an arbitrary adversarially chosen distribution, the objective of the RBME problem is to sample from the posterior induced by the uncorrupted data points,

(11)

where represents the subset of points in sampled from the distribution . We note that for data sampled from the Huber contamination model, we have from Equation (5) that with probability at least ,

(12)

Let us denote by and by the corresponding upper and lower bounds on the number of corrupted data points. Following the notation in Section 4, the function for this problem is given by:

(13)

Also, we define the corresponding function from Section 3 as well as the sample mean and covariance for the clean data points.

The following corollary instantiates the guarantees of Theorem 2 for the specific function defined for the RBME problem.

Corollary 4

Consider the RBME problem with posterior given by Equation (11) and data sampled from the Huber model. Then the iterates of Rob-ULA with step size and satisfy: