Modern statistical and machine learning algorithms typically operate under limited human supervision, therefore robustness - the ability of algorithms to properly handle atypical or corrupted inputs - is an important and desirable property. Robustness of the most basic algorithms, such as estimation of the mean and covariance structure that serve as “building blocks” of more complex methods, have received significant attention in the mathematical statistics and theoretical computer science communities; the survey papers by Lugosi and Mendelson (2019a); Diakonikolas and Kane (2019) provide excellent overview of the recent contributions of these topics as well as applications to a variety of statistical problems. The key defining characteristics of modern robust methods are (a) their ability to operate under minimal model assumptions; (b) ability to handle high-dimensional inputs and (c) computational tractability. However, many algorithms that provably admit strong theoretical guarantees are not computationally efficient. In this work, we deal with a class of methods that can be broadly viewed as risk minimization: the output (or the solution) provided by such methods is always a minimizer of the properly defined risk, or cost function. For example, estimation of the mean
of a square-integrable random variablecan be viewed as minimization of the risk over . Since the risk involves the expectation with respect to the unknown distribution, its exact computation is impossible. Instead, risk minimization methods introduce a robust data-dependent “proxy” of the risk function, and attempt to minimize it instead. The robust empirical risk minimization method by Brownlees, Joly and Lugosi brownlees2015empirical, the “median of means tournaments” developed by Lugosi and Mendelson (2019b)
and a closely related method of Lecué and Lerasle lecue2020robust2 are the prominent examples of this approach. Unfortunately, the resulting problems are computationally hard as they typically involve minimization of general non-convex functions. In this paper, we propose a Bayesian analogue of robust empirical risk minimization that allows one to replace non-convex loss minimization by sampling that can be readily handled by many existing MCMC algorithms. Moreover, we show that for the parametric models, our approach preserves one of the key benefits of Bayesian methods - the “build-in” quantification of uncertainty - and leads to valid confidence sets.
Next, we introduce the mathematical framework required for the rest of the exposition. Let be a random variable with values in some measurable space and unknown distribution . Suppose that are the training data – N i.i.d. copies of . We assume that the sample has been modified in the following way: a random set of observations were replaced by arbitrary values, possibly depending on the sample. Only the corrupted values are observed.
Suppose that has a density with respect to a -finite measure (for instance, the Lebesgue measure or the counting measure). We will assume that belongs to a family of density functions , where is a compact subset, and that for some unknown in the interior of . We will also make the necessary identifiability assumption stating that is the unique minimizer of over , where is the negative log-likelihood, that is, . Clearly, an approach based on minimizing the classical empirical risk of leads to familiar maximum likelihood estimator (MLE) . At the same time, the main object of interest in the Bayesian approach is the posterior distribution
, which is a random probability measure ondefined via
for all measurable sets . Here, is the prior distribution with density with respect to the Lebesgue measure. The following result, known as the Bernstein-von Mises (BvM) theorem that is due to L. Le Cam in its present form (see Van der Vaart van2000asymptotic for its proof and discussion), is one of the main bridges connecting the frequentist and Bayesian approaches.
Theorem (Bernstein-von Mises).
Under the appropriate regularity assumptions on the family ,
where is the MLE, stands for the total variation distance, is the Fisher Information matrix and denotes convergence in probability (with respect to the distribution of the sample ).
In essence, BvM theorem implies that for a given the credible set, i.e. the set of posterior probability, coincides asymptotically with the set of probability under the distribution , which is well-known to be an asymptotically valid (frequentist) confidence interval for , again under minimal regularity assumptions rigorously defined for instance in the book by Van der Vaart (2000). It is well known however that the standard posterior distribution is, in general, not robust: if the sample contains even one corrupted observation (referred to as an “outlier” in what follows), the posterior distribution can concentrate arbitrarily far from the “true” parameter that defines the data-generating distribution. We will further illustrate this fact by giving concrete examples below. Many existing works are devoted to robustness of Bayesian methods, such as the papers by Doksum and Lo (1990) and Hoff (2007) that investigate methods based on conditioning on partial information, as well as a more recent paper by Miller and Dunson (2015) who introduced the notion of the “coarsened” posterior; however, its behavior in the presence of outliers is not explicitly addressed. In other works, contamination is typically accommodated by either assuming heavy-tailed likelihoods, like the Student’s t-distribution, on the outliers Svensen and Bishop (2005), or by attempting to identify and remove them Bayarri and Berger (1994).
Here, we propose an approach that instead relies on the ideas inspired by the median of means principle to construct a robust proxy for the log-likelihood of the data and, consequently, a robust version of the posterior distribution. The original MOM estimator was proposed by Nemirovsky and Yudin (1983) and later rediscovered by Jerrum et al. (1986); Alon et al. (1999). Its versions and extensions were studied more recently by many authors including Lerasle and Oliveira (2011); Lugosi and Mendelson (2019b); Lecué and Lerasle (2020); Minsker (2020); we refer the reader to the surveys mentioned in the introduction for a more detailed literature overview.
The main advantage of the proposed approach is the fact that the resulting posterior distribution satisfies a version of the Bernstein-von Mises theorem similar to the one above. One of the main implications of this result is the fact that asymptotically, the “credible sets” obtained from the proposed posterior distribution are valid confidence intervals that are also robust to sample contamination. To the best of our knowledge, related results in the literature either do not study asymptotic behavior of the credible sets, or the obtained results are suboptimal (Minsker et al., 2017). The notable exception is the recent work by Baraud et al. (2020), however, the so-called -Bayes posterior distribution introduced in that work is not robust to outliers of arbitrary nature, unlike the analogue defined below.
1.1 Proposed approach.
Let be an arbitrary fixed point in the relative interior of ; for simplicity, we will suppress the dependence on in the notation. Observe that the density of the posterior distribution is proportional to ; indeed, this is evident from equation 1 once the numerator and the denominator are divided by . The key idea is to replace the average by its robust proxy denoted and defined below, which gives rise to the robust posterior distribution
defined for all measurable sets .
While it is possible to work with the log-likelihood directly, it is often easier and more natural to deal with the increments . For instance, in the Gaussian regression model with , with likelihood and , which is more manageable than itself.
Note that the density of is maximized for . For instance, if the prior
is the uniform distribution over, then corresponds exactly to the robust risk minimization problem which, as we’ve previously mentioned, is hard due to non-convexity of the function . At the same time, sampling from is possible, making the “maximum a posteriori” (MAP) estimator as well as the credible sets associated with accessible. The robust risk estimator employed in this work is based on the ideas related to the median of means principle. The original MOM estimator was proposed by Nemirovsky and Yudin (1983) and later by Jerrum et al. (1986); Alon et al. (1999). Its versions and extensions were studied more recently by many researchers including Audibert et al. (2011); Lerasle and Oliveira (2011); Brownlees et al. (2015); Lugosi and Mendelson (2019b); Lecué and Lerasle (2020); Minsker (2020). Let be a positive integer and be disjoint subsets (“blocks”) of of equal cardinality , . For every , define the block average
which is the (increment of) empirical log-likelihood corresponding to the subsample indexed by . Next, let be a convex, even, strictly increasing smooth function with bounded fist derivative; for instance, a smooth (e.g. convolved with an infinitely differentiable kernel) version of the Huber’s loss is an example of such function. Furthermore, let be a non-decreasing sequence such that and . Finally, define
which is clearly a solution to the convex optimization problem. Robustness and non-asymptotic performance of can be quantified as follows. Let , and ; then for all such that for some absolute constant ,
with probability at least , where as . Put simply, under very mild assumptions on , admits sub-Gaussian deviations around , moreover, it can absorb the number of outliers that is of order . We refer the reader to Theorem 3.1 in Minsker (2018) for a uniform over version of this bound as well as more technical details. We end this section with two technical remarks.
The classical MOM estimator corresponds to the choice which is not smooth but is scale-invariant, in a sense that the resulting estimator does not depend on the choice of . While the latter property is often desirable, we conjecture that the posterior distribution based on such “classical” MOM estimator does not satisfy the Bernstein-von Mises theorem, and that smoothness of is not just a technical assumption. This, perhaps surprising, conjecture is so far only supported by our simulation results.
Let be i.i.d. with normal distribution . The key observation is that the posterior distributions are often multimodal and skewed, unlike the expected bell shape.
be i.i.d. with normal distribution, and the prior distribution for is . Furthermore, let . We sample from the robust posterior distribution for the values of and . The resulting plots are presented in Figure 1
. The key observation is that the posterior distributions are often multimodal and skewed, unlike the expected bell shape.
Let us mention that the posterior distribution is a valid probability measure, meaning that . Indeed, note that for all , hence
Therefore, a sufficient condition for being finite is a.s. This is guaranteed by the fact that for any , for in the support of .
2 Main results.
We are ready to state the main theoretical results for the robust posterior distribution . First, we will state them in a way that avoids technical assumptions which can be found in the latter part of the section. Recall that where , and let and . The following theorem characterizes robustness properties of the mode of the posterior defined as
Under the appropriate regularity conditions on the loss function
Under the appropriate regularity conditions on the loss function, prior and the family , with probability at least ,
as long as for some absolute constants . Here, is a function that converges to as .
In particular, stated inequality implies that as long as the number of blocks containing outliers is not too large, the effect of these outliers on the squared estimation error is limited, regardless of their nature and magnitude. While the fact that the mode of is a robust estimator of is encouraging, one has to address the ability of the method to quantify uncertainty to fully justify the title of the “posterior distribution.” This is exactly the content of the following result. Let ; it can be viewed as a mode of the posterior distribution corresponding to the uniform prior on .
Assume the outlier-free framework. Under appropriate regularity conditions on the prior and the family ,
The message of this result is that in the ideal, outlier-free scenario, the robust posterior distribution asymptotically behaves like the usual posterior distribution , and that the credible sets associated with it are asymptotically valid confidence intervals. Technical requirements include a condition on the growth of the number of blocks of data, namely for some defined below. The main novelty here is the first, “BvM part” of the theorem, while asymptotic normality of has been previously established by Minsker (2020).
We finish this section by listing the complete list of regularity conditions that are required for the stated results to hold. The norm refers to the standard Euclidean norm everywhere below.
The function is convex, even, and such that
for and for .
is bounded and Lipschitz continuous.
One example of such is the smoothed Huber’s loss: let
Moreover, set . Then , where denotes the convolution, satisfies assumption 1. Condition (iii) on the higher-order derivatives is technical in nature and can likely be avoided at least in some examples; in numerical simulations, we did not notice the difference between results based on the usual Huber’s loss and its smooth version. Next assumption is a standard requirement related to the local convexity of the loss function at its global minimum .
The Hessian exists and is strictly positive definite.
In particular, this assumption ensures that in a sufficiently small neighborhood of , for some constants . The following two conditions allow one to control the “complexity” of the class .
For every , the map is differentiable at for -almost all (where the exceptional set of measure can depend on ), with derivative . Moreover, , the envelope function of the class satisfies for some and a sufficiently small .
An immediate implication of this assumption is the fact that the function is locally Lipschitz. It other words, for any , there exists a ball of radius such that for all . In particular, this condition suffices to prove consistency of the estimator .
The following condition is related to the modulus of continuity of the empirical process indexed by the gradients . It is similar to the typical assumptions required for the asymptotic normality of the MLE, such as Theorem 5.23 in the book by Van der Vaart (2000). Define
The following relation holds:
Moreover, the number of blocks satisfies as .
Limitation on the growth of is needed to ensure that the bias of the estimator is of order , a fact that we rely on in the proof the Bernstein-von Mises theorem. Finally, we state a mild requirement imposed on the prior distribution.
The density of the prior distribution is positive and bounded on , and is continuous on the set for some positive constant .
Most commonly used families of distributions satisfy assumptions 2-4. For example, this is easy to check for the normal, Laplace or Poisson families in the location model where . Other examples covered by our assumptions include the linear regression with Gaussian or Laplace-distributed noise. The main work is usually required to verify assumption 4; it relies on the standard tools for the bounds on empirical processes for classes that are Lipschitz in parameter or have finite Vapnik-Chervonenkis dimension. Examples can be found in the books by Van der Vaart (2000) and Van Der Vaart et al. (1996).
3 Numerical examples and applications.
We will illustrate our theoretical findings by presenting numerical examples below. The loss function that we use is Huber’s loss defined before. While, strictly speaking, it does not satisfy the smoothness requirements, we found that it did not make a difference in our simulations. Algorithm for sampling from the posterior distributions was based on the “No-U-Turn sampler” variant of Hamiltonian Monte Carlo method (Hoffman and Gelman, 2014). Robust estimator of the log-likelihood are approximated via the gradient descent algorithm at every . Our first example demonstrates that using Huber’s loss in the framework of Example 1.
We consider two scenarios: in the first one, the data are i.i.d. copies of random variables. In the second scenario, data are generated in the same way except that randomly chosen observations are replaced with i.i.d. copies of distributed random variables. Results are presented in figures 3 and 3, where the usual posterior distribution is plotted as well for comparison purposes. The main takeaway from this simple example is that the proposed method behaves as expected: as long as the number of blocks is large enough, robust posterior distribution concentrates most of its “mass” near the true value of the unknown parameter, while the usual posterior distribution is negatively affected by the outliers. At the same time, in the outlier-free regime, both posteriors perform similarly.
In this example, we consider a non-synthetic dataset in the linear regression framework. The dataset in question was provided by Cortez et al. (2009) and describes the qualities of different samples of red and white wine. It contains “subjective” features such as fixed acidity, pH, alcohol, etc., and one “objective” feature, the scoring of wine quality; white wine samples are selected to perform the linear regression where the objective feature is the response and the subjective features are the regressors. It is assumed that the data is sampled from the model
where is the intercept, is the response, are the chosen regressors (see detailed variable names in Table 1) along with the corresponding coefficients , and is the random error with distribution. Here we remark that, for simplicity only out of “subjective” features are selected such that this model agrees with the OLS linear regression model generated by best subset selection with minimization of BIC.
In the second experiment, randomly chosen response variables are replaced with
randomly chosen response variables are replaced withrandom outliers. In both cases, the priors for are set to be for every , and the prior for is the uniform distribution on . The block size is set to be and the number of blocks is . The MAP estimates of ’s and , as well as the two end points of the credible intervals are reported in Table 1. These plots yet again demonstrate that the posterior , unlike its standard version, shows stable behavior when the input data are corrupted.
MAP estimates of the intercept, regression coefficients and the standard deviation, left and right end points of credible intervals in parentheses.
4.1 Proof of Theorem 1.
In view of assumption 2, whenever is sufficiently small. Hence, if we show that satisfies this requirement, we would only need to estimate . To this end, denote , and observe that
If , the last term above can be dropped without changing the inequality. On the other hand, is bounded and , , whence the last term is at most . Given , assumption 2 implies that there exists such that . Let be large enough so that , whence . It follows from Lemma 2 in Minsker (2020) (see also Theorem 3.1 in Minsker (2018)) that under the stated assumptions,
with probability at least as long as are large enough and is sufficiently small. Here, is a function that tends to as . This shows consistency of . Next, we will provide the required explicit upper bound on . As we’ve demonstrated above, it suffices to find an upper bound for . We will apply the result of Theorem 2.1 in Mathieu and Minsker (2021) to deduce that for large enough, with probability at least . To see this, it suffices to notice that in view of Lemma 6 in the supplementary material, where may depend on and the class . We conclude by remarking that the term can be improved under additional assumptions, for instance by using Theorem 2.3 in Mathieu and Minsker (2021) instead of Theorem 2.1. However, we avoid these technicalities as our main goal is to control the dependence of the error on the number of outliers.
4.2 Proof of Theorem 2 (sketch).
The complete proof is rather long and technical. Here, we will outline several reduction steps that are needed to transform the problem into an easier one while the missing details are included in the supplementary material. First, in view of the well-known property of the total variation distance, .
Next, let us introduce the new variable , multiply the numerator and the denominator on the posterior by , and set
and . The total variation distance can then be equivalently written as . The function can be viewed a pdf of a new probability measure . Thus it suffices to show that
Since is the unique minimizer of , . Next, define ; it is twice differentiable since both and are. It is shown in the proof of Lemma 4 in Minsker (2020) that with high probability. Therefore, a unique mapping exists around the neighborhood of and so do and . Denote . The following result, proven in the supplementary material, essentially establishes stochastic differentiability of at .
The following relation holds:
In view of the lemma, the total variation distance between the normal laws and converges to in probability. Hence one only needs to show that . Let
and observe that as long as one can establish that
we will be able to conclude that
so that . This further implies that
and the desired result would follow. Therefore, it suffices to establish that relation (6) holds. Moreover, since outside of a compact set , it is equivalent to showing that
where . Note that
Let us divide into 3 regions: , and where is a sufficiently small positive number and is a sufficiently large so that contains . Finally, is chosen such that , and that satisfies an additional growth condition specified in Lemma 8 in the supplement. The remainder of the proof is technical and is devoted to proving that each part of the integral (7) corresponding to converges to . Details are presented in the supplementary material.
The space complexity of approximating the frequency moments. Journal of Computer and system sciences 58 (1), pp. 137–147. Cited by: §1.1, §1.
- Robust linear least squares regression. The Annals of Statistics 39 (5), pp. 2766–2794. Cited by: §1.1.
- Robust bayes-like estimation: rho-bayes estimation. Annals of Statistics 48 (6), pp. 3699–3720. Cited by: §1.
Robust Bayesian bounds for outlier detection. Recent Advances in Statistics and Probability, pp. 175–190. Cited by: §1.
- Empirical risk minimization for heavy-tailed losses. Annals of Statistics 43 (6), pp. 2507–2536. Cited by: §1.1.
- Modeling wine preferences by data mining from physicochemical properties. Decision support systems 47 (4), pp. 547–553. Cited by: Example 3.
- Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911. Cited by: §1.
- Consistent and robust Bayes procedures for location based on partial information. The Annals of Statistics, pp. 443–453. Cited by: §1.
- Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics, pp. 265–283. Cited by: §1.
- The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo.. J. Mach. Learn. Res. 15 (1), pp. 1593–1623. Cited by: §3.
- Random generation of combinatorial structures from a uniform distribution. Theoretical computer science 43, pp. 169–188. Cited by: §1.1, §1.
- Robust machine learning by median-of-means: theory and practice. Annals of Statistics 48 (2), pp. 906–931. Cited by: §1.1, §1.
- Robust empirical mean estimators. arXiv preprint arXiv:1112.3914. Cited by: