1 Introduction
Rank aggregation refers to the task of recovering the total order over a set of items, given a collection of pairwise/partial/full preferences over the items [23]. Compared to rating items, the preference is a more natural expression of user opinions which can provide more consistent results [31]. Therefore, rank aggregation is a practical and useful approach to summarize user preferences [11]. Furthermore, the preferences could arise not only by explicitly querying users, but also through passive data collection, i.e., by observing user purchasing behavior [2], clicks on search engine results [12], etc. The flexible collection of preferences enables successful application of rank aggregation in various fields, from image rating [22] to peer grading [31], and bioinformatics [19].
A basic assumption underlying most rank aggregation models is that all preferences are provided by homogeneous users, sharing the same annotation accuracy and agreeing with the single ground truth ranking [12, 21]. However, the above homogeneous data assumption is rarely satisfied due to the flexible data construction and the complex real situation [14, 20, 26]. Therefore, rank aggregation methods usually suffer from model misspecification, namely the inconsistency between the collected ranking data and the homogeneous data assumption [29].
To alleviate the aforementioned inconsistency issue, existing methods usually resort to an augmentation of the ranking model to account for additional perturbation in observed preferences [15]. Particularly, the reliability of users is considered in [10, 15], which studied rank aggregation in a crowdsourcing environment for pairwise/trinary preferences. [31]
introduced a general framework to aggregate ordinal peer gradings from users while considering the user reliability. However, each user usually provides one preference in real applications, which would cause overfitting since we need to estimate the reliability w.r.t. each preference
[32]. Actually, these previous attempts simply amount to convolving the original ranking model with the preassumed corruption mechanism. It leads to a new model with a few more parameters but is just as bound to be misspecified w.r.t. agnostic noises in the real world.In this paper, we present a novel rank aggregation approach, called CoarsenRank. The main idea of CoarsenRank is to perform rank aggregation over the neighborhood of the noisy ranking data, which enables CoarsenRank to be robust against model misspecification [35, 9]. However, it is intractable to directly inference over the neighborhood of the noisy ranking data because of the unlimited samples involved. It also prohibits gradientbased solutions adopted for distributional robustness in the optimization community due to the particularity of the ranking data [5, 13]. Considering that no generalization test is required for rank aggregation, the relative entropy is adopted as the divergences metric for tractability concerns with accuracy guarantee [3, 27]. We further introduce a prior distribution for the unknown size of the neighborhood to avoid parameter tuning. Then, a computationally efficient formula is derived for CoarsenRank. Unfortunately, the direct posterior inference is still inefficient due to the annoying rank structure. A data augmentation trick is introduced to ensure a closed form solution for the approximated posterior.
More precisely, we make the following key contributions:

We introduce a novel robust rank aggregation method called CoarsenRank. To the best of our knowledge, CoarsenRank is the first rank aggregation method against model misspecification and enjoys distributional robustness.

We obtain a computationally efficient formula for CoarsenRank, which introduces only one extra hyperparameter to vanilla ranking models. Further, we derive a tractable closedform solution and introduce a datadriven criterion for choosing the hyperparameter.

We successfully applied our CoarsenRank on four realworld datasets. Empirical results demonstrate the superior reliability and efficiency of CorasenRank over other baselines.
The remainder of this paper is organized as follows. Section 2 discusses the noisy rank aggregation setting and introduces the main idea of our CoaesenRank. In Sect. 3, we illustrate how CoarsenRank enables us to perform robust rank aggregation against model misspecification and obtain a simple formula for CoarsenRank. Section 4 presents a closed form EM algorithm for CoarsenRank and discusses the strategy for hyperparameter selection. Section 5 demonstrates the efficacy of CoarsenRank through empirical results on four realworld datasets. Section 6 concludes the paper and envisions the future work.
2 Rank aggregation under model misspecification
In this paper, we focus on the noisy rank aggregation. The term “noisy” here refers to a general concept that the observed preferences are inconsistent with the assumption of the ranking aggregation model.
2.1 Problem statement
Considering an item set , the observed dataset denotes a collection of preferences over subsets of , i.e., , and . Let denote the latent consistent counterpart of the collected preferences . The notation denotes item is preferred over item . The goal of rank aggregation is to aggregate the collected preferences into a total order over all items in .
In this paper, we focus our work on the probability ranking model. Particularly, it assumes a generative model
for the collected preferences . However, in practice, the real data generation model may not be perfectly consistent with the adopted generative model . The inconsistency could arise when preferences were not collected from a homogeneous user community because the single total order assumption would be no longer satisfied. Here comes the noisy rank aggregation problem, we want to aggregate the preferences against most potential uncertainties caused by the adopted misspecified ranking model .2.2 Previous attempts: Correction at the sample level
When encountering the inconsistent preferences, previous approaches usually resort to an augmentation of the ranking model to account for additional error/noise/uncertainty at the sample level (See the upper panel of Fig.1). There are essentially two ways of doing this:

One intuitive approach is to correct each noisy preference by preassuming some corruption distribution, i.e., . However, this simply amounts to convolving the original model distribution with the chosen corruption distribution, leading to a new model that has a few more parameters but is just as a bound to be misspecified w.r.t. other unlisted perturbations.

The second approach would be to model the joint distribution
, which needs to allow for most potential perturbations. Essentially, it needs to be a nonparametric model for , but leads to be computationally expensive.
Since the corruption patterns underlying the noisy preferences vary from setting to setting, it is impossible to design the general practice that is suitable for most settings. Therefore, in this paper, we aggregate the noise preference from another perspective.
2.3 CoarsenRank: Rectifying the underlying data distribution
Note that in many situations, it is impractical to correct the model, and these are the situations our method is intended to address. We are concerned with a broader class of noisy rank aggregation problems in general, not just one particular kind of perturbation considered in previous methods.
Motivated by the recent advances of robust Bayesian inference [25, 35], we assume that the observed noisy preferences locates in nearby of its latent consistent counterpart , i.e., . denotes some divergence measure between two datasets. Then, the rank aggregation over the coarsened ranking “zone” would be robust to a distributional perturbation from the adopted ranking model (See the lower panel of Fig.1). We refer to our model as coarsened rank aggregation or CoarsenRank for short. Compare with the previous methods, our CoarsenRank is robust to most potential perturbations while possessing high computational efficiency.
3 Coarsened rank aggregation
In this section, we illustrate how CoarsenRank enables us to perform rank aggregation in a way that is robust to model misspecification. Particularly, CoarsenRank need to perform rank aggregation over a coarsened ranking “zone” , instead of the original ranking data directly. Fig. 2 illustrates the pipeline of how we derive our CoaesenRank model.
3.1 Approximate posterior conditioning on
According to our motivation above, we first define the empirical distribution for the preferences and , namely and , where is the Dirac delta function [36]. Assuming that the empirical distribution converges to the corresponding data generating distribution, namely and when , we come to Theorem 1, our approximate posterior.
Theorem 1
(Lemma S3.1 in [25]) Suppose is an almost surely (a.s.)consistent estimator^{2}^{2}2
In probability theory, an event happens almost surely (a.s.) if it happens with probability one.
of , namely , where and when . Assume and , then we have(1) 
for any such that .
Theorem 1 is a general conclusion in robust Bayesian inference [25]. It justifies our motivation to pursue robustness in a distributional sense. In what follows, we extend Theorem 1 to some variants which possess nice properties for robust rank aggregation.
Level of distributional robustness
The value of parameter
is usually difficult to set without sufficient prior knowledge. We treat it as a random variable and introduce a prior on it. Then, we have the following conclusion.
Corollary 1
Assume all conditions for Theorem 1 are satisfied and , the approximate posterior can be further simplified, namely
(2) 
when random variable subjects to an exponential prior.
Indeed, a very large class of distributions can be adopted as prior . A case of particular interest arises when , since it leads to a computationally simple formula via maintaining exponential formulation. The efficacy of the exponential prior is verified in our experiment (See Sect. 5).
Inspired by the exponential formulation of posterior derived in Eq. (2), we give the following derivations (Eq. (3)) to explain why the standard posterior is lack of robustness.
(3)  
where holds because the empirical data distribution . indicates Monte Carlo approximation. holds due to the incorporate of the entropy term. The standard posterior (Eq.(3)) diverges to infinity as , while the approximate posterior (Eq.(2)) remains stable.
Types of distributional robustness and tractability
The choice of in (Eq. (1)) affects both the richness of the robustness types we wish to cover as well as the tractability of the resulting optimization problem. The Wasserstein metric is a popular option in previous approaches on distributional robustness [5, 13, 35]
, which exhibits super tolerance to adversarially corrupted outliers
[8, 9] and also allows robustness to unseen data [1, 33]. Meanwhile, [3, 27] adopted fdivergences in pursuit of tractable optimization approaches. Considering the particularity of rank aggregation task: 1) no generalization test required; 2) high complexity of the ranking model. We consider KullbackLeibler (KL) divergence, since it helps to exhibit robustness to most types of perturbations and allows inference following standard algorithms with no additional computational burden.Corollary 2
If is an almost surely (a.s.)consistent estimator of , and is subject to an exponential prior, i.e., . We obtain the following approximation to the posterior:
(4) 
where denotes the distribution on the left is approximately equal to the distribution proportional to the expression on the right, and .
3.2 Coarsened probability ranking model
Here we instantiate with the PlackettLuce (PL) model [30, 24], which is one of the most popular and frequently applied parametric distributions to analyze rankings of a finite set of items. Note, our technique is not limited to the PL model, but also applicable to other probability ranking models. For a ranking list , the PlackettLuce model assumes
(5) 
where is the positive support parameters associated with each item.
According to Corollary 2, a simple formulation of our CoarsenRank can be represented as follow:
(6) 
where . denotes the length of each preference. denotes the observed preferences, while represents the latent preferences which is consistent with the adopted ranking model .
Remark 1 (Connection between CoarsenRank and the standard posterior)
Since , we have denoting the the expected discrepancy of the observed preferences w.r.t. . Further, approximates to zero as , which means the misspecification no longer exists. Meanwhile, the robust posterior Eq.(6) degenerates to the standard posterior as approximates to when .
Remark 2 (Optimization intractability and our strategy)
The main inferential issue related to Eq. (6) concerns the presence of the annoying normalization terms,, , that do not permit the direct maximization of the posterior. Further, the nonnegative constraint over the model parameters rules out the direct applications of gradientbased optimization approaches.
Motivated by [7], we introduce the data augmentation trick to address the above difficulty. Considering the fact that the Gumbel distribution is employed as a distribution of the support parameters and the conjugacy of the Gamma density with the Gumbel distribution, we follow [7] and introduce an auxiliary Gamma random variable for each normalization term, which leads to a joint distribution without suffering from the annoying normalization terms. Therefore, a simple and effective solution is derived accordingly for the approximated posterior (Eq. (6)).
4 Efficient Bayesian inference
According to our discussion in Remark 2, we resort to the data augmentation trick to eliminate the annoying normalization terms peculiar to the ranking model Eq.(5), which helps to deduce an efficient inference method for Eq. (6).
4.1 Data augmentation
First, we reformulate Eq. (6) as follows:
(7) 
where . Regarding each normalization term in Eq.(7), we introduce an auxiliary variable , and . According to Remark 2, let each
be subject to a Gamma distribution, namely
. Here is the gamma function evaluated at . More specifically, we define the posterior distribution of as follows,(8) 
Now, we can deal with the joint distribution directly, which leads to significant simplifications for optimization. Further, we utilize a Gamma prior to instantiate the prior distribution , which naturally satisfies the nonnegative constrain of , i.e., ). is initialized to in this paper, . Therefore, the full likelihood of our CoarsenRank model can be further formulated as follows,
(9) 
where . denotes the observed preferences. denotes the introduced auxiliary variables. is the discrepancy between the observed preferences and its latent consistent counterpart
, measured in Kullback–Leibler divergence. Note that
denotes the length of each preference, which is usually not fixed in realword applications.4.2 EM algorithm with closedformed updating rules
Concerning the presence of the introduced auxiliary variables
, we resort to the ExpectationMaximization (EM) framework, which is a silver bullet to compute the maximumlikelihood solution or maximum a posteriori estimation in the presence of latent variables.
Expectation step (Mstep)
In the expectation step, we calculate the expectation of each auxiliary variable w.r.t. its posterior distribution :
(10) 
where and . Then, the expectation of the completedata log likelihood function w.r.t. the posterior of the introduced auxiliary variables can be represented as follows:
(11) 
Maximization step (Mstep)
In the maximization step, we maximize the objective function Eq. (11) by setting its gradient w.r.t. to zero and obtain the following estimates for :
(12) 
Calibration for real application (Cstep)
In real applications, the number of items involved in partial comparisons usually varies significantly. Some items may appear frequently in the ranking list due to their popularity, while others only appear a few times due to their professionality. In such cases, the final ranking will not be unique or even not converge. To ensure a unique solution and to avoid overfitting, regularization may be used. To ensure the nonnegativity of the parameter , we perform normalization over . Namely, where is the tunable regularization parameter depending on the number of items. We fixed in our experiment for simplicity.
4.3 A datadriven strategy for choosing
If we have no prior basis for choosing parameter in Eq. 2, then the following diagnostic curve can help to make a datadriven choice. Let be a measure of fit to the data and be a measure of model complexity. Following [34], we use the posterior expected log likelihood for , and the difference between the loglikelihood evaluated at the posterior mean of the parameters and the posterior expected loglikelihood for . Specifically, we define and , where is an approximate posterior distribution for and is the posterior expectation of . Therefore, the adopted Deviance Information Criterion (DIC) can be represented as [34]. As ranges from to , DIC traces out a curve in , and the technique is to choose with the lowest DIC or where DIC levels off.
5 Experimental evaluation
In this section, we verify the efficacy of the proposed CoarsenRank on noisy rank aggregation with the startoftheart approaches. The results are carried on four realworld noisy ranking datasets.
5.1 Experimental setting
Regarding the performance metric, we consider the Kendall’s distance [17], which is one of the most common measures of similarity between rankings. counts the pairwise agreements between items from two rankings and . denotes total number of pairs. ranges from (worst) to (best).
As for baselines, we first consider the vanilla PlackettLuce model [30, 24]. For the sake of fair comparison, we optimize it with two optimization approaches, i.e., gradient descent (PL) [6] and EM using data augmentation (PLEM) [7]. Then, we compare the results with PeerGrader [31], which is a variation of PlackettLuce model for partial preferences while incorporating the user reliability estimation module. We also compare with the popular noisy ranking model CrowdBT [10]. Since CrowdBT was originally designed for pairwise preferences, we generalize CrowdBT to partial preferences following rankbreaking [28]. Namely, we first break each partial preference into a set of pairwise comparisons and then apply CrowdBT to each pairwise comparison independently.
Similar to CoarsenRank, we propose to calibrate the baseline to avoid overfitting and guarantee a unique solution. In terms of PLEM, we adopt the same calibration method as CoarsenRank. In terms of other baselines, their formulation a little different from us, our calibration method cannot be applied. Following CrowdBT, we use virtual node regularization [10]. Specifically, it augments the original dataset with , which consists of the pairwise comparison between all items and a virtual item , namely .
In this section, we conduct our experiment on four realworld datasets introduced in previous research, whose statistics are introduced in Table 1. The Readlevel dataset [10] contains English text excerpts whose reading difficulty level is annotated by workers from a crowdsourcing platform. The total order over all excerpts is provided by the domain expert. The SUSHI dataset is introduced in [16], which consists of customers’ preferences over types of sushi. Following [18], we generate the total order using the vanilla PL over the entire preferences. The BabyFace dataset [15] consists of the evaluations of workers from Amazon Mechanical Turk on images of children’s facial microexpressions from happy to angry. A total order over all microexpressions is provided as ground truth by the agreement of most workers after the experiment. The PeerGrading dataset [32] consists of assessments, i.e., Selfgrading and Peer grading, from students over group submissions. We then created the ordinal gradings by merging the Selfgrading and Peer grading regarding the same assignment provided by each student. The TA gradings (following a linear order) provided by six teaching assistants over all submissions are considered as ground truth.
5.2 Deviance Information Criterion (DIC) for choosing the hyperparameter
Following Sect. 4.3, we adopted the DIC to choose the hyperparameter for different datasets. Since it is intractable to analytically calculate the posterior expectation in DIC, we implement a Gibbs Sampling procedure (details refer to appendix). Particularly, the posterior distribution of subjects to Gamma distribution according to our definition (Eq. (8)). The full conditional distributions of can be easily derived according to Eq. (12). Therefore, we can collect the samplings from and calculate the Monte Carlo estimation of DIC for different . The number of samplings is set to in our experiment. The diagnostic curves of on four datasets are plotted in Fig. 3(a)(d), respectively.
The results show that the DIC decreases dramatically at first when is small, then the curve reaches a cusp and levels off, with more modest increases/decreases in fit when becomes larger. is chosen at the point with the lowest DIC or where DIC levels off in our experiment, marked as “ ” in each figure.
5.3 The efficacy of CoarsenRank in four real applications
Fig. 3(e) shows the performance improvement of all methods over PLEM on four datasets. It can be observed that: (1) CoarsenRank achieves consistently improvement over other rank aggregation baselines. It demonstrates the great potential of CoarsenRank in real applications, where model misspecification widely exists. (2) The accuracy of PL is comparable with PLEM on all datasets, which rules out the possibility that the EM algorithm would lead to performance improvement. (3) CrowdBT and PeerGrader get superior performance on BabyFace because of sufficient annotations (over ) from each user and the trinary preferences setting in BabyFace. (4) The performance of CrowdBT and PeerGrader, vary significantly on different datasets. The reason is that their preassumed corruption patterns may not be consistent with unknown noises in each dataset. (5) Marginal improvement is achieved by CrowdBT and PeerGrader on Readlevel, SUSHI and PeerGrading when each user provides almost one preference. Points & are model misspecification cases our CoarsenRank is intended to address.
Fig. 3(f) shows the computation cost of all methods on four real datasets. (1) CoarsenRank achieves a much lower computation compared to other robust ranking aggregation baselines. It shows our CoarsenRank is promising for deploying in a largescale environment, where reliability and efficiency are all required. (2) The computation costs of CoarsenRank and PLEM are comparable because of the only difference betweenCoarsenRank and PLEM lying at the choosing of parameter (See Eq. 12
). (3) PeerGrader suffers from significant inefficiencies since it needs to optimize parameters alternatively. (4) CrowdBT replaces the inefficient alternative optimization with the online Bayesian moment matching and achieves lower computation compared to PeerGrader. However, it still inefficient on
SUSHI dataset because of the lack of an efficient rankbreak method for long preferences.6 Conclusion
Our CoarsenRank performs imprecise inference conditioning on a neighborhood of the noisy preferences, which opens a new door to the robust rank aggregation against model misspecification. Experiments on four real applications demonstrate imprecise inference on the neighborhood of the noisy preferences, instead of the original dataset, can improve the model reliability. It shows that our CoarsenRank has great potential in real applications, e.g., social choice, information retrieval, recommender system, etc, where errors/nosies/uncertainties widely exist. A promising direction is to explore other divergence metrics for other statistical properties of rank aggregation.
Acknowledgements
IWT was supported by ARC FT130100746, DP180100106 and LP150100671. MS was supported by the International Research Center for Neurointelligence (WPIIRCN) at The University of Tokyo Institutes for Advanced Study. WC is supported by the National Natural Science Foundation of China (Nos. 61603338 and 61703370).
References

[1]
Soroosh Shafieezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, and Daniel
Kuhn.
Distributionally robust logistic regression.
In Advances in Neural Information Processing Systems, pages 1576–1584, 2015.  [2] Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. Group recommendations with rank aggregation and collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems, pages 119–126. ACM, 2010.
 [3] Aharon BenTal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
 [4] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
 [5] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning. arXiv preprint arXiv:1610.05627, 2016.
 [6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [7] Francois Caron and Arnaud Doucet. Efficient bayesian inference for generalized Bradley–Terry models. Journal of Computational and Graphical Statistics, 21(1):174–196, 2012.
 [8] Ruidi Chen and Ioannis Paschalidis. Outlier detection using robust optimization with uncertainty sets constructed from risk measures. ACM SIGMETRICS Performance Evaluation Review, 45(3):174–179, 2018.

[9]
Ruidi Chen and Ioannis Ch Paschalidis.
A robust learning approach for regression models based on
distributionally robust optimization.
The Journal of Machine Learning Research
, 19(1):517–564, 2018.  [10] Xi Chen, Paul N Bennett, Kevyn CollinsThompson, and Eric Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 193–202. ACM, 2013.
 [11] Jean C de Borda. Mémoire sur les élections au scrutin. 1781.
 [12] Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pages 613–622. ACM, 2001.
 [13] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050, 2017.
 [14] Isobel Claire Gormley and Thomas Brendan Murphy. Exploring heterogeneity in irish voting data: A mixture modelling approach. Technical report, Technical Report 05/09, Department of Statistics, Trinity College Dublin, 2005.
 [15] Bo Han, Yuangang Pan, and Ivor W Tsang. Robust plackett–luce model for kary crowdsourced preferences. Machine Learning, 107(4):675–702, 2018.
 [16] Toshihiro Kamishima. Nantonac collaborative filtering: recommendation based on order responses. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 583–588. ACM, 2003.
 [17] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
 [18] Ashish Khetan and Sewoong Oh. Datadriven rank breaking for efficient rank aggregation. The Journal of Machine Learning Research, 17(1):6668–6721, 2016.
 [19] Minji Kim, Farzad Farnoud, and Olgica Milenkovic. Hydra: gene prioritization via hybrid distancescore rank aggregation. Bioinformatics, 31(7):1034–1043, 2014.
 [20] Raivo Kolde, Sven Laur, Priit Adler, and Jaak Vilo. Robust rank aggregation for gene list integration and metaanalysis. Bioinformatics, 28(4):573–580, 2012.
 [21] Xue Li, Xinlei Wang, and Guanghua Xiao. A comparative study of rank aggregation methods for partial and top ranked lists in genomic applications. Briefings in bioinformatics, 20(1):178–189, 2017.

[22]
Lucy Liang and Kristen Grauman.
Beyond comparing image pairs: Setwise active learning for relative attributes.
InProceedings of the IEEE conference on Computer Vision and Pattern Recognition
, pages 208–215, 2014.  [23] Shili Lin. Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):555–570, 2010.
 [24] RD Luce. Individual choice theory: A theoretical analysis, 1959.
 [25] Jeffrey W Miller and David B Dunson. Robust bayesian inference via coarsening. Journal of the American Statistical Association, pages 1–13, 2018.
 [26] Cristina Mollica and Luca Tardella. Bayesian plackett–luce mixture models for partially ranked data. psychometrika, 82(2):442–458, 2017.
 [27] Hongseok Namkoong and John C Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
 [28] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65(1):266–287, 2016.
 [29] Yuangang Pan, Bo Han, and Ivor W. Tsang. Stagewise learning for noisy kary preferences. Machine Learning, 107(810):1333–1361, 2018.
 [30] Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
 [31] Karthik Raman and Thorsten Joachims. Methods for ordinal peer grading. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1037–1046. ACM, 2014.
 [32] Mehdi SM Sajjadi, Morteza Alamgir, and Ulrike von Luxburg. Peer grading in a course on algorithms and data structures: Machine learning algorithms do not improve over simple baselines. In Proceedings of the third (2016) ACM conference on Learning@ Scale, pages 369–378. ACM, 2016.
 [33] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
 [34] David J Spiegelhalter, Nicola G Best, and Bradley P Carlin. Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models. Technical report, 1998.
 [35] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, pages 5334–5344, 2018.
 [36] Eric W Weisstein. Delta function. delta, 29:30, 2004.
Appendix A Detailed Proof for Theorem 1, Corollary 1 and Corollary 2
Theorem 1
(Lemma S3.1 in [25]) Suppose is an a.s.consistent estimator of ^{3}^{3}3In probability theory, an event happens almost surely (a.s.) if it happens with probability one., namely , where and when . Assume and , then we have
for any such that .
Proof
Since , we have ^{4}^{4}4 denotes the indicator function, which returns one when is true and zero, otherwise.. Then we have hold since . According to the dominated convergence theorem [4], we have .
Similarly, we have . Since , , . Therefore, we have , according to the dominated convergence theorem and . Above all, we have
Corollary 1
Assume all conditions for Theorem 1 are satisfied and , the approximate posterior can be further simplified, namely
when random variable subjects to an exponential prior.
Proof
Note that since , we have
where the second equation holds because the exponential prior is independent from . Substitute the in Eq.(2) with and finish the proof.
Before giving the proof for Corollary 2, we first introduce Lemma 1 which contains some preliminary results in previous research [25].
Lemma 1
Let . Let . We argue that if i.i.d. and for , then for near ,
where .
Corollary 2
Suppose is an a.s.consistent estimator of , and subjects to an exponential prior, i.e., . We obtain the following approximation to the posterior:
where denotes the distribution on the left is approximately equal to the distribution proportional to the expression on the right, and .
Proof
According to Theorem 1, we have
where is according to Bayesian theory but omitting the normalization constant with respect to .
Further, we have
where . follows Corollary 1.