The essence of science is the search for objective truth, yet scientific work is typically evaluated through peer review — a notoriously subjective process.111Even papers about peer review are subject to peer review, the irony of which has not escaped us. The former editor of the British Medical Journal writes [Smith2006]:
“People have a great many fantasies about peer review, and one of the most powerful is that it is a highly objective, reliable, and consistent process. I regularly received letters from authors who were upset that the BMJ rejected their paper and then published what they thought to be a much inferior paper on the same subject. Always they saw something underhand. They found it hard to accept that peer review is a subjective and, therefore, inconsistent process.”
Of course, this criticism is equally relevant to computer science in general, and AI in particular, but AI researchers have been playing a leading role in using their craft to propose and test new approaches to peer review [Roos, Rothe, and Scheuermann2011, Charlin and Zemel2013, Ge, Welling, and Ghahramani2013, Lawrence and Cortes2014, Kurokawa et al.2015, Shah et al.2017].
We contribute to this effort by examining one of the sources of inconsistency in peer review: the mapping from criteria scores to overall recommendation. Specifically, most conferences ask reviewers to score papers on criteria such as novelty, significance, and technical quality. But there is rarely any guidance as to how these criteria should be aggregated in order to arrive at a recommendation in a principled way. As a consequence, the acceptance decision on each paper is governed by the subjective opinions of a handful of reviewers (who review that paper) regarding the relative importance of various criteria. Furthermore, these aggregation rules are not consistent across different papers.
A fascinating exception, which serves as a case in point, is the 27th AAAI Conference on Artificial Intelligence (AAAI 2013). Reviewers were asked to score papers, on a scale of 1–6, according to the following criteria: technical quality, experimental analysis, formal analysis, clarity/presentation, novelty of the question, novelty of the solution, breadth of interest, and potential impact. The admirable goal of the program chairs was to select “exciting but imperfect papers” over “safe but solid” papers, and, to this end, they provided detailed instructions on how to map the foregoing criteria to an overall recommendation. For example, the preimage of ‘strong accept’ is “a 5 or 6 in some category, no 1 in any category,” that is, reviewers were instructed to strongly accept a paper that has a 5 or 6 in, say, clarity, but is below average according to each and every other criterion (i.e., a clearly boring paper). Some reviewers chose not to follow these instructions, as noted by the program chairs in an email to a subset of the program committee:
“Note that the reviewer forms have directions for how to assign the summary score. Ideally, this score would have been automated, but it was not. Some reviewers have not properly applied the rules.”
Although the specific issue we described can be alleviated via a revised mapping of criteria to recommendations, that hand-crafted revised mapping is likely to have its own quirks. In summary, the AAAI-13 anecdote highlights the difficulty of devising a principled mapping from criteria scores to recommendations.
To address this problem, we propose an approach based on machine learning and social choice, which provides a mapping from criteria scores to recommendations capturing the opinion of the entire (reviewer) community. From a machine learning perspective, the examples are reviews, each consisting of criteria scores (the input point) and an overall recommendation (the label). We make the innocuous assumption that each reviewer has a monotonic
mapping in mind, in the sense that a paper whose scores are at least as high as those of another paper on every criterion would receive an overall recommendation that is at least as high; the reviews submitted by a particular reviewer can be seen as observations of that mapping. Given these data, we find a single monotonic mapping that minimizes a loss function (which we will discuss momentarily). We can then apply this mapping to the criteria scores associated with each review, to obtain new overall recommendations, which replace the original ones.
In the foregoing framework, the overall recommendation initially given by a reviewer only serves to express an opinion about the desired mapping in one particular case; we achieve consistency across reviews by discarding these initial recommendations and applying a single mapping, which is the output of the learning process, exclusively to criteria scores. Therefore, this process lies in the realm of computational social choice [Brandt et al.2016]: it can be seen as aggregating individual opinions over mappings into a consensus mapping. From this viewpoint, it is natural to select the loss function so that the resulting aggregation method satisfies socially desirable properties, such as consensus (if all reviewers agree then the aggregate mapping should coincide with their recommendations), efficiency (if one paper dominates another then its overall recommendation should be at least as high), and strategyproofness (reviewers cannot pull the aggregate mapping closer to their own recommendations by misreporting them).
Specifically, we focus on the the family of loss functions, for , which resembles the norm of a matrix. Our question, then, is:
For what values of and does our aggregation method, which chooses a monotonic mapping that minimizes the loss on the given reviews, satisfy consensus, efficiency, and strategyproofness?
Our main theoretical result is a characterization theorem that gives a decisive answer to this question: the three properties are satisfied if and only if . This result singles out an instantiation of our approach that we view as particularly attractive and well grounded.
We also provide empirical results, which analyze properties of our approach when applied to a dataset of reviews from IJCAI 2017. One vignette is that the papers selected by aggregation have a overlap with actual list of accepted papers, suggesting that our approach makes a significant difference compared to the status quo (arguably for the better).
2 Our Framework
Suppose there are reviewers , and a set of papers, denoted using letters such as . Each reviewer reviews a subset of papers, denoted by . Conversely, let denote the set of all reviewers who review paper . Each reviewer assigns scores to each of their papers on different criteria, such as novelty, experimental analysis, and technical quality, and also gives an overall recommendation. We denote the criteria scores given by reviewer to paper by , and the corresponding overall recommendation by . Let denote the domains of the criteria scores, and let . Also, let denote the domain of the overall recommendations. For concreteness, we assume that each as well as is the real line. However, our results hold more generally, even if these domains are non-singleton intervals in , for instance.
We further assume that each reviewer has a monotonic function in mind that they use to compute the overall recommendation for a paper from its criteria scores. By a monotonic function, we mean that given any two score vectorsand , if is greater than or equal to on all coordinates, then the function’s value on must be at least as high as its value on . Formally, for each reviewer , there exists such that for all , where
is the set of all monotonic functions.
2.1 Loss Functions
Recall that our goal is to use all criteria scores, and their corresponding overall recommendations, to learn an aggregate function that captures the opinions of all reviewers on how criteria scores should be mapped to recommendations. We do this by computing the function in that minimizes the loss on the data. In more detail, given , we compute
In words, for a function , the loss is the norm taken over the loss associated with individual reviewers, where the latter loss is defined as the norm computed on the error of with respect to the reviewer’s overall recommendations. This definition closely resembles the norm of a matrix, which has many applications in machine learning [Ding et al.2006, Kong, Ding, and Huang2011, Nie et al.2010]. We refer to aggregation by minimizing loss as defined in Equation (1) as aggregation.
Equation (1) does not specify how to break ties between multiple minimizers. For concreteness, we select the minimizer with minimum empirical norm. Mathematically, let
be the set of all loss minimizers. We break ties by choosing
Observe that since the loss and constraint set are convex, is also a convex set. Hence, as defined by Equation (2) is unique. We emphasize that although we use minimum norm for tie-breaking, all of our results hold under any reasonable tie-breaking method, such as the minimum norm for any .
Once the function has been computed, it can be applied to every review (for all reviewers and papers ) to obtain a new overall recommendation . There is a separate — almost orthogonal — question of how to aggregate the overall recommendations of several reviewers on a paper into a single recommendation (typically this is done by taking the average). In our theoretical results we are agnostic to how this additional aggregation step is performed, but we return to it in our experiments in Section 4.
We remark that an alternative approach would be to learn a monotonic function for each reviewer (which best captures their recommendations), and then aggregate these functions into a single function . We chose not to pursue this approach, because in practice there are very few examples per reviewer, so it is implausible that we would be able to accurately learn the reviewers’ individual functions.
2.2 Axiomatic Properties
In social choice theory, the most common approach — primarily attributed to Arrow Arr51 — for comparing different aggregation methods is to determine which desirable axioms they satisfy. We take the same approach in order to determine the values of and for the aggregation in Equation (1).
We stress that axioms are defined for aggregation methods and not aggregate functions. Informally, an aggregation method is a function that takes as input all the reviews , and outputs an aggregate function . We do not define an aggregation method formally to avoid introducing cumbersome notation that will largely be useless later. It is clear that for any , aggregation (with tie-breaking as defined by Equation 2) is an aggregation method.
Social choice theory essentially relies on counterfactual reasoning to identify scenarios where it is clear how an aggregation method should behave. To give one example, the Pareto efficiency property of voting rules states that if all voters prefer alternative to alternative , then should not be elected; this situation is extremely unlikely to occur, yet Pareto efficiency is obviously a property that any reasonable voting must satisfy. With this principle in mind, we identify a setting in our problem where the requirements are very clear, and then define our axioms in that setting.
For all of our axioms, we restrict attention to scenarios where every reviewer reviews every paper, that is, for every . Moreover, we assume that the papers have ‘objective’ criteria scores, that is, the criteria scores given to a paper are the same across all reviewers, so the only source of disagreement is how the criteria scores should be mapped to an overall recommendation. We can then denote the criteria scores of a paper simply as , as opposed to , since they are independent of .
An axiom is satisfied by an aggregation method if its statement holds for every possible number of reviewers and number of papers , and for all possible criteria scores and overall recommendations. We start with the simplest axiom, consensus, which informally states that if there is a paper such that all reviewers give it the same overall recommendation, then must agree with the reviewers; this axiom is closely related to the unanimity axiom in social choice.
Axiom 1 (Consensus).
For any paper , if all reviewers report identical overall recommendations for some , then .
Before presenting the next axiom, we require another definition: we say that paper dominates paper if there exists a bijection such that for all , . Equivalently (and less formally), paper dominates paper if the sorted overall recommendations given to pointwise-dominate the sorted overall recommendations given to . Intuitively, in this situation, should receive a (weakly) higher overall recommendation than , which is exactly what the axiom requires; it is similar to the classic Pareto efficiency axiom mentioned above.
Axiom 2 (Efficiency).
For any pair of papers , if dominates , then .
Our final axiom is strategyproofness, a game-theoretic property that plays a major role in social choice theory [Moulin1983]. Intuitively, strategyproofness means that reviewers have no incentive to misreport their overall recommendations: They cannot bring the aggregate recommendations — the community’s consensus about the relative importance of various criteria — closer to their own through strategic manipulation.
Axiom 3 (Strategyproofness).
For each reviewer , and all possible manipulated recommendations , if is replaced with , then
where and are the aggregate functions obtained from the original and manipulated reviews, respectively.
3 Theoretical Results
In Section 2 we introduced aggregation as a family of rules for aggregating individual opinions towards a consensus mapping from criteria scores to recommendations. But that definition, in and of itself, leaves open the question of how to choose the values of and in a way that leads to the most socially desirable outcomes. The axioms of Section 2.2 allow us to give a satisfying answer to this question. Specifically, our main theoretical result is a characterization of aggregation in terms of the three axioms.
aggregation, where , satisfies consensus, efficiency, and strategyproofness if and only if .
We remark that for , Equation (1) does not distinguish between different reviewers, that is, the aggregation method pools all reviews together. We find this interesting, because the aggregation framework does have enough power to make that distinction, but the axioms guide us towards a specific solution, , which does not.
Turning to the proof of the theorem, in a nutshell, we establish the ‘only if’ direction in three steps: strategyproofness is violated by aggregation for all (Lemma 1), consensus is violated by and (Lemma 2), and efficiency is violated by and (Lemma 3). Together, the three lemmas leave as the only option, and, for the ‘if’ direction, Lemma 4 then shows that aggregation does indeed satisfy the three axioms. Below we state the lemmas and give some proof ideas; the theorem’s full proof is relegated to Appendix A.
aggregation with violates strategyproofness.
We prove the lemma via a simple construction with just one paper and two reviewers, who give the paper overall recommendations of and , respectively. For , the aggregate score is
and for , it is
Either way, the unique minimum is obtained at an aggregate score of . If reviewer 1 reported an overall recommendation of , however, the aggregate score would be , which matches her ‘true’ recommendation, thereby violating strategyproofness.
aggregation with and violates consensus.
Lemma 2 is established via another simple construction: two papers, two reviewers, and overall recommendations
where denotes the overall recommendation given by reviewer to paper . Crucially, the two reviewers agree on an overall recommendation of for paper , hence the aggregate score of this paper must also be . But we show that aggregation would not return an aggregate score of for paper .
In our view, the next lemma presents the most interesting and counter-intuitive result in the paper.
aggregation with and violates efficiency.
It is quite surprising that such reasonable loss functions violate the simple requirement of efficiency. In what follows we attempt to explain this phenomenon via a connection of our problem with the notion of the ‘Fermat point’ of a triangle [Spain1996]. The explanation provided here demonstrates the negative result for aggregation. The complete proof of the lemma for general values of is quite involved, and is relegated to Appendix A.
Consider a setting with reviewers and papers, where each reviewer reviews both papers. We let and denote the respective objective criteria scores of the two papers. Assume that no score in is pointwise greater than or equal to the other score in that set. Let the overall recommendations given by the reviewers be , , to the first paper and , and to the second paper. Under these scores, let denote the aggregate function that minimizes the loss.
The Fermat point of a triangle is a point such that the sum of its (Euclidean) distances from all three vertices is minimized. Consider a triangle in with vertices , and . Setting , one can use known algorithms to compute the Fermat point of this triangle as . More generally, when the vertex is moved away from the rest of the triangle (by increasing ), the Fermat point paradoxically biases towards the other (second) coordinate.
Connecting back to our original problem, by definition, the Fermat point of this triangle is exactly . When , paper 1 receives scores in sorted order, which dominates the sorted scores of paper 2. However the aggregate score of paper 1 is strictly smaller than of paper 2, thereby violating efficiency for the loss.
Finally, we now turn to the positive result.
aggregation with satisfies consensus, efficiency and strategyproofness.
When each reviewer reviews every paper and the papers have objective criteria scores, aggregation, with tie-breaking as in Equation (2), satisfies
where of a set of points in is their left median. Using this observation, we can show that all three properties are satisfied, thereby proving Lemma 4.
It is worth noting that, although we have presented the lemmas as components in the proof of Theorem 1, they also have standalone value (some more than others). For example, if one decided that only strategyproofness is important, then Lemma 1 would give significant guidance on choosing an appropriate method.
4 Implementation and Experimental Results
In order to empirically analyze different aspects of our approach, we employ a dataset of reviews from the 26 International Joint Conference on Artificial Intelligence (IJCAI 2017), which was made available to us by the program chair, Carles Sierra. To our knowledge, we are the first to use this dataset.
At submission time, authors were asked if review data for their paper could be included in an anonymized dataset, and, similarly, reviewers were asked whether their reviews could be included; the dataset provided to us consists of all reviews for which permission was given. Each review is tagged with a reviewer ID and paper ID, which are anonymized for privacy reasons. The criteria used in the conference are ‘originality’, ‘relevance’, ‘significance’, ‘quality of writing’ (which we call ‘writing’), and ‘technical quality’ (which we call ‘technical’), and each is rated on a scale from to . Overall recommendations are also on a scale from to . In addition, information about which papers were accepted and which were rejected is included in the dataset.
The number of papers in the dataset is , of which were accepted, which amounts to . This is a large subset of the submissions to the conference, of which were accepted, for an actual acceptance rate of . The number of reviewers in the dataset is , and the number of reviews is . Tables 2 and 2 show the distribution of the number of reviews received by papers, and the number of papers reviewed by reviewers.
We apply aggregation (i.e., ), as given in Equation (1), to this dataset to learn the aggregate function. Let us denote that function by . The optimization problem in Equation (1) is convex, and standard optimization packages can efficiently compute the minimizer. Hence, importantly, computational complexity is a nonissue in terms of implementing our approach.
Once we compute the aggregate function , we calculate the aggregate overall recommendation of each paper by taking the median of the aggregate reviewer scores for that paper obtained by applying to the objective scores:
Recalling that of the papers in the dataset were actually accepted to the conference, in our experiments we define the set of papers accepted by the aggregate function as the the top of papers according to their respective values.
We now present the specific experiments we ran, and their results.
4.1 Varying Number of Reviewers
In our first experiment, for each value of a parameter , we subsampled distinct reviews for each paper uniformly at random from the set of all reviews for that paper (if the paper had fewer than to begin with then we retained all the reviews). We then computed an aggregate function, , via aggregation applied only to these subsampled reviews. Next, we found the set of top 27.27% papers as given by applied to the subsampled reviews. Finally, we compared the overlap of this set of top papers for every value of with the set of top 27.27% papers as dictated by the overall aggregate function .
The results from this experiment are plotted in Figure 1, and lead to several observations. First, the incremental overlap from to is very small because there are a very few papers that had or more reviews (Table 2). Second, we see that the amount of overlap monotonically increases with the number of reviewers per paper , thereby serving as a sanity check on the data as well as our methods. Third, we observe the overlap to be quite high () even with a single reviewer per paper.
4.2 Loss Per Reviewer
Next, we look at the loss of different reviewers, under (obtained by aggregation). In order for the losses to be on the same scale, we normalize each reviewer’s loss by the number of papers reviewed by them. Formally, the normalized loss of reviewer (for ) is
The normalized loss averaged across reviewers is found to be
, and the standard deviation is. Figure 2 shows the distribution of the normalized loss of all the reviewers. Note that the normalized loss of a reviewer can fall in the range . These results thus indicate that the function is indeed at least a reasonable representation of the mapping of the broader community.
4.3 Visualizing the Community Aggregate Mapping
Our framework is not only useful for computing an aggregate mapping to help in acceptance decisions, but also for understanding the preferences of the community for use in subsequent modeling and research. We illustrate this application by providing some visualizations and interpretations of the aggregate function obtained from aggregation on the IJCAI review data.
The function lives in a -dimensional space, making it hard to visualize the entire aggregate function at once. Hence we instead fix the values of criteria at a time and plot the function in terms of the remaining two criteria. In all of the visualization and interpretation below, the fixed criteria are set to their respective (marginal) modes: For ‘quality of writing’ the mode is (715 reviews), for ‘originality’ it is (826 reviews), for ‘relevance’ it is (888 reviews), for ‘significance’ it is (800 reviews), and for ‘technical quality’ it is (702 reviews). We present the three most representative plots in Figure 3, and relegate the rest to Appendix B.
The key takeaways from this experiment are as follows:
Writing and relevance do not have a significant influence (Figure 3(a)). Really bad writing or relevance is a significant downside, excellent writing or relevance is appreciated, but everything else in between in irrelevant.
Technical quality and significance exert a high influence (Figure 3(b)). Moreover, the influence is approximately linear.
Originality exerts a moderate influence (Figure 3(c)).
Linear models (i.e., models that are linear in the criteria) are quite popular in machine learning. Our empirical observations reveal that linear models are partially applicable to conference review data — for some criteria one may indeed assume a linear model, but not for all.
4.4 Overlaps with Other Aggregates
We also compute the overlap between the set of top papers selected by aggregation with the actual top accepted papers. It is important to emphasize that we believe the set of papers selected by our method is better than any hand-crafted or rule-based decision using the scores, since this aggregate represents the opinion of the community. Hence, to be clear, we do not have a goal of maximizing the overlap. Nevertheless, a very small overlap would mean that our approach is drastically different from standard practice, which would potentially be disturbing. We find that the overlap is , which we think is quite fascinating — our approach does make a significant difference, but the difference is not so drastic as to be disconcerting.
Finally, out of intellectual curiosity, we also computed the pairwise overlaps of the papers accepted by aggregation, for . We find that, interestingly, the choice of the reviewer-norm has significantly more influence as compared to the paper-norm ; we refer the reader to Appendix B for details.
A primary challenge in the analysis of peer review is the absence of ground truth. This challenge is compounded by the paucity of models for problems such as subjective mappings. We address these challenges by leveraging tools from social choice theory and taking an axiomatic viewpoint, and then making only the weak assumption that the mapping for any reviewer is monotonic.
In particular, one can think of the theoretical results of Section 3 as supporting aggregation using the tools of social choice theory, whereas the empirical results of Section 4 focus on studying its behavior on real data (as well as phenomena related to the data itself). Understanding this helps clear up another possible source of confusion: are we not overfitting by training on a set of reviews, and then applying the aggregate function to the same reviews? The answer is negative, because the process of learning the function amounts to an aggregation of opinions about how criteria scores should be mapped to overall recommendations. Applying it to the data yields recommendations in , whereas this function from to lives in a different space.
That said, it is of intellectual interest to understand the statistical aspects of estimating the community’s consensus mapping function, assuming the existence of a ground truth. In more detail, suppose that each reviewer’s true functionis a noisy version of some underlying function that represents the community’s beliefs. Then can aggregation recover the function (in the statistical consistency sense)? If so, then with what sample complexity? At a conceptual level, this non-parametric estimation problem is closely related to problems in isotonic regression [Shah et al.2016, Gao and Wellner2007, Chatterjee, Guntuboyina, and Sen2018]. The key difference is that the observations in our setting consist of evaluations of multiple functions, where each such function is a noisy version of the original monotonic function. In contrast, isotonic regression is primarily concerned with noisy evaluations of a common function. Nevertheless, the insights from isotonic regression suggest that the naturally occurring monotonicity assumption of our setting can yield attractive — and sometimes near-parametric [Shah et al.2016, Shah, Balakrishnan, and Wainwright2018] — rates of estimation.
Finally, our work focuses on learning one representative aggregate mapping for the entire community of reviewers. Instead, the program chairs of a conference may wish to allow for multiple mappings that represent the aggregate opinions of different sub-communities (e.g., theoretical or applied researchers). In this case, one can modify our framework to also learn this (unknown) partition of reviewers and/or papers into multiple sub-communities with different mapping functions, and frame the problem in terms of learning a mixture model. The design of computationally efficient algorithms for aggregation under such a mixture model is a challenging open problem.
Shah was supported in part by NSF grants CRII-CCF-1755656 and CCF-1763734. Noothigattu and Procaccia were supported in part by NSF grants IIS-1350598, IIS-1714140, CCF-1525932, and CCF-1733556; by ONR grants N00014-16-1-3075 and N00014-17-1-2428; and by a Sloan Research Fellowship and a Guggenheim Fellowship. We are grateful to Francisco Cruz for compiling the IJCAI 2017 review dataset, and to Carles Sierra for making it available to us.
- [Arrow1951] Arrow, K. 1951. Social Choice and Individual Values. Wiley.
- [Brandt et al.2016] Brandt, F.; Conitzer, V.; Endriss, U.; Lang, J.; and Procaccia, A. D., eds. 2016. Handbook of Computational Social Choice. Cambridge University Press.
- [Charlin and Zemel2013] Charlin, L., and Zemel, R. 2013. The Toronto paper matching system: An automated paper-reviewer assignment system. Manuscript.
- [Chatterjee, Guntuboyina, and Sen2018] Chatterjee, S.; Guntuboyina, A.; and Sen, B. 2018. On matrix estimation under monotonicity constraints. Bernoulli 24(2):1072–1100.
[Ding et al.2006]
Ding, C.; Zhou, D.; He, X.; and Zha, H.
-PCA: Rotational invariant
-norm principal component analysis for robust subspace factorization.In Proceedings of the 23rd International Conference on Machine Learning (ICML), 281–288.
[Gao and Wellner2007]
Gao, F., and Wellner, J. A.
Entropy estimate for high-dimensional monotonic functions.
Journal of Multivariate Analysis98(9):1751–1764.
- [Ge, Welling, and Ghahramani2013] Ge, H.; Welling, M.; and Ghahramani, Z. 2013. A Bayesian model for calibrating conference review scores. Manuscript.
- [Kong, Ding, and Huang2011] Kong, D.; Ding, C.; and Huang, H. 2011. Robust nonnegative matrix factorization using L21-norm. In Proceedings of the 20th International Conference on Information and Knowledge Management (CIKM), 673–682.
- [Kurokawa et al.2015] Kurokawa, D.; Lev, O.; Morgenstern, J.; and Procaccia, A. D. 2015. Impartial peer review. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 582–588.
- [Lawrence and Cortes2014] Lawrence, N., and Cortes, C. 2014. The NIPS Experiment. http://inverseprobability.com/2014/12/16/the-nips-experiment.
- [Moulin1983] Moulin, H. 1983. The Strategy of Social Choice, volume 18 of Advanced Textbooks in Economics. North-Holland.
[Nie et al.2010]
Nie, F.; Huang, H.; Cai, X.; and Ding, C. H.
Efficient and robust feature selection via joint-norms minimization. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), 1813–1821.
- [Roos, Rothe, and Scheuermann2011] Roos, M.; Rothe, J.; and Scheuermann, B. 2011. How to calibrate the scores of biased reviewers by quadratic programming. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI), 255–260.
- [Shah, Balakrishnan, and Wainwright2018] Shah, N. B.; Balakrishnan, S.; and Wainwright, M. J. 2018. Low permutation-rank matrices: Structural properties and noisy completion. arXiv:1709.00127.
- [Shah et al.2016] Shah, N. B.; Balakrishnan, S.; Guntuboyina, A.; and Wainwright, M. J. 2016. Stochastically transitive models for pairwise comparisons: Statistical and computational issues. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 11–20.
- [Shah et al.2017] Shah, N. B.; Tabibian, B.; Muandet, K.; Guyon, I.; and von Luxburg, U. 2017. Design and analysis of the NIPS 2016 review process. arXiv:1708.09794.
- [Smith2006] Smith, R. 2006. Peer review: A flawed process at the heart of science and journals. Journal of the Royal Society of Medicine 99:178–182.
- [Spain1996] Spain, P. 1996. The Fermat point of a triangle. Mathematics Magazine 69(2):131–133.
Appendix A Proof Of Theorem 1
Recall that the proof of Theorem 1 is comprised of four lemmas, three for the ‘only if’ direction, and one for the ‘if’ direction. We give the proofs of these lemmas below.
a.1 Proof of Lemma 1
Consider aggregation with arbitrary . We show that strategyproofness is violated. The construction for this is as follows. Suppose there is one paper and two reviewers. The first reviewer gives the paper an overall recommendation of and the second reviewer gives it an overall recommendation of . Let be the (objective) criteria scores of this paper.
Let us first consider . For a function , all we care about in this example is its value at . Hence, for simplicity, let denote the value of function at , i.e, . Then our aggregation becomes
We claim that is the unique minimizer. Observe that if , then the value of our objective is when . On the other hand, if or if then the value of our objective is at least . Hence . By symmetry, we can restrict attention to the range since if there is a minimizer in then there must also be a minimizer in . Consequently, we rewrite the optimization problem as
Consider the function defined by . This function is strictly convex (the second derivative is strictly positive in the domain) whenever . Hence from the definition of strict convexity, we have
whenever . Consequently, the objective value of (5) is greater at than at . We conclude that whenever .
When , we equivalently write the optimization problem as
This objective has a value of if and strictly greater if . Hence, for as well.
The true overall recommendation of reviewer differs from the aggregate by (in every norm). However, if reviewer reported an overall recommendation of , then an argument identical to that above shows that the minimizer is . Reviewer has thus successfully brought down the difference between her own true overall recommendation and the aggregate to . We conclude that strategyproofness is violated whenever .
a.2 Proof of Lemma 2
The construction showing that aggregation violates consensus is as follows. Suppose there are two papers, two reviewers and both reviewers review both papers. Assume that the papers have objective criteria scores and , and that neither of these scores is pointwise greater than or equal to the other. Let the overall recommendations of the reviewers for the papers be given by the matrix
where denotes the overall recommendation of reviewer for paper . Since both reviewers give the same overall recommendation of to paper , any aggregation method that satisfies consensus must also give paper an aggregate score of . We show that this is not the case under aggregation.
Let denote the value of function on paper , i.e. . And let denote the aggregate score of paper . Since we are minimizing loss, the aggregate function satisfies:
We claim that is a minimizer of . The objective function value at this point is
For arbitrary , we have
where the first inequality holds because the maximum of two elements is always larger than the first, and the second inequality holds by the triangle inequality. Therefore, is a minimizer of . The norm of this minimizer is . On the other hand, any minimizer with would have an norm of at least . It follows that such a minimizer will not be selected. In other words, aggregation would select a minimizer for which the aggregate score of paper is not , violating consensus.222Observe that even if we used any norm with for tie-breaking, the norm of would be , while the norm of any minimizer would still be at least , violating consensus.
Complete picture of minimizers.
For completeness, we look at the set of all minimizers of . This is given by
Pictorially, this set is given by the shaded square in Figure 4. It is the square with vertices at , , and .
This shows that almost all minimizers violate consensus. For the specific tie-breaking considered, the minimizer chosen is the one with minimum norm, i.e., the projection of onto this square. This gives us , violating consensus.
Observe that tie-breaking using minimum norm, for , also chooses as the aggregate function, violating consensus. For , all points on the line segment () would be tied winners, almost all of which violate consensus. Further, even if one uses other reasonable tie-breaking schemes like maximum norm, they suffer from the same issue, i.e., there is a tied winner which violates consensus.
a.3 Proof of Lemma 3
Consider aggregation with an arbitrary . We show that efficiency is violated. The construction for this is as follows. There are papers, reviewers and each reviewer reviews both papers. Assume that the papers have objective criteria scores and , and that neither of these scores is pointwise greater than or equal to the other. Let the overall recommendations by the reviewers for the papers be defined by the matrix
where is a constant strictly bigger than and denotes the overall recommendation by reviewer to paper . Observe that paper dominates paper . But, we will show that there exists a value such that the aggregate score of paper is strictly smaller than the aggregate score of paper .
Let denote the value of function on paper , i.e. . And let denote the aggregate score of paper ; observe that we write it as a function of because the aggregate score of each paper would depend on the chosen score . Since we are minimizing loss, the aggregate function satisfies:
For the overall proof to be easier to follow, proofs of all claims are given at the end of this proof. Also, just to re-emphasize, the whole proof assumes .
is a strictly convex objective function.
Claim 1 states that is strictly convex, implying that it has a unique minimizer . Hence, there is no need to consider tie-breaking.
and are bounded. In particular, and .
Claim 2 states that the aggregate score of both papers lies in the interval irrespective of the value of . This allow us to restrict ourselves to the region when computing the minimizer of (9). Hence, for the rest of the proof, we only consider the space . In this region, the optimization problem (8) can be rewritten as
To start off, we analyze the objective function as we take the limit of going to infinity. Later, we show that the observed property holds even for a sufficiently large finite .
For the limit to exist, redefine the objective function as , i.e.,
For any value of , the function has the same minimizer as , that is,
For any (fixed) ,
The proof proceeds by analyzing some important properties of the limiting function .
The function is convex in . Moreover, the function is strictly convex for and .
is minimized at , where
Observe that Claim 6 is the desired result, but for the limiting objective function . The remainder of the proof proceeds to show that this result holds even for the objective function , when the score is large enough. Define . We first show that (i) there exists such that , and then (ii) show that in this case, we have .
To prove part (i), we first analyze how functions and relate to each other. Using Claim 3, for any fixed , by definition of the limit, for any , there exists (which could be a function of ) such that, for all , we have
For a given , denote the corresponding value of by . And, let denote the set of all values of for which Equation (13) holds for .
for every .
i.e. is in an -band around throughout this region. And observe that this band gets smaller as is decreased (which is achieved at a larger value of ).
To bound the distance between , the minimizer of , and , the minimizer of , we bound the distance between the objective function values at these points.
Although does not minimize , Claim 8 says that the objective value at cannot be more than larger than its minimum, . We use this to bound the distance between and the minimizer . Observe that falls in the -level set of . So, we next look at a specific level set of .
Observe that a minimum exists (infimum is not required) for the minimization in (16) because we are minimizing over the closed set and is continuous.
For any fixed , Equation (12) shows that is bounded away from . Hence, Claim 4 shows that is strictly convex at and in the region around . Further, is convex everywhere else. Coupling this with the fact that (16) minimizes along points not arbitrarily close to the minimizer , we have .
Define the level set of with respect to :
For every , we have
Define , and set . Then, set as before. Applying Claim 8, we obtain
In other words, . And applying Claim 9, we obtain , completing part (i).
This implies that , which means
Using these properties, we have
where the first inequality holds because of the first part of (17), the equality holds because and the second inequality holds because of the second part of (17). Therefore, for , the aggregate scores of the two papers are such that
Proof of Claim 1.
Take arbitrary with , and let . We show that . For this, we will first show that either (i) is not parallel to , (ii) is not parallel to or (iii) is not parallel to . For the sake of contradiction, assume that this is not true. That is, assume is parallel to , is parallel to , and is parallel to . This implies that
where 333A boundary case not captured here is when is exactly one of the points or , leading to or being zero respectively. But for this case, it is easy to prove that the other two pairs of vectors cannot be parallel unless .. Note that, none of can be because . The second equation tells us that and the third one tells us that . So, either or . But from the first equation, . So if