Peer review serves as an effective solution for quality evaluation in reviewing processes, especially in academic paper review (Dörfler et al., 2017; Shah et al., 2017) and massive open online courses (MOOCs) (Díez Peláez et al., 2013; Piech et al., 2013; Shah et al., 2013). However, despite its scalability, competitive peer review faces the serious challenge of being vulnerable to strategic manipulations (Anderson et al., 2007; Thurner and Hanel, 2011; Alon et al., 2011; Kurokawa et al., 2015; Kahng et al., 2017). By giving lower scores to competitive submissions, reviewers may be able to increase the chance that their own submissions get accepted. For instance, a recent experimental study (Balietti et al., 2016) on peer review of art, published in the Proceedings of the National Academy of Sciences (USA), concludes
“…competition incentivizes reviewers to behave strategically, which reduces the fairness of evaluations and the consensus among referees.”
As noted by Thurner and Hanel (2011), even a small number of selfish, strategic reviewers can drastically reduce the quality of scientific standard. In the context of conference peer review, Langford (2008) calls academia inherently adversarial:
“It explains why your paper was rejected based on poor logic. The reviewer wasn’t concerned with research quality, but rather with rejecting a competitor.”
Langford states that a number of people agree with this viewpoint. Thus the importance of peer review in academia and its considerable influence over the careers of researchers significantly underscores the need to design peer review systems that are insulated from strategic manipulations.
In this work, we present a higher-level framework to address the problem of strategic behavior in conference peer review. We present an informal description of the framework here and formalize it later in the paper. The problem setting comprises a number of submitted papers and a number of reviewers. We are given a graph which we term as the “conflict graph”. The conflict graph is a bipartite graph with the reviewers and papers as the two partitions of vertices, and an edge between any reviewer and paper if that reviewer has a conflict with that paper. Conflicts may arise due to authorship (the reviewer is an author of the paper) or other reasons such as institutional conflicts etc. Given this conflict graph, there are two design steps in the peer review procedure: (i) assigning each paper to a subset of reviewers for review, and (ii) aggregating the reviews provided by the reviewers to give a final evaluation of each paper. Under our framework, the goal is to design these two steps of the peer-review procedure that satisfies two properties – strategyproofness and efficiency.
The first goal is to design peer-review procedures that are strategyproof with respect to the given conflict graph. A peer-review procedure is said to be strategyproof if no reviewer can change the outcome for any papers with which she/he has a conflict. This definition is formalized later in the paper. Strategyproofness not only reassures the authors that the review process is fair, but also ensures that the authors receive proper feedback for their work. We note that a strategyproof peer-review procedure alone is inadequate with respect to any practical requirements – simply giving out a fixed, arbitrary evaluation makes the peer-review procedure strategyproof.
Consequently, in addition to requiring strategyproofness, our framework measures the peer-review procedure with another yardstick – that of efficiency. Informally, the efficiency of a peer-review procedure is a measurement of how well the final outcome reflects reviewers’ assessments of the quality of the submissions, or a measurement of the accuracy in terms of the final acceptance decisions. There are several ways to define efficiency – from a social choice perspective or a statistical perspective. In this paper, we consider efficiency in terms of the notion of unanimity in social choice theory: an agreement among all reviewers must be reflected in the final aggregation.
In addition to the conceptual contribution based on this framework, we make several technical contributions towards this important problem. We first design a peer review algorithm which theoretically guarantees strategyproofness along with a notion of efficiency that we term “group unanimity”. Our result requires only a mild assumption on the conflict graph of the peer-review design task. We show this assumption indeed holds true in practice via an empirical analysis of the submissions made to the 2017 International Conference on Learning Representations (ICLR-17) conference111https://openreview.net/group?id=ICLR.cc/2017/conference. Our algorithm is based on the popular partitioning method, and our positive results can be regarded as generalizing it to the setting of conference peer review. We further demonstrate a simple trick to make the partitioning method more practically appealing for conference peer review and validate it on the ICLR-17 data.
We then complement our positive results with negative results showing that one cannot expect to meet requirements that are much stronger than that provided by our algorithm. In particular, we show that under mild assumptions on the authorships, there is no algorithm that can be both strategyproof and “pairwise unanimous”. Pairwise unanimity is a stronger notion of efficiency than group unanimity, and is also known as Pareto efficiency in the literature of social choice (Brandt et al., 2016). We show that our negative result continues to hold even when the notion of strategyproofness is made extremely weak. We then provide a conjecture and insightful results on the impossibility when the assignment satisfies a simple “connectivity” condition. Finally, we connect back to the traditional settings in social choice theory, and show an impossibility when every reviewer reviews every paper. These negative results highlight the intrinsic hardness in designing strategyproof conference review systems.
2 Related Work
As early as in the 1970s, Gibbard and Satterthwaite had already been aware of the importance of a healthy voting rule that is strategyproof in the setting of social choice (Gibbard, 1973; Satterthwaite, 1975). Nowadays, the fact that prominent peer review mechanisms such as the one used by the National Science Foundation (Hazelrigg, 2013) and the one for time allocation on telescope (Merrifield and Saari, 2009) are manipulable has further called for strategyproof peer review mechanisms.
Our work is most closely related to a series of works on strategyproof peer selection (De Clippel et al., 2008; Alon et al., 2011; Holzman and Moulin, 2013; Fischer and Klimm, 2015; Kurokawa et al., 2015; Aziz et al., 2016; Kahng et al., 2017), where agents cannot benefit from misreporting their preferences over other agents.222Some past literature refers to this requirement as ensuring that agents are “impartial”. However, the term “impartial” also has connotations on (possibly implicit) biases due to extraneous factors such as some features about the agents (Hojat et al., 2003). In this paper, we deliberately use the term “strategyproof” in order to make the scope of our contribution clear in that we do not address implicit biases. De Clippel et al. (2008) consider strategyproof decision making under the setting where a divisible resource is shared among a set of agents. Later, Alon et al. (2011); Holzman and Moulin (2013) consider strategyproof peer approval voting where each agent nominates a subset of agents and the goal is to select one agent with large approvals. Alon et al. (2011) propose a randomized strategyproof mechanism using partitioning that achieves provable approximate guarantee to the deterministic but non-strategyproof mechanism that simply selects the agent with maximum approvals. Bousquet et al. (2014) and Fischer and Klimm (2015) further extended and analyzed this mechanism to provide an optimal approximate ratio in expectation. Although the first partitioning-based mechanism partitions all the voters into two disjoint subsets, this has been recently extended to -partition by Kahng et al. (2017). In all these works, each agent is essentially required to evaluate all the other agents except herself. This is impractical for conference peer review, where each reviewer only has limited time and energy to review a small subset of submissions. In light of such constraints, Kurokawa et al. (2015) propose an impartial mechanism (Credible Subset) and provide associated approximation guarantees for a setting in which each agent is only required to review a few other agents. Credible Subset is a randomized mechanism that outputs a subset of
agents, but it has non-zero probability returns an empty set. Based on the work ofDe Clippel et al. (2008); Aziz et al. (2016) propose a mechanism for peer selection, termed as Dollar Partition, which is strategyproof and satisfies a natural monotonicity property. Empirically the authors showed that Dollar Partition outperforms Credible Subset
consistently and in the worst case is better than partition-based approach. However, even if the target output size is, Dollar Partition may return a subset of size strictly larger than . Our positive results, specifically our Divide-and-Rank algorithm presented subsequently, borrows heavily from this line of literature. That said, our work addresses the application of conference peer review which is significantly more general and challenging as compared to the settings considered in past works.
Our setting of conference peer review is more challenging as compared to these past works as each reviewer may author multiple papers and moreover each paper may have multiple authors as reviewers. Specifically, the conflict graph under conference peer review is a general bipartite graph, where conflicts between reviewers and papers can arise not only because of authorships, but also advisor-advisee relationships, institutional conflicts, etc. In contrast, past works focus on applications of peer-grading and grant proposal review, and hence consider only one-to-one conflict graphs (that is, where every reviewer is conflicted with exactly one paper).
Apart from the most important difference mentioned above, there are a couple of other differences of this work as compared to some past works. In this paper we focus on ordinal preferences where each reviewer is asked to give a total ranking of the assigned papers, as opposed to providing numeric ratings. We do so inspired by past literature (Barnett, 2003; Stewart et al., 2005; Douceur, 2009; Tsukida and Gupta, 2011; Shah et al., 2013, 2016) which highlights the benefits of ordinal data in terms of avoiding biases and miscalibrations as well as allowing for a more direct comparison between papers. Secondly, while most previous mechanisms either output a single paper or a subset of papers, we require our mechanism to output a total ranking over all papers. We consider this requirement since this automated output in practice will be used by the program chairs as a guideline to make their decisions, and this more nuanced data comprising the ranking of the papers can be more useful towards this goal.
A number of other papers study various other aspects of conference peer review. The works Hartvigsen et al. (1999); Charlin and Zemel (2013); Garg et al. (2010); Stelmakh et al. (2018) design algorithms for assigning reviewers to papers under various objectives, and these objectives and algorithms may in fact be used as alternative definitions of the objective of “efficiency” studied in the present paper. The papers Roos et al. (2011); Ge et al. (2013); Wang and Shah (2018) consider review settings where reviewers provide scores to each paper, with the aim of addressing the problems of biases and miscalibrations in these scores. Finally, experiments and empirical evaluations of conference peer reviews can be found in Lawrence and Cortes (2014); Mathieus (2008); Connolly et al. (2014); Shah et al. (2017); Tomkins et al. (2017).
3 Problem setting
In this section, we first give a brief introduction to the setting of our problem, and then introduce the notation used in the paper. At last, we formally define various concepts and properties to be discussed in the subsequent sections.
Modern review process is governed by four key steps: (i) a number of papers are submitted for review; (ii) each paper is assigned to a number of reviewers; (iii) reviewers provide their feedback on the papers they are reviewing; and (iv) the feedback from all reviewers is aggregated to make final decisions on the papers. Let be the number of reviewers and be the number of submitted papers. Define to be the set of reviewers and to be the set of submitted papers.
The review process must deal with a number of conflicts of interest. The most common form of conflict is the authorship conflict: a large number of reviewers are also authors of submitted papers. Additional sources of conflicts may include advisor-advisee relationships between reviewers and authors of papers, institutional conflicts, etc. To characterize conflicts of interest, we use a bipartite graph with vertices , where an edge is connected between a reviewer and a paper if there exists some conflict of interests between reviewer and paper . In this graph we omit the authors of papers who are not reviewers. Reviewers who do not have conflicts of interest with any paper are nodes with no edges. Given the set of submitted papers and reviewers, this graph is fixed and cannot be controlled. Note that the conflict graph defined above can be viewed as a generalization of the authorship graph in the previously-studied settings (Merrifield and Saari, 2009; Alon et al., 2011; Holzman and Moulin, 2013; Fischer and Klimm, 2015; Kurokawa et al., 2015; Aziz et al., 2016; Kahng et al., 2017) of peer grading and grant proposal review, where each reviewer (paper) is connected to at most one paper (reviewer).
The review process is modeled by a second bipartite graph , termed as review graph, that also has the reviewers and papers as its vertices. This review graph has an edge between a reviewer and a paper if that reviewer reviews that paper. For every reviewer ,333We use the standard notation to represent the set for any positive integer . we let denote the set of papers assigned to this reviewer for review, or in other words, the neighborhood of node in the bipartite graph . The program chairs of the conference are free to choose this graph, but subject to certain constraints and preferences. To ensure balanced workloads across reviewers, we require that every reviewer is assigned at most papers for some integers . In other words, every node in has at most neighbors (in ) in graph . Additionally, each paper must be reviewed by a certain minimum number of reviewers, and we denote this minimum number as . Thus every node in the set must have at least neighbors (in ) in the graph . For any (directed or undirected) graph , we let the notation denote the set of (directed or undirected, respectively) edges in graph .
At the end of the reviewing period, each reviewer provides a total ranking of the papers that she/he reviewed. For any set of papers , we let denote the set of all permutations of papers in . Furthermore, for any paper and any permutation , we let denote the position of paper in the permutation . At the end of the reviewing period, each reviewer submits a total ranking of the papers in . We define a (partial) ranking profile as the collection of rankings from all the reviewers. When the assignment of papers to reviewers is fixed, we use the shorthand for profile . For any subset of papers , we let denote the restriction of to only the induced rankings on . Finally, when the ranking under consideration is clear from context, we use the notation to say that paper is ranked higher than paper in the ranking.
Under this framework, the goal is to jointly design: (a) a paper-reviewer assignment scheme, that is, edges of the graph , and (b) an associated review aggregation rule which maps from the ranking profile to an aggregate total ranking of all papers.444To be clear, the function is tied to the assignment graph . The graph specifies the sets , and then the function takes permutations of these sets of papers as its inputs. We omit this from the notation for brevity. For any aggregation function , we let be the position of paper when the input to is the profile .
We consider two factors for designing the review process . The first factor is efficiency: the output ranking should reflect opinions of most reviewers. We capture this notion of efficiency by a function takes a review graph and an aggregation rule as inputs and outputs a measure of efficiency (with higher values indicating a higher efficiency). This function may be chosen by the program chairs of the conference; we discuss some choices below which we use in this paper. The second factor we consider is strategyproofness: we want to make sure that no reviewer can benefit from mis-reporting her preferences. Then the goal is to solve the following optimization problem:
We denote the optimal value of (1) by . In what follows we define strategyproofness and efficiency that any conference review mechanism should satisfy under our paper-review setting. Inspired by the theory of social choice, in this paper we define the notion of efficiency via two variants of “unanimity”, and we also discuss two natural notions of strategyproofness. That said, we emphasize that the formulation in (1) is a general framework that can additionally incorporate other notions of efficiency such as statistical efficiency. For any choice of under our framework, the goal is to maximize the efficiency of the review mechanism under the constraint that it should also be strategyproof.
3.1 Efficiency (unanimity)
In this paper, we consider efficiency of a peer-review process in terms of the notion of unanimity. Unanimity is one of the most prevalent and classic properties to measure the efficiency of a voting system in the theory of social choice (Fishburn, 2015).
At a colloquial level, unanimity states that when there is a common agreement among all reviewers, then the aggregation of their opinions must also respect this agreement. In this paper we discuss two kinds of unanimity, termed group unanimity (GU) and pairwise unanimity (PU). Both kinds of unanimity impose requirements on the aggregation function for any given reviewer assignment. Specifically, both notions of unanimity are represented by efficiency functions which are binary and set as 1 if the respective notion of unanimity is satisfied and 0 otherwise; we denote the efficiency function as when considering pairwise unanimity and as when considering group unanimity.
We first define group unanimity:
Definition 3.1 (Group Unanimity, GU).
We define to be group unanimous (GU) if the following condition holds for every possible profile . If there is a non-empty set of papers such that every reviewer ranks the papers she reviewed from higher than those she reviewed from , then must have for every pair of papers and such that at least one reviewer has reviewed both and . The efficiency objective if is group unanimous, and otherwise.
Intuitively, group unanimity says that if papers can be partitioned into two sets such that every reviewer who has reviewed papers from both sets agrees that the papers she has reviewed from the first set are better than what she reviewed from the second set, then the final output ranking should respect this agreement.
Our second notion of unanimity, termed pairwise unanimity, is a local refinement of group unanimity. This notion is identical to the classical notion of unanimity stated in Arrow’s impossibility theorem (Arrow, 1950) – the classical unanimity considers every reviewer to review all papers (that is, ), whereas our notion is also defined for settings where reviewers may review only subsets of papers.
Definition 3.2 (Pairwise Unanimity, PU).
We define to be pairwise unanimous (PU) if the following condition holds for every possible profile and every pair of papers : If at least one reviewer has reviewed both and and all the reviewers that have reviewed and agree on , then . The efficiency objective if is pairwise unanimous, and otherwise.
An important property is that pairwise unanimity is stronger than group unanimity:
, that is, if is pairwise unanimous, then is also group unanimous.
We now move on to our other requirement in peer review, that of strategyproofness.
Intuitively, strategyproofness means that a reviewer cannot benefit from being dishonest; in the context of conference review, this means that a reviewer cannot change the position of her conflicting papers, by changing her own ranking. Strategyproofness is defined with respect to a given conflict graph which we denote by ; we recall the notation as the set of edges of graph .
Definition 3.3 (Strategyproofness, SP).
A review process is called strategyproof with respect to a conflict graph if for every reviewer and paper such that the following condition holds: for every pair of profiles (under assignment ) that differ only in the ranking given by reviewer , the position of is unchanged. Formally, and , it must be that .
Having established these preliminaries, we now move on to the main results of this paper.
4 Positive results: Group unanimity and strategyproofness
In this section we consider the design of reviewer assignments and aggregation rules for strategyproofness and group unanimity (efficiency). It is not hard to see that strategyproofness and group unanimity cannot be simultaneously guaranteed for arbitrary conflict graphs , for instance, when is a fully-connected bipartite graph. Prior works on this topic consider a specific class of conflict graphs — those with one-to-one relations between papers and reviewers — which do not capture conference peer review settings. We consider a more general class of conflict graphs and present an algorithm based on the partitioning-based method (Alon et al., 2011), which we show can achieve . We then empirically demonstrate, using submission data from the ICLR-17 conference, that this class of conflict graphs is indeed representative of peer review settings. In addition to the feasibility, we present a simple trick to significantly improve the practical appeal of our algorithm (and more generally the partitioning method) to conference peer review.
4.1 The Divide-and-Rank Algorithm
We now present our algorithm “Divide-and-Rank” for reviewer assignment and rank aggregation. At a higher level, our algorithm performs a partition of the reviewers and papers for assignment, and aggregates the reviews by computing a ranking which is consistent with any group agreements. The Divide-and-Rank algorithm works for a general conflict graph as long as the conflict graph can be divided into two reasonably-sized disconnected components (we verify this assumption in the next section). The algorithm is simple yet flexible in that the assignment within each partition and the aggregation among certain groups of papers can be done using any existing algorithm for assignment and aggregation respectively. This flexibility is useful as it allows to further optimize various other metrics in addition to strategyproofness and unanimity.
The Divide-and-Rank assignment algorithm and Divide-and-Rank aggregation algorithm are formally presented as Algorithm 1 and Algorithm 2 respectively, and we discuss the details in the next two paragraphs.
The Divide-and-Rank assignment algorithm begins by partitioning the conflict graph into two disconnected components that meet the requirements specified by and . This is achieved using the subroutine Partition. Partition first runs a breadth-first-search (BFS) algorithm to partition the original conflict graph into connected components, where the th connected component contains reviewers and papers. Next, the algorithm performs a dynamic programming to compute all the possible subset sums achievable by the connected components. Here means that there exists a partition of the first components such that one side of the partition has reviewers and papers, and 0 otherwise. The last step is to check whether there exists a subset satisfying the requirement, and if so, runs a standard backtracking algorithm along the table to find the actual subset . Clearly the Partition runs in , and since , it runs in polynomial time in the size of the input conflict graph .
Then the algorithm assigns papers to reviewers in a fashion that guarantees each paper is going to be reviewed by at least reviewers and each reviewer reviews at most papers. The assignment of papers in any individual component (to reviewers in the other component) can be done using any assignment algorithm (taken as an input ) as long as the algorithm can satisfy the -requirements. Possible choices for the algorithm include the popular Toronto paper matching system (Charlin and Zemel, 2013) or others Hartvigsen et al. (1999); Garg et al. (2010); Stelmakh et al. (2018).
We now move to the aggregation procedure in Algorithm 2. At a high level, the papers in each component are aggregated separately using the subroutine Contract-and-Sort. This aggregation in Contract-and-Sort is performed by identifying sets of papers that dominate one another, ensuring that any set of papers is necessarily ranked higher than any set which it dominates, and finally ranking the papers within each set using any arbitrary aggregation algorithm (taken as an input ). Possible choices for the algorithm include the modified Borda count (Emerson, 2013), Plackett-Luce aggregation (Hajek et al., 2014), or others (Caragiannis et al., 2017). Moving back to the main algorithm, the two rankings returned by Contract-and-Sort respectively for the two components are simply interlaced to obtain a total ranking over all the papers.
The following theorem now shows that Divide-and-Rank satisfies group unanimity and is strategyproof.
Suppose the vertices of can be partitioned into two groups and such that there are no edges in across the groups and that . Then , that is, Divide-and-Rank is group unanimous and strategyproof.
Recall that is the optimal value of (1) under strategyproof and group unanimity. The assignment (Algorithm 1) in Divide-and-Rank ensures strategyproofness while the aggregation (Algorithm 2) yields it the unanimity property.
The Divide-and-Rank algorithm aptly handles the various nuances of real-world conferences peer review, which render other algorithms inapplicable. This includes the features that each reviewer can write multiple papers and each paper can have multiple authors, and furthermore that each reviewer may review only a subset of papers. Even under this challenging setting, our algorithm guarantees that no reviewer can influence the ranking of her own paper via strategic behavior, and it is efficient from a social choice perspective.
In the remainder of this subsection, we delve a little deeper into the interleaving step (Step 7) of the aggregation algorithm. At first glance, this interleaving – performed independent of the reviewers’ reports – may be a cause of concern. Indeed, assuming there is some ground truth ranking of all papers and even under the assumption that the outputs of the Contract-and-Sort procedure are consistent with this ranking, the worst case scenario is where the interleaving causes papers to be placed at a positions that are away from their respective positions in the true ranking. We show that, however, such a worst case scenario is unlikely to arise, when the ground truth ranking is independent of the conflict graph.
Suppose satisfies the conditions given in Theorem 4.1 and there exists constant such that . Assume the ground-truth ranking is chosen uniformly at random from all permutations in independent of , and that the two partial outputs of Contract-and-Sort in Algorithm 2 respect . Let the output ranking of Divide-and-Rank be . Then for every , for any , with probability at least , we have:
Proposition 4.2 shows that the maximum deviation between the aggregated ranking and the ground truth ranking is with high probability. Hence for large enough, such deviation is negligible when program chairs of conferences need to make accept/reject decisions, where the number of accepted papers usually scales linearly with .
4.2 Analysis of ICLR-17 submissions
|Number of submitted papers||489|
|Number of distinct authors||1417|
|Average # papers written per author||1.27|
|Maximum # papers written by an author||14|
|Number of connected components||253|
|#authors, #papers in largest connected component||371, 133|
|#authors, #papers in second largest connected component||65, 20|
Our Divide-and-Rank algorithm is based on the partitioning method which relies on a partition of the set of authors and papers such that there is no conflict across the partition. The most prominent type of conflicts is authorships, and here we restrict attention to the authorship conflict graph. In this section, we empirically verify that the partitioning conditions indeed hold in a conference peer-review setting using data from the ICLR-17 conference. We then empirically demonstrate how to make the partitioning method more appealing for conference peer review. In particular, we show that removing only a small number of reviewers can result in a dramatic reduction in the size of the largest component in the conflict graph thereby providing great flexibility towards partitioning the papers and authors.
We analyzed all papers submitted to the ICLR-17 conference with the given authorship relationship as the conflict g. ICLR-17 received 489 submissions by 1,417 authors; we believe this dataset is a good representative of a medium-sized modern conference. In the analysis of this dataset, we instantiate the conflict graph as the authorship graph. It is important to note that we consider only the set of authors as the entire reviewer pool (since we do not have access to the actual reviewer identities). Adding reviewers from outside the set of authors would only improve the results since these additional reviewers will have no edges in the (authorship) conflict graph.
We first investigate the existence of (moderately sized) components in the conflict graph. Our analysis shows that the authorship graph is not only disconnected, but also has more than 250 components. The largest connected component contains 133 (that is, about ) of all papers, and the second largest CC is much smaller. We tabulate the results from our analysis in Table 1. These statistics indeed verify our assumption in Theorem 4.1 that the conflict graph is disconnected and can be divided into two disconnected parts of similar size.
The partitioning method has previously been considered for the problem of peer grading (Kahng et al., 2017). The peer grading setting is quite homogeneous in that each reviewer (student) goes through the same course and hence any paper (homework) can be assigned to any reviewer. In peer review, however, different reviewers typically have different areas of expertise and hence their abilities to review any paper varies by the area of the paper. In order to accommodate this diversity in area of expertise in peer review, one must have a greater flexibility in terms of assigning papers to reviewers. In our analysis in Table 1 we saw that the largest connected component comprises 372 authors and 133 papers. It is reasonable to expect that a large number of reviewers with expertise required to review these 133 papers would also fall in the same connected component, meaning that a naïve application of Divide-and-Rank to this data would assign these 133 papers to reviewers who may have a significantly lower expertise for these papers. This is indeed a concern, and in what follows, we discuss a simple yet effective way to ameliorate this problem.
We show empirically using the ICLR-17 data that by removing only a small number of authors from the reviewer pool, we can make the conflict graph much more sparse, allowing for a significantly more flexible application of our algorithm Divide-and-Rank
(or more generally, any partition-based algorithm). In more detail, we remove a small fraction of authors from the reviewer pool. We use the simple heuristic of choosing to remove the authors with the maximum degree in the (authorship) conflict graph. We then study the statistics of the resulting conflict graph (with all papers but only the remaining reviewers) in terms of the numbers and sizes of the connected components. We present the results in Table2. We see that on removing only a small fraction of authors – 50 authors which is only about of all others – the number of papers in the largest connected component reduces by 86% to just 18. Likewise, the number of authors in the largest connected component reduces to as small as 55 from 371 originally. These numbers thus demonstrate that despite all the idiosyncrasies of conference peer review, the Divide-and-Rank and the partitioning method can be made practically applicable for peer review.
|#Authors removed from reviewer pool|
5 Negative Results
The positive results in the previous section focus on group unanimity, which is weaker than the conventional notion of unanimity (which we refer to as pairwise unanimity). Moreover, the algorithm had a disconnected review graph whereas the review graphs of (not strategyproof) conferences today are typically connected (Shah et al., 2017). It is thus natural to wonder about the extent to which these results can be strengthened. Can a peer-review system with a connected reviewer graph satisfy these properties? Can a strategyproof peer-review system be pairwise unanimous? In this section we present some negative results toward these questions, thereby highlighting the critical impediments towards (much) stronger results.
Before we go to our results, we give another notion of strategyproofness, which is significantly weaker than the notion of strategyproofness (Definition 3.3), and is hence termed as weak strategyproofness. As compared to strategyproofness which is defined with respect to a given conflict graph, weak strategyproofness only requires the existence of a conflict graph (with non-zero reviewer-degrees) for which the review process is strategyproof.
Definition 5.1 (Weak Strategyproofness, WSP).
A review process is called weakly strategyproof, if for every reviewer , there exists some paper such that for every pair of distinct profiles (under assignment ) and , it is guaranteed that .
In other words, weak strategyproofness requires that for each reviewer there is at least one paper (not necessarily authored by this reviewer) whose ranking cannot be influenced by the reviewer. As the name suggests, strategyproofness is strictly stronger than weak strategyproofness, when each reviewer has at least one paper of conflict.
We define the notion of weak strategyproofness mainly for theoretical purposes; obviously WSP is too weak to be useful for practical applications. However we show that even this extremely weak requirement is impossible to satisfy in situations of practical interest.
|Pairwise||None||Mild (see Corollary 5.2)||No||Theorem 5.1|
|Group||Weak||Mild (Connected )||Conjecture: No||Proposition 5.3|
We summarize our results in Table 3. Recall that we show the property of group unanimity and strategyproof for Divide-and-Rank; as the first direction of possible extension, we show in Theorem 5.1 that the slightly stronger notion of pairwise unanimity is impossible to satisfy under mild assumptions, even without strategyproof constraints. Then in Section 5.2 we explore the second direction of extension, by requiring a connected ; we give conjectures and insights that group unanimity and weak strategyproofness is impossible under this setting. At last in Theorem 5.4 we revert to the traditional setting of social choice, where every reviewer gives a total ranking of the set of all papers ; we show that in this setting it is impossible for any review process to be pairwise unanimous and weakly strategyproof.
5.1 Impossibility of Pairwise Unanimity
We show in this section that pairwise unanimity is too strong to satisfy under mild assumptions. These assumptions are mild in the sense that a violation of the assumptions leads to severely limited and somewhat impractical choices of .
In order to precisely state our result, we first introduce the notion of a review-relation graph . Given a paper-review assignment , the review-relation graph is an undirected graph with as its vertices and where any two papers and are connected iff there exists at least one reviewer who reviews both the papers. With this preliminary in place, we are now ready to state the main result of this section:
There is no review process that is pairwise unanimous (that is, for every ), when the following condition holds: contains a cycle of length 3 or more such that no single reviewer reviews all the papers in the cycle.
In the corollary below we give some direct implications of the condition in Theorem 5.1 when , that is, when every reviewer ranks a same number of papers.
Suppose . If is pairwise unanimous, the following conditions hold:
does not contain any cycles of length or more.
The set of papers reviewed by any pair of reviewers and must satisfy the condition . In words, if a pair of reviewers review more than one papers in common then they must review exactly the same set of papers.
The number of distinct sets in is at most .
Remarks. In modern conferences like NIPS (Shah et al., 2017), each reviewer usually reviews around 3 to 6 papers. If we make the review process pairwise unanimous, by point 3 of Corollary 5.2 the number of distinct review sets is much smaller than the number of reviewers; this severely limits the design of review sets, since many reviewers would be necessitated to review identical sets of papers. Point 2 is a related, strong requirement, since the specialization of reviewers might not allow for such limiting of the intersection of review sets. For instance, there are a large number of pairs of reviewers who review more than one common paper but none with exactly the same set of papers in NIPS 2016 (Shah et al., 2017). In general, Theorem 5.1 and Corollary 5.2 show that it is difficult to satisfy pairwise unanimity, even without considering strategyproofness. This justifies our choice of group unanimity in the positive results.
5.2 Group Unanimity and Strategyproof for a Connected Review Graph
Having shown that pairwise unanimity is too strong a requirement to satisfy, we now consider another direction for extension – conditions on the review graph . A natural question follows: Under what condition on the review graph are both group unanimity and strategyproofness possible? Although we will leave the question of finding the exact condition open, we conjecture that if we require to be connected, then group unanimity and strategyproofness cannot be simultaneously satisfied.
To show our insights, we analyze an extremely simplified review setting. We show that even in this very simple case, for every weakly strategyproof .
Consider any and suppose , where are disjoint nonempty sets of papers. Consider a review graph with reviewers, where reviewer reviews , reviews , and reviews . Then for every such that is weakly strategyproof.
Proposition 5.3 thus shows that for the simple review graph considered in the statement, group unanimity and weak strategyproofness cannot hold at the same time. We conjecture that such a negative result may hold for more general connected review graphs, and such a negative result may be proved by identifying a component of the general review graph that meets the condition of Proposition 5.3. This shows that our design process of the review graph in Section 4 is quite essential for ensuring those important properties.
5.3 Pairwise Unanimity and Strategyproof under Total Ranking
Throughout the paper so far, motivated by the application of conference peer review, we considered a setting where every reviewer reviews a (small) subset of the papers. In contrast, a bulk of the classical literature in social choice theory considers a setting where each reviewer ranks all candidates or papers (Arrow, 1950; Satterthwaite, 1975). Given this long line of literature, intellectual curiosity drives us to study the case of all reviewers reviewing all papers for our conference peer-review setting.
We now consider our notion of pairwise unanimous and weakly strategyproof in this section under this total-ranking setting, where . In this case, the review graph is always a complete bipartite graph, and it only remains to design the aggregation function . Although total rankings might not be practical for large-scale conferences, it is still helpful for smaller-sized conferences and workshops.
Under this total ranking setting, we prove a negative result showing that pairwise unanimity and strategyproofness cannot be satisfied together, and furthermore, even the notion of weak strategyproofness (together with PU) is impossible to achieve.
Suppose . If , then for any weakly strategyproof .
To prove Theorem 5.4, we use Cantor’s diagonalization argument to generate a contradiction by assuming there exists that is both PU and WSP. (Note that the conditions required for Theorem 5.1 are not met in the total ranking case.)
It is interesting to note that pairwise unanimity can be easily satisfied in this setting of total rankings, by using a simple aggregation scheme such as the Borda count. However, Theorem 5.4 shows that surprisingly, even under the extremely mild notion of strategyproofness given by WSP, it is impossible to achieve pairwise unanimity and strategyproofness simultaneously.
In this section, we provide the proofs of all the results from previous sections.
6.1 Proof of Proposition 3.1
Suppose is PU, and satisfies that every reviewer ranks the papers she reviewed from higher than those she reviewed from . Now for every and and reviewer such that reviews both and , must rank since otherwise the assumption of is violated. Since is PU, we know that must respect as well. This argument holds for every and that have been reviewed by at least one reviewer, and hence is also GU.
6.2 Proof of Theorem 4.1
We assume that the condition on the partitioning of the conflict graph, as stated in the statement of this theorem, is met. We begin with a lemma which shows that for any aggregation algorithm , Contract-and-Sort is group unanimous.
For any assignment and aggregation algorithms and , the aggregation procedure Contract-and-Sort is group unanimous.
We prove this lemma in Section 6.2.1. Under the assumptions on , and sizes of , it is easy to verify that there is a paper allocation satisfies and each paper gets at least reviews. The strategyproofness of Divide-and-Rank follows from the standard ideas in the past literature on partitioning-based methods (Alon et al., 2011): Algorithm 1 guarantees that reviewers in do not review papers in , and reviewers in do not review papers in . Hence the fact that Divide-and-Rank is strategyproof trivially follows from the assignment procedure where each reviewer does not review the papers that are in conflict with her, as specified by the conflict graph . Given that all the other reviews are fixed, the ranking of the papers in conflict with her will only be determined by the other group of reviewers and so fixed no matter how she changes her own ranking. On the other hand, from Lemma 6.1, since Contract-and-Sort is group unanimous, we know that and respect group unanimity w.r.t. and , respectively. Since , it follows that and also respect group unanimity w.r.t. . Finally, note that there is no reviewer who has reviewed both papers from and , the interlacing step preserves the group unanimity, which completes our proof.
6.2.1 Proof of Lemma 6.1
Let , where is a preference profile. Define . Let denote the number of SCCs in . Construct a directed graph such that each of its vertices represents a SCC in , and there is an edge from one vertex to another in iff there exists an edge going from one SCC to the other in the original graph . Let be a topological ordering of the vertices in . Since is a topological ordering, then edges can only go from to where . Now consider any cut in that satisfies the requirement of group unanimity, i.e., all edges in the cut direct from to . Then there is no pair of papers and such that and are in the same connected component, otherwise there will be both paths from to and to , contradicting that forms a cut where all the edges go in one direction. This shows that and also form a partition of all the vertices . Now consider any edge from to . Suppose is in component and in component . We have , since and forms a partition of all SCCs; also it cannot happen that , otherwise is not a topological ordering returned by . So it must be , and the edge is respected in the final ordering.
6.3 Proof of Proposition 4.2
We would first need a lemma for the location of papers:
Let , where is the largest integer that is strictly smaller than . Then and .
We prove this lemma in Section 6.3.1.
Consider any paper , and suppose its position in is . Define and . Without loss of generality assume (the other case is symmetric) and let . We discuss the following two cases depending on whether or .
Case I: If . Let be the number of papers in ranked strictly higher (better) than according to . Since the permutation is uniformly random, conditioned on this value of , the other papers’ positions in the true ranking are uniformly at random in positions . Now for any paper , let
be an indicator random variable set as 1 if position ofis higher than in , and 0 otherwise. So , and when . Then using Hoeffding’s inequality without replacement, we have
for any . The last inequality is due to , which holds because with a constant . Now setting we have the bound
Now note that by Algorithm 2, the position of paper in the ranking is . Use this relationship to substitute in the above inequality, and notice that by assumption , we have
On the other hand,
where the last inequality is by the assumption that is large enough so that .
Case II: If . Again, let be the number of papers in ranked strictly higher (better) than according to . As the analysis in Case I, similarly, we have , and . With the same analysis using Hoeffding’s inequality without replacement, with probability at least we have
Now using Lemma 6.2, the position of paper in in this case is . Using exactly the same analysis as Case I we have
Combine both Case I and Case II, and notice that
is uniformly distributed in. Using a union bound over , with probability we have:
6.3.1 Proof of Lemma 6.2
We show that for every slot that there is no such that , there exists one slot for such that , i.e., all slots that are left empty by are taken by slots of . Since that the two kinds of slots have a total number of , we show that there are no overlap between the two kinds of slots, thus proving the lemma.
Let . Suppose if there is no such that , then there must exist some such that
This is because there must be a multiple of in the range , but our assumption makes that there is no such multiply in .
Now let . By we have ; substituting in (4) we have
Thus there exists . Thus we prove the lemma.
6.4 Proof of Theorem 5.1
The proof of Theorem 5.1 is a direct formulation of our intuition in Section 5.1. Without loss of generality let be the cycle not reviewed by a single reviewer, for . Hence there exists a partial profile such that for all the reviewers who have reviewed both and , (define ). On the other hand, since for each reviewer, at least one pair is not reviewed by her, the constructed partial profile is valid. Now assume is PU, then we must have and , which contradicts the transitivity of the ranking.
6.5 Proof of Corollary 5.2
We prove each of the conditions in order.
Proof of part 1: If there is a cycle of size , then no reviewer can review all the papers in it since it exceeds the size of review sets. So there is no such cycle.
Proof of part 2: The statement trivially holds for . For , Suppose there are two reviewers and such that . Since , there exist papers and such that and . Also , and let . By definition it is easy to verify that forms a cycle that satisfies the condition in Theorem 5.1, and hence is not pairwise unanimous.
Proof of part 3: Define a “paper-relation graph” as follows: Given a paper-review assignment , the paper-relation graph is an undirected graph, whose nodes are the distinct sets in ; we connect two review sets iff they have one paper in common. Note that by 2, each pair of distinct sets has at most one paper in common.
We first show that is pairwise unanimous, then must necessarily be a forest. If there is a cycle in , then there is a corresponding cycle in the review relation graph . To see this, not losing generality suppose the shortest cycle in is . Also, suppose not losing generality. Then forms a cycle in by its definition. Since each reviewer reviews exactly one set in , there is no reviewer reviewing all papers in this cycle of papers in . Thus the condition in Theorem 5.1 is satisfied, and is not pairwise unanimous.
We now use this result to complete our proof. Consider the union of all sets of papers that form vertices of . We know that this union contains exactly papers since each paper is reviewed at least once. Now let denote the number of distinct review sets (that is, number of vertices of ), and let denote the vertices of . The union of three or more sets in is empty, since otherwise there will be a cycle in . Using this fact, we apply the inclusion-exclusion principle to obtain
Now use the inequality which arises since is a forest, to obtain the claimed bound .
6.6 Proof of Proposition 5.3
Fix some ranking of papers within each individual set , , and (e.g., according to the natural order of their indices). In the remainder of the proof, any ranking of all papers always considers these fixed rankings within these individual sets. With this in place, in what follows, we refer to any ranking in terms of the rankings of the four sets of papers.
Suppose there is one such that satisfies group unanimity and weak strategyproofness for , and consider the following 4 profiles: