Peer review is the backbone of academia. In order to provide high-quality peer reviews, it is of utmost importance to assign papers to the right reviewers (Thurner and Hanel, 2011; Black et al., 1998; Bianchi and Squazzoni, 2015). Even a small fraction of incorrect reviews can have significant adverse effects on the quality of the published scientific standard (Thurner and Hanel, 2011) and dominate the benefits yielded by the peer-review process that may have high standards otherwise (Squazzoni and Gandelli, 2012). Indeed, researchers unhappy with the peer review process are somewhat more likely to link their objections to the quality or choice of reviewers (Travis and Collins, 1991).
We focus on peer-review in conferences where a number of papers are submitted at once. These papers must simultaneously be assigned to multiple reviewers who have load constraints. The importance of the reviewer-assignment stage of the peer-review process cannot be overstated; quoting Rodriguez et al. (2007), “one of the first and potentially most important stage is the one that attempts to distribute submitted manuscripts to competent referees.” Given the massive scale of many conferences such as NIPS and ICML, these reviewer assignments are largely performed in an automated manner. For instance, NIPS 2016 assigned 5 out of 6 reviewers per paper using an automated process (Shah et al., 2017). This problem of automated reviewer assignments forms the focus of this paper.
Various past studies show that small changes in peer review quality can have far reaching consequences (Thorngate and Chowdhury, 2014; Squazzoni and Gandelli, 2012) not just for the papers under consideration but more generally also for the career trajectories of the researchers. These long term effects arise due to the widespread prevalence of the Matthew effect (“rich get richer”) in academia (Merton, 1968).
It is also known (Travis and Collins, 1991; Lamont, 2009) that works that are novel or not mainstream, particularly those interdisciplinary in nature, face significantly higher difficulty in gaining acceptance. A primary reason for this undesirable state of affairs is the absence of sufficiently many good “peers” to aptly review interdisciplinary research (Porter and Rossini, 1985).
These issues strongly motivate the dual goals of the reviewer assignment procedure we consider in this paper — fairness and accuracy. By fairness, we specifically consider the notion of max-min fairness which is studied in various branches of science and engineering (Rawls, 1971; Lenstra et al., 1990; Hahne, 1991; Lavi et al., 2003; Bonald et al., 2006; Asadpour and Saberi, 2010). In our context of reviewer assignments, max-min fairness posits maximizing the review-quality of the paper with the least qualified reviewers. The max-min fair assignment guarantees that no paper is discriminated against in favor of more lucky counterparts. That is, even the most ambivalent paper with a small number of reviewers being competent enough to evaluate its merits will receive as good treatment as possible. The max-min fair assignment also ensures that in any other assignment there exists at least one paper with the fate at least as bad as the fate of the most disadvantaged paper in the aforementioned fair assignment.
Alongside, we also consider the requirement of statistical accuracy. One of the main goals of the conference peer-review process is to select the set of “top” papers for acceptance. Two key challenges towards this goal are to handle the noise in the reviews and subjective opinions of the reviewers; we accommodate these aspects in terms of existing (Ge et al., 2013; McGlohon et al., 2010; Dai et al., 2012) and novel statistical models of reviewer behavior. Prior works on the reviewer assignment problem (Long et al., 2013; Garg et al., 2010; Karimzadehgan et al., 2008; Tang et al., 2010) offer a variety of algorithms that optimize the assignment for certain deterministic objectives, but do not study their assignments from the lens of statistical accuracy. In contrast, our goal is to design an assignment algorithm that can simultaneously achieve both the desired objectives of fairness and statistical accuracy.
We make several contributions towards this problem. We first present a novel algorithm, which we call PeerReview4All, for assigning reviewers to papers. Our algorithm is based on a construction of multiple candidate assignments, each of which is obtained via an incremental execution of max-flow algorithm on a carefully designed flow network. These assignments cater to different structural properties of the similarities and a judicious choice between them provides the algorithm appealing properties.
Our second contribution is an analysis of the fairness objective that our PeerReview4All algorithm can achieve. We show that our algorithm is optimal, up to a constant factor, in terms of the max-min fairness objective. Furthermore, our algorithm can adapt to the underlying structure of the given similarity data between reviewers and papers and in various cases yield better guarantees including the exact optimal solution in certain scenarios. Finally, after optimizing the outcome for the most worst-off paper and fixing the assignment for that paper, our algorithm aims at finding the most fair assignment for the next worst-off paper and proceeds in this manner until the assignment for each paper is fixed.
As a third contribution, we show that our PeerReview4All algorithm results in strong statistical guarantees in terms of correctly identifying the top papers that should be accepted. We consider a popular statistical model (Ge et al., 2013; McGlohon et al., 2010; Dai et al., 2012) which assumes existence of some true objective score for every paper. We provide a sharp analysis of the minimax risk in terms of “incorrect” accept/reject decisions, and show that our PeerReview4All algorithm leads to a near-optimal solution.
Fourth, noting that paper evaluations are typically subjective (Kerr et al., 1977; Mahoney, 1977; Ernst and Resch, 1994; Bakanic et al., 1987; Lamont, 2009), we propose a novel statistical model capturing subjective opinions of reviewers, which may be of independent interest. We provide a sharp minimax analysis under this subjective setting and prove that our assignment algorithm PeerReview4All is also near-optimal for this subjective-score setting.
Our fifth and final contribution is an experiment we designed and conducted on the Amazon Mechanical Turk crowdsourcing platform to objectively compare the performance of different reviewer-assignment algorithms. We design this experiment in a manner that circumvents the challenge posed by the absence of a ground truth in peer review settings, for objective evaluations of accuracy. The results of the experiment highlight the promise of PeerReview4All in practice, in addition to the theoretical benefits discussed elsewhere in the paper. The dataset pertaining to the experiment, as well as the code for our PeerReview4All algorithm, are available on the first author’s website.
The remainder of this paper is organized as follows. We discuss related literature in Section 2. In Section 3, we present the problem setting formally with a focus on the objective of fairness. In Section 4 we present our PeerReview4All algorithm. We establish deterministic approximation guarantees on the fairness of our PeerReview4All algorithm in Section 5. We analyze the accuracy of our PeerReview4All algorithm under an objective-score model in Section 6, and introduce and analyze a subjective score model in Section 7. We empirically evaluate the algorithm in Section 8 using synthetic and real-world experiments. We then provide the proofs of all the results in Section 9. We conclude the paper with a discussion in Section 10.
2 Related literature
The reviewer assignment process consists of two steps. First, a “similarity” between every (paper, reviewer) pair that captures the competence of the reviewer for that paper is computed. These similarities are computed based on various factors such as the text of the submitted paper, previous papers authored by reviewers, reviewers’ bids and other features. Second, given the notion of good assignment, specified by the program chairs, papers are allocated to reviewers, subject to constraints on paper/reviewer loads. This work focuses on the second step (assignment), assuming the first step of computing similarities as a black box. In this section, we give a brief overview of the past literature on both of the steps of the reviewer-assignment process.
Computing similarities. The problem of identifying similarities between papers and reviewers is well-studied in data mining community. For example, Mimno and McCallum (2007) introduce a novel topic model to predict reviewers’ expertise. Liu et al. (2014) use the random walk with restarts model to incorporate both expertise of reviewers and their authority in the final similarities. Co-authorship graphs (Rodriguez and Bollen, 2008) and more general bibliographic graph-based data models (Tran et al., 2017) give appealing methods which do not require a set of reviewers to be pre-determined by conference chair. Instead, these methods recommend reviewers to be recruited, which might be particularly useful for journal editors.
One of the most widely used automated assignment algorithms today is the Toronto Paper Matching System or TPMS (Charlin and Zemel, 2013)
which also computes estimations of similarities between submitted papers and available reviewers using techniques in natural language processing. These scores might be enhanced with reviewers’ self-accessed expertise adaptively queried from them in an automatic manner.
Our work uses these similarities as an input for our assignment algorithm, and considers the computation of these similarity values as a given black box.
Cumulative goal functions. With the given similarities, much of past work on reviewer assignments develop algorithms to maximize the cumulative similarity, that is, the sum of the similarities across all assigned reviewers and all papers. Such an objective is pursued by the organizers of SIGKDD conference (Flach et al., 2010) and by the widely employed TPMS assignment algorithm (Charlin and Zemel, 2013). Various other popular conference management systems such as EasyChair (easychair.org) and HotCRP (hotcrp.com) and several other papers (see Long et al. 2013; Charlin et al. 2012; Goldsmith and Sloan 2007; Tang et al. 2010 and references therein) also aim to maximize various cumulative functionals in their automated reviewer assignment procedures. In the sequel, we argue however that optimizing such cumulative objectives is not fair — in order to maximize them, these algorithms may discriminate against some subset of papers. Moreover, it is the non-mainstream submissions that are most likely to be discriminated against. With this motivation, we consider a notion of fairness instead.
Fairness. In order to ensure that no papers are discriminated against, we aim at finding a fair assignment — an assignment that ensures that the most disadvantaged paper gets as competent reviewers as possible. The issue of fairness is partially tackled by Hartvigsen et al. (1999), where they necessitate every paper to have at least one reviewer with expertise higher than certain threshold, and then maximize the value of that threshold. However, this improvement only partially solves the issue of discrimination of some papers: having assigned one strong reviewer to each paper, the algorithm may still discriminate against some papers while assigning remaining reviewers. Given that nowadays large conferences such as NIPS and ICML assign 4-6 reviewers to each paper, a careful assessment of the paper by one strong reviewer might be lost in the noise induced by the remaining weak reviews. In the present study, we measure the quality of assignment with respect to any particular paper as sum similarity over reviewers assigned to that paper. Thus, the fairness of assignment is the minimum sum similarity across all papers; we call an assignment fair if it maximizes the fairness. We note that assignment computed by our PeerReview4All algorithm is guaranteed to have at least as large max-min fairness as that proposed by Hartvigsen et al. (1999).
Benferhat and Lang (2001) discuss different approaches to selection of the “optimal” reviewer assignment. Together with considering a cumulative objective, they also note that one may define the optimal assignment as an assignment that minimizes a disutility of the most disadvantaged reviewer (paper). This approach resembles the notion of max-min fairness we study in this paper, but Benferhat and Lang (2001) do not propose any algorithm for computing the fair assignment.
The notion of max-min fairness was formally studied in context of peer-review by Garg et al. (2010). While studying a similar objective, our work develops both conceptual and theoretical novelties which we highlight here. First, Garg et al. (2010) measure the fairness in terms of reviewers’ bids — for every reviewer they compute a value of papers assigned to that reviewer based on her/his bids and maximize the minimum value across all reviewers. While satisfying reviewers is a useful practice, we consider fairness towards the papers in their review to be of utmost importance. During a bidding process reviewers have limited time resources and/or limited access to papers’ content to evaluate their relevance, and hence reviewers’ bids alone are not a good proxy towards the measure of fairness. In contrast, in this work we consider similarities — scores that are designed to represent a competence of reviewer in assessing a paper. Besides reviewers’ bids, similarities are computed based on the full text of the submissions and papers authored by reviewer and can additionally incorporate various factors such as quality of previous reviews, experience of reviewer and other features that cannot be self-assessed by reviewers.
The assignment algorithm proposed in Garg et al. (2010)
works in two steps. In the first step, the problem is set up as an integer programming problem and a linear programming relaxation is solved. The second step involves a carefully designed rounding procedure that returns a valid assignment. The algorithm is guaranteed to recover an assignment whose fairness is within a certain additive factor from the best possible assignment. However, the fairness guarantees provided inGarg et al. (2010) turn out to be vacuous for various similarity matrices. As we discuss later in the paper, this is a drawback of the algorithm itself and not an artifact of their guarantees. In contrast, we design an algorithm with multiplicative approximation factor that is guaranteed to always provide a non-trivial approximation which is at most constant factor away from the optimal.
Next, Garg et al. (2010) consider fairness of the assignment as an eventual metric of the assignment quality. However, we note that the main goal of the conference paper reviewing process is an accurate acceptance of the best papers. Thus, in the present work we both theoretically and empirically study the impact of the fairness of the assignment on the quality of the acceptance procedure.
Finally, although Garg et al. (2010) present their algorithm for the case of discrete reviewer’s bids, we note that this assumption can be relaxed to allow real-valued similarities with a continuous range as in our setting. In this paper we refer to the corresponding extension of their algorithm as the Integer Linear Programming Relaxation (ILPR) algorithm.
Fair division. A direction of research that is relevant to our work studies the problem of fair division where max-min fairness is extensively developed. The seminal work of Lenstra et al. (1990) provides a constant factor approximation to the minimum makespan scheduling problem where the goal is to assign a number of jobs to the unrelated parallel machines such that the maximal running time is minimized. Recently Asadpour and Saberi (2010); Bansal and Sviridenko (2006) proposed approximation algorithms for the problem of assigning a number of indivisible goods to several people such that the least happy person is as happy as possible. However, we note that techniques developed in these papers cannot be directly applied for reviewer assignments problem in peer review due to the various idiosyncratic constraints of this problem. In contrast to the classical formulation studied in these works, our problem setting requires each paper to be reviewed by a fixed number of reviewers and additionally has constraints on reviewers’ loads. Such constraints allow us to achieve an approximation guarantee that is independent of the total number of papers and reviewers, and depends only on , the number of reviewers required per paper, as . In contrast, the approximation factor of Asadpour and Saberi (2010) gets worse at a rate of , where is a number of persons (papers in our setting).
Statistical aspects. Different statistical aspects related to conference peer-review have been studied in the literature. McGlohon et al. (2010) and Dai et al. (2012) studied aggregation of consumers ratings to generate a ranking of restaurants or merchants. They come up with objective score model of reviewer which we also use in this work. Ge et al. (2013) also use similar model of reviewer and propose a Bayesian approach to calibrating reviewer’ scores, which allows to incorporate different biases in context of conference peer-review. Sajjadi et al. (2016) empirically compare different methods of score aggregation for peer grading of homeworks. Peer grading is a related problem to conference peer review, with the key difference that the questions and answers (“papers”) are more closed-ended and objective. They conclude that although more sophisticated methods are praised in the literature, the simple averaging algorithm demonstrates better performance in their experiment. Another interesting observation they make is an edge of cardinal grades over ordinal in their setup. In this work we also consider the conferences with cardinal grading scheme of submissions.
To the best of our knowledge, no prior works on conference peer-review has studied the entire pipeline — from assignment to acceptance — from a statistical point of view. In this work we take the first steps to close this gap and provide a strong minimax analysis of naïve yet interesting procedure of determining top papers. Our findings suggest that higher fairness of the assignment leads to better quality of acceptance procedure. We consider both the objective score model (Ge et al., 2013; McGlohon et al., 2010; Dai et al., 2012) and a novel subjective-score model that we propose in the present paper.
Coverage and Diversity. For completeness, we also discuss several related works that study reviewer assignment problem.
Li et al. (2015) present a greedy algorithm that tries to avoid assigning a group of stringent reviewers or a group of lenient reviewers to a submission, thus maintaining diversity of the assignment in terms of having different combinations of reviewers assigned to different papers.
Another way to ensure diversity of the assignment is proposed by Liu et al. (2014). Instead of designing the special assignment algorithm, they try to incentivize the diversity by special construction of similarities. Besides incorporating expertise and authority of reviewers in similarities, they add an additional term to the optimization problem which balances similarities by increasing scores for reviewers from different research areas.
Karimzadehgan et al. (2008) consider topic coverage as an objective and propose several approaches to maintain broad coverage, requiring reviewers assigned to paper being expert in different subtopics covered by the paper. They empirically verify that given a paper and a set of reviewers, their algorithms lead to better coverage of paper’s topics as compared to baseline technique that assigns reviewers based on some measure of similarity between text of submission and papers authored by reviewers, but does not do topic matching.
A similar goal is formally studied by Long et al. (2013). They measure the coverage of the assignment in terms of the total number of distinct topics of papers covered by the assigned reviewers. They propose a constant factor approximation algorithm that benefits from a sub-modular nature of the objective. As we show in Appendix C, the techniques of Long et al. (2013) can be combined with our proposed algorithm to obtain an assignment which maintains not only fairness, but also a broad topic coverage.
3 Problem setting
3.1 Preliminaries and notation
Given a collection of papers, suppose that there exists a true, unknown total ranking of the papers. The goal of the program chair (PC) of the conference is to recover top papers, for some pre-specified value . In order to achieve this goal, the PC recruits reviewers and asks each of them to read and evaluate some subset of the papers. Each reviewer can review a limited number of papers. We let denote the maximum number of papers that any reviewer is willing to review. Each paper must be reviewed by distinct reviewers. In order to ensure this setting is feasible, we assume that . In practice, is typically small (2 to 6) and hence should conceptually be thought of as a constant.
The PC has access to a similarity matrix , where denotes the similarity between any reviewer and any paper .111Here, we adopt the standard notation for any positive integer . These similarities are representative of the envisaged quality of the respective reviews: a higher similarity between any reviewer and paper is assumed to indicate a higher competence of that reviewer in reviewing that paper (this assumption is formalized later). We do not discuss the design of such similarities, but often they are provided by existing systems (Charlin and Zemel, 2013; Mimno and McCallum, 2007; Liu et al., 2014; Rodriguez and Bollen, 2008; Tran et al., 2017).
Our focus is on the assignment of papers to reviewers. We represent any assignment by a matrix , whose entry is if reviewer is assigned paper and otherwise. We denote the set of reviewers who review paper under an assignment as . We call an assignment feasible if it respects the conditions on the reviewer and paper loads. We denote the set of all feasible assignments as :
Our goal is to design a reviewer-assignment algorithm with a two-fold objective: (i) fairness to all papers, (ii) strong statistical guarantees in terms of recovering the top papers.
From a statistical perspective, we assume that when any reviewer is asked to evaluate any paper , then she/he returns score . The end goal of the PC is to accept or reject each paper. In this work we consider a simplified yet indicative setup. We assume that the PC wishes to accept the “top” papers from the set of submitted papers. We denote the “true” set of top papers as . While the PC’s decisions in practice would rely on several additional factors including the text comments by reviewers and the discussions between them, in order to quantify the quality of any assignment we assume that the top papers are chosen through some estimator that operates on the scores provided by the reviewers. Such an estimator can be used in practice to serve as a guide to the program committee in order to help reduce their load. These acceptance decisions can be described by the chosen assignment and estimator . We denote the set of accepted papers under an assignment and estimator as
. The PC then wishes to maximize the probability of recovering the setof top papers.
Although the goal of exact recovering of top papers is appealing, given the large number of papers submitted to a conference such as ICML and NIPS, this goal might be too optimistic. Another alternative is to recover top papers allowing for a certain Hamming error tolerance . For any two subsets of , we define their Hamming distance to be the number of items that belong to exactly one of the two sets — that is
The goal of PC under this scenario is to choose a pair such that for the given error tolerance parameter , the probability is minimized. We return to more details on the statistical aspects later in the paper.
3.2 Fairness objective
An assignment objective that is popular in past papers (Charlin and Zemel, 2013; Charlin et al., 2012; Taylor, 2008) is to maximize the cumulative similarity over all papers. Formally, these works choose an assignment which maximizes the quantity
An assignment algorithm that optimizes this objective (2) is implemented in the widely used Toronto Paper Matching System (Charlin and Zemel, 2013). We will refer to the feasible assignment that maximizes the objective (2) as and denote the algorithm which computes as TPMS.
We argue that the objective (2) does not necessarily lead to a fair assignment. The optimal assignment can discriminate some papers in order to maximize the cumulative objective. To see this issue, consider the following example.
Consider a toy problem with and , with a similarity matrix shown in Table 1. In this example, paper is easy to evaluate, having non-zero similarities with all the reviewers, while papers and are more specific and weak reviewer has no expertise in reviewing them. Reviewer is an expert and is able to assess all three papers. Maximizing total sum of similarities (2), the TPMS algorithm will assign reviewers , , and to papers , , and respectively. Observe that under this assignment, paper is assigned a reviewer who has insufficient expertise to evaluate the paper. On the other hand, the alternative assignment which assigns reviewers , , and to papers , , and respectively ensures that every paper has a reviewer with similarity at least . This “fair” assignment does not discriminate against papers and for improving the review quality of the already benefitting paper .
With this motivation, we now formally describe the notion of fairness that we aim to optimize in this paper. Inspired by the notion of max-min fairness in a variety of other fields (Rawls, 1971; Lenstra et al., 1990; Hahne, 1991; Lavi et al., 2003; Bonald et al., 2006; Asadpour and Saberi, 2010), we aim to find a feasible assignment to maximize the following objective for given similarity matrix :
The assignment optimal for (3) maximizes the minimum sum similarity across all the papers. In other words, for every other assignment there exists some paper which has the same or lower sum similarity. Returning to our example, the objective (3) is maximized when reviewers , , and are assigned to papers , , and respectively.
Our reviewer assignment algorithm presented subsequently guarantees the aforementioned fair assignment. Importantly, while aiming at optimizing (3), our algorithm does even more — having the assignment for the worst-off paper fixed, it finds an assignment that satisfies the second worst-off paper, then the next one and so on until all papers are assigned.
It is important to note that similarities obtained by different techniques (Charlin and Zemel, 2013; Mimno and McCallum, 2007; Rodriguez and Bollen, 2008; Tran et al., 2017) all have different meanings. Therefore, the PC might be interested to consider a slightly more general formulation and aim to maximize
for some reasonable choice of monotonically increasing function .222We allow . When reviewer with similarity is assigned to paper, she/he is able to perfectly access the quality of the paper. While the same effect might be achieved by redefining for all , this formulation underscores the fact that assignment procedure is not tied to any particular method of obtaining similarities. Different choices of represent the different views on the meaning of similarities. As a short example, let us consider for some .333We use to denote the indicator function, that is, if is true and otherwise. This choice stratifies reviewers for each paper into strong (similarity higher than ) and weak. The fair assignment would be such that the most disadvantaged paper is assigned to as many strong reviewers as possible. We discuss other variants of later when we come to the statistical properties of our algorithm. In what follows we refer to the problem of finding reviewer assignment that maximizes the term (4) as the fair assignment problem.
With this motivation, in the next section we design a reviewer assignment algorithm that seeks to optimize the objective (4) and provide associated approximation guarantees. We will refer to a feasible assignment that exactly maximizes as and denote the algorithm that computes as Hard. When the function is clear from context, we drop the subscript and denote the Hard assignment as for brevity.
Finally we note that for our running example (Table 1 above), the ILPR algorithm (Garg et al., 2010), despite trying to optimize fairness of the assignment, also returns an unfair assignment which coincides with . The reason for this behavior lies in the inner-working of the ILPR algorithm: a linear programming relaxation splits reviewers and in two and makes them review both paper and paper . During the rounding stage, reviewer is assigned to either paper or paper , ensuring that the remaining paper will be reviewed by reviewer . Given that reviewer has zero similarity with both papers and , the fairness of the resulting assignment will be . Such an issue arises more generally in the ILPR algorithm and is discussed in more detail subsequently in Section 5.3 and in Appendix A.1.
4 Reviewer assignment algorithm
In this section we first describe our PeerReview4All algorithm followed by an illustrative example.
A high level idea of the algorithm is the following. For every integer , we try to assign each paper to reviewers with maximum possible similarities while respecting constraints on reviewer loads. We do so via a carefully designed “subroutine” that is explained below. Continuing for that value of , we complement this assignment with additional reviewers for each paper. Repeating the procedure for each value of , we obtain candidate assignments each with reviewers assigned to each paper, and then choose the one with the highest fairness. The assignment at this point ensures guarantees of worst-case fairness (4). We then also optimize for the second worst-off paper, then the third worst-off paper and so on in the following manner. In the assignment at this point, we find the most disadvantaged papers and permanently fix corresponding reviewers to these papers. Next, we repeat the procedure described above to find the most fair assignment among the remaining papers, and so on. By doing so, we ensure that our final assignment is not susceptible to bottlenecks which may be caused by irrelevant papers with small average similarities.
The higher-level idea behind the aforementioned subroutine to obtain the candidate assignment for any value of is as follows. The subroutine constructs a layered flow network graph with one layer for reviewers and one layer for papers, that captures the similarities and the constraints on the paper/reviewer loads. Then the subroutine incrementally adds edges between (reviewer, paper) pairs in decreasing order of similarity and stops when the paper load constraints are met (each paper can be assigned to reviewers using only edges added at this point). This iterative procedure ensures that the papers are assigned reviewers with approximately the highest possible similarities.
Subroutine. A key component of our algorithm is a construction of a flow network in a sequential manner in Subroutine 1. The subroutine takes as input, among other arguments, the set of papers that are not yet assigned and the required number of reviewers per paper . The goal of the subroutine is to assign each paper in with reviewers, respecting the reviewer load constraints, in a way that minimum similarity across all paper-reviewer pairs in resulting assignment is maximized.
The output of the subroutine is an assignment (represented by variable ) which is initially set as empty (Step 1). The subroutine begins (Step 2) with a construction of a directed acyclic graph (a “flow network”) comprising 4 layers in the following order: a source, all reviewers, all papers in , and a sink. An edge may exist only between consecutive layers. The edges between the first two layers control the reviewers’ workloads and edges between the last two layers represent the number of reviews required by the papers. Finally, costs of the all edges in this initial construction are set to . Note that in subsequent steps, the edges are added only between the second and third layers. Thus, the maximum flow in the network is at most .
The crux of the subroutine is to incrementally add edges one at a time between the layers, representing the reviewers and papers, in a carefully designed manner (Steps 3 and 4). The edges are added in order of decreasing similarities. These edges control a reviewer-paper relationship: they have a unit capacity to ensure that any reviewer can review any paper at most once and their costs are equal to the similarity between the corresponding (reviewer, paper) pair.
After adding each edge, the subroutine (Step 5) tests whether a max-flow of size is feasible. Note that a feasible flow of size corresponds to a feasible assignment: by construction of the flow network described earlier, we know that the reviewer and paper load constraints are satisfied. The capacity of each edge in our flow network is a non-negative integer, thereby guaranteeing that the max-flow is an integer, that it can be found in polynomial time, and that the flow in every edge is a non-negative integer under the max-flow. Once the max-flow of size is reached, the subroutine stops adding edges. At this point, it is ensured that the value of the lowest similarity in the resulting assignment is maximized.
Finally, the subroutine assigns each paper to reviewers, using only the “high similarity” edges added to the network so far (Steps 6 and 7). The existence of the corresponding assignment is guaranteed by max-flow in the network being equal to . There may be more than one feasible assignments that attain the max-flow. While any of these assignments would suffice from the standpoint of optimizing the worst-case fairness objective (4), the PC may wish to make a specific choice for additional benefits and specify the heuristic to pick the max-flow in Step 6 of the subroutine. For example, if the max-flow with the maximum cost is selected, then the resulting assignment nicely combines fairness with the high average quality of the assignment. Another choice, discussed in Appendix C, helps with broad topic coverage of the assignment. Importantly, the approximation guarantees established in Theorem 1 and Corollary 1, as well as statistical guarantees from Sections 6 and 7 hold for any max-flow assignment chosen in Steps 6 and 7.
For comparison, we note that the TPMS algorithm can equivalently be interpreted in this framework as follows. The TPMS algorithm would first connect all reviewers to all papers in layers 2 and 3 of the flow graph. It will then compute a max-flow with max cost in this fully connected flow network and make reviewer-paper assignments corresponding to the edges with unit flow between layers 2 and 3. In contrast, our sequential construction of the flow graph prevents papers from being assigned to weak reviewers and is crucial towards ensuring the fairness objective.
Algorithm. The algorithm calls the subroutine iteratively and uses the outputs of these iterates in a carefully designed manner. Initially, all papers belong to a set which represents papers that are not yet assigned. The algorithm repeats Steps 2 to 7 until all papers are assigned. In every iteration, for every value of , the algorithm first calls the subroutine to assign reviewers to each paper from (Step 2b), and then adjusts reviewers’ capacities and the similarity matrix (Step 2c) to prevent any reviewer being assigned to the same paper twice. Next, the subroutine is called again (Step 2d) to assign another reviewers to each paper. As a result, after completion of Step 2, feasible candidate assignments are constructed. Each assignment , is guaranteed (through the Step 2b) to maximize the minimum similarity across pairs where and reviewer is among strongest reviewers assigned to paper in ; and (through the Steps 2d and 2e) to have each paper assigned with exactly reviewers.
In Step 3, the algorithm chooses the assignment with the highest fairness (4) among the candidate assignments and the assignment from the previous iteration (empty in the first iteration). Note that since is also included in the maximizer, the fairness cannot decrease in subsequent iterations.
In the chosen assignment, the algorithm identifies the papers that are most disadvantaged, and fixes the assignment for these papers (Step 4). The assignment for these papers will not be changed in any subsequent step. The next steps (Steps 5 and 6) update the auxiliary variables to account for this assignment that is fixed — decreasing the corresponding reviewer capacities and removing these assigned papers from the set . Step 7 then keeps a track of the present assignment for use in subsequent iterations, ensuring that fairness cannot decrease as the algorithm proceeds.
We make a few additional remarks regarding the PeerReview4All algorithm.
1. Computational cost: A naïve implementation of the PeerReview4All algorithm has a computational complexity . We give more details on implementation and computational aspects in Appendix B.
2. Variable reviewer or paper loads: More generally, the PeerReview4All algorithm allows for specifying different loads for different reviewers and/or papers. For general paper loads, we consider and define the capacity of edge between node corresponding to any paper and sink as .
3. Incorporating conflicts of interest: One can easily incorporate any conflict of interest between any reviewer and paper by setting the corresponding similarity to .
4. Topic coverage: The techniques developed in Long et al. (2013) can be employed to modify our algorithm in a way that it first ensures fairness and then, among all approximately fair assignments, picks one that approximately maximizes the number of distinct topics of papers covered. We discuss this modification in Appendix C.
To provide additional intuition behind the design of the algorithm, we now present an example that we also use in the next section to explain our approximation guarantees.
Let for a moment assume thatand let be a constant close to . Consider the following two scenarios:
The optimal assignment is such that all the papers are assigned to reviewers with high similarity:
The optimal assignment is such that there are some “critical” papers which have assigned reviewers with similarities higher than and the remaining assigned reviewers with small similarities. All other papers are assigned to reviewers with similarity higher than .
Intuitively, the first scenario corresponds to an ideal situation since there exists an assignment such that each paper has competent reviewers (with similarity ). In contrast, in the second scenario, even in the fair assignment, some papers lack expert reviewers. Such a scenario may occur, for example, if some non-mainstream papers were submitted to a conference. This case entails identifying and treating these disadvantaged papers as well as possible. To be able to find the fair assignment in both scenarios, the assignment algorithm should distinguish between them and adapt its behavior to the structure of similarity matrix. Let us track the inner-workings of PeerReview4All algorithm to demonstrate this behaviour.
We note that by construction, the fairness of the resulting assignment is determined in the first iteration of Steps 2 to 7 of Algorithm 1, so we restrict our attention to . First, consider scenario (S1). The subroutine called with parameter will add edges to the flow network until the maximal flow of size is reached. Since the optimal assignment is such that the lowest similarity is higher than , the last edge added to the flow network will have similarity at least , implying that the fairness of the candidate assignment , which is a lower bound for the fairness of resulting assignment, will be at least . Given that is close to one, we conclude that in this case algorithm is able to recover an assignment which is at least very close to optimal.
Now, let us consider scenario (S2). In this scenario, the subroutine called with may return a poor assignment. Indeed, since there is a lack of competent reviewers for critical papers, there is no way to assign each paper with reviewers having a high minimum similarity in the assignment. However, the subroutine called with parameter will find strong reviewers for each paper (including the critical papers), thereby leading to a fairness . The obtained lower bound guarantees that the assignment recovered by the PeerReview4All algorithm is also close to the optimal, because in the fair assignment some papers have only strong reviewers.
This example thus illustrates how the PeerReview4All algorithm can adapt to the structure of the similarity matrix in order to guarantee fairness, as well as other guarantees that are discussed subsequently in the paper.
5 Approximation guarantees
In this section we provide guarantees on the fairness of the reviewer-assignment by our algorithm. We first establish guarantees on the max-min fairness objective introduced earlier (Section 5.1). We subsequently show that our algorithm optimizes not only the worst-off paper but recursively optimizes all papers (Section 5.2). We then conclude this section on deterministic approximation guarantees with a comparison to past literature (Section 5.3).
5.1 Max-min fairness
We begin with some notation that will help state our main approximation guarantees. For each value of , consider the reviewer-assignment problem but where each paper requires (instead of ) reviews (each reviewer still can review up to papers). Let us denote the family of all feasible assignments for this problem as . Now define the quantities
Intuitively, for every assignment from the family , the quantity upper bounds the minimum similarity for any assigned (reviewer, paper) pair. It also means that the value is achievable by some assignment in . The value captures the value of the largest entry in the similarity matrix and gives a trivial upper bound for every feasible assignment . Likewise, the value captures the smallest entry in the similarity matrix and yields a lower bound for every feasible assignment .
We are now ready to present the main result on the approximation guarantees for the PeerReview4All algorithm as compared to the optimal assignment .
Consider any feasible values of , any monotonically increasing function , and any similarity matrix . The assignment given by the PeerReview4All algorithm guarantees the following lower bound on the fairness objective (4):
The numerator of (7a) is a lower bound on the fairness of the assignment returned by our algorithm. It is important to note that if , that is, if we only need to assign one reviewer for each paper, then our PeerReview4All Algorithm finds exact solution for the problem, recovering the classical results of Garfinkel (1971) as a special case.
In practice, the number of reviewers required per paper is a small constant (typically set as ), and in that case, our algorithm guarantees a constant factor approximation. Note that the fraction in the right hand side of (7a) can become or , and in both cases it should be read as .
The bound (7a) can be significantly tighter than , as we illustrate in the following example.
Consider two scenarios (S1) and (S2) from Section 4.2, and consider . One can see that under scenario (S1), we have . Setting in the numerator and in the denominator of the bound (7a), and recalling that , we obtain:
where we have also used the fact that . Let us now consider the second scenario (S2) in the example of Section 4.2. In this scenario, since each paper can be assigned to strong reviewers with similarity higher than , we have . We then also have . Moreover, there are some papers which have only strong reviewers in optimal assignment , and hence we have . Setting in the numerator and in the denominator of the bound (7a), some algebraic simplifications yield the bound
We now briefly provide more intuition on the bound (7a) by interpreting it in terms of specific steps in the algorithm. Setting , let us consider the first iteration of the algorithm. Recalling the definition (6) of , the PeerReview4All subroutine called with parameter on Step 2b finds an assignment such that all the similarities are at least . This guarantee in turn implies that the fairness of the corresponding assignment is at least , thereby giving rise to the numerator of (7a). The denominator is an upper bound of the fairness of the optimal assignment . The expression for any value of is obtained by simply appealing to the definition of which is defined in terms of the optimal assignment. By definition (6) of , for every feasible assignment exists at least one paper such that at most of the assigned reviewers are of similarity larger than . Thus, the fairness of the optimal assignment is upper-bounded by the sum similarity of the paper that has reviewers with similarity (the highest possible similarity), and reviewers with similarity .
Finally, one may wonder whether optimizing the objective (2) as done by prior works (Charlin and Zemel, 2013; Charlin et al., 2012) can also guarantee fairness. It turns out that this is not the case (see the example in Table 1 for intuition), and optimizing the objective (2) is not a suitable proxy towards the fairness objective (4). In Appendix A.2 we show that in general the fairness objective value of the TPMS algorithm which optimizes (2) may be arbitrarily bad as compared to that attained by our PeerReview4All algorithm.
In Appendix A.3 we show that the analysis of the approximation factor of our algorithm is tight in a sense that there exists a similarity matrix for which the bound (7b) is met with equality. That said, the approximation factor of our PeerReview4All algorithm can be much better than for various other similarity matrices, as demonstrated in examples (S1) and (S2).
5.2 Beyond worst case
The previous section established guarantees for the PeerReview4All algorithm on the fairness of the assignment in terms of the worst-off paper. In this section we formally show that the algorithm does more: having the assignment for the worst-off paper fixed, the algorithm then satisfies the second worst-off paper, and so on.
Recall that Algorithm 1 iteratively repeats Steps 2 to 7. In fact, the first time that Step 3 is executed, the resulting intermediate assignment achieves the max-min guarantees of Theorem 1. However, the algorithm does not terminate at this point. Instead, it finds the most disadvantaged papers in the selected assignment and fixes them in the final output (Step 4), attributing these papers to reviewers according to . Then it repeats the entire procedure (Steps 2 to 7) again to identify and fix the assignment for the most disadvantaged papers among the remaining papers and so on until the all papers are assigned in . We denote the total number of iterations of Steps 2 to 7 in Algorithm 1 as . For any iteration , we let be the set of papers which the algorithm, in this iteration, fixes in the resulting assignment. We also let denote the assignment selected in Step 3 of the iteration. Note that eventually all the papers are fixed in the final assignment , and hence we must have .
Once papers are fixed in the final output , the assignment for these papers are not changed any more. Thus, at the end of each iteration of Steps 2 to 7, the algorithm deletes (Step 6) the columns of similarity matrix that correspond to the papers fixed in this iteration. For example, at the end of the first iteration, columns which correspond to are deleted from . For each iteration , we let denote the similarity matrix at the beginning of the iteration. Thus, we have , because at the beginning of the first iteration, no papers are fixed in the final assignment .
Moving forward, we are going to show that for every iteration , the sum similarity of the worst-off papers (which coincides with the fairness of ) is close to the best possible, given the assignment for the all papers fixed in the previous iterations. As in Theorem 1, we will compare the fairness with the fairness of the optimal assignment that Hard algorithm would return if called at the beginning of the iteration. We stress that for every , the Hard algorithm assigns papers and respects the constraints on reviewers’ loads, adjusted for the assignment of papers in . We denote the corresponding assignment as . Note that . The following corollary summarizes the main result of this section:
where values , are defined with respect to the similarity matrix and constraints on reviewers’ loads adjusted for the assignment of papers in .
The corollary guarantees that each time the algorithm fixes the assignment for some papers in , the sum similarity for these papers (which is smallest among papers from ) is close to the optimal fairness, where optimal fairness is conditioned on the previously assigned papers. In case , the bound (8) coincides with the bound (7) from Theorem 1. Hence, once the assignment for the most worst-off papers is fixed, the PeerReview4All algorithm adjusts maximum reviewers’ loads and looks for the most fair assignnment of the remaining papers.
5.3 Comparison to past literature
In this section we discuss how the approximation results established in previous sections relate to the past literature.
First, we note that the assignment , computed in Step 2 in the first iteration of Steps 2 to 7 of Algorithm 1, recovers the assignment of Hartvigsen et al. (1999), thus ensuring that our algorithm is at least as fair as theirs. Second, if the goal is to assign only one reviewer () to each of the papers, then our PeerReview4All algorithm finds the optimally fair assignment and recovers the classical result of Garfinkel (1971).
In the remainder of this section, we provide a comparison between the guarantees of the PeerReview4All algorithm established in Theorem 1 and the guarantees of the ILPR algorithm (Garg et al., 2010). Rewriting the results of Garg et al. (2010) in our notation, we have the bound:
Note that our bound (7) for our PeerReview4All algorithm is multiplicative and bound for the ILPR algorithm is additive which makes them incomparable in a sense that neither one dominates another. However, we stress the following differences. First, if we assume to be upper-bounded by one, then assignment satisfies the bound
This bound gives a nice additive approximation factor — for a large value of the optimal fairness , the constant additive factor is negligible. However, if the optimal fairness is small, which can happen if some papers do not have a sufficient number of high-expertise reviewers, then the lower bound on the fairness of the ILPR assignment (10) becomes negative, making the guarantees vacuous as any arbitrary assignment will achieve a non-negative fairness. Note that this issue is not an artifact of the analysis but is inherent in the ILPR algorithm itself, as we demonstrate in the example presented in Table 1 and in Appendix A.1. In contrast, our algorithm in the worst case has a multiplicative approximation factor ensuring that it always returns a non-trivial assignment.
This discrepancy becomes more pronounced if the function is allowed to be unbounded, and the similarities are significantly heterogeneous. Suppose there is some reviewer and paper such that . Then the bound (9) for the ILPR algorithm again becomes vacuous, while the bound (7) for the PeerReview4All algorithm continues to provide a non-trivial approximation guarantee.
6 Objective-score model
We now turn to establishing statistical guarantees for our PeerReview4All algorithm from Section 4. We begin by considering an “objective” score model which we borrow from past works.
6.1 Model setup
The objective-score model assumes that each paper has a true, unknown quality and each reviewer assigned to paper gives her/his estimate of . The eventual goal is to estimate top papers according to true qualities . Following the line of works by Ge et al. (2013); McGlohon et al. (2010); Dai et al. (2012); Sajjadi et al. (2016), we assume the score given by any reviewer to any paper
to be independently and normally distributed around the true paper qualities:
, which implies that the variance of the reviewers’ scores depends only on the reviewer, but not on the paper reviewed. We claim that this assumption is not appropriate for our peer-review problem: conferences today (such as ICML and NIPS) cover a wide spectrum of research areas and it is not reasonable to expect the reviewer to be equally competent in all of the areas.
In our analysis, we assume that the noise variances are some function of the underlying computed similarities.444Recall that the similarities can capture not only affinity in research areas but may also incorporate the bids or preferences of reviewers, past history of review quality, etc. We assume that for any and , the noise variance
for some monotonically decreasing function . We assume that this function is known; this assumption is reasonable as the function can, in principle, be learned from the data from the past conferences.
We note that the model (11) does not consider reviewers’ biases. However, some reviewers might be more stringent while others are more lenient. This difference results in score of any reviewer for any paper being centered not at , but at . A common approach to reduce biases in reviewers’ scores is a post-processing. For example, Ge et al. (2013) compared different statistical models of reviewers in attempt to calibrate the biases; the techniques developed in that work may be extended to the reviewer model (11). Thus, we leave that bias term out for simplicity.
Given a valid assignment , the goal of an estimator is to recover the top papers. A natural way to do so is to compute the estimates of true paper scores and return top papers with respect to these estimated scores. The described estimation procedure is a significantly simplified version of what is happening in the real-world conferences. Nevertheless, this fully-automated procedure may serve as a guideline for area chairs, providing a first-order estimate of the total ranking of submitted papers. In what follows, we refer to any estimator as and to the estimated score of any paper as . Specifically, we consider the following two estimators:
Mean score estimator (MEAN)
The mean score estimator is convenient in practice because it is not tied to the assumed statistical model, and in the past has been found to be predictive of final acceptance decisions in peer-review settings such as National Science Foundation grant proposals (Cole et al., 1981) and homework grading (Sajjadi et al., 2016). This observation is supported by the program chair of ICML 2012 John Langford, who notices in his blog (Langford, 2012) that in ICML 2012 the decisions on the acceptance were “surprisingly uniform as a function of average score in reviews”.
Here we present statistical guarantees for both and estimators and for both exact top recovery and recovery under a Hamming error tolerance.
6.3.1 Exact top recovery
Let us use and to denote the indices of the papers that are respectively ranked and according to their true qualities. Similar to the past work by Shah and Wainwright (2015) on top item recovery, a central quantity in our analysis is a -separation threshold defined as:
Intuitively, if the difference between and papers is large enough, it should be easy to recover top papers. To formalize this intuition, for any value of a parameter , consider a family of papers’ scores
For the first half of this section, we assume that function is bounded, that is, .555More generally, we could consider bounded function with range for some . Without loss of generality, we set which can always be achieved by appropriate scaling. This assumption implicitly assumes that every reviewer can provide a minimum level of expertise while reviewing any paper even if she/he has zero similarity with that paper.
In addition to the gap , the hardness of the problem also depends on the similarities between reviewers and papers. For instance, if all reviewers have near-zero similarity with all the papers, then recovery is impossible unless the gap is extremely large. In order to quantify the tractability of the problem in terms of the similarities we introduce the following set of families of similarity matrices parameterized by a non-negative value :
In words, if similarity matrix belongs to , then the fairness of the optimally fair (with respect to ) assignment is at least .
Finally, we define a quantity that captures the quality of approximation provided by PeerReview4All:
Note that Theorem 1 gives lower bounds on the value of .
Having defined all the necessary notation, we are ready to present the first result of this section on recovering the set of top papers .
(a) For any , and any monotonically decreasing , if , then for
(b) Conversely, for any continuous strictly monotonically decreasing and any , there exists a universal constant such that if and , then
1. The PeerReview4All assignment algorithm thus leads to a strong minimax guarantee on the recovery of the top papers: the upper and lower bounds differ by at most a term in the requirement on and constant pre-factor. Also note that as discussed in Section 5.1, approximation factor of the PeerReview4All algorithm can be much better than for various similarity matrices.
2. In addition to quantifying the performance of PeerReview4All, an important contribution of Theorem 2 is a sharp minimax analysis of the performance of every assignment algorithm. Indeed, the approximation ratio (17) can be defined for any assignment algorithm, by substituting corresponding assignment instead of . For example, if one has access to the optimal assignment (e.g., by using PeerReview4All if ) then we will have corresponding approximation ratio thereby yielding bounds that are sharp up to constant pre-factors.
3. While on one hand the estimator is preferred over when model (11) is correct, on the other hand, if , then the estimator is more robust to model mismatches.
4. The technical assumption is made without loss of any generality, because values of outside this range are vacuous. In more detail, for any similarity matrix , it must be that . Moreover, the co-domain of function comprises only non-negative real values, implying that for any similarity matrix .
5. The upper bound of the theorem holds for a slightly more general model of reviewers — reviewers with sub-Gaussian noise. Formally, in addition to the Gaussian noise model (11), the proof of Theorem 2(a) also holds for the following class of distributions of the score :
is an arbitrary mean zero sub-Gaussian random variable with scale parameter.
The conditions of Theorem 2 require function to be bounded. We now relax our earlier boundedness assumption on and consider .
In what follows we restrict our attention to MLE estimator which represents the paradigm that reviewers with higher similarity should have more weight in the final decision. In order to demonstrate that our PeerReview4All algorithm is able to adapt to different structures of similarity matrices — from hard cases when optimal assignment provides only one strong reviewer for some of the papers, to ideal cases when there are strong reviewers for every paper — let us consider the following set of families of similarity matrices parametrized by a non-negative value and integer parameter :
Here is as defined in (6).
In words, the parameter defines the notion of strong reviewer while parameter denotes the maximum number of strong (with similarity higher than ) reviewers that can be assigned to each paper without violating the conditions.
Then the following adaptive analogue of Theorem 2 holds:
(a) For any , , and any monotonically decreasing , if , then
(b) Conversely, for any continuous strictly monotonically decreasing , any , and any , there exists a universal constant such that if and , then
1. Observe that there is no approximation factor in the upper bound. Thus, the PeerReview4All algorithm together with are simultaneously minimax optimal up to a constant pre-factor in classes of similarity matrices for all , .
3. Corollary 2 together with Theorem 2 show that our PeerReview4All algorithm produces the assignment which is simultaneously minimax (near-)optimal for various classes of similarity matrices. We thus see that our PeerReview4All algorithm is able to adapt to the underlying structure of similarity matrix in order to construct an assignment in which even the most disadvantaged paper gets reviewers with sufficient expertise to estimate the true quality of the paper.
6.3.2 Approximate recovery under Hamming error
Although our ultimate goal is to recover set of top papers exactly, we note that often scores of boundary papers are close to each other so it may be impossible to distinguish between the and papers in the total ranking. Thus, a more realistic goal would be to try to accept papers such that the set of accepted papers is in some sense “close” to the set . In this work we consider the standard notion of Hamming distance (1) as a measure of closeness. We are interested in minimizing the quantity:
for some user-defined value of .
Similar to the exact recovery setup, the key role in the analysis is played by generalized separation threshold (compare with equation 14):
where and are indices of papers that take and positions respectively in the underlying total ranking. For any value of we consider the following generalization of the set defined in (15):
Also recall the family of matrices from (16) and the approximation factor from (17) for any parameter . With this notation in place, we now present the analogue of Theorem 2 in case of approximate recovery under the Hamming error.
(a) For any , , , and any monotonically decreasing , if , then for