In the absence of a well-chosen incentive structure, experts are not necessarily honest when reporting their opinions. For example, when reporting subjective probabilities, experts who have a reputation to protect might tend to produce forecasts near the most likely group consensus, whereas experts who have a reputation to build might tend to overstate the probabilities of outcomes they feel will be understated in a possible consensus(Nakazono, 2013). Hence, an important question when eliciting experts’ opinions is how to incentivize honest reporting.
Proper scoring rules (Winkler and Murphy, 1968) are traditional devices that incentivize honest reporting of subjective probabilities, i.e., experts maximize their expected scores by honestly reporting their opinions. However, proper scoring rules rely on the assumption that there is an observable future outcome, which is not always a reasonable assumption. For example, when market analysts provide sales forecasts on a potential new product, there is no guarantee that the product will ever be produced. Hence, the actual number of sales may never be observed.
In this paper, we propose a scoring method for promoting honest reporting amongst a group of experts when future outcomes are unobservable. In particular, we are interested in settings where experts observe signals from a multinomial distribution with an unknown parameter. Honest reporting then means that experts report exactly the signals that they observed. Our scoring method is built on proper scoring rules. However, different than what is traditionally assumed in the proper scoring rules literature, our method does not assume that there is an observable future outcome. Instead, scores are determined based on pairwise comparisons between experts’ reported opinions.
The proposed method may be used in a variety of settings, e.g., strategic planning, reputation systems, peer review, etc. When applied to strategic planning, the proposed method may induce honest evaluation of different strategic plans. A strategic plan is a systematic and coordinated way to develop a direction and a course for an organization, which includes a plan to allocate the organization’s resources (Argenti, 1968). After a candidate strategic plan is discarded, it becomes nearly impossible to observe what would be the consequences of that plan because strategic plans are long-term in nature. Hence, a method to incentivize honest evaluations of candidate strategic plans cannot assume that the result of a strategic plan is observable in the future.
Our method can also be applied to reputation systems to elicit honest feedback. In reputation systems, individuals rate a product/service after experiencing it, e.g., customer product reviews on Amazon.com are one such reputation system. Due to the subjective nature of this task, incentives for honest feedback should not be based on the assumption that an absolute rating exists.
For ease of exposition, we introduce our scoring method by illustrating its application to a domain where traditionally there are no observable outcomes: the peer-review process. Peer review is a process in which an expert’s output is scrutinized by a number of other experts with relevant expertise in order to ensure quality control and/or to provide credibility. Peer review is commonly used when there is no objective way to measure the output’s quality, i.e., when quality is a subjective matter. Peer review has been widely used in several professional fields, e.g., accounting (AICPA: American Institute of CPAs, 2012), law (LSC: Legal Services Commission, 2005), health care (Dans, 1993), etc.
Currently, a popular application of the peer-review process is in online education. Recent years have seen a surge of massive online open courses, i.e., free online academic courses aimed at large-scale participation. Some of these courses have attracted tens of thousands of students (Pappano, 2012). One of the biggest challenges faced by online educators brought by this massive number of students is the grading process since the available resources (personnel, time, etc.) is often insufficient. Auto-grading by computers is not always feasible, e.g., courses whose assignments consist of essay-style questions and/or questions that do not have clear right/wrong answers. Peer review has been used by some companies like Coursera111http://www.coursera.org/ as a way to overcome this issue.
For simplicity’s sake, we focus on peer review as used in modern scientific communication. The process, as we consider in this paper, can be described as follows: when a manuscript first arrives at the editorial office of an academic journal, it is first examined by the editor, who might reject the manuscript immediately because either it is out-of-scope or because it is of unacceptable quality. Manuscripts that pass this first stage are then sent out to experts with relevant expertise who are usually asked to classify the manuscript as publishable immediately, publishable after some revisions, or not publishable at all. Traditionally, the manuscript’s authors do not know the reviewers’ identities, but the reviewers may or may not know the identity of the authors.
In other words, peer review can be seen as a decision-making process where the reviewers serve as cognitive inputs that help a decision maker (chair, editor, course instructor, etc.) judge the quality of a peer’s output. A crucial point in this process is that it greatly depends on the reviewers’ honesty. In the canonical peer-review process, reviewers have no direct incentives for honestly reporting their reviews. Several potential problems have been discussed in different research areas, e.g., bias against female authors, authors from minor institutions, and non-native English writers (Bornmann et al., 2007; Wenneras and Wold, 1997; Primack and Marrs, 2008; Newcombe and Bouton, 2009).
In order to illustrate the application of our method to peer review, we start by modeling the peer-review process as a Bayesian model so as to take the uncertainty regarding the quality of the manuscript into account. We then introduce our scoring method to evaluate reported reviews. We assume that the scores received by reviewers are somehow coupled with relevant incentives, be they social-psychological, such as praise or visibility, or material rewards through prizes or money. Hence, we naturally assume that reviewers seek to maximize their expected scores and that there are no external incentives. We show that reviewers strictly maximize their expected scores by honestly disclosing their reviews under the additional assumptions that they are Bayesian decision-makers and that they cannot influence the reviews of other reviewers.
Honesty is intrinsically related to accuracy in our peer-review model: as the number of honest reviews increases, the distribution of the reported reviews converges to the probability distribution that represents the quality of the manuscript. We performed peer-review experiments to validate the model and to test the efficiency of the proposed scoring method. Our experimental results corroborate our theoretical model by showing that the act of encouraging honest reporting through the proposed scoring method creates more accurate reviews than the traditional peer-review process, where reviewers have no direct incentives for expressing their true reviews.
In addition to our method for inducing honest reporting, we also propose a method to aggregate opinions that uses information from experts’ scores. Our aggregation method is general in a sense that it can be used in any decision-making setting where experts report probability distributions over the outcomes of a discrete random variable. The proposed method works as if the experts were continuously updating their opinions in order to accommodate the expertise of others. Each updated opinion takes the form of a linear opinion pool, where the weight that an expert assigns to a peer’s opinion is inversely related to the distance between their opinions. In other words, experts are assumed to prefer opinions that are close to their own opinions, where closeness is defined by an underlying proper scoring rule. We provide conditions under which consensus is achieved under our aggregation method and discuss a behavioral foundation of it. Using data from our peer-review experiments, we find that the consensual review resulting from the proposed aggregation method is consistently more accurate than the canonical average review.
2 Related Work
In recent years, two prominent methods to induce honest reporting without the assumption of observable future outcomes were proposed: the Bayesian truth serum (BTS) method (Prelec, 2004) and the peer-prediction method (Miller et al., 2005).
The BTS method works on a single multiple-choice question with a finite number of alternatives. Each expert is requested to endorse the answer mostly likely to be true and to predict the empirical distribution of the endorsed answers. Experts are evaluated by the accuracy of their predictions as well as how surprisingly common their answers are. The surprisingly common criterion exploits the false consensus effect to promote truthfulness, i.e., the general tendency of experts to overestimate the degree of agreement that the others have with them.
The score received by an expert from the BTS method has two major components. The first one, called the information score, evaluates the answer endorsed by the expert according to the log-ratio of its actual-to-predicted endorsement frequencies. The second component, called the prediction score, is a penalty proportional to the relative entropy between the empirical distribution of answers and the expert’s prediction of that distribution. Under the BTS scoring method, collective honest reporting is a Bayes-Nash equilibrium.
The BTS method has been used to promote honest reporting in many different domains, e.g., when sharing rewards amongst a set of experts (Carvalho and Larson, 2011) and in policy analysis (Weiss, 2009)
. However, the BTS method has two major drawbacks. First, it requires the population of experts to be large. Second, besides reporting their opinions, experts must also make predictions about how their peers will report their opinions. While the artificial intelligence community has recently addressed the former issue(Witkowski and Parkes, 2012; Radanovic and Faltings, 2013), the latter issue is still an intrinsic requirement for using the BTS method.
The drawbacks of the BTS method are not shared by the peer-prediction method (Miller et al., 2005). In the peer-prediction method, a number of experts experience a product and rate its quality. A mechanism then collects the ratings and makes payments based on those ratings. The peer-prediction method makes use of the stochastic correlation between the signals observed by the experts from the product to achieve a Bayes-Nash equilibrium where every expert reports honestly.
A major problem with the peer-prediction method is that it depends on historical data. For example, when applied to a peer-review setting, after a reviewer reports his review, say
, the mechanism then estimates reviewer’s prediction of the review reported by another reviewer , , which is then evaluated and rewarded using a proper scoring rule and reviewer ’s actual reported review. The mechanism needs to have a history of previously reported reviews for computing , which is not always a reasonable assumption, e.g., when the evaluation criteria may change from review to review and when the peer-review process is being used for the first time. In other words, the peer-prediction method is prone to cold-start problems.
Carvalho and Larson (2012) addressed this issue by making the extra assumption that experts have uninformative prior knowledge about the distribution of the observed signals. Given this assumption, honest reporting is induced by simply making pairwise comparisons between reported opinions and rewarding agreements. In this paper, we extend the method by Carvalho and Larson (2012) in several ways. First, we show that the assumption of uninformative priors is unnecessary as long as experts have common prior distributions and this fact is common knowledge. Moreover, we provide stronger conditions with respect to the underlying proper scoring rule under which pairwise comparisons induce honest reporting.
Another contribution of our work is a method to aggregate the reported opinions into a single consensual opinion. Over the years, both behavioral and mathematical methods have been proposed to establish consensus (Clemen and Winkler, 1999). Behavioral methods attempt to generate agreement through interaction and exchange of knowledge. Ideally, the sharing of information leads to a consensus. However, behavioral methods usually provide no conditions under which experts can be expected to reach an agreement. On the other hand, mathematical aggregation methods consist of processes or analytical models that operate on the reported opinions in order to produce a single aggregate opinion. DeGroot (1974) proposed a model which describes how a group of experts can reach agreement on a consensual opinion by pooling their individual opinions. A drawback of DeGroot’s method is that it requires each expert to explicitly assign weights to the opinions of other experts. In this paper, we propose a method to set these weights directly which takes the scores received by the experts into account. We also provide a behavioral interpretation of the proposed aggregation method.
A related method for finding consensus was proposed by Carvalho and Larson (2013). Under the assumption that experts prefer probability distributions close to their own distributions, where closeness is measured by the root-mean-square deviation, the authors showed that a consensus is always achieved. Moreover, if risk-neutral experts are rewarded using the quadratic scoring rule, then the assumption that experts prefer probability distributions that are close to their own distributions follows naturally. The approach in this paper is more general because the underlying proper scoring rule can be any bounded proper scoring rule.
From an empirical perspective, we investigate the efficiency of both our scoring method and our method for finding consensus in a peer-review experiment. Formal experiments involving peer review are still relatively scarce. Even though the application of the peer-review process to scientific communication can be traced back almost 300 years, it was not until the early 1990s that research on this matter became more intensive and formalized (van Rooyen, 2001). Scientists in the biomedical domain have been in the forefront of research on the peer-review process due to the fact that dependable quality-controlled information can literally be a matter of life and death in this research field. In particular, the staff of the renowned BMJ, formerly British Medical Journal, have been studying the merits and limitations of peer review over a number of years (Lock, 1985; Godlee et al., 2003). Most of their work has focused on defining and evaluating review quality (van Rooyen et al., 1999), and examining the effect of specific interventions on the quality of the resulting reviews (van Rooyen, 2001).
One mechanism used to prevent bias in the peer-review process is called double-blind review, which consists of hiding both authors and reviewers’ identities. Indeed, it has been reported that such a practice reduces bias against female authors (Budden et al., 2008). However, it can be argued that knowing the authors’ identities makes it easier for the reviewers to compare the new manuscript with previously published papers, and it also encourages the reviewers to disclose conflicts of interest. Another argument that undermines the benefits of double-blind reviewing is that the authorship of the manuscript is often obvious to a knowledgeable reader from the context, e.g., self-referencing, research topic, writing style, working paper repositories, seminars, etc. (Falagas et al., 2006; Justice et al., 1998; Yankauer, 1991). Furthermore, this mechanism does not prevent against certain types of bias, e.g., when a reviewer rejects new evidence or new knowledge because it contradicts established norms, beliefs or paradigms.
Some work has focused on the calibration aspect of peer review. Roos et al. (2011) proposed a maximum likelihood method for calibrating reviews by estimating both the bias of each reviewer and the unknown ideal score of the manuscript. Bias is treated as the general rigor of a reviewer across all his reviews. Hence, Roos et al.’s method does not attempt to prevent bias by rewarding honest reporting. Instead, it adjusts reviews a posteriori so that they can be globally comparable.
Instead of calibrating reviews a posteriori, Robinson (2001) suggested to “calibrate” reviewers a priori. Reviewers are first asked to review short texts that have gold-standard reviews, i.e., reviews of high quality provided by experts with relevant expertise. Thereafter, they receive calibration scores, which are later used as weighting factors to determine how well their future reviews will be considered. This approach, however, does not guarantee that reviewers will report honestly after the calibration phase, when gold-standard reviews are no longer available.
To the best of our knowledge, our peer-review experiments are the first to investigate the use of incentives for honest reporting in a peer-review task. When objective verification is not possible, as in the peer-review process, economic measures may be used to encourage experts to honestly disclose their opinions. The proposed scoring method does so by making pairwise comparisons between reported reviews and rewarding agreements.
Rewarding experts based on pairwise comparisons has been empirically proven to be an effective incentive technique in other domains. Shaw et al. (2011) measured the effectiveness of a collection of social and financial incentive schemes for motivating experts to conduct a qualitative content analysis task. The authors found that treatment conditions that provided financial incentives and asked experts to prospectively think about the responses of their peers produced more accurate responses. Huang and Fu (2013) showed that informing the experts that their rewards will be based on how similar their responses are to other experts’ responses produces more accurate responses than telling the experts that their rewards will be based on how similar their responses are to gold-standard responses. Our work adds to the existing body of literature by theoretically and empirically showing that pairwise comparisons make the peer-review process more accurate.
3 The Basic Model
In our proposed peer-review process, a manuscript is reviewed by a set of reviewers , with . The quality of the manuscript is represented by a multinomial distribution222We use the term multinomial distribution to refer to the generalization of the Bernoulli distribution for discrete random variables with any constant number of outcomes. The parameter of this distribution is a probability vector that specifies the probability of each possible outcome.
to refer to the generalization of the Bernoulli distribution for discrete random variables with any constant number of outcomes. The parameter of this distribution is a probability vector that specifies the probability of each possible outcome.with unknown parameter , where represents the best evaluation score that the manuscript can receive and is the probability assigned to the evaluation score being equal to .
Each reviewer is modeled as possessing a privately observed draw (signal) from . Hence, our model captures the uncertainty of the reviewers regarding the quality of the manuscript. We extend the model to multiple observed signals in Section 5. We denote the honest review of each reviewer by , where . Honest reviews are independent and identically distributed, i.e., . We say that reviewer is reporting honestly when his reported review is equal to his honest review, i.e., .
Reviews are elicited and aggregated by a trusted entity referred to as the center333We refer to a single reviewer as “he” and to the center as “she”., which is also responsible for rewarding the reviewers. Let be reviewer ’s review score after he reports . We discuss how is determined in Section 4. Review scores are somehow coupled with relevant incentives, be they social-psychological, such as praise or visibility, or material rewards through prizes or money. We make four major assumptions in our model:
Autonomy: Reviewers cannot influence other reviewers’ reviews, i.e., they do not know each other’s identity and they are not allowed to communicate to each other during the reviewing process.
Risk Neutrality: Reviewers behave so as to maximize their expected review scores.
Dirichlet Priors: There exists a common prior distribution over , i.e., . We assume that this prior is a Dirichlet distribution and this is common knowledge.
Rationality: After observing , every reviewer updates his belief by applying Bayes’ rule to the common prior, i.e., .
The first assumption describes how peer review is traditionally done in practice. The second assumption means that reviewers are self-interested and no external incentives exist for each reviewer. The third assumption means that reviewers have common prior knowledge about the quality of the manuscript, a natural assumption in the peer-review process. We discuss the formal meaning of such an assumption in the following subsection. The fourth assumption implies that the posterior distributions are consistent with Bayesian updating, i.e.:
The last three assumptions imply that reviewers are Bayesian decision-makers. We note that different modeling choices could have been used, e.g., models based on games of incomplete information. Unlike our model, an incomplete-information game is often used when experts do not know each other’s beliefs. To find strategic equilibria in such incomplete-information models, one would need information about experts’ beliefs about each other’s private information. A Bayesian structure could be used to model each expert’s beliefs about the others, and it would permit the calculation of experts’ expected scores, which are maximized at equilibrium. However, the natural autonomy assumption makes such a Bayesian structure unrealistic.
3.1 Dirichlet Distributions
An important assumption in our model is that reviewers have Dirichlet priors over distributions of evaluation scores. The Dirichlet distribution can be seen as a continuous distribution over parameter vectors of a multinomial distribution. Since is the unknown parameter of the multinomial distribution that describes the quality of the manuscript, then it is natural to consider a Dirichlet distribution as a prior for . Given a vector of positive integers,
, that determines the shape of the Dirichlet distribution, the probability density function of the Dirichlet distribution overis:
Figure 1 shows the above probability density when for some parameter vectors . For the Dirichlet distribution in (1), the expected value of is . The probability vector is called the expected distribution regarding .
is itself a Dirichlet distribution. This relationship is often used in Bayesian statistics to estimate hidden parameters of multinomial distributions. To illustrate this point, suppose that reviewerobserves the signal , for . After applying Bayes’ rule, reviewer ’s posterior distribution is . Consequently, the new expected distribution is:
We call the probability vector in (2) reviewer ’s posterior predictive distribution regarding because it provides the distribution of future outcomes given the observed data . With this perspective, we regard the values as “pseudo-counts” from “pseudo-data”, where each can be interpreted as the number of times that the -probability event has been observed before.
Throughout this paper, we assume that reviewers have common prior Dirichlet distributions and this fact is common knowledge, i.e., the value of is initially the same for all reviewers. A practical interpretation of this assumption is that reviewers have common prior knowledge about the quality of the manuscript, i.e., reviewers have a common expectation regarding the quality of arriving manuscripts.
By using Dirichlet distributions as priors, belief updating can be expressed as an updating of the parameters of the prior distribution444 We note that other priors could have been used. However, the inference process would not necessarily be analytically tractable. In general, tractability can be obtained through conjugate distributions. Hence, another modeling choice is to consider that evaluation scores follow a normal distribution with unknown parameters. Assuming exchangeability, we can then use either the normal-gamma distribution or the normal-scaled inverse gamma distribution as the conjugate prior
We note that other priors could have been used. However, the inference process would not necessarily be analytically tractable. In general, tractability can be obtained through conjugate distributions. Hence, another modeling choice is to consider that evaluation scores follow a normal distribution with unknown parameters. Assuming exchangeability, we can then use either the normal-gamma distribution or the normal-scaled inverse gamma distribution as the conjugate prior(Bernardo and Smith, 1994). The major drawback with this approach is that continuous evaluation scores might bring extra complexity to the reviewing process.. Furthermore, the assumption of common knowledge allows the center to estimate reviewers’ posterior distributions based solely on their reported reviews, a point which is explored by our proposed scoring method. Due to its attractive theoretical properties, the Dirichlet distribution has been used to model uncertainty in a variety of different scenarios, e.g, when experts are sharing a reward based on peer evaluations (Carvalho and Larson, 2012) and when experts are grouped based on their individual differences (Navarro et al., 2006).
4 Scoring Method
In this section, we propose a scoring method to induce honest reporting of reviews. The proposed method is built on proper scoring rules (Winkler and Murphy, 1968).
4.1 Proper Scoring Rules
Consider an uncertain quantity with possible outcomes , and a probability vector , where is the probability value associated with the occurrence of outcome . A scoring rule is a function that provides a score for the assessment upon observing the outcome , for . A scoring rule is called strictly proper when an expert receives his maximum expected score if and only if his stated assessment corresponds to his true assessment (Winkler and Murphy, 1968). The expected score of at for a real-valued scoring rule is:
Proper scoring rules have been used as a tool to promote honest reporting in a variety of domains, e.g., when sharing rewards amongst a set of experts based on peer evaluations (Carvalho and Larson, 2010, 2012), to incentivize experts to accurately estimate their own efforts to accomplish a task (Bacon et al., 2012), in prediction markets (Hanson, 2003), in weather forecasting (Gneiting and Raftery, 2007), etc. Some of the best known strictly proper scoring rules, together with their scoring ranges, are:
All the above scoring rules are symmetric, i.e., , for all probability vectors , for all permutations on elements, and for all outcomes indexed by . We say that a scoring rule is bounded if , for all probability vectors and . For example, the logarithmic scoring rule is not bounded because it might return whenever the probability vector contains an element equal to zero, whereas both the quadratic and spherical scoring rules are always bounded. A well-known property of strictly proper scoring rules is that they are still strictly proper under positive affine transformations (Gneiting and Raftery, 2007), i.e., , for a strictly proper scoring rule , , and .
If is a strictly proper scoring rule, then a positive affine transformation of , i.e., for and , is also strictly proper.
4.2 Review Scores
If we knew a priori reviewers’ honest reviews, we could then compare the honest reviews to the reported reviews and reward agreement. However, due to the subjective nature of the peer-review process, we are facing a situation where this objective truth is practically unknowable. Our solution is to induce honest reporting through pairwise comparisons of reported reviews. The first step towards computing each reviewer ’s review score is to estimate his posterior predictive distribution shown in (2) based on his reported review . Let be such an estimation, where:
Recall that the elements of reviewer ’s true posterior predictive distribution are defined as:
Clearly, if and only if reviewer is reporting honestly, i.e., when he reports . The review score of reviewer is determined as follows:
where and are constants, for and , and is a strictly proper scoring rule. Scoring rules require an observable outcome, or a “reality”, in order to score an assessment. Intuitively, we consider each review reported by every reviewer other than reviewer as an observed outcome, i.e., the evaluation score deserved by the manuscript, and then we score reviewer ’s estimated posterior predictive distribution in (4) as an assessment of that value.
Each reviewer strictly maximizes his expected review score if and only if .
Let and . By the autonomy assumption, reviewers cannot affect their peers’ reviews. Hence, we can restrict ourselves to show that each reviewer strictly maximizes , for , if and only if .
(If part) Since is a strictly proper scoring rule, from Proposition 1 we have that:
If , then by construction , i.e., the estimated posterior predictive distribution in (4) is equal to the true posterior predictive distribution in (2). Consequently, honest reporting strictly maximizes reviewers’ expected review scores.
(Only-if part). Using a similar argument, given that is a strictly proper scoring rule, from Proposition 1 we have that:
Another way to interpret the above result is to imagine that each reviewer is betting on the review deserved by the manuscript. Since the most relevant information available to him is the observed signal, then the strategy that maximizes his expected review score is to bet on that signal, i.e., to bet on his honest review. When this happens, the true posterior predictive distribution in (2) is equal to the estimated posterior predictive distribution in (4) and, consequently, the expected score resulting from a strictly proper scoring rule is strictly maximized when the expectation is taken with respect to the true posterior predictive distribution.
It is important to observe that by incentivizing honest reporting, the scoring function in (5
) also incentivizes accuracy since honest reviews are draws from the distribution that represents the true quality of the manuscript. In other words, the center is indirectly observing these draws when reviewers report honestly. Consequently, due to the law of large numbers, the distribution of the reported reviews converges to the distribution that represents the true quality of the manuscript as the number of honestly reported reviews increases. Our experimental results in Section 7 show that there indeed exists a strong correlation between honesty and accuracy.
Different interpretations of the scoring method in (5
) arises depending on the underlying strictly proper scoring rule and the hyperparameter. In the following subsections, we discuss two different interpretations: 1) when is a symmetric and bounded strictly proper scoring rule and reviewers’ prior distributions are non-informative; and 2) when is a strictly proper scoring rule sensitive to distance.
4.3 Rewarding Agreement
Assume that reviewers’ prior distributions are non-informative, i.e., all the elements making up the hyperparameter have the same value. This happens when reviewers have no relevant prior knowledge about the quality of the manuscript. Consequently, the elements of reviewers’ true and estimated posterior predictive distributions can take on only two possible values (see equations (2) and (4) for ).
Moreover, if is a symmetric scoring rule, then the term in (5) can take on only two possible values because a permutation of elements with similar values does not change the score of a symmetric scoring rule. When is also strictly proper, it means that , when , and , when , where . Consequently, each term of the summation in (5) can be written as:
When is also bounded, we can then set and , and the above values become, respectively, and . Hence, the resulting review scores do not depend on parameters of the model. Moreover, we obtain an intuitive interpretation of the scoring method in (5): whenever two reported reviews are equal to each other, the underlying reviewers are rewarded by one payoff unit. Thus, in practice, our scoring method works by simply comparing reported reviews and rewarding agreements whenever is a symmetric and bounded strictly proper scoring rule and reviewers have no informative prior knowledge about the quality of the manuscript.
Another interesting point is that the center can reward different agreements in different ways, i.e., reviewers are not necessarily equally valued. For example, if the center knows a priori that a particular reviewer is reliable (respectively, unreliable), then she can increase (respectively, decrease) the reward of reviewers whose reviews are in agreement with reviewer ’s reported review. Formally, this means that for different reviewers and , the center can use different values for and in (5). Proposition 2 is not affected by this as long as , , and their values are independent of the reported reviews. Hence, by having a few reliable reviewers, this approach might help to eliminate the hypothetical scenario where a set of reviewers learn over time to report similar reviews. A similar idea was proposed by Jurca and Faltings (2009) to prevent collusions in reputation systems.
4.4 Strictly Proper Scoring Rules Sensitive to Distance
Pairwise comparisons, as defined in the previous subsection, might work well for small values of , the best evaluation score that the manuscript can receive, but it can be too restrictive and, to some degree, unfair when the best evaluation score is high. For example, when and the review used as the observed outcome is also equal to , a reported review equal to seems to be more accurate than a reported review equal to . One effective way to deal with these issues is by using strictly proper scoring rules in (5) that are sensitive to distance.
Using the notation of Section 4.1, recall that is some reported probability distribution. Given that the outcomes are ordered, we denote the cumulative probabilities by capital letter: . We first define the notion of distance between two probability vectors as proposed by Staël von Holstein (1970). We say that a probability vector is more distant from the th outcome than a probability vector if:
Intuitively, the above definition means that can be obtained from by successively moving probability mass towards the th outcome from other outcomes (Staël von Holstein, 1970). A scoring rule is said to be sensitive to distance if whenever is more distant from for all . Epstein (1969) introduced the ranked probability score (RPS), a strictly proper scoring rule that is sensitive to distance. Using the formulation of Epstein’s result proposed by Murphy (1970), we have for a probability vector and an observed outcome :
Figure 2 illustrates the scores returned by (6) for different reported reviews and values for when reviewers’ prior distributions are non-informative. When using RPS as the strictly proper scoring rule in (5), reviewers are rewarded based on how close their reported reviews are to the reviews taken as observed outcomes. For example, when the review used as the observed outcome is equal to (see the dotted line with squares in Figure 2), the returned score monotonically decreases as the reported review increases. Since RPS is strictly proper, Proposition 2 is still valid for any hyperparameter , i.e., each reviewer strictly maximizes his expected review score by reporting honestly. The scoring range of RPS is . Hence, review scores are always non-negative when using and in (5).
4.5 Numerical Example
Consider four reviewers () and the best evaluation score being equal to four (). Suppose that reviewers have non-informative Dirichlet priors with , and that reviewers 1, 2, 3, and 4 report, respectively, , , , and . From (4), the resulting estimated posterior predictive distributions are, respectively, , , , and . In what follows, we illustrate the scores returned by (5) when using a symmetric and bounded strictly proper scoring rule and when using RPS.
4.5.1 Rewarding Agreements
Assume that in (5) is the quadratic scoring rule shown in (3), which in turn is symmetric, bounded, and strictly proper. Consequently, as discussed in Section 4.3, the term in (5) can take on only two values:
Hence, by setting and , the above values are equal to, respectively, and . Using the scoring method in (5), we obtain the following review scores: and . That is, the review scores received by reviewers and are similar due to the fact that . Reviewer and ’s review scores are equal to because there is no match between their reported reviews and others’ reported reviews.
4.5.2 Taking Distance into Account
Now, assume that in (5) is the RPS rule shown in (6). In order to ensure non-negative review scores, let and . Using the scoring method in (5), we obtain the following review scores: , and . The review score of reviewer is the lowest because his reported review is the most different review, i.e., it has the largest distance between it and all of the other reviews.
5 Multiple Criteria
In our basic model, reviewers observe only one signal from the distribution that represents the quality of the manuscript. However, manuscripts are often evaluated under multiple criteria, e.g., relevance, clarity, originality, etc., meaning that in practice reviewers might observe multiple signals and report multiple evaluation scores. Under the assumption that these signals are independent, each reported evaluation score can be scored individually using the same scoring method proposed in the previous section. Clearly, Proposition 2 is still valid, i.e., honest reporting still strictly maximizes reviewers’ expected review scores.
The lack of relationship between different criteria is not always a reasonable assumption. A modeling choice that takes the relationship between observed signals into account, which is also consistent with our basic model, is to assume that the quality of the manuscript is still represented by a multinomial distribution, but now reviewers may observe several signals from that distribution. Formally, let be the number of draws from the distribution that represents the quality of the manuscript, where each signal represents an evaluation score related to a criterion. Instead of a single number, each reviewer ’s private information is now a vector: , where , for . The basic assumptions (autonomy, risk neutrality, Dirichlet priors, and rationality) are still the same. For ease of exposition, we denote reviewer ’s true posterior predictive distribution in this section by . Under this new model, each reviewer ’s posterior predictive distribution is now defined as:
where is an indicator function:
Assuming that each reported review is a vector of evaluations scores, i.e., , where , for and , the center estimates each reviewer ’s posterior predictive distribution by applying Bayes’ rule to the common prior. The resulting estimated posterior predictive distribution , referred to as for ease of exposition, is:
Thereafter, the center rewards by using a strictly proper scoring rule and other reviewers’ reported reviews as observed outcomes:
where is some function used by the center to summarize each reviewer ’s reported review in a single number, and whose image is equal to the set . For example, can be a function that returns the median or the mode of the reported evaluation scores. Honest reporting, i.e., , maximizes reviewers’ expected review scores under this setting.
When observing and reporting multiple signals, each reviewer maximizes his expected review score when .
Due to the autonomy assumption, we restrict ourselves to show that each reviewer maximizes , for , when . Given that is a strictly proper scoring rule, from Proposition 1 we have that:
When observing and reporting multiple evaluation scores, a reviewer can weakly maximize his expected review score by reporting a review different than his true review as long as the estimated posterior predictive distributions are the same. For example, when reviewer reports , the resulting estimated posterior predictive distribution is the same as when he reports , and, consequently, reviewer receives the same review score in both cases. This implies that the scoring method in (9) is more suitable to a peer-review process where all criteria are equally weighted since honest reporting weakly maximizes expected review scores.
5.1 Summarizing Signals when Prior Distributions are Non-Informative
When reviewers report multiple evaluation scores, the intuitive interpretation of review scores as rewards for agreements that arises when using symmetric and bounded strictly proper scoring rules (see Section 4.3) is lost because the elements of the estimated posterior predictive distribution in (8) can take on more than two different values.
A different approach that preserves the aforementioned intuitive interpretation when reviewers’ prior distributions are non-informative is to ask the reviewers to summarize their observed signals into a single value before reporting it, instead of the center doing it on their behalf. Hence, each reviewer is now reporting honestly when , where is some function suggested by the center whose image is equal to the set . This new model can be interpreted as if the reviewers were reviewing the manuscript under several criteria and reporting the manuscript’s overall evaluation score by reporting the value .
Since reviewers are reporting only one value, we can use the original scoring method in (5) to promote honest reporting. We prove below that for any symmetric and bounded strictly proper scoring rule, honest reporting strictly maximizes reviewers’ expected review scores under the scoring method in (5) if and only if is the mode of the observed signals, i.e., when . Ties between observed signals are broken randomly.
When observing multiple signals and reporting , each reviewer with non-informative prior strictly maximizes his expected review score under the scoring method in (5), for a symmetric and bounded strictly proper scoring rule , if and only if .
Recall that since each reviewer observes multiple signals, his true posterior predictive distribution is equal to as shown in (7). Due to Proposition 1 and since reviewers cannot affect their peers’ reviews because of the autonomy assumption, we restrict ourselves to show that each reviewer maximizes , for , if and only if , where .
Let be the most common signal observed by reviewer . Hence, reviewer ’s subjective probability associated with is greater than his subjective probability associated with any other signal , i.e., .
(If part) Given that is a symmetric and strictly proper scoring rule and that each reviewer is reporting only one evaluation score, the resulting score from can take on only two possible values: , if , and otherwise (see discussion in Section 4.3). When reporting , reviewer ’s expected review score is . Given that for any , and , this expected review score is maximized. Thus, reporting maximizes reviewer ’s expected review score.
(Only-if part) Recall that all the elements making up the hyperparameter have the same value because reviewers’ prior distributions are non-informative. Let be reviewer ’s estimated posterior predictive distribution computed according to the original scoring method in (5) when reviewer is reporting , i.e.:
For contradiction’s sake, suppose that reviewer maximizes his expected review score by misreporting his review and reporting . Let be reviewer ’s estimated posterior predictive distribution when he is misreporting his review, i.e.:
As discussed in Section 4.3, the term can take on only two possible values whenever is a symmetric scoring rule. Consequently, for . A consequence of our assumption that reviewer maximizes his expected review score by misreporting his review is that . Assuming that is a symmetric and bounded proper scoring rule, this inequality becomes:
The second line follows from the fact that for . Regarding the last line, we have by construction that , and . Consequently, we obtain that . As we stated before, since is the most common signal observed by reviewer , then . Thus, we have a contradiction. So, , i.e., reviewer maximizes his expected review score only if he reports . ∎
In other words, the above proposition says that each reviewer should report the evaluation score most likely to be deserved by the manuscript when their prior distributions are non-informative and they are rewarded according to the scoring method in (5). Any other evaluation score has a lower associated subjective probability and, consequently, reporting it results in a lower expected review score. To summarize, Proposition 4 implies that the scoring method proposed in (5) induces honest reporting by rewarding agreements whenever reviewers’ prior distributions are non-informative and the center is interested in the mode of each reviewer’s observed signals. It is noteworthy that Proposition 4 does not assume that the center knows a priori the number of observed signals , thus providing more flexibility for practical applications of our method.
6 Finding a Consensual Review
After reviewers report their reviews and receive their review scores, there is still the question of how the center will use the reported reviews in making a suitable decision. Since reviewers are not always in agreement, belief aggregation methods must be used to combine the reported reviews into a single representative review. The traditional average method is not necessarily the best approach since unreliable reviewers might have a big impact on the aggregate review. Moreover, a consensual review is desirable because it represents a review that is acceptable by all.
In this section, we propose an adaptation of a classical mathematical method to find a consensual review. Intuitively, it works as if reviewers were constantly updating their reviews in order to aggregate knowledge from others. The scoring concepts introduced in previous sections are incorporated by the reviewers when updating their reviews. In what follows, for the sake of generality, we assume that reviewers evaluate the manuscript under criteria, i.e., each reviewer observes signals from the underlying distribution that represents the quality of the manuscript and report a vector of evaluation scores, where for all . The center then estimates reviewers ’s posterior predictive distribution , referred to as for ease of exposition, as in (8). We relax our basic model by allowing the evaluation scores in the aggregate review to take on any real value between and the best evaluation score .
6.1 DeGroot’s Model
DeGroot (1974) proposed a model that describes how a group might reach a consensus by pooling their individual opinions. When applying this model to a peer-review setting, each reviewer is first informed of others’ reported reviews. In order to accommodate the information and expertise of the rest of the group, reviewer then updates his own review as follows:
where is a weight that reviewer assigns to reviewer ’s reported review when he carries out this update. It is assumed that , for every reviewer and , and . In this way, each updated review takes the form of a linear combination of reported reviews, also known as a linear opinion pool. The weights must be chosen on the basis of the relative importance that reviewers assign to their peers’ reviews. The whole updating process can be written in a more general form using matrix notation: , where:
Since all the original reviews have changed, the reviewers might wish to update their new reviews in the same way as they did before. If there is no basis for the reviewers to change their assigned weights, the whole updating process after revisions can then be represented as follows:
Let be reviewer ’s review after revisions, i.e., it denotes the th row of the matrix . We say that a consensus is reached if and only if , for every reviewer and , when .
6.2 Review Scores as Weights
The original method proposed by DeGroot (1974) does not encourage honesty in a sense that reviewers can assign weights to their peers’ reviews however they wish so as long as the weights are consistent with the construction previously defined. Furthermore, it requires the disclosure of reported reviews to the whole group when reviewers are weighting others’ reviews, a fact which might be troublesome when the reviews are of a sensitive nature.
A possible way to circumvent the aforementioned problems is to derive weights from the original reported reviews by taking into account review scores. In particular, we assume the weight that a reviewer assigns to a peer’s review is directly related to how close their estimated posterior predictive distributions are, where closeness is defined by an underlying proper scoring rule. We provide behavioral foundations for such an assumption in the following subsection. Formally, the weight that reviewer assigns to reviewer ’s reported review is computed as follows:
that is, the weight is proportional to the expected review score that reviewer would receive if he had reported the review reported by reviewer , where the expectation is taken with respect to reviewer ’s estimated posterior predictive distribution. Consequently, the weight that each reviewer indirectly assigns to his own review is always the highest because is a strictly proper scoring rule, i.e., . We assume that , for every reviewer and . As long as is bounded, this assumption can be met by appropriately setting the value of . Consequently, , for every . Moreover, because the denominator of the fraction in (11) normalizes the weights so they sum to one.
In the interest of reaching a consensus, DeGroot’s method in (10) is applied to the original reported reviews using the weights as defined in (11). We show that a consensus is always reached under this proposed method whenever the review scores are positive.
If , for every reviewer , then when .
Due to the assumption that , for every reviewer and , all the elements of the matrix in (10) are strictly greater than zero and strictly less than one. Moreover, the sum of the elements in any row is equal to one. Consequently, can be regarded as a stochastic matrix, or a one-step transition probability matrix of a Markov chain with states and stationary probabilities. Furthermore, the underlying Markov chain is aperiodic and irreducible. Therefore, a standard limit theorem of Markov chains applies in this setting, namely given an aperiodic and irreducible Markov chain with transition probability matrix , every row of the matrix converges to the same probability vector when (Ross, 1995). ∎
Recall that , where is a probability vector that incorporates all the previous weights. This equality implies that the consensual review can be represented as an instance of the linear opinion pool. Hence, an interpretation of the proposed method is that reviewers reach a consensus regarding the weights in (10). When in the above equality, the underlying linear opinion pool becomes the average of the reported evaluation scores. A drawback with an averaging approach is that it does not take into account the scoring concepts introduced in the previous sections, a fact which might favor unreliable reviewers. Moreover, disparate reviews might have a big impact on the resulting aggregate review. On the other hand, under our approach to find , reviewers weight down reviews far from their own reviews, which implies that the proposed method might be less influenced by disparate reviews. A numerical example in subsection 6.4 illustrates this point. The experimental results discussed in Section 7 show that our method to find a consensual review is consistently more accurate than the traditional average method.
6.3 Behavioral Foundation
The major assumption regarding our method for finding a consensual review is that reviewers assign weights according to (11). An interesting interpretation of (11) arises when the proper scoring rule is effective with respect to a metric . Formally, given a metric that assigns a real number to any pair of probability vectors, which can be seen as the shortest distance between the two probability vectors, we say that a scoring rule is effective with respect to if the following relation holds for all probability vectors , , and (Friedman, 1983):
Thus, when is effective with respect to a metric , the higher the weight one reviewer assigns to a peer’s review in (11), the closer their estimated posterior predictive distributions are according to the metric . In other words, when using effective scoring rules, reviewers naturally prefer reviews close to their own reported reviews, and the weight that each reviewer assigns to his own review is always the highest one. Hence, in spirit, the resulting learning model in (10) can be seen as a model of anchoring (Tversky and Kahneman, 1974) in a sense that the review of a reviewer is an “anchor”, and subsequent updates are biased towards reviews close to the anchor.