# Truthful Peer Grading with Limited Effort from Teaching Staff

## Authors

• 1 publication
• 13 publications
• ### Peer Grading in a Course on Algorithms and Data Structures: Machine Learning Algorithms do not Improve over Simple Baselines

Peer grading is the process of students reviewing each others' work, suc...
06/02/2015 ∙ by Mehdi S. M. Sajjadi, et al. ∙ 0

• ### Learning to Teach in Cooperative Multiagent Reinforcement Learning

We present a framework and algorithm for peer-to-peer teaching in cooper...
05/20/2018 ∙ by Shayegan Omidshafiei, et al. ∙ 0

• ### Measurement Integrity in Peer Prediction: A Peer Assessment Case Study

We propose measurement integrity, a property related to ex post reward f...
08/12/2021 ∙ by Noah Burrell, et al. ∙ 0

• ### Linking open-source code commits and MOOC grades to evaluate massive online open peer review

Massive Open Online Courses (MOOCs) have been used by students as a low-...
04/15/2021 ∙ by Siruo Wang, et al. ∙ 0

• ### The Device War - The War Between IOT Brands In A Household

Users buy compatible IOT devices from different brands with an expectati...
12/30/2018 ∙ by Marius C. Silaghi, et al. ∙ 0

• ### Random Gossip Processes in Smartphone Peer-to-Peer Networks

In this paper, we study random gossip processes in communication models ...
02/07/2019 ∙ by Calvin Newport, et al. ∙ 0

• ### Discovery of Bias and Strategic Behavior in Crowdsourced Performance Assessment

With the industry trend of shifting from a traditional hierarchical appr...
08/05/2019 ∙ by Yifei Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The current research on peer grading mechanisms can be broadly classified into three classes. The first class of literature considers the grades from the peers as their

best effort and use multiple graders’ independent scores of a paper to statistically arrive at a final score (Hamer et al., 2005; Cho and Schunn, 2007; Piech et al., 2013; Shah et al., 2013; Paré and Joordens, 2008; Kulkarni et al., 2014; De Alfaro and Shavlovsky, 2014; Raman and Joachims, 2014; Caragiannis et al., 2015; Wright et al., 2015). These approaches assume that the graders reveal their true observed scores or invest effort to find the true scores. But without a clear incentive to do so, these mechanisms stand vulnerable to strategic manipulations.

The final class is a hybrid approach where partial ground truth is accessible and this is augmented with the peer prediction method. The ground truth can be found either via grading a section of the papers by the teaching staff or selectively verifying the grades of certain peer graded answerscripts. Using this information, schemes can be devised that reward a grader for agreement with the trusted report (Jurca and Faltings, 2005; Dasgupta and Ghosh, 2013; Gao et al., 2016). Gao et al. (2016) present the spot checking mechanism where some selected papers are graded by a trusted authority and a payment (which can be given to a grader as peer grading bonus scores) is made based on the agreement with that trusted scores, while for the non-spot-checked papers the payment is a constant. This scheme ensures truthfulness in dominant strategies. However, this scheme (a) penalizes an inherent inaccuracy of a grader in the same way as a manipulation, and (b) needs a number of ground truth papers that increases linearly with the number of students.

### Our Approach and Results

Our approach exploits the fact that, in practice, a peer grading mechanism can extract more information from the students, who are also graders. For example, the mechanism (a) can have some graded answerscripts to measure the quality of the graders, and (b) allow students to report an incorrect grading and correct them with the help of the teaching staff. Hence, eventually all ‘true’ grades will be obtained, and these information will be used to deter peer graders from deliberate underperforming. We present the mechanism TRUthful Peer Evaluation with Quality Assurance (TRUPEQA) in Algorithm 1 that provide the following theoretical and experimental features and findings. TRUPEQA

• [noitemsep,leftmargin=*,topsep=0pt,parsep=0pt,partopsep=0pt]

• estimates the accuracy of the graders (Alg. 1, Step 4) using a constant number of papers that are graded by the teaching staff (we call such papers probes).

• incentivizes the graders to grade at the level of the estimated accuracy (Theorem 1). This method ensures that the mechanism accounts for the inaccuracies of the graders and penalizes only manipulation and not inaccuracy.

• ensures voluntary participation (Theorem 2).

• minimizes the total cost of revealing the true grades among all mechanisms that use fixed number of probes, i.e., mechanisms that use partial ground truth (Observation 2).

• on synthetic data, even when graders are non-strategic, interestingly, it yields statistically significant lower RMS error than the Gibbs sampling mechanism on the model of grader bias and reliability as described in Piech et al. (2013), and also the mean and median mechanisms. It also receives less fraction of regrading requests compared to those mechanisms (§6.1.1).

• quite naturally, it performs significantly better when the graders are strategic (§6.1.2).

• both TRUPEQA and Gibbs need the knowledge of priors of the scores. If the prior used by the mechanism is different from the true prior, the performance of these two mechanisms are affected. However, Gibbs turns out to be too sensitive to it and the error increases with increasing reliability, while TRUPEQA continues to perform better as reliability increases (§6.1.3).

• on a real dataset with discrete model of scores and accuracies, TRUPEQA performs better on the RMS error (§6.2), even though it is difficult to ascertain whether the graders manipulated or not.

## 2 Model

Let . Let be the set of candidates writing the test that will be peer-graded. Therefore, the set of answerscripts (or papers) and the set of graders are both denoted by . We use as the index for a grader and as the index for a paper. The set of graders assigned paper is denoted by where . The true score of paper is denoted by and the score observed by a grader is denoted by . The set of papers graded by grader is given by . Both the true scores and the observed scores belong to the set of scores which can be continuous or discrete. Our analyses primarily assume a continuous score set. However, with a little adaptation the analyses extend to discrete cases as well. The scores are drawn i.i.d. from according to a distribution , which is a common knowledge. The observation of given by the graders is governed by the error model (for a discrete

, this is the probability mass function

), which is the density of given and is the parameter of the error model, which we will call accuracy. Let denote the set of all possible accuracies.

We consider the premise of peer-grading mechanisms where papers are checked by the teaching staff, and the grades provided are assumed to be the ground truth. Denote the set of such ground truth papers by , and call it the set of probe papers. The true grades of these probe papers are known to the mechanism designer (e.g., the instructor of the course), but these scores are not ex-ante revealed to the peer-graders. Each grader is given the probe and non-probe papers in two batches. The mechanisms we consider estimate the accuracy from the performance of the graders on these probe papers in batch one, and use it to predict the score of a non-probe paper in batch two. The mechanisms we consider allow the graders to know the identities of the probe papers and the true scores ex-post their grading, with which their estimated accuracies are also released. With such information released, the mechanism design goal is to ensure that a rational grader continues to operate at the same estimated accuracy and report the observations truthfully.111However, in a more realistic scenario, such a complete information of batch one may not be released, which further limits the possibility of manipulation by the grader. We show that even with this information available, truthful mechanisms can be designed, which also helps distinguish error with manipulation.

An estimation rule is given by the vector

, with , which is the function that estimates the accuracy of grader from the probe papers she corrects. Hence if agent reports the scores for the papers while the true scores of those papers are , her accuracy is given by . We assume that , , and , i.e., every paper is graded by at least one grader, and every grader grades at least one probe and one non-probe paper. We consider the estimation rule to be standard and publicly announced beforehand. Therefore, it is not part of the mechanism. However, we will see that its role is crucial while defining the notions of truthfulness and voluntary participation.

For notational simplicity, we will use the following shorthands for grader : for the probe papers assigned to , for the non-probe papers assigned to , for the estimated accuracy, and for the set of co-graders who grade at least one common non-probe paper with . The set of accuracies is denoted by .

We assume that the true score vector is eventually observed at a cost. Define, to be the cost of observing true score when a score of was (possibly incorrectly) given to paper . We argue that this is quite natural in the context of peer-grading. Even though a paper might be incorrectly graded, the student has an option to appeal – in which case, the teaching staff gets involved and does a correct grading – but this happens at an additional cost, captured by the function . An example of is one that increases in the difference and is zero when .222If a student’s true score is far from the given score, then the teaching staff needs to check a larger portion of the answerscript, leading to a larger cost. The reward to the designer is denoted by the function , and is defined as the negative of the cost, i.e., . The total reward for a collection of papers is additive over individual rewards. With a little abuse of notation we denote the reward for the papers in with as well, i.e., .

A peer-grading mechanism for a given estimation rule is therefore given by the tuple , where

• [noitemsep,leftmargin=*,topsep=0pt,parsep=0pt,partopsep=0pt]

• is grader set assignment function, as defined before.

• , with , which is the function that computes the scores of the non-probe papers from the scores reported by the graders and their accuracies. Hence, the score decided by the mechanism for paper is when the graders submit and the estimated accuracy vector is .

• , with , which denotes the function that yields the transfer (or payment) to grader , and is a function of the given and true scores of the papers, and the accuracies of the graders. Hence, the transfer is a real number, where is the score given to paper by the mechanism. For continuous scores, a scaled value of the transfer can be directly added to the total score as bonus marks for peer-grading, while for discrete, the transfer has to be awarded separately.

It is reasonable to assume , as the usual practice in peer-grading is to not assign her own paper for grading. Therefore the score received by agent , i.e., , is independent of her reported grades . The transfer is the only factor influenced by her report, and one of our mechanism design goals is to choose this transfer such that it incentivizes the agent to report her observations truthfully.

We assume that every student cares only for her score in the examination given by the mechanism and the transfer (e.g., the bonus marks) from peer grading. Among these, the score received is independent of her grading, and hence non-manipulable. Therefore, we consider the only potentially manipulable part of the payoff, the transfer, for our analysis. For agent in mechanism , this is given by

 uMi(~yCGiNPi,yNPi,qCGi)=ti((rj(~yG(j)j,qG(j)):j∈NPi),yNPi,qCGi). (1)

The payoff captures the fact that a grader is paid only for the non-probe papers she checks, and the payment is dependent on the score given by the mechanism to those papers, the true score of the papers, and the accuracies of the co-graders. The score given by the mechanism to a paper is also dependent on the reported scores of the other graders who grade the same paper.

Note that there can be a cost of grading the papers for agent , which we have ignored in our model for simplicity. However, in certain setups, the cost of grading a paper is typically known and can be added to the transfer. All our analyses will continue to hold even in that setting. The setting where the cost is dependent on the accuracy of a grader is a different research question which we leave as a future work.

The reported scores of grader , given by , is decomposable into the probe and non-probe components, i.e., . Since the estimation function is publicly known, agent knows that her accuracy will be estimated as , when the true scores of the probes, , is made public by the designer. The strategy agent can consider is whether to report the scores of the non-probe papers according to the distribution given by the same accuracy level, i.e., according to , or something different. Our mechanism design goal is to ensure that the graders continue grading the non-probe papers with the same accuracy level estimated via the probes. This is captured in the definition of truthfulness as follows.

###### Definition 1 (Equal Intensity Incentive Compatibility (EIIC))

Consider grader . Let , where . A mechanism is equal intensity incentive compatible (EIIC) if for all ,

 \omit\span\omit\span\omit\@ADDCLASSltxeqnlefteqn$E~y(i)NPi,~yCGi∖{i}NPi,yNPi | qCGiuMi((~y(i)NPi,~yCGi∖{i}NPi),yNPi,qCGi)$⩾E~y(i)NPi,~yCGi∖{i}NPi,yNPi | qCGiuMi((^y(i)NPi,~yCGi∖{i}NPi),yNPi,qCGi).

Where is drawn from the distribution , and is drawn from the distribution .

Note that the inequality of Definition 1 should hold for every reported score vector on the probe papers of . The expectation of the utility is taken w.r.t. the accuracy estimated from this reported score vector. The definition says that for a mechanism to be EIIC, every agent should receive maximum expected payoff if she grades the non-probe papers assigned to her with the same accuracy as estimated from the probe papers, which takes into account that the graders could be inaccurate. Note that this definition also assumes that the other co-graders of draw their scores according to the accuracies estimated from their scores on their respective probe papers. Therefore, this notion of incentive compatibility is close in spirit to the ex-post incentive compatibility notion in literature (Mezzetti, 2004; Nath and Zoeter, 2013; Bhat et al., 2014), which is the best achievable truthfulness guarantee in this setting of interdependent valuations due to the impossibility result by Jehiel et al. (2006).

EIIC is subtly different from the expectation computed w.r.t. the graders’ true accuracies. It is unlikely that a grader will perfectly know her true accuracy, i.e., the probability of observing when the true score is . Rather if the performance of a grader is evaluated on some training set, and is shown to the grader, she can decide whether to continue at the same level of estimated accuracy (by putting similar effort for grading) or not. EIIC ensures that working at the level of estimated accuracy on the non-probe papers is a better option in expectation.

Also, EIIC ensures a little stronger guarantee than is required in an actual setting of peer grading (see paragraph 2 of Section 2 and footnote1). In practice, the true scores of the probe papers and the estimated accuracies of the graders may not be released before they check the non-probe papers, leading to a more restricted set of manipulation strategies of the graders.333This is not unusual in mechanism design, e.g., dominant strategy incentive compatible mechanisms are designed even though agents may never know the reported types of the other agents.

Our next definition ensures that every peer-grader willfully participates in this mechanism, as the expected payoff from participating truthfully is non-negative.

###### Definition 2 (Ex-Post Individual Rationality (EPIR))

Consider grader . Let , where . A mechanism is ex-post individually rational (EPIR) if for all ,

 E~y(i)NPi,~yCGi∖{i}NPi,yNPi | qCGiuMi((~y(i)NPi,~yCGi∖{i}NPi),yNPi,qCGi)⩾0. (2)

Where is drawn from the distribution , and is drawn from the distribution .

The goal of the instructor in a peer-grading context is to minimize the expected cost, which is equivalent to maximizing the expected reward. Ideally, the cost is zero if the scores given by a mechanism is equal to the true scores for each paper. This can be interpreted as though the paper does not come back to the teaching staff for regrading, and therefore the peer-graded score is the final score for the paper. Therefore, the following property is essential for any reasonable peer-grading mechanism.

###### Definition 3 (Expected Reward Maximizer (ERM))

Let , where . The score computing function of a mechanism is expected reward maximizer (ERM) if for every reported score vector for non-probe papers, and for every estimated accuracy vector , it maximizes the expected reward, i.e.,

 r∗(~yNN∖P,qN)∈argmaxxN∖P∈S|N∖P|EyN∖P | ~yNN∖P;qNR(xN∖P,yN∖P). (3)

Since the reward is only limited to the non-probe papers, the expectation is calculated based on the scores reported and the true score of those papers. Let the ERM given by Equation 3 be denoted by . Note that, since the reward function is additive over the papers, i.e., , the decision problem of Equation 3 is also decomposable. Hence the ERM score for every paper is given by

 r∗j(~yG(j)j,qG(j))∈argmaxxj∈SEyj | ~yG(j)j;qG(j)R(xj,yj). (4)

The reward at the ERM score for paper when the true score is is denoted by

 W∗j(~yG(j)j,qG(j),yj)=R(r∗j(~yG(j)j,qG(j)),yj). (5)

Define a the ERM score of paper in the absence of agent to be

 r(−i)∗j(~yG(j)∖{i}j,qG(j)∖{i})∈argmaxxj∈SEyj | ~yG(j)∖{i}j;qG(j)∖{i}R(xj,yj). (6)

The reward at the ERM score for paper in the absence of agent when the true score is is denoted by

 W(−i)∗j(~yG(j)∖{i}j,qG(j)∖{i},yj)=R(r(−i)∗j(~yG(j)∖{i}j,qG(j)∖{i}),yj). (7)

We will use the shorthands and for the above two expressions when the arguments of such functions are clear from the context.

We are now in a position to present the central mechanism of this paper.

## 3 The Trupeqa mechanism

In this section, we present our mechanism TRUPEQA (TRUthful Peer Evaluation with Quality Assurance), that (1) decides the assignment of the papers to the graders, (2) selects the score for every paper, and (3) decides the transfer to every grader. Algorithm 1 shows the details of the steps, while Figure 1 shows the dependencies of different variables graphically using a multi-agent influence diagram (MAID) (Koller and Milch, 2003).

The operating principle of this mechanism is to assign equal number of papers to the graders and pick the score of a paper such that the expected cost (w.r.t. the estimated accuracy of all the graders who graded this paper) is minimized. Finally, the transfer is the marginal contribution of the grader towards minimizing this cost. In the next section, we show that these features of TRUPEQA satisfy some desirable properties of peer grading.

## 4 Properties of Trupeqa

By the observation of Equation 4, we have that the joint problem of finding the score vector of all the papers to maximize the expected reward (Equation 3) is equivalent to the problem to scoring the papers individually with the same objective. Similarly, since the transfer to grader in TRUPEQA is additive over all the papers she grades (Step 6 of Alg. 1), we see that the questions of incentive compatibility and individual rationality are completely decomposable into the questions at the individual paper levels.

###### Observation 1 (Decomposability)

If TRUPEQA satisfies the inequalities given by Definitions 2 and 1 individually for every paper , i.e., they hold when , and , it is sufficient to conclude that TRUPEQA is EIIC and EPIR.

With the above observation, we proceed to proving the main results of this section.

###### Theorem 1 (Eiic)

TRUPEQA is EIIC.

Proof:  Pick arbitrary and . Define to be the transfer to grader for grading paper under TRUPEQA. Since is arbitrary, by Observation 1, it is sufficient to show that the inequalities given by Definition 1 hold after replacing the utility with the transfer term and the term with a single paper . Formally, we need to show that ,

 \omit\span\omit\span\omit\@ADDCLASSltxeqnlefteqn$E~y(i)j,~yG(j)∖{i}j,yj | qG(j)tij((~y(i)j,~yG(j)∖{i}j),yj,qG(j))$⩾E~y(i)j,~yG(j)∖{i}j,yj | qG(j)tij((^y(i)j,~yG(j)∖{i}j),yj,qG(j)).

Where is drawn from the distribution , and is drawn from the distribution , while is any arbitrary score.

It is easy to see that is independent of the report of agent by definition. Therefore in order to prove the inequality of Section 4, it is sufficient to show that

 E~y(i)j,~yG(j)∖{i}j,yj | qG(j)W∗j(~y(i)j,~yG(j)∖{i}j,qG(j),yj) ⩾E~y(i)j,~yG(j)∖{i}j,yj | qG(j)W∗j(^y(i)j,~yG(j)∖{i}j,qG(j),yj). (8)

Consider the LHS of the above inequality

 E~y(i)j,~yG(j)∖{i}j,yj | qG(j)W∗j(~y(i)j,~yG(j)∖{i}j,qG(j),yj) =E~y(i)j,~yG(j)∖{i}j,yj | qG(j)R(r∗j(~yG(j)j,qG(j)),yj) =E~y(i)j,~yG(j)∖{i}j | qG(j)(Eyj | ~y(i)j,~yG(j)∖{i}j,qG(j)R(r∗j(~yG(j)j,qG(j)),yj)) ⩾E~y(i)j,~yG(j)∖{i}j | qG(j)(Eyj | ~y(i)j,~yG(j)∖{i}j,qG(j)R(r∗j(^y(i)j,~yG(j)∖{i}j,qG(j)),yj)) =E~y(i)j,~yG(j)∖{i}j,yj | qG(j)R(r∗j(^y(i)j,~yG(j)∖{i}j,qG(j)),yj) =E~y(i)j,~yG(j)∖{i}j,yj | qG(j)W∗j(^y(i)j,~yG(j)∖{i}j,qG(j),yj).

The first equality is obtained by substituting the values from Equation 5

. The second equality holds due to the chain rule of probability. The inequality holds by definition of

(Equation 4). The term within the parentheses is maximized when operates on which is generated from the actual distribution perceived by agent , i.e., . The inequality holds for any on the RHS, in particular, for . The last two equalities hold by reorganizing the chain rule and the definition of . Hence, we have obtained Equation 8 and the proof is complete.
Our next result shows that every grader has a non-negative utility from participating in the peer-grading exercise.

###### Theorem 2 (Epir)

TRUPEQA is EPIR.

Proof:  Pick arbitrary and . Define to be the transfer to grader for grading paper under TRUPEQA. Since is arbitrary, by Observation 1, it is sufficient to show that the inequalities given by Equation 2 hold after replacing the utility with the transfer term and the term with a single paper . Formally, we need to show that ,

 E~y(i)j,~yG(j)∖{i}j,yj | qG(j)tij((~y(i)j,~yG(j)∖{i}j),yj,qG(j))⩾0. (9)

Where is drawn from the distribution , and is drawn from the distribution , while is any arbitrary score.

Substituting for and the expressions of from Equations 7 and 5, we get the first equality below.

 E~y(i)j,~yG(j)∖{i}j,yj | qG(j)tij((~y(i)j,~yG(j)∖{i}j),yj,qG(j)) =E~y(i)j,~yG(j)∖{i}j,yj | qG(j)(R(r∗j(~yG(j)j,qG(j)),yj) −R(r(−i)∗j(~yG(j)∖{i}j,qG(j)∖{i}),yj)) =E~y(i)j,~yG(j)∖{i}j | qG(j)(Eyj | ~y(i)j,~yG(j)∖{i}j,qG(j)(R(r∗j(~yG(j)j,qG(j)),yj) −R(r(−i)∗j(~yG(j)∖{i}j,qG(j)∖{i}),yj))) ⩾0.

The second equality holds by the chain rule of probability. The inequality holds by definition of (Equation 4). The first term within the inner parentheses is maximized when operates on which is generated from the actual distribution perceived by agent , i.e., . The expected reward at is therefore no smaller than that at any , in particular, for . Hence, we have obtained Equation 9 and the proof is complete.
EIIC and EPIR make sure that reporting the observed scores according to the estimated accuracy is a best response of a grader if the other graders behave in a similar fashion. These guarantees are less vulnerable than the uninformative equilibria problem of the peer prediction mechanisms, because (a) TRUPEQA accounts for the inaccuracies of the graders via probe papers, and (b) the payment is contingent on the true grades that are finally revealed. These two occurrences, which are beyond the graders’ control and has close proximity to the ground truth, keep the graders’ options of manipulation restricted. The following observation follows from the fact that the mechanism chooses the score to minimize the expected cost (Step 5).

###### Observation 2 (Expected Cost Minimizer)

Among all mechanisms using fixed number of probe papers, TRUPEQA minimizes the total cost of rechecking.

## 5 Relationship with some classic mechanisms

The transfer term of TRUPEQA, given by step 6 of Algorithm 1, resembles the payments of certain classic mechanisms in the independent and interdependent valuations setup. The first of them is the pivotal payment of Vickery-Clarke-Groves (VCG) mechanism (Vickrey, 1961; Clarke, 1971; Groves, 1973), and we explain here that TRUPEQA is quite a different mechanism than the VCG. In classical quasi-linear model the decision of the designer affects the utility of the agents via the valuation function. But in this setting, the valuation, which is the cost of grading and is assumed to be zero has no dependency on the decision made by the designer – the decision here is the final score of a graded paper. For the same reason, TRUPEQA is also different from the two stage mechanism by Mezzetti (2004), which applies to the setting of interdependent valuations. The decision of the mechanism, i.e., (step 5 of Algorithm 1), affects the payoff of a grader via the transfer. The transfer is designed in such a way that the objective of the planner is aligned with the payoff of the graders. In other words, the designer makes the graders partner in his goal of maximizing the expected reward. The term is subtracted to ensure that the bonus scores for peer grading is not too high yet individual rationality is satisfied. The QUEST mechanism (Bhat et al., 2014) is the closest in structure of the choice of decision and transfers of TRUPEQA, and applies to the setting of truthful crowdsourcing. The major differences of TRUPEQA from QUEST are (a) in the latter the choice of agents assigned to a task is also decided by solving an optimization problem, while in the former it is fixed, and (b) QUEST assumes that the agents know their qualities perfectly and report them to the designer, while in TRUPEQA, the mechanism runs only on the reported scores and qualities are only estimated, which also leads to a different incentive compatibility guarantee for this mechanism.

## 6 Empirical evaluation

Two other closely related peer-grading mechanisms are the spot checking (SC) (Gao et al., 2016) and the correlated agreement (CA) (Shnayder et al., 2016) mechanisms. The SC mechanism randomly picks a group of peer-graded papers and verifies the scores – a method called spot checking. The graders are rewarded or penalized depending on their agreement with the spot-checked papers. For other papers, the payment is constant. This mechanism satisfies truthfulness in dominant strategies. The CA mechanism uses a peer prediction method (discussed in §1) that incentivizes truth-telling in an equilibrium. The novelty of both these algorithms is entirely on the truthfulness aspect as they omit how the peer scores are aggregated into the final scores. Therefore, it may be assumed that they use one of the standard aggregation techniques against which we already compare our mechanism.

### 6.1 Simulation with continuous scores

For the generation of the true scores and the error model of the peer-graders, we use the model of grader bias and reliability as described in Piech et al. (2013), which is a widely used model for continuous scores. This model assumes that the true score for paper is distributed as , for all , the reliability and bias for a grader are distributed as and respectively, for all – where and

are gamma and normal distributions respectively. The observed score of paper

by grader , given by , is distributed as . The model parameters are chosen appropriately to reflect a realistic peer grading scenario.

In traditional examinations, the scores typically lie between and . The parameters of the above model are chosen in our experiment to compress this score spread within a width of . The value of and implies that the true score comes from the prior which ensures that 95% of the score values lie within with mean . Since bias is defined as the constant shift from true score, a value of ensures that 95% of the bias values lies in the range with mean , i.e., about a maximum of marks shift in a traditional examination. Similarly, gives the range of

for bias. Reliability is defined as the inverse of the variance of the noise of the score observed by a grader after the bias is added to the true score. The reliability parameters,

and , are chosen such that 95% of the noise varies within respectively for the mean of reliabilities . The variation of the noise is comparable to a maximum variation of marks to marks in traditional examination, i.e., reliability increases.

In the simulations, we consider . The true grade of paper , given by , and agent ’s observation are generated according to the above model, for all . For every grader , the accuracy parameters are estimated via maximum likelihood estimate given for all . The reward function is given by: , where is the score decided by an algorithm and is the true score. In the following sections, we consider three different assumptions on the graders’ behavior and the mechanism’s knowledge of prior that impact the accuracy of the mechanism and the number of regrading requests.