Truthful Peer Grading with Limited Effort from Teaching Staff

07/31/2018 ∙ by Jatin Jindal, et al. ∙ Indian Institute of Technology Kanpur 0

Massive open online courses pose a massive challenge for grading the answerscripts at a high accuracy. Peer grading is often viewed as a scalable solution to this challenge, which largely depends on the altruism of the peer graders. Some approaches in the literature treat peer grading as a 'best-effort service' of the graders, and statistically correct their inaccuracies before awarding the final scores, but ignore graders' strategic behavior. Few other approaches incentivize non-manipulative actions of the peer graders but do not make use of certain additional information that is potentially available in a peer grading setting, e.g., the true grade can eventually be observed at an additional cost. This cost can be thought of as an additional effort from the teaching staff if they had to finally take a look at the corrected papers post peer grading. In this paper, we use such additional information and introduce a mechanism, TRUPEQA, that (a) uses a constant number of instructor-graded answerscripts to quantitatively measure the accuracies of the peer graders and corrects the scores accordingly, (b) ensures truthful revelation of their observed grades, (c) penalizes manipulation, but not inaccuracy, and (d) reduces the total cost of arriving at the true grades, i.e., the additional person-hours of the teaching staff. We show that this mechanism outperforms several standard peer grading techniques used in practice, even at times when the graders are non-manipulative.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 16

page 17

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Peer evaluation of academic publications is a standard practice for large scientific communities for many years (Campanario, 1998). In recent years, the massive open online courses (MOOCs), that has revolutionized classroom teaching, have also started adopting it because of two major reasons: (a) it saves instructor’s time and yields faster feedback to the students, and (b) studies have shown that students learn better by checking each others’ answerscripts (Sadler and Good, 2006), among other behavioral and cognitive benefits. These benefits led majority of the MOOC platforms resort to peer grading for large classes. There is, however, an amount of skepticism about the accuracy and ethics of this method. Studies have found that incorrectly designed peer grading schemes may exhibit grade inflation, where peer graders give consistent higher grades than that of an instructor (Strong et al., 2004). Naturally, a body of research has been devoted to make peer grading accurate and non-manipulable.

The current research on peer grading mechanisms can be broadly classified into three classes. The first class of literature considers the grades from the peers as their

best effort and use multiple graders’ independent scores of a paper to statistically arrive at a final score (Hamer et al., 2005; Cho and Schunn, 2007; Piech et al., 2013; Shah et al., 2013; Paré and Joordens, 2008; Kulkarni et al., 2014; De Alfaro and Shavlovsky, 2014; Raman and Joachims, 2014; Caragiannis et al., 2015; Wright et al., 2015). These approaches assume that the graders reveal their true observed scores or invest effort to find the true scores. But without a clear incentive to do so, these mechanisms stand vulnerable to strategic manipulations.

Strategic manipulation is very natural to expect in a large-scale peer grading system which demands time and effort from the peer graders, and has also been observed in practice as discussed earlier. An effective peer grading system must protect against such manipulations, and research efforts have been pursued in that end. Hence the second class of literature in peer grading addresses truthful peer grading approaches through peer prediction mechanisms (Prelec, 2004; Miller et al., 2005; Jurca and Faltings, 2009; Faltings et al., 2012; Witkowski et al., 2013; Dasgupta and Ghosh, 2013; Witkowski and Parkes, 2013; Waggoner and Chen, 2014; Shnayder et al., 2016). In peer prediction mechanisms, every agent is asked to reveal their observed signals, and rewards are designed by comparing each agent’s report with that of her peers. The reward design ensures truthtelling in the equilibrium, i.e., it creates a situation such that every agent has an incentive to invest effort in finding out the correct signal and to reveal it truthfully if she believes that other agents will also do so. Even if truthful revealing of private signals is an equilibrium, peer prediction methods induce other uninformative equilibria as well (Jurca and Faltings, 2009; Waggoner and Chen, 2014). If no agent follows the ‘put-effort-and-report-observation’ strategy, it is not beneficial for an agent to follow it. The operating principle of peer prediction mechanisms is that it rewards for coordination, but not for accuracy, and therefore their suitability for settings where ground truth does exist is limited. In particular, when the information is costly to obtain, it is generally easier for the agents to resort to an uninformative equilibrium. In more recent developments, efforts are undertaken to make the truthful equilibrium Pareto dominant in peer prediction mechanisms, i.e., the truthful equilibrium is (weakly) more rewarding to every agent than any other equilibrium (Dasgupta and Ghosh, 2013; Witkowski and Parkes, 2013; Kamble et al., 2015; Radanovic and Faltings, 2015; Shnayder et al., 2016). However, Gao et al. (2016) show that such arguments rely critically on the assumption that every agent has access to only one private signal per object, which is untrue in the context of peer grading.

The final class is a hybrid approach where partial ground truth is accessible and this is augmented with the peer prediction method. The ground truth can be found either via grading a section of the papers by the teaching staff or selectively verifying the grades of certain peer graded answerscripts. Using this information, schemes can be devised that reward a grader for agreement with the trusted report (Jurca and Faltings, 2005; Dasgupta and Ghosh, 2013; Gao et al., 2016). Gao et al. (2016) present the spot checking mechanism where some selected papers are graded by a trusted authority and a payment (which can be given to a grader as peer grading bonus scores) is made based on the agreement with that trusted scores, while for the non-spot-checked papers the payment is a constant. This scheme ensures truthfulness in dominant strategies. However, this scheme (a) penalizes an inherent inaccuracy of a grader in the same way as a manipulation, and (b) needs a number of ground truth papers that increases linearly with the number of students.

Our approach in this paper is closest to the third strand of literature, where we make use of one additional information, which is motivated by the following observation. If a student finds that her answerscript has been wrongly graded, she always has an option to appeal to the instructor. Naturally, no instructor can ignore such a request as the reduction of burden of grading via peer grading cannot be prioritized over inaccurate grading. Hence, if such a request comes, the instructor or the teaching assistants have to look into it and provide the correct grade. However, it will take additional labor from the staff. So, eventually all true grades will be revealed, but at a cost. Therefore, a reasonable goal of a peer grading mechanism is to learn these true grades at the minimum cost. Our mechanism is presented in this premise, and ensures accurate grades with incentives for the peer graders to reveal their observations truthfully. The mechanism is capable of distinguishing manipulation from inaccuracy, and does not penalize the latter. It also needs a constant number of ground truth papers. We provide a brief overview of our approach and results in the following section.

Our Approach and Results

Our approach exploits the fact that, in practice, a peer grading mechanism can extract more information from the students, who are also graders. For example, the mechanism (a) can have some graded answerscripts to measure the quality of the graders, and (b) allow students to report an incorrect grading and correct them with the help of the teaching staff. Hence, eventually all ‘true’ grades will be obtained, and these information will be used to deter peer graders from deliberate underperforming. We present the mechanism TRUthful Peer Evaluation with Quality Assurance (TRUPEQA) in Algorithm 1 that provide the following theoretical and experimental features and findings. TRUPEQA

  • [noitemsep,leftmargin=*,topsep=0pt,parsep=0pt,partopsep=0pt]

  • estimates the accuracy of the graders (Alg. 1, Step 4) using a constant number of papers that are graded by the teaching staff (we call such papers probes).

  • incentivizes the graders to grade at the level of the estimated accuracy (Theorem 1). This method ensures that the mechanism accounts for the inaccuracies of the graders and penalizes only manipulation and not inaccuracy.

  • ensures voluntary participation (Theorem 2).

  • minimizes the total cost of revealing the true grades among all mechanisms that use fixed number of probes, i.e., mechanisms that use partial ground truth (Observation 2).

  • on synthetic data, even when graders are non-strategic, interestingly, it yields statistically significant lower RMS error than the Gibbs sampling mechanism on the model of grader bias and reliability as described in Piech et al. (2013), and also the mean and median mechanisms. It also receives less fraction of regrading requests compared to those mechanisms (§6.1.1).

  • quite naturally, it performs significantly better when the graders are strategic (§6.1.2).

  • both TRUPEQA and Gibbs need the knowledge of priors of the scores. If the prior used by the mechanism is different from the true prior, the performance of these two mechanisms are affected. However, Gibbs turns out to be too sensitive to it and the error increases with increasing reliability, while TRUPEQA continues to perform better as reliability increases (§6.1.3).

  • on a real dataset with discrete model of scores and accuracies, TRUPEQA performs better on the RMS error (§6.2), even though it is difficult to ascertain whether the graders manipulated or not.

2 Model

Let . Let be the set of candidates writing the test that will be peer-graded. Therefore, the set of answerscripts (or papers) and the set of graders are both denoted by . We use as the index for a grader and as the index for a paper. The set of graders assigned paper is denoted by where . The true score of paper is denoted by and the score observed by a grader is denoted by . The set of papers graded by grader is given by . Both the true scores and the observed scores belong to the set of scores which can be continuous or discrete. Our analyses primarily assume a continuous score set. However, with a little adaptation the analyses extend to discrete cases as well. The scores are drawn i.i.d. from according to a distribution , which is a common knowledge. The observation of given by the graders is governed by the error model (for a discrete

, this is the probability mass function

), which is the density of given and is the parameter of the error model, which we will call accuracy. Let denote the set of all possible accuracies.

We consider the premise of peer-grading mechanisms where papers are checked by the teaching staff, and the grades provided are assumed to be the ground truth. Denote the set of such ground truth papers by , and call it the set of probe papers. The true grades of these probe papers are known to the mechanism designer (e.g., the instructor of the course), but these scores are not ex-ante revealed to the peer-graders. Each grader is given the probe and non-probe papers in two batches. The mechanisms we consider estimate the accuracy from the performance of the graders on these probe papers in batch one, and use it to predict the score of a non-probe paper in batch two. The mechanisms we consider allow the graders to know the identities of the probe papers and the true scores ex-post their grading, with which their estimated accuracies are also released. With such information released, the mechanism design goal is to ensure that a rational grader continues to operate at the same estimated accuracy and report the observations truthfully.111However, in a more realistic scenario, such a complete information of batch one may not be released, which further limits the possibility of manipulation by the grader. We show that even with this information available, truthful mechanisms can be designed, which also helps distinguish error with manipulation.

An estimation rule is given by the vector

, with , which is the function that estimates the accuracy of grader from the probe papers she corrects. Hence if agent reports the scores for the papers while the true scores of those papers are , her accuracy is given by . We assume that , , and , i.e., every paper is graded by at least one grader, and every grader grades at least one probe and one non-probe paper. We consider the estimation rule to be standard and publicly announced beforehand. Therefore, it is not part of the mechanism. However, we will see that its role is crucial while defining the notions of truthfulness and voluntary participation.

For notational simplicity, we will use the following shorthands for grader : for the probe papers assigned to , for the non-probe papers assigned to , for the estimated accuracy, and for the set of co-graders who grade at least one common non-probe paper with . The set of accuracies is denoted by .

We assume that the true score vector is eventually observed at a cost. Define, to be the cost of observing true score when a score of was (possibly incorrectly) given to paper . We argue that this is quite natural in the context of peer-grading. Even though a paper might be incorrectly graded, the student has an option to appeal – in which case, the teaching staff gets involved and does a correct grading – but this happens at an additional cost, captured by the function . An example of is one that increases in the difference and is zero when .222If a student’s true score is far from the given score, then the teaching staff needs to check a larger portion of the answerscript, leading to a larger cost. The reward to the designer is denoted by the function , and is defined as the negative of the cost, i.e., . The total reward for a collection of papers is additive over individual rewards. With a little abuse of notation we denote the reward for the papers in with as well, i.e., .

A peer-grading mechanism for a given estimation rule is therefore given by the tuple , where

  • [noitemsep,leftmargin=*,topsep=0pt,parsep=0pt,partopsep=0pt]

  • is grader set assignment function, as defined before.

  • , with , which is the function that computes the scores of the non-probe papers from the scores reported by the graders and their accuracies. Hence, the score decided by the mechanism for paper is when the graders submit and the estimated accuracy vector is .

  • , with , which denotes the function that yields the transfer (or payment) to grader , and is a function of the given and true scores of the papers, and the accuracies of the graders. Hence, the transfer is a real number, where is the score given to paper by the mechanism. For continuous scores, a scaled value of the transfer can be directly added to the total score as bonus marks for peer-grading, while for discrete, the transfer has to be awarded separately.

It is reasonable to assume , as the usual practice in peer-grading is to not assign her own paper for grading. Therefore the score received by agent , i.e., , is independent of her reported grades . The transfer is the only factor influenced by her report, and one of our mechanism design goals is to choose this transfer such that it incentivizes the agent to report her observations truthfully.

We assume that every student cares only for her score in the examination given by the mechanism and the transfer (e.g., the bonus marks) from peer grading. Among these, the score received is independent of her grading, and hence non-manipulable. Therefore, we consider the only potentially manipulable part of the payoff, the transfer, for our analysis. For agent in mechanism , this is given by

(1)

The payoff captures the fact that a grader is paid only for the non-probe papers she checks, and the payment is dependent on the score given by the mechanism to those papers, the true score of the papers, and the accuracies of the co-graders. The score given by the mechanism to a paper is also dependent on the reported scores of the other graders who grade the same paper.

Note that there can be a cost of grading the papers for agent , which we have ignored in our model for simplicity. However, in certain setups, the cost of grading a paper is typically known and can be added to the transfer. All our analyses will continue to hold even in that setting. The setting where the cost is dependent on the accuracy of a grader is a different research question which we leave as a future work.

The reported scores of grader , given by , is decomposable into the probe and non-probe components, i.e., . Since the estimation function is publicly known, agent knows that her accuracy will be estimated as , when the true scores of the probes, , is made public by the designer. The strategy agent can consider is whether to report the scores of the non-probe papers according to the distribution given by the same accuracy level, i.e., according to , or something different. Our mechanism design goal is to ensure that the graders continue grading the non-probe papers with the same accuracy level estimated via the probes. This is captured in the definition of truthfulness as follows.

Definition 1 (Equal Intensity Incentive Compatibility (EIIC))

Consider grader . Let , where . A mechanism is equal intensity incentive compatible (EIIC) if for all ,

Where is drawn from the distribution , and is drawn from the distribution .

Note that the inequality of Definition 1 should hold for every reported score vector on the probe papers of . The expectation of the utility is taken w.r.t. the accuracy estimated from this reported score vector. The definition says that for a mechanism to be EIIC, every agent should receive maximum expected payoff if she grades the non-probe papers assigned to her with the same accuracy as estimated from the probe papers, which takes into account that the graders could be inaccurate. Note that this definition also assumes that the other co-graders of draw their scores according to the accuracies estimated from their scores on their respective probe papers. Therefore, this notion of incentive compatibility is close in spirit to the ex-post incentive compatibility notion in literature (Mezzetti, 2004; Nath and Zoeter, 2013; Bhat et al., 2014), which is the best achievable truthfulness guarantee in this setting of interdependent valuations due to the impossibility result by Jehiel et al. (2006).

EIIC is subtly different from the expectation computed w.r.t. the graders’ true accuracies. It is unlikely that a grader will perfectly know her true accuracy, i.e., the probability of observing when the true score is . Rather if the performance of a grader is evaluated on some training set, and is shown to the grader, she can decide whether to continue at the same level of estimated accuracy (by putting similar effort for grading) or not. EIIC ensures that working at the level of estimated accuracy on the non-probe papers is a better option in expectation.

Also, EIIC ensures a little stronger guarantee than is required in an actual setting of peer grading (see paragraph 2 of Section 2 and footnote1). In practice, the true scores of the probe papers and the estimated accuracies of the graders may not be released before they check the non-probe papers, leading to a more restricted set of manipulation strategies of the graders.333This is not unusual in mechanism design, e.g., dominant strategy incentive compatible mechanisms are designed even though agents may never know the reported types of the other agents.

Our next definition ensures that every peer-grader willfully participates in this mechanism, as the expected payoff from participating truthfully is non-negative.

Definition 2 (Ex-Post Individual Rationality (EPIR))

Consider grader . Let , where . A mechanism is ex-post individually rational (EPIR) if for all ,

(2)

Where is drawn from the distribution , and is drawn from the distribution .

The goal of the instructor in a peer-grading context is to minimize the expected cost, which is equivalent to maximizing the expected reward. Ideally, the cost is zero if the scores given by a mechanism is equal to the true scores for each paper. This can be interpreted as though the paper does not come back to the teaching staff for regrading, and therefore the peer-graded score is the final score for the paper. Therefore, the following property is essential for any reasonable peer-grading mechanism.

Definition 3 (Expected Reward Maximizer (ERM))

Let , where . The score computing function of a mechanism is expected reward maximizer (ERM) if for every reported score vector for non-probe papers, and for every estimated accuracy vector , it maximizes the expected reward, i.e.,

(3)

Since the reward is only limited to the non-probe papers, the expectation is calculated based on the scores reported and the true score of those papers. Let the ERM given by Equation 3 be denoted by . Note that, since the reward function is additive over the papers, i.e., , the decision problem of Equation 3 is also decomposable. Hence the ERM score for every paper is given by

(4)

The reward at the ERM score for paper when the true score is is denoted by

(5)

Define a the ERM score of paper in the absence of agent to be

(6)

The reward at the ERM score for paper in the absence of agent when the true score is is denoted by

(7)

We will use the shorthands and for the above two expressions when the arguments of such functions are clear from the context.

We are now in a position to present the central mechanism of this paper.

3 The Trupeqa mechanism

In this section, we present our mechanism TRUPEQA (TRUthful Peer Evaluation with Quality Assurance), that (1) decides the assignment of the papers to the graders, (2) selects the score for every paper, and (3) decides the transfer to every grader. Algorithm 1 shows the details of the steps, while Figure 1 shows the dependencies of different variables graphically using a multi-agent influence diagram (MAID) (Koller and Milch, 2003).

1:  Input: reported scores of the graders on the probe papers – reported after Step 3, and reported scores on the non-probe papers – reported after Step 4
2:  Given: the size of the probe set , the estimation function , the priors on
3:   part: every grader is assigned (even) papers to grade, of which are probe papers and rest are non-probe, in such a way that every non-probe papers is assigned to exactly graders. The assignment of papers to graders also ensures that the grader does not get her own paper assigned to her.
4:  The accuracies are estimated by applying on their papers, and are revealed to the graders.
5:   part: the scores of the papers are given by the ERM (Equation 3), which is equivalent to the scores given by the decomposed ERM (Equation 4) for every .
6:   part: the transfer to grader for grading paper is given by and the total transfer to grader is therefore .
Algorithm 1 TRUPEQA

The operating principle of this mechanism is to assign equal number of papers to the graders and pick the score of a paper such that the expected cost (w.r.t. the estimated accuracy of all the graders who graded this paper) is minimized. Finally, the transfer is the marginal contribution of the grader towards minimizing this cost. In the next section, we show that these features of TRUPEQA satisfy some desirable properties of peer grading.

Figure 1: Multi-agent influence diagram for TRUPEQA.

4 Properties of Trupeqa

By the observation of Equation 4, we have that the joint problem of finding the score vector of all the papers to maximize the expected reward (Equation 3) is equivalent to the problem to scoring the papers individually with the same objective. Similarly, since the transfer to grader in TRUPEQA is additive over all the papers she grades (Step 6 of Alg. 1), we see that the questions of incentive compatibility and individual rationality are completely decomposable into the questions at the individual paper levels.

Observation 1 (Decomposability)

If TRUPEQA satisfies the inequalities given by Definitions 2 and 1 individually for every paper , i.e., they hold when , and , it is sufficient to conclude that TRUPEQA is EIIC and EPIR.

With the above observation, we proceed to proving the main results of this section.

Theorem 1 (Eiic)

TRUPEQA is EIIC.

Proof:  Pick arbitrary and . Define to be the transfer to grader for grading paper under TRUPEQA. Since is arbitrary, by Observation 1, it is sufficient to show that the inequalities given by Definition 1 hold after replacing the utility with the transfer term and the term with a single paper . Formally, we need to show that ,

Where is drawn from the distribution , and is drawn from the distribution , while is any arbitrary score.

It is easy to see that is independent of the report of agent by definition. Therefore in order to prove the inequality of Section 4, it is sufficient to show that

(8)

Consider the LHS of the above inequality

The first equality is obtained by substituting the values from Equation 5

. The second equality holds due to the chain rule of probability. The inequality holds by definition of

(Equation 4). The term within the parentheses is maximized when operates on which is generated from the actual distribution perceived by agent , i.e., . The inequality holds for any on the RHS, in particular, for . The last two equalities hold by reorganizing the chain rule and the definition of . Hence, we have obtained Equation 8 and the proof is complete.
Our next result shows that every grader has a non-negative utility from participating in the peer-grading exercise.

Theorem 2 (Epir)

TRUPEQA is EPIR.

Proof:  Pick arbitrary and . Define to be the transfer to grader for grading paper under TRUPEQA. Since is arbitrary, by Observation 1, it is sufficient to show that the inequalities given by Equation 2 hold after replacing the utility with the transfer term and the term with a single paper . Formally, we need to show that ,

(9)

Where is drawn from the distribution , and is drawn from the distribution , while is any arbitrary score.

Substituting for and the expressions of from Equations 7 and 5, we get the first equality below.

The second equality holds by the chain rule of probability. The inequality holds by definition of (Equation 4). The first term within the inner parentheses is maximized when operates on which is generated from the actual distribution perceived by agent , i.e., . The expected reward at is therefore no smaller than that at any , in particular, for . Hence, we have obtained Equation 9 and the proof is complete.
EIIC and EPIR make sure that reporting the observed scores according to the estimated accuracy is a best response of a grader if the other graders behave in a similar fashion. These guarantees are less vulnerable than the uninformative equilibria problem of the peer prediction mechanisms, because (a) TRUPEQA accounts for the inaccuracies of the graders via probe papers, and (b) the payment is contingent on the true grades that are finally revealed. These two occurrences, which are beyond the graders’ control and has close proximity to the ground truth, keep the graders’ options of manipulation restricted. The following observation follows from the fact that the mechanism chooses the score to minimize the expected cost (Step 5).

Observation 2 (Expected Cost Minimizer)

Among all mechanisms using fixed number of probe papers, TRUPEQA minimizes the total cost of rechecking.

5 Relationship with some classic mechanisms

The transfer term of TRUPEQA, given by step 6 of Algorithm 1, resembles the payments of certain classic mechanisms in the independent and interdependent valuations setup. The first of them is the pivotal payment of Vickery-Clarke-Groves (VCG) mechanism (Vickrey, 1961; Clarke, 1971; Groves, 1973), and we explain here that TRUPEQA is quite a different mechanism than the VCG. In classical quasi-linear model the decision of the designer affects the utility of the agents via the valuation function. But in this setting, the valuation, which is the cost of grading and is assumed to be zero has no dependency on the decision made by the designer – the decision here is the final score of a graded paper. For the same reason, TRUPEQA is also different from the two stage mechanism by Mezzetti (2004), which applies to the setting of interdependent valuations. The decision of the mechanism, i.e., (step 5 of Algorithm 1), affects the payoff of a grader via the transfer. The transfer is designed in such a way that the objective of the planner is aligned with the payoff of the graders. In other words, the designer makes the graders partner in his goal of maximizing the expected reward. The term is subtracted to ensure that the bonus scores for peer grading is not too high yet individual rationality is satisfied. The QUEST mechanism (Bhat et al., 2014) is the closest in structure of the choice of decision and transfers of TRUPEQA, and applies to the setting of truthful crowdsourcing. The major differences of TRUPEQA from QUEST are (a) in the latter the choice of agents assigned to a task is also decided by solving an optimization problem, while in the former it is fixed, and (b) QUEST assumes that the agents know their qualities perfectly and report them to the designer, while in TRUPEQA, the mechanism runs only on the reported scores and qualities are only estimated, which also leads to a different incentive compatibility guarantee for this mechanism.

6 Empirical evaluation

Section 4 shows that TRUPEQA satisfies two very desirable properties in the context of peer-grading – truthfulness and voluntary participation. However, the grading accuracy – how close the peer-decided score is from the true score – of the peer-graded answerscripts is an important metric for measuring the efficacy of a mechanism. This metric is also correlated with the cost of rechecking the answerscripts and the number of answerscripts that ask for regrading. To complement our theoretical investigation of TRUPEQA, in this section, we consider an empirical evaluation of the mechanism on the aspect of grading accuracy, and compare it with some standard peer-grading protocols on both synthetic and real datasets. We also consider strategic aspects of peer grading and conduct a separate evaluation when the graders manipulate. The standard peer grading protocols assume that the graders submit their grades truthfully and our experiments show that accuracy is affected when graders are strategic.

The rest of this section is organized into two parts. In part one, we consider continuous scores, generate the data from a given prior, and use a well-known error model for the peer-graders. In part two, we consider a real dataset from a peer-graded course (Vozniuk et al., 2014) and adapt the error model for such a setting. In both the cases, we consider the following metrics: (1) grading accuracy, i.e., the root mean square (RMS) error of the grades, (2) fraction of the answerscripts requesting regrading, i.e., the fraction of papers for which the peer-graded score is not within a threshold of the true score, and compare them with that of the standard mechanisms. While the RMS error gives an aggregate view of the inaccuracy of the peer-grading mechanism, the regrading fraction gives a per-student view of the inaccuracy. The mechanisms considered (except TRUPEQA) are (a) mean, where every answerscript is graded by multiple peers and the mean is taken as the peer-graded score, (b) median, which is similar to (a) except the median is taken as the peer-graded score, and (c) Gibbs sampling on the model of grader bias and reliability as described in Piech et al. (2013).

Two other closely related peer-grading mechanisms are the spot checking (SC) (Gao et al., 2016) and the correlated agreement (CA) (Shnayder et al., 2016) mechanisms. The SC mechanism randomly picks a group of peer-graded papers and verifies the scores – a method called spot checking. The graders are rewarded or penalized depending on their agreement with the spot-checked papers. For other papers, the payment is constant. This mechanism satisfies truthfulness in dominant strategies. The CA mechanism uses a peer prediction method (discussed in §1) that incentivizes truth-telling in an equilibrium. The novelty of both these algorithms is entirely on the truthfulness aspect as they omit how the peer scores are aggregated into the final scores. Therefore, it may be assumed that they use one of the standard aggregation techniques against which we already compare our mechanism.

6.1 Simulation with continuous scores

For the generation of the true scores and the error model of the peer-graders, we use the model of grader bias and reliability as described in Piech et al. (2013), which is a widely used model for continuous scores. This model assumes that the true score for paper is distributed as , for all , the reliability and bias for a grader are distributed as and respectively, for all – where and

are gamma and normal distributions respectively. The observed score of paper

by grader , given by , is distributed as . The model parameters are chosen appropriately to reflect a realistic peer grading scenario.

In traditional examinations, the scores typically lie between and . The parameters of the above model are chosen in our experiment to compress this score spread within a width of . The value of and implies that the true score comes from the prior which ensures that 95% of the score values lie within with mean . Since bias is defined as the constant shift from true score, a value of ensures that 95% of the bias values lies in the range with mean , i.e., about a maximum of marks shift in a traditional examination. Similarly, gives the range of

for bias. Reliability is defined as the inverse of the variance of the noise of the score observed by a grader after the bias is added to the true score. The reliability parameters,

and , are chosen such that 95% of the noise varies within respectively for the mean of reliabilities . The variation of the noise is comparable to a maximum variation of marks to marks in traditional examination, i.e., reliability increases.

In the simulations, we consider . The true grade of paper , given by , and agent ’s observation are generated according to the above model, for all . For every grader , the accuracy parameters are estimated via maximum likelihood estimate given for all . The reward function is given by: , where is the score decided by an algorithm and is the true score. In the following sections, we consider three different assumptions on the graders’ behavior and the mechanism’s knowledge of prior that impact the accuracy of the mechanism and the number of regrading requests.

6.1.1 Graders are truthful

For TRUPEQA, we use papers as probe. Papers are distributed such that each student gets papers to grade, where are probe and are non-probe. Each non-probe paper is also graded by at least graders. The ERM is calculated according to Equation 4. Finally the root mean squared (RMS) error between and is computed. We provide the calculations to find the expressions of estimated accuracies, the posterior of given the observations, and the decomposed ERM in Appendix A for a cleaner presentation.

The Gibbs setup uses Gibbs sampling on the model of grader bias and reliability as described in Piech et al. (2013). The grades given by mechanisms mean and median are the mean and median respectively of , for each paper .

We choose the threshold for regrading request to be , i.e., paper comes back for regrading if the decided score by a mechanism satisfies .

For every mean reliability, true grades are randomly generated times and for each true grade, observations according to the parameters are repeated times. The average performances of the four algorithms discussed above are shown in Figures 1(b) and 1(a) (RMS error) for two different values of , and in Figures 2(b) and 2(a)

(fraction of papers requested for regrading) with 95% confidence interval.

(a)