1 Introduction
Educational platforms such as Coursera, Udacity, and EdX provide easy access to high level education to everyone who has a decent Internet access. In September , these platforms had more than million users —essentially, students attending the offered courses— and this number is expected to further increase in the near future. The term “massive open online course”, or simply MOOC, is very descriptive of the service these platforms offer. A MOOC is the result of their partnership with a faculty member in a top university, whose role is to design the course and organize the course material so that it takes advantage of the most popular Internet apps that the platform utilizes. Courses offered include literally everything.
Even though the service provided is certainly useful, the viability of MOOCs will strongly depend on their sources of revenues. Currently, investments from VCs have secured their survival for a short term, but what about their future? A feature that could be the main source of revenue for MOOCs is the socalled verified certificate which the students can get at a cost of a few dozens of dollars. The verified certificate keeps information about the performance of a student in a course (or in a chain of courses) and can be used to justify a student’s quality to potential employers. So, the verified certificate should have reliable information about the student performance in the courses she has participated in. Even though the means to guarantee this in the traditional University system is wellestablished, achieving this in a MOOC is a challenge.
The big issue is in the massive student participation. Of course, the Internet provides tools so that organizing exams with huge numbers of students is possible. But what about assessment and grading? As the most popular courses attract students or more and the vision of MOOCs enthusiasts is for millions of students per course, is grading of assignments or exams possible? No doubt, professional graders would be extremely costly. Organizing the material using multiplechoice questions and answers that could be graded automatically cannot be an option when the students are asked to prepare an essay or a formal mathematical proof or express their critical thinking over some issue. Grading is a typical example of a human computation [11] task in these cases.
The only solution that seems consistent to the MOOCs vision is known as peer grading [10, 13, 16], according to which the grading task is outsourced to the students that participated to the exam themselves. This approach has been already implemented in some MOOCs and standalone experimental tools such as crowdgrader.org [8], peergrading.org [14], and our own corank^{1}^{1}1Available at corank.ceid.upatras.gr. [3]
are already available. Even though the approach seems straightforward, there are subtle implementation issues. For example, allowing the students to use cardinal scores is problematic, since they participate both in the exam and in grading and they may have incentives to assign low grades in order to improve their personal relative performance. Even if we assume that they grade honestly, their experience in doing so is very limited and the result will most probably be unreliable.
In this paper, we focus our attention on ordinal peer grading
, which has recently received attention in the AI and machine learning community
[4, 14, 15]. Following the setting that we considered in our previous work [4], each student gets a bundle of a small number (our favourite number that we have extensively used recently is ) of exam papers so that each exam paper is given to the same number of students. Each student has to rank the exam papers in her bundle (in terms of quality) and an aggregation rule will then combine the (partial) rankings submitted by the students and come up with a final ranking of all exam papers; this will be the grading outcome. Information about the position of a student in the final ranking (e.g., top out of students) can be included in her verified certificate.In [4], we formally proved that a simple aggregation rule, inspired from Borda’s rule from social choice theory [2], recovers correctly an expected fraction of of the pairwise relations in the underlying ground truth ranking, when bundles of size are used and students make no mistakes when grading. The assumption for a ground truth and the comparison of the grading outcome to it is similar in spirit to recent approaches that combine voting and learning [5, 6, 7]. The new aspect in [4] (as well as in the current paper) is the relaxed requirement of recovering the ground truth only approximately. Experimental results show that this rule has very good performance in an imperfect grading scenario inspired by a noisy model of generating random rankings that has been proposed by Mallows [12]. Note that, unlike other studies [14, 15], we investigate the potential of applying ordinal peer grading exclusively, without involving any professionals in grading.
In this paper, we follow a different approach. To explain our rationale, we remark that theoretical analysis requires to handle with extra care dependencies between several random variables that appear due to the distribution of exam papers to bundles. The analysis of Borda was possible only due to its particular definition; we have not managed to extend the analysis to any other aggregation rule. Also, the
notation in the theoretical guarantee for Borda above hides large constant terms that constitute the bound of theoretical interest only. We would like to develop a “theory” for determining the performance of Borda with the highest possible accuracy. And, of course, why should we restrict our study only to Borda?Instead, we define and study a large class of simple aggregation rules, which we call typeordering aggregation rules; the class includes Borda. We present a theoretical framework for assessing the performance of each member of this class with respect to a series of performance objectives. A crucial step in our study is that we have completely neglected the dependencies between the random variables that make the rigorous analysis difficult. This sacrifice of mathematical rigor is formally incorrect unless the number of students tends to infinity; this can be justified by the massive participation in MOOCs. But the best justification of our approach is that the theoretical predictions of performance are experimentally shown to be exact, which means that the dependencies have no positive or negative impact on performance. Furthermore, once (statistical) information about the grading behaviour of students and the desired performance objectives are known, our framework can serve as an optimization toolkit for selecting the optimal typeordering aggregation rule. This requires an exact solution to an instance of the feedback arc set problem which, albeit NPhard in general, can be solved exactly for the instances that do arise.
Our theoretical framework allows us to obtain a series of results. For example, we establish that Borda is the optimal typeordering aggregation rule when students act as perfect graders. This is rather surprising, since Borda is among the simplest aggregation rules in the class we consider. Even though it was not observed to be optimal in any other scenario we considered, its performance is always extremely close to optimality. Furthermore, as mentioned above, the optimization task of deciding the optimal aggregation rule strongly depends on the information about grading behaviour. We study how inaccuracies of this information affect the choice of the optimal aggregation rule and its performance for the Mallows model. The results suggest a very minor impact and, essentially, a tiny sample of a student population is enough for building a fairly accurate model of grading behaviour.
Overall, our approach combines theory, simulations, and experimentation and is presented graphically in Figure 1
. The lower chain of the figure describes what one would expect from a simulated exam. There is a student population and some of them participate in an exam. The preparation level of the participating students that determines their performance in the exam is a random variable following a uniform probability distribution. After the exam, each student acts as the grader of a small number of exam papers submitted by other students. The grading performance can depend on the preparation level as well. The grades are combined using the aggregation rule and the final ranking is compared to the ground truth to come up with the observed performance.
The most interesting part of Figure 1 is the upper chain. First, a field experiment can be used to extract information about the student population, translated into a noise model. We have performed such a field experiment with students in our home institution; we describe it in detail and present the collected data later in the paper. These data are used to build a noise model which, together with the desired performance objective, are then given as input to the optimization engine which constructs the optimal aggregation rule for the particular noise model and performance objective, and reports a theoretical prediction about the performance the rule is expected to have. The optimal aggregation rule can also be applied to the grades from our simulated exams (hence, the vertical arrow in Figure 1) and a comparison of the theoretically predicted performance with the observed performance of the simulated exam will validate our theory.
2 Preliminaries
We assume that students have participated in an exam and have submitted their exam papers. Our approach to ordinal peer grading has three distinct tasks: the distribution of papers to students, the grading task by each student, and the aggregation of the grades into a final result. We describe these tasks in detail here and give definitions that will be useful later.
2.1 Distributing the exam papers
All students that participated in the exam will have to participate in grading as well. The goal of the first task is to balance their grading load. This is done by distributing (copies of) each paper to the students so that each exam paper is given to exactly students and each student receives exactly (distinct) exam papers. The papers that a student receives form her bundle. These are the exam papers which the student has to grade. Crucially, the bundle of a student should not contain her own exam paper.
A regular bipartite graph with nodes on each side of the bipartition (called an bundle graph in [4]) can be used to represent the distribution of exam papers to students. Each node of sets and represents a student. An edge of the graph between a node and a node indicates that the exam paper of the student corresponding to node is in the bundle of the student corresponding to node . The restriction on the degree of the nodes of set means that each exam paper is given to exactly students and the restriction on the degree of the nodes of means that all bundles have size .
In our previous work, we considered bundle graphs that satisfy a particular structural property, namely they contain no cycle of length . This was a technical constraint that was required only in our theoretical analysis. Experimental results in that paper indicate that uniformly random regular bipartite graphs are almost as good as bundle graphs. These are the bundle graphs we considered in the current work. A random regular graph can be built as follows. Starting from the complete bipartite graph with node sets and , first remove the edges between nodes corresponding to the same students in and . Then, draw a perfect matching uniformly at random among all perfect matchings of that do not include previously removed edges. Repeat this step times; this is possible due to Hall’s matching theorem since, at the beginning of every step, the edges that have not previously been removed form a regular bipartite graph. The union of the edges in the perfect matchings forms the bundle graph.
2.2 Modelling the grading task
Throughout the paper, we assume that there is an underlying strict ranking of the exam papers, the ground truth, which we aim to recover. As it will shortly become apparent, the setting we consider is so restrictive that we should not expect to recover the ground truth exactly. Instead, we would like to recover the ground truth approximately.
A restriction of our setting is that each student is given only exam papers to grade. Another restriction is that the grading task for each student is simply to rank the exam papers in her bundle, in decreasing order of quality. We consider different scenarios for the grading behaviour of the students. In a first scenario, assume that, after the end of the exam, the instructor announces indicative solutions and gives detailed instructions that the students can use during grading. Here, we assume that students will act as perfect graders. Admittedly, this is an unrealistic assumption but we include it as an extreme case in our study together with many others.
Our most realistic scenario poses another challenging restriction. In general, we assume that students make mistakes when grading. For modelling purposes, we assume that a student grades as follows. When she receives a bundle of papers, she draws a random ranking according to a probability distribution that characterizes the grading behaviour of all students participating in the exam. So, the behaviour of a student in an imperfect grading scenario is characterized by a noise matrix , where denotes the probability that the exam paper with correct rank among the exam papers in a bundle is ranked at position by the grader. Clearly, a noise matrix is doubly stochastic, i.e., the sum of the entries in any column and any row is equal to . Observe that the corresponding noise matrix for perfect grading is the identity matrix. The term “noise model” is often used as a synonym for the term “noise matrix”.
Note that a noise matrix provides only aggregate information over all students of a population. Actually, it is not hard to see that a doubly stochastic matrix may correspond to many different probability distributions over rankings. This kind of information will be the only tool we will use in our theoretical analysis. On the other hand, in our simulations, we will use more refined models of grader behaviour, assuming that it strongly depends on the level of preparation of the student for the exam and her success in it. A particular such model is the following. Each student has a quality drawn uniformly at random from the interval
and affects her position in the ground truth and her ability to grade as well. The ground truth is the ranking of the students in decreasing order of quality. A student of quality performs the grading task as follows: she considers every pair of exam papers and in her bundle, such that appears ahead of in the ground truth, and temporarily determines with probability and with probability ; the pairwise relation will evolve into her ranking of the exam papers in her bundle. If, after considering all pairs of exam papers in the bundle, the pairwise relation is cyclic, the whole process is repeated from scratch. Otherwise, the ranking of the exam papers in the bundle induced by is the grading outcome of student .The above process has been implemented and is used in our simulations with bundles of size . In our theoretical analysis, we have computed the corresponding noise matrix by sampling students with uniform qualities and simulating the grading behaviour described above. The resulted noise matrix is presented here for ease of reference.
(1) 
Due to its similarities with the wellknown Mallows model [12] for generating random rankings, we refer to this grading behaviour as Mallows grading.
2.3 Aggregation rules
The third important task is to aggregate the partial rankings provided by the graders into a final output ranking. This is done using an aggregation rule. A simple but very compelling aggregation rule is inspired by the Borda count voting rule. In our context, Borda computes a score for each exam paper by examining the positions it has in the rankings of the graders that have this exam paper in their bundles. A first position by an exam paper contributes points to its score, a second position contributes points, and so on. The outcome of Borda is a ranking of the exam papers in nonincreasing order of their Borda scores. When we use Borda, we assume that ties are broken uniformly at random but other tiebreaking schemes could be considered as well. More generally, a positional scoring aggregation rule
could use a different score vector
with (and ) in order to increase the score of each exam paper by whenever it is ranked th by a grader.In our previous paper [4]
, we also considered several other aggregation rules such as a rule that we call Random Serial Dictatorship (RSD) as well as rules that are based on appropriately defined Markov chains. RSD is very slow in the computation of the final outcome and, even though it performs remarkably well with perfect graders, it has a poor performance in simulated exams with Mallows graders. We will not consider it in the current paper; actually, applying it with input from
graders, which is the typical scenario we consider in this paper, is a computational challenge. The aggregation rules that are based on Markov chains were defined in an unsuccessful attempt to distinguish between high and low quality graders and put more weight on the partial rankings of the former. These ideas are not considered in this work either. Instead, we focus on much simpler aggregation rules.3 Typeordering aggregation rules
We will use the term type to refer to the grading result of an exam paper. The grading result for it consists of the ranks it gets from the graders that have it in their bundles. So, the type is a vector of integers from . We follow the convention that the entries in types appear in monotone nondecreasing order. We use
to denote the set of all types for bundle size . It is not hard to see that contains different types.
As an example with , an exam paper of type is ranked first by one of its graders, second by four graders, and fifth by one grader. Now, consider another exam paper of type and observe that Borda would give to both exam papers the same Borda score of . Is there some particular reason for which these two exam papers should be very close in the final ranking? Now, consider the two types and of Borda scores and , respectively. Borda indicates that an exam paper with the second type is better. But looking carefully at the ranks, we could come up with the following interpretation. The first exam paper is very good (and most probably in one of the two top positions in any bundle) and the two low ranks are due to poor judgement by the graders. In contrast, the second exam paper is just above average and this is reflected in all grades. Of course, such interpretations are valid only when they can be supported by information about the graders. But, certainly, there are cases where such interpretations are indeed valid.
So, it seems that Borda is restrictive; then, one would think that this is due to the particular scores that Borda uses. We will not consider the task of investigating whether different scores could yield better results but will instead define a much broader class of aggregation rules. A typeordering aggregation rule uses a strict ordering of the types in . Then, the final ranking of the exam papers follows the ordering of their types, breaking ties uniformly at random. In general, rules of this class seem to be very powerful. Compared to Borda which partitions the set of exam papers into only different scores, a typeordering aggregation rule can distinguish between exponentially many (in terms of ) different types. In the following, we use the term Borda ordering to refer to any ordering of the types in nonincreasing order of Borda score. We also use to denote the Borda score of an exam paper with type . Clearly, .
We remark that the use of types in the definition of a broad class of aggregation rules has been possible due to the regularity that we imposed on the bundles and the distribution of exam papers to them. Of course, this creates issues related to the theoretical analysis of these rules (such as dependencies between the random variables involved in the distribution to bundles and in grading) and it should be expected that their rigorous analysis will be much more involved than the analysis of Borda (which is already quite complicated; see [4]). In the next section, we discuss how to overcome such issues.
3.1 A framework for theoretical analysis
For the analysis of typeordering aggregation rules, we will assume an infinite number of students. This is close to the vision of MOOCs with huge numbers of enrolled students and is the important assumption that constitutes the theoretical analysis possible. So, the positions of students in the ground truth ranking can be thought of as occupying the continuum of the interval . We will usually identify an exam paper as a real number , i.e., by its rank in the ground truth ranking. Furthermore, we will assume that in each of the bundles to which exam paper belongs, the remaining exam papers are selected uniformly at random from the student population. Our assumption of infinitely many students allows us to ignore subtleties such as the requirement that all students in a bundle should be distinct and also different than the student that acts as the grader of the bundle (the probability that this requirement will not be satisfied in some bundle is zero).
Consider an aggregation rule that uses an ordering of the types defined by bundles of size and is applied to partial rankings provided by graders whose behaviour follows the noise model . Let us focus on computing the expected number of pairwise relations in the ground truth ranking that are correctly recovered in the outcome of the rule. It suffices to consider every pair of exam papers with (i.e., exam paper has a better rank in the ground truth compared to exam paper ) and add one point if has a better type than according to the ordering , and half a point if both exam papers have the same type. In this last case, the tie is resolved uniformly at random and the probability that the correct pairwise relation will be recovered is . Hence, denoting the expected fraction of pairwise relations recovered by the rule by (we will refine this notation in a while) and the event that exam paper gets type after grading by , we have
The first sum runs over all pairs of different types with order and the second sum runs over all types. The scary (at first glance) double integral can be hidden under the notation to obtain
(2) 
We will use the term weight to refer to the quantity . Our assumption for an infinite number of students nullifies any dependencies between the rank vectors that exam papers and get after grading. So, the events and are independent and the definition of the weight becomes
(3) 
Let us now compute the probability that exam paper gets type . By considering all ways to distribute the entries of the type vector as ranks of an exam paper by the graders that handle it (ignoring symmetries), there are
ways that the exam paper can get type , where is the number of graders that have the exam paper ranked th. Again, due to our assumption for infinitely many students and the uniform inclusion of them into bundles, the quality of each exam paper included in a bundle does not affect the quality of other exam papers (in the same or different bundles). Clearly, the grading by different students is performed without dependencies either. Denoting by the event that exam paper is ranked th in a bundle, the probability that is of type is
To compute , it suffices to consider all possible true ranks that exam paper may have in a bundle and account for the probability of having such a rank and being ranked th by the grader handling the bundle. Let us denote by the event that the true rank of in a bundle is . Then,
Now, the probability is equal to the number of ways we can choose exam papers to be ahead of times the probability that all of them will indeed be ahead of in the bundle times the probability that the rest exam papers in the bundle will have true ranks worse than . We use to denote the set of all entry vectors with and, for compactness of notation, we abbreviate by . We have
where the second equality is obtained by exchanging the sum and product operators. Using the fact that for , we obtain
(5)  
Interestingly, is a univariate polynomial of degree . Then, the double integral can be computed analytically. The computation is tedious but straightforward. In Appendix A, we show how to compute such integrals.
3.2 Computing optimal typeordering aggregation rules
The approach in Section 3.1 suggests a general way of evaluating the performance of any typeordering aggregation rule. In order to compute the expected number of correctly recovered pairwise relations, it suffices to use equations (2), (3), and (5). Equation (5) can be used to obtain , which is then used in equation (3) to compute the weights (for any possible pair of types and ). Finally, equation (2) returns the expected number of correctly recovered pairwise relations.
Of course, the expected number of correctly recovered pairwise relations is not the only performance objective one would like to measure. For example, we could simply ignore exam papers that are very close in the ground truth ranking. The ground truth ranking is mostly a modelling assumption and it should not be very restrictive in the evaluation of an aggregation rule. So, we could just measure the expected number of correctly recovered pairwise relations between pairs of exam papers with ranks in the ground truth that differ by at least (for small values such as ). Another possibility would be to ignore pairwise relations between pairs of exam papers that have both very low rank in the ground truth. For example, why is it important to recover correctly the pairwise relation between the students that have true ranks and ? A general objective in this direction would be to measure the correctly recovered relations between pairs of exam papers that involve one with true rank in the top (e.g., ).
Our theoretical framework can be easily extended to handle such cases. In general, a performance objective is defined by a bivariate function which returns the importance of measuring a correctly recovered relation between two students and with . In the presentation of our framework in Section 3.1, we have assumed such a function with for every student pair. The two scenarios of the previous paragraph can be captured by the function (i) when and otherwise, and (ii) when (and ) and otherwise. Many other performance objectives can be defined including ones in which the function returns fractional values between and .
The only modification in the computation of Section 3.1 is in the computation of the weights which should become
(6) 
In order to capture the generality of the scenarios considered, we overload the notation for the performance measure to specify the bundle size , the aggregation rule , the noise matrix describing the grading behaviour, and the performance objective .
Theorem 1.
Consider a typeordering aggregation rule that is applied on sized partial rankings from an infinite population of students with grading behaviour that follows a noise matrix . Then, the fraction of correctly recovered pairwise relations that satisfy the performance objective given by the bivariate function is
(7) 
Note that the weights do not depend on the aggregation rule at all. They depend on the grading behaviour and the performance objective. Instead, the aggregation rule determines only the particular weights that should be summed up in order to compute . This means that, once we have information about the bundle size, the grading behaviour, and the desired performance objective, we can seek for the typeordering aggregation rule that is optimal for this particular scenario. All we have to do is to compute the typeordering aggregation rule that maximizes which, actually, translates to computing an ordering of the types so that the leftmost summation in the definition (7) is maximized.
It is not hard to see that the problem is equivalent to solving the feedback arc set problem on an edgeweighted complete directed graph. In particular, the input is a complete directed graph that has a node for each type . A directed edge from a node corresponding to type towards a node corresponding to type has weight . Now, the objective is to find an ordering of the nodes so that the total weight of “consistently directed” edges from a node to a node of higher rank in the ordering is maximized. The next statement should now be obvious.
Theorem 2.
Computing the optimal typeordering aggregation rule for a scenario involving specific bundle size, grading behaviour, and desired performance objective is equivalent to solving feedback arc set on an edgeweighted complete directed graph.
Feedback arc set (FAS) is NPhard even in its very simple variant on unweighted tournaments [1]. The particular weighted version we consider here admits a PTAS [9]. Unfortunately, the solutions that such a PTAS can guarantee in reasonable time are sufficiently far from optimality and the resulting typeordering aggregation rule will consequently have highly suboptimal performance. Fortunately, the FAS instances that we had to solve in order to compute optimal rules have a very nice structure for all the scenarios considered. This structure allows us to compute the optimal FAS solution (almost) exactly by a straightforward algorithm that we present in the following. We strongly believe that this nice property holds in any scenario that can appear in practice.
Let us assume that we would like to solve FAS on an edgeweighted complete directed graph and to compute an ordering of the nodes of so that the total weight of edges in the direction that is consistent to the ordering is as high as possible. First observe that if two opposite directed edges have the same weight, the ordering of its endpoints does not affect the contribution of the consistently directed edge. So, the decision about the relative order of such noncritical node pairs can be postponed until the very end of the algorithm and any decision about them will be just fine. Now, consider two nodes and of such that ; then, the consistently directed edge that we would like to have in the final solution is . We will call such pairs of nodes critical pairs. Decisions about the ordering of critical pairs of nodes have to be taken first. An ideal situation would be if after deciding the critical node pairs, we came up with a partial ordering of all nodes that participate in at least one critical pair. The ordering could then be completed by appropriate decisions about noncritical node pairs. And, luckily, this process would have resulted in an optimal solution for FAS since every pair of nodes has the maximum possible contribution to the objective. Of course, things are not as easy in general since the decisions about critical pairs may lead to cycles of nodes, which cannot be part of the final ordering.
Our algorithm proceeds as follows. It takes as input an edgeweighted complete directed graph with as the node set and weight (computed using (6)) for every directed edge from type to type . Our algorithm builds an auxiliary unweighted directed graph again over the types. For every critical pair of types with , the auxiliary graph has a directed edge from type to type . The next step is to compute all strongly connected components of ; two types and belong to the same strongly connected component if contains a directed path from to and a directed path from to . This computation can be easily done by computing breadth first search trees rooted at every node of . After this step, the ordering of the types in different strongly connected components is irrevocably decided. In order to decide the ordering of types within the same strongly connected component, we use brute force on the corresponding subgraph of . If the size of a strongly connected component is so large that brute forcing is prohibitive, we just order the types within the component according to a Borda ordering (breaking ties uniformly at random). As a final step, we decide the order of noncritical node pairs.
The approach to use Borda ordering when brute forcing is very costly in terms of running time might give the impression that the outcome of the above algorithm is always very close to a Borda ordering. Surprisingly, our algorithm returns Borda orderings (or orderings that are very close to Borda) only when this is absolutely necessary. One such situation is presented in the next section where we show that Borda is indeed the optimal aggregation rule in all scenarios that involve perfect graders. For imperfect graders, brute forcing has been proved extremely useful as the vast majority of strongly connected components are very small. We report statistical information from the size distribution of strongly connected components in Section 4.
3.3 Borda is optimal for perfect graders
We will now exploit our theoretical framework to obtain our first concrete result.
Theorem 3.
For every scenario that involves perfect graders, Borda (with any tie breaking) is the optimal typeordering aggregation rule.
Proof.
Assume that we have a scenario with a bundle size of , perfect grading (i.e., a identity noise matrix), and a bivariate function that represents the performance objective.
We first compute the probability that exam paper gets type using (3.1) and the fact that if and otherwise. Hence,
Now, consider two exam papers with ranks and in the ground truth such that and let and be two types. Using the above equality, we obtain
(8) 
Since , it is also and the right hand side of the last equation is above, equal, or below if and only if the quantity is positive, zero, or negative. Hence, the quantity
and the Borda score difference between the two types and have the same sign. Now, let be the signum function. Then, we have that
This implies that any Borda ordering of the types maximizes the quantity and, hence, the quantity , and the theorem follows. ∎
4 Validation of our framework
4.1 Building a realistic noise model using a field experiment
We have run a field experiment with the students that attended the course on Computational Complexity in our home institution during the Spring 2015 semester. This is a course that the first author teaches during the last few years and usually includes an optional mid term exam. As it is typically the case in Greek universities, cardinal integer and halfinteger scores between and are used in such exams and they represent how correct the answers of the students to the exam questions are. Hence, these cardinal scores represent the success of the students in the exam in absolute terms.
In our experiment, our goal has been to investigate how effective the students can be in ordinal grading. For this purpose, we created a hypothetical exam with three questions and prepared several answers for them. In particular, we prepared 16 different answers to question 1, 12 answers to question 2, and 8 answers to question 3. Combinations of these answers into all different ways resulted in a pool of 1536 different exam papers. We created bundles of size from this pool. Each student was given a bundle of exam papers which was asked to rank (for a bonus grade). Note that the selection of papers in each bundle was not arbitrary. The answers for the questions belonged to different levels of correctness and included excellent ones, almost excellent ones with a minor issue not fully resolved, answers in the right direction but with sloppy writeup, completely incorrect answers, no answer at all, etc. Specifically, we had 7, 6, and 5 different levels of correctness for the answers in questions 1, 2, and 3, respectively. When bundles were formed, we imposed the following constraint for any pair of exam papers and in a bundle: if the correctness level of paper in an answer is strictly higher than that in paper , then paper cannot have a strictly higher correctness level than in any other answer. Furthermore, there was at least one question for which the answers had different levels of correctness. This guaranteed a strict ranking of the exam papers in each bundle and, furthermore, that this ranking would be welldefined and independent of any assumptions about the importance of the different questions.
In addition to this ranking exercise, the students also participated in the traditional mid term exam. This allowed us to quantify the correlation between their grading behaviour and their success in the traditional exam. The results are depicted in Figure 2
(a). Even though students that were excellent in the traditional exam did very well as graders, the grading performance of the majority of the remaining students seems to be uniformly distributed between average and excellent, with just a few underperforming outliers.
^{2}^{2}2An explanation for this grading behaviour is that, even though the students have participated in many exams like the mid term in the past and have a very good idea of what they are expected to do, this was the very first time they were asked to rank. We plan to make the whole information (questions, answers, indicative bundles, etc.) available to the students attending our Computational Complexity course during the current semester (Spring 2016) so that they are wellprepared for the next field experiment. It will be very interesting to see how this preparation will affect their grading behaviour.For comparison, we have plotted the same small number of randomly chosen students according to the Mallows distribution in Figure 2(b). Here, the correlation between student quality (to be thought of as equivalent to the success in the traditional exam) and grading performance is clear. The data depicted in Figure 2(a) have been used extensively in “realistic” exams. In these exams, we simulate a large population of students, whose quality and grading behaviour is determined by randomly drawing a bubble from those in Figure 2(a) with probability proportional to the area of the bubbles. Essentially, each of our students in the experiment serves as the support of the realistic distribution while the quality is slightly perturbed to result in a strict ground truth ranking.
The above information has also been distilled in the noise matrix with
(9) 
which represents a “realistic” noise model. The information in the matrix was obtained by measuring the frequency that the th ranked exam paper by students should be correctly ranked at position . For example, out of students ranked third an exam paper in their bundles which should have been ranked fourth; thus, . This matrix is used in the computation of the optimal typeordering aggregation rules for realistic scenarios according to the methodology developed in Sections 3.1 and 3.2. Observe that matrix does not include any information about the correlation between the grading behaviour and the quality of the student that acts as grader; actually, this is a feature that our framework completely neglects. Surprisingly, the experimental results that we present in the following indicate that storing information about this relation is not necessary for making accurate performance predictions.
4.2 On the accuracy of theoretical performance predictions
Our next step was to compute the optimal typeordering aggregation rules for several scenarios. We have kept the same bundle size of in all scenarios and distinguish between the realistic and Mallows noise models by using the corresponding matrices and defined in equations (9) and (1), respectively. As performance objectives, we have considered the following:

all2all: the total number of all correctly recovered pairwise relations;

th10% and th50%: the total number of correctly recovered relations between pairs that include an exam paper that is ranked in the top and top in the ground truth, respectively;

acc2% and acc5%: the total number of correctly recovered relations between pairs with positions that differ by at least and in the ground truth, respectively.
We have used the theoretical framework presented in Sections 3.1 and 3.2; of course, computations have been automated. All the computational results that we report in the following have been obtained using an Intel 12core i7 machine with 32Gb of RAM running Windows 7. Our methods have been implemented in C using the GNU Multiple Precision Arithmetic Library (GMP) and in Matlab R2013a. In particular, high precision is absolutely necessary in order to compute the weights even for bundles of size since, by inspecting equations (5) and (6) carefully, we can see that there are products with more than factors and factorials of integers up to that are involved in the computations.
A first remark is that the algorithm that we have used to solve FAS exactly is very fast. This is due to the fact that strongly connected components have very small size. In all cases we considered, among the 462 different types that we can have for bundles of size , more than 97% of them form singleton components and the maximum component size never exceeded 26 (for the sample approximation of Mallows that we discuss in Section 4.3). Their distribution for scenarios with realistic and Mallows noise models for all performance objectives is depicted in Table 1.


realistic model  mallows model  


size  all2all  th50%  acc2%  acc5%  all2all  th50%  acc2%  acc5% 


1  448  460  449  451  453  459  449  449 
3–7  13  2  12  10  6  3  10  12 
8–11  1  0  1  1  2  0  2  0 
12  0  0  0  0  1  0  1  1 


max  10  3  10  10  20  4  20  20 

Recall that our theoretical framework assumes an infinite number of students and, hence, neglects apparent dependencies between random variables that certainly exist in scenarios with finite number of students. So, the information in Table 2 is rather surprising and shows that, besides our assumptions, our theory provides extremely accurate predictions for the performance of aggregation rules in practice. Note that the values in Table 2 are percentages and we never observed differences beyond the second decimal point between the theoretically predicted value and the experimental one. Also, note that we have used students in our experiments. This number is lower than the vision for the most popular courses that will be offered by MOOCs in the near future; the predictions become even more accurate for higher numbers of students.


noise  perfect grading  realistic grading  mallows grading  
setting  theory  theory  theory  
method  borda  borda  opt  borda  opt  borda  opt  borda  opt  borda 


all2all  92.01  92.02  80.01  79.57  80.09  79.57  85.15  84.38  85.16  84.39 
th10%  96.94  96.95  87.61  87.18  87.60  87.17  92.05  90.52  92.07  90.54 
th50%  94.13  94.14  83.62  83.43  83.62  83.43  88.39  87.80  88.40  87.81 
acc2%  93.57  93.57  81.27  80.73  81.27  80.74  86.52  85.72  86.52  85.73 
acc5%  95.47  95.47  82.97  82.42  82.97  82.42  88.42  87.61  88.42  87.62 

Table 2 contains the expected values (from 1000 simulations) of the performance measure for each grading scenario and corresponding optimal aggregation rule (as well as for Borda). Even though Borda is never optimal in any imperfect grading scenario that we considered, its performance is always very close to the optimal rule. In particular, Borda is never more than worse than the optimal rule in realistic scenarios (for the acc5% performance objective); such differences can be up to in scenarios with Mallows noise (and the th10% objective). Figure 3 reports detailed information for all these simulations. Clearly, the performance of the aggregation rules for all the objectives that we considered is sharply concentrated around the expected values. The clouds of points are very close to the diagonal (corresponding to values with equal coordinates) in the realistic scenarios and only marginally further from it in the Mallows scenarios.
A final comment on the performance of the optimal typeordering aggregation rules is that they are extremely robust. Even though they have been optimized with respect to a particular performance objective, they perform very well with respect to other objectives as well. Figure 4 shows measurements of properties that cannot be expressed as performance objectives under our framework. Each plot shows data about Borda and the optimal (under the all2all objective) typeordering aggregation rules in scenarios with perfect, realistic, and Mallows grading. Borda in the perfect grading scenario achieves the best performance with respect to these objectives as well. Actually, its performance in this scenario can serve as the optimistic barrier for every typeordering aggregation rule in any (imperfect) grading scenario. More interestingly, Borda has performance that is very close to the optimal rule for realistic grading (the corresponding curves almost coincide in Figure 4) and slightly worse for Mallows grading. These results are in sync to those in Table 2, Figure 3, and Figure 4 and showcases the robustness of Borda in different scenarios.
4.3 The effect of inaccuracies in the noise model
The realistic noise model that we built in Section 4.1 is, by definition, an approximation of the students in our home institution. Besides limitations that have to do with our modelling assumptions, it has the obvious drawback that it has been built using a very small fraction of our students. So far, the reader should have been convinced that the aggregation rules we have built are indeed optimal for a large population that inherits the quality and grading performance of this small fraction of students; this has been the focus of our experiments with realistic grading. What is far from clear is whether these aggregation rules will perform equally well for the whole population of the students in our home institution. To see the importance of this question, imagine it in the planetary scale that MOOCs envision. Can we make safe predictions for huge student populations by sampling a tiny fraction of them, building a noise model as we did in Section 4.1, and then select optimal typeordering aggregation rules as we did in Section 4.2?
We give a positive answer to this question by considering Mallows grading scenarios. With Mallows, we have the luxury of a welldefined noise model for the grading behaviour of a huge student population and we have used it in order to compute optimal aggregation rules for this model. This information will be used only for assessing the approach presented in the following. Instead, we will pretend that no information about grading behaviour is available and all we can do is to apply (actually, to simulate) a field experiment like the one we presented in Section 4.1 on a tiny fraction of the students in order to come up with a noise matrix. In this way, we will compute an approximation of the true noise model.
We have followed this approach using samples of Mallows graders of size and . The two noise model matrices we obtained are as follows:
and
The matrices have been used to compute the optimal typeordering aggregation rules for the five performance objectives. Interestingly, the instances of FAS that we had to solve were slightly harder now, particularly for the sample noise model, where strongly connected components of size up to emerged in the auxiliary graph. Still, our methodology was applied smoothly and allowed us to compute optimal rules.


# samples  100  1000  Mallows  
setting  theory  theory  theory  


all2all  84.95  84.95  85.14  85.15  85.15  85.16 
th10%  91.82  91.85  92.05  92.04  92.05  92.07 
th50%  88.21  88.21  88.39  88.38  88.39  88.40 
acc2%  86.31  86.31  86.51  86.51  86.52  86.52 
acc5%  88.19  88.20  88.41  88.41  88.42  88.42 

Table 3 shows the theoretical prediction and observed performance values of these rules (by applying the rules on 1000 simulated exams with Mallows students). The performance of the aggregation rules that were computed using the sample approximation of Mallows are already amazingly close to those for Mallows. For the rules that we computed using the sample approximation, it is almost impossible to distinguish them from the Mallowsoptimal ones in terms of performance.
A more refined graphical representation of these findings is given in Figure 5 (best viewed in color). Each plot contains a blue and a red cloud of points each corresponding to a single simulated exam with students. The blue points (respectively, red points) measure the performance of the optimal rule for the sample (respectively, sample) Mallows approximation versus the Mallowsoptimal rule. The blue cloud almost coincides with the diagonal in each plot, indicating an optimal approximation of the Mallowsoptimal rule. The red cloud is distinct (there is a negligible “intersection” of blue and red points only for the th10% performance objective) but very close. To realize how close the two clouds are, almost the whole cloud of points for Borda (from Figures 3(d), 3(e), and 3(f)) would be located outside the plot area of Figure 5 (if we attempted to plot it).
We conclude the presentation of our experimental results with a comparison of the type orderings of the optimal aggregation rules for Mallows and its sample and sample approximations; these are presented in Table 4. Therein, we can see that the optimal rule for the sample noise model according to the all2all performance objective is very close (but not identical) to the Mallowsoptimal rule. The optimal rule for the sample noise model is substantially different (this difference is more apparent in lower positions of the ordering which cannot be included here). An interesting characteristic of optimal rules for Mallows and its approximations is that the orderings of types are not monotonic. For example, type is always ahead of . This justifies our decision to study typeordering aggregation rules and ignore positional scoring rules; clearly, no positional scoring rule would come up with nonmonotonic orderings of types.



pos.  mallows  sample  sample 
approx.  approx.  


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  

References
 [1] N. Alon. Ranking tournaments. SIAM Journal on Discrete Mathematics, 20(1):137–142, 2006.
 [2] F. Brandt, V. Conitzer, U. Endriss, J. Lang, and A. D. P. (Eds.). Handbook of Computational Social Choice. Cambridge University Press, 2016.

[3]
I. Caragiannis, G. A. Krimpas, M. Panteli, and A. A. Voudouris.
corank: an online tool for collectively deciding efficient rankings
among peers.
In
Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI)
, 2016, forthcoming.  [4] I. Caragiannis, G. A. Krimpas, and A. A. Voudouris. Aggregating partial rankings with applications to peer grading in massive online open courses. In Proceedings of the 14th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), pages 675–683, 2015.
 [5] I. Caragiannis, A. D. Procaccia, and N. Shah. When do noisy votes reveal the truth? In Proceedings of the 14th ACM conference on Electronic commerce (EC), pages 143–160, 2013.
 [6] I. Caragiannis, A. D. Procaccia, and N. Shah. Modal ranking: A uniquely robust voting rule. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014.
 [7] F. Chierichetti and J. M. Kleinberg. Voting with limited information and many alternatives. SIAM Journal on Computing, 43(5):1615–1653, 2014.
 [8] L. de Alfaro and M. Shavlovsky. CrowdGrader: A tool for crowdsourcing the evaluation of homework assignments. In SIGCSE, pages 415–420, 2014.

[9]
C. KenyonMathieu and W. Schudy.
How to rank with few errors.
In
Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC)
, pages 95–103, 2007.  [10] C. Kulkarni, K. P. Wei, H. Le, D. Chia, K. Papadopoulos, J. Cheng, D. Koller, and S. R. Klemmer. Peer and self assessment in massive online classes. ACM Transactions on ComputerHuman Interaction, 20(6), 2013.
 [11] E. Law and L. von Ahn. Human Computation. Synthesis Lecture on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2011.
 [12] C. L. Mallows. Nonnull ranking models. Biometrika, 44:114–130, 1957.
 [13] C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in MOOCs. In Proceedings of the 6th International Conference on Educational Data Mining (EDM), pages 153–160, 2013.
 [14] K. Raman and T. Joachims. Methods for ordinal peer grading. In Proceedings of the 20th ACM Conference on Knowledge Discovery and Data Mining (KDD), pages 1037–1046, 2014.
 [15] N. B. Shah, J. K. Bradley, A. Parekh, M. Wainwright, and K. Ramchandran. A case for ordinal peerevaluation in MOOCs. In Neural Information Processing Systems (NIPS): Workshop on Data Driven Education, 2013.
 [16] T. Walsh. The PeerRank method for peer assessment. In Proceedings of the 21st European Conference on Artificial Intelligence (ECAI), pages 909–914, 2014.
Appendix A Computing the weights
We now elaborate on how to analytically compute the weight ; the following computations are implemented in Algorithm 1. Recall that is given by equation (6), i.e.,
and that the performance objective bivariate function indicates whether a correctly recovered pairwise relation between two students and (with ) should be accounted for or not. All perfomance objectives we consider in this paper can generically be described by such a function with when and for appropriate values of , and otherwise. In particular, all2all can be expressed with the tuple , th10% and th50% with the tuples and , and acc2% and acc5% with the tuples and .
Then, is equal to
(10)  
Now, recall that is given by equation (5) and is a univariate polynomial of degree . Hence, it can be written as
(11) 
where the coefficient with are computed by equation (5), and have been included for completeness in the first part of Algorithm 1.
The inner integral in equation (10) is computed as follows:
Using the fact that for , , and , we obtain
(12) 
Observe that the inner integral is also a univariate polynomial of degree and it can be written as
(13) 
where the coefficient with are computed by equation (12) at the second part of Algorithm 1.
Comments
There are no comments yet.