Aggregating partial rankings with applications to peer grading in massive online open courses

11/17/2014 ∙ by Ioannis Caragiannis, et al. ∙ 0

We investigate the potential of using ordinal peer grading for the evaluation of students in massive online open courses (MOOCs). According to such grading schemes, each student receives a few assignments (by other students) which she has to rank. Then, a global ranking (possibly translated into numerical scores) is produced by combining the individual ones. This is a novel application area for social choice concepts and methods where the important problem to be solved is as follows: how should the assignments be distributed so that the collected individual rankings can be easily merged into a global one that is as close as possible to the ranking that represents the relative performance of the students in the assignment? Our main theoretical result suggests that using very simple ways to distribute the assignments so that each student has to rank only k of them, a Borda-like aggregation method can recover a 1-O(1/k) fraction of the true ranking when each student correctly ranks the assignments she receives. Experimental results strengthen our analysis further and also demonstrate that the same method is extremely robust even when students have imperfect capabilities as graders. We believe that our results provide strong evidence that ordinal peer grading can be a highly effective and scalable solution for evaluation in MOOCs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Massive online open courses (MOOCs) such as Coursera and EdX have currently become a trend and have attracted significant funding from VCs and support from leading academics. Their vision is to use the Internet and provide (to huge numbers of students) an educational experience that is typical in courses targeted to small audiences in top-class Universities. Whether MOOCs will become the next big business over the Internet strongly depends on whether they will satisfy the fundamental need for easy and cheap access to high quality education without restrictions. An apparent bottleneck for their full deployment and success is the fact that assessment and grading with the classical means is extremely costly. A typical approach is to use closed type questions in exams or assignments so that grading can be done automatically. This is highly unsatisfactory when, as part of a course, one would like to evaluate the students’ ability of proving a mathematical statement, or expressing their critical thinking over an issue, or even demonstrating their creative writing skills. Evaluating this ability is inherently a human computation task [14].

An approach that has been proposed is to outsource the grading task to the students participating in the exam or assignent themselves; for example, they can be required to grade (a small number of) their peers’ assignments as part of their own assignment [22]. Of course, allowing the students to grade using cardinal scores is risky; they are not experienced in assessing their peers’ performance in absolute terms111This is in contrast to the main assumption behind the reviewing systems that are used in academic conferences. and they may have strong incentives to assign low scores to everybody in order to increase their own relative success in the assignment. An alternative that sounds feasible is to ask each student to provide a ranking of a small number of her peers’ assignments and then compute a global ranking by merging the partial ones; this is known as ordinal peer grading (e.g., see [23, 24]). Can this global ranking be in accordance to the objective comparison of students in terms of their performance in the assignment? Which are the necessary methods for this computation? And how accurate can this global ranking be? In this paper, we address these questions and provide both conceptual and technical answers.

Merging individual rankings into a global one is the main goal of voting rules from social choice theory, where a set of voters provide full rankings over the available alternatives and a voting rule has to transform this input into a winning alternative or an aggregate ranking of the alternatives. At first glance, ordinal peer grading seems to be a natural application area for classical voting theory. Interestingly, its particular characteristics deviate from those usually studied in the voting literature. First, each voter is also an alternative. This is a rare assumption in social choice in works that focus mostly on incentives issues (e.g., see [2, 13]). Second, the input consists of partial rankings over small subsets of alternatives. The closest such approach in social choice is known as preference elicitation [7] where simple queries are asked to each voter about their preferences; for example, in top- elicitation [10], each voter provides the partial ranking of the alternatives she likes the most. The (complexity) effects of using only partial rankings in voting have been studied under the possible and necessary winner problems (e.g., see [27]). An important characteristic of ordinal peer grading is that the partial rankings have the same size and that each assignment is given to the same number of graders. And finally, there is an objective way to assess the ordinal peer grading outcome by comparing it to the objective comparison of the students in terms of performance in the assignment. This is close in spirit to recent approaches that use voting in order to learn a ground truth [5, 6], such as a winning alternative or an underlying true ranking. In our work, we deviate from these studies as well since we aim to learn the ground truth only approximately. So, ordinal peer grading is a setting where ideas and analysis techniques from human computation, voting, and learning are blended together in novel ways.

In particular, our model uses a grading scheme that asks each student to rank the assignments of other students. For fairness reasons, we restrict ourselves to grading schemes that distribute each assignment to exactly students. Unlike recent studies [23, 24], we investigate the potential of applying ordinal peer grading exclusively, i.e., without involving any professionals in grading. We assume that there is an underlying true (strict) ranking of the assignments (the ground truth) and we would like to recover correctly an as high as possible fraction of it using input from the students. We have two scenaria that determine the input. In the first one, we assume that, after the students have submitted their assignments, the instructor announces indicative solutions and grading instructions. Here, we make the simplifying assumption that each student grades the assignments in her bundle consistently to the ground truth (perfect grading). In a second scenario that is also assumed in [23], we assume that grading is performed without any guidance by the instructor. Here, the natural assumption is that the quality of a student determines both her performance in the assignment and her grading ability. We have mostly focused on simple rank aggregation rules such as the adaptation of the classical Borda count [4], where the partial ranking provided by each grader is interpreted as follows: points are given to the assignment ranked first, points to the one ranked second, and so on. The global ranking is then computed by ordering the assignments in decreasing order of these Borda scores. We have also considered more aggregation rules which are described in detail in Sections 2 and 4.

Our technical contributions can be summarized as follows. In Section 3, we present a theoretical analysis of Borda when the partial rankings on input are consistent to the ground truth. We prove that using any way to distribute assignments per student, Borda recovers correctly an expected fraction of of the pairwise relations in the ground truth. If the distribution of the assignments has some particularly desired simple structure, an even better guarantee of is obtained. The independence of these results from the number of students is rather surprising. Our proofs exploit the beautiful theory of martingales

in order to cope with dependencies between random variables that are involved in the analysis. To the best of our knowledge, this is the first application of martingales in social choice. We also present extensive experiments with Borda and other aggregation rules (in Section

4). Our findings further justify the robustness of Borda, even in the scenario of imperfect grading. For example, Borda is shown to recover more than of the ground truth by distributing assignments per student (with students having highly varying grading capabilities). Here, we borrow ideas from recent studies on voting and learning (e.g., [5]) and use noise models for the generation of random partial rankings whose distance from the ground truth depends probabilistically on the quality of the graders. En route, we provide some intuition about the problem (in Section 2). We conclude with a discussion of (the many) possible extensions of our work in Section 5.

2 Problem statement, terminology and notation

Let denote a universe of elements. A collection of subsets of is called a grading scheme with parameters and (or -grading scheme) if consists of subsets of called bundles, each bundle has size , and each element of belongs to exactly subsets of . To see the relation to peer grading, we can view the elements of the universe as the papers of students participating in an assignment. Each bundle contains papers that will be graded by a distinct student. Of course, we require that no student will grade her own paper. This can be easily achieved by a matching computation.222Indeed, for every student , there are bundles that do not contain her paper. Then, the bipartite graph that represents the information about the bundles that a student is allowed to grade is regular and, by Hall’s matching theorem, has a perfect matching. This matching can be used to assign bundles of papers to students.

Alternatively, we can represent the -grading scheme with a bipartite graph which we will call -bundle graph. The set of nodes has size and contains a distinct node for each element of . The set of nodes has size too and contains a node for each bundle of . The set of edges contains an edge connecting node with node if and only if the element corresponding to node belongs to the bundle corresponding to node . Clearly, an -bundle graph is -regular. Actually, every -regular bipartite graph has the same number of nodes in both bipartition sides and be used as an -bundle graph.

A partial ranking associated with a bundle is simply a ranking of the elements contains. We remark that is undefined for elements not belonging to . A profile is simply the collection that contains the partial ranking for each bundle of . An aggregation rule takes as input a profile of partial rankings and computes a complete ranking of all elements. A typical example is the following rule that extends Borda count from classical voting theory. Each element gets a score from each appearance in a partial ranking. The Borda score of an element is then the sum of the scores from all partial rankings. Within each partial ranking, a score of is given to the element that is ranked first, a score of to the element that is ranked second, and so on. The final complete ranking is computed by sorting the elements in decreasing order in terms of their Borda scores. We will use the term Borda to refer to this aggregation rule. Even though one can think of several different ways to resolve ties, we simply ignore ties in our theoretical analysis (Section 3) and use uniformly random tie-breaking in our experiments (Section 4).

We have also considered another aggregation rule which we call Random Serial Dictatorship (RSD). The term is inspired by the well-known mechanism for house allocation markets [1]. A complete ranking is computed gradually starting from an initially empty one. In a first serial phase, the partial rankings are considered in a random order. When considering a partial ranking, we copy to the global one all the pairwise relations that do not contradict (i.e., do not form cycles with) relations copied earlier. When all partial rankings have been considered, the global partial ranking is augmented by the pairwise relations implied due to transitivity (e.g., the pairwise relations and copied from two partial rankings imply that as well). Then, we use a second random completion phase to complete the global ranking as follows. In each step, we pick a random pair of elements whose relation has not been decided so far. We make this decision randomly and update all pairwise relations that this decision and the existing ones imply due to transitivity. We continue this way until all pairwise relations have been decided.

We are now ready to give the statement of the problem that we consider more formally. In general, we would like to use the grading schemes and aggregation rules in order to learn an unknown ground truth, i.e., a ranking of the elements representing their relative quality. A first question is whether the ground truth can be learnt with certainty when the partial rankings are consistent to it. In other words, we ask for an order-revealing grading scheme (and a corresponding order-revealing bundle graph) which defines the bundles in such a way that the partial rankings contain enough information so that all pairwise relations in the ground truth can be recovered with certainty. Unfortunately, order-revealing grading schemes have severe limitations. In particular, they should have the following too demanding property: for every pair of elements, there should be some bundle that contains both of them.333This property essentially asks for a -regular bipartite graph of diameter at most . Our order-revealing bundle graphs are known as Moore bipartite graphs, i.e., they are the smallest bipartite graphs of degree at least and of diameter at most ; see [18] for a detailed survey on the degree-diameter problem. Indeed, let be an order-revealing grading scheme over a universe of elements and assume that there are two elements and so that no bundle contains both and . Now, consider a ranking that has and in the first two positions and let be the ranking that differs from only in the order of and . Clearly, the partial rankings within the bundles are identical in both cases and, as a result, there is no way to identify whether the ground truth is the ranking or the ranking . Notice that the above property implies that RSD combined with order-revealing grading schemes recovers the ground truth with certainty (and does not have to run the random completion phase). This is not the case for Borda unless any two elements co-exist in the same number of bundles (like in the bundle graphs constructed below).

Clearly, the maximum number of elements that belong to a bundle with is and this number should be at least if we want to belong to some bundle with every other element. This immediately implies that order-revealing grading schemes should have bundles of size . In sharp contrast to this disappointing observation, we will see that the goal of approximate order-revealing grading schemes is a very feasible one and leads to effective and scalable grading solutions in theory and practice. Interestingly, many of our findings make use of bundle graphs that are order-revealing; this is why we have included the following explicit construction of order-revealing grading schemes for particular values of the parameters and here.

Let be a prime and let be a universe with elements. We will construct the grading scheme in which each bundle has size exactly . Observe that these values for and satisfy the lower-bound condition mentioned above with equality. Rename the elements of as and define the bundles of as follows:

  • ;

  • For , ;

  • For and , .

An order-revealing -bundle graph is depicted in Figure 1; it represents the following grading scheme . The underlying universe is and has the following seven -sized bundles: , , , , , , and . The numbering of nodes in set indicates an assignment of bundles to students for grading and, hence, nodes with the same number are not adjacent.

Figure 1: An order-revealing -bundle graph.

We prove the correctness of our construction using basic facts from number theory.

Lemma 1.

The above construction yields an order-revealing grading scheme.

Proof.

Clearly, the above grading scheme consists of bundles of size . Also, observe that each element belongs to exactly bundles. Indeed, element belongs to sets and for . Element belongs to sets and for . Element belongs to sets and such that .

We complete the proof by showing that for every pair , there exists a bundle that contains both and . This is clearly true if one of and is or if both and belong to or to some , for . So, there are two more cases to be considered. First, assume that for and . Then, there exists an such that and, hence, both and belong to set . It remains to consider the case where and with . Then, there exists a unique such that . This follows from the facts that is prime and that any linear equation of the form has solutions if and only if divides . Now, set and observe that and . Hence, both and belong to and the proof is complete. ∎

We now relax our requirements and seek for an approximate order-revealing grading scheme. Our aim is to use a bundle graph of simple structure and of very low (i.e., independent of ) degree and still be able to correctly recover a high fraction of the pairwise relations in the ground truth. Our grading schemes will be randomized in the sense that we will always randomly permute the elements before associating them to nodes of set of the bundle graph; let denote this bijection (or permutation). Sometimes, in our experiments, the bundle graphs we use are themselves random. Much of our work (i.e., our theoretical analysis in Section 3 as well as the first among the two sets of experiments reported in Section 4) has focused on the scenario where the partial rankings are consistent to the ground truth. Our second set of experiments in Section 4 uses partial rankings that deviate from the ground truth according to a noise model.

3 Analysis of Borda

In this section, we present our theoretical results. We assume that the -bundle graph has and . These are technical assumptions that do not affect the applicability of our results; recall that, in practice, we would like and to be huge and very small, respectively. Surprisingly, Borda correctly recovers a very large fraction of the ground truth as the next statement suggests.

Theorem 2.

When Borda is applied on partial rankings that are consistent to the ground truth, the expected fraction of correctly recovered pairwise relations is at least when the -bundle graph has girth at least , and at least in general.

We prove this theorem by relating the performance of Borda only to the degree and on a quantity that characterizes the structure of the bundle graph. For the definition of , we need some notation; this will be heavily used throughout this section. Given two nodes of , we use to denote their common neighbourhood in , i.e., . Observe that since is -regular. Also, we define the quantity as . Then,

where the sum runs over all ordered pairs of

in .

Intuitively, the quantity is small when, on average, the common neighbourhood between pairs of nodes is small. The extreme case is when the common neighbourhood consists of a single node; in this case, the graph has girth444The girth of a graph is the length of its smallest cycle. at least . The next lemma provides upper bounds on that will be useful later.

Lemma 3.

For every -regular bipartite graph , . Every -regular bipartite graph of girth at least has .

Proof.

Consider two nodes of an arbitrary -regular bipartite graph. We will show that is at most . Consider the sets of nodes , , and , and the edges connecting these nodes to . Each edge from a node of to a node contributes to the quantity , which can be up to . Hence, each edge from a node of to a node contributes at most to the quantity and there are such edges. Similarly, each edge from a node of and to a node contributes to the quantity , which can be up to . Hence, each edge from a node of or to a node contributes at most to the quantity and there are such edges. So, is bounded by times the total contributions to quantities by the edges between and , i.e., by .

Now, assume that the graph has girth at least ; this means that for any node , otherwise would be in a -cycle with either or . We will show that . Each node can be adjacent to either one node of or (exclusive) to at most one node of and at most one node of . Among the nodes in , denote by the ones that are adjacent to one node from and to one node from . So, any node that is among the neighbours of in or belongs to has . Any node among the remaining nodes of has . Hence, . The second part of the lemma follows by observing that the quantity is the number of nodes of with which cannot be higher than . ∎

The important step in the proof of Theorem 2 is to focus on two elements and with ranks (positions)

in the ground truth and to bound from above the probability that the difference in their Borda scores is inconsistent to their rank difference. This will require to take care of several subtle dependencies among the random variables involved. We will do so by exploiting the beautiful theory of martingales and a well-known tail inequality about them. The necessary background from martingale theory is presented below; the interested reader can refer to the textbooks

[19] and [20] for an introduction to martingales and their applications.

Definition 4.

A sequence of random variables is a martingale with respect to a second sequence of random variables if for every , it holds that .

The next definition provides a general way to define martingales associated with any random variable and was first used by Doob [8].

Definition 5.

Consider a random variable and a sequence of random variables . Then, the sequence of random variables such that and for every is a martingale, called a Doob martingale.

We can now present a powerful tail inequality for martingales that is known as Azuma-Hoeffding inequality (see Azuma [3] and Hoeffding [12]).

Lemma 6 (Azuma-Hoeffding inequality).

Let be a martingale with for . Then, for all , it holds that

We are now ready to show that the probability that the Borda score of a high-rank element is larger than the Borda score of a low-rank element is small. Importantly, it turns out that this probability decreases exponentially in terms of the rank difference. We will first study such phenomena under particular conditions on our bijection .

Lemma 7.

Let , and consider the two elements with ranks in the ground truth. Let be the random variable denoting the difference of the Borda score of minus the Borda score of and let be the event that and . Then,

and

Proof.

We begin the proof by computing the expectation of the Borda scores. Element gets one point for each bundle it belongs to plus one additional point for each appearance of an element with rank higher than in the bundles belongs to. Assuming that and , there are appearances of in the bundles of and appearances of elements different than and ; each of them has probability to have higher rank than . Hence, the expected Borda score of element is . Similarly, element gets one point for each bundle it belongs to plus one additional point for each appearance of an element with rank higher than . There are appearances of elements different than and in bundles of and each of them has rank higher than with probability . Hence, the expected Borda score of element is , and the expectation of the difference is indeed

Given , define to be the set of nodes in that are at distance exactly from or (not including and ); notice that . Now, consider an arbitrary ordering of the nodes of and let be the random variable denoting the rank of the element . Using the random variables and the random variable , we define the Doob martingale such that and (hence, given , ). The next technical lemma bounds the difference for .

Lemma 8.

For every , it holds that .

Proof.

Throughout this proof, all random variables and probabilities are conditioned on the event , even if, in order to simplify notation, we do not explicitly write so.

For every node , denote by the number of common neighbours between , , and . We can now express using the following observations: the Borda score difference

  • increases for each appearance of element in the same bundle with ,

  • for each appearance of element in a bundle containing but not provided that the rank of is higher than , and

  • for each appearance of an element in a bundle containing both and provided that the rank of is between and , and

  • decreases for each appearance of element in a bundle containing but not provided that the rank of is higher than .

Using our notation and , we have

Denoting by the sequence , we have that the difference is

(1)

Once the values of are determined, let and be the number of available ranks from that are between and and higher than , respectively. Hence, for , we have

and for , we have

Now, (1) yields

The second and fourth parenthesis in the above expression are obviously between and . Recall that and . Also, by the definition of , , for every . Combined with our assumption that , these properties imply that the first parenthesis is between and , and the third one is between and . The lemma follows since . ∎

Lemma 7 then follows by applying the Azuma-Hoeffding inequality (Lemma 6) with and using Lemma 8 to bound the difference . ∎

The proof of Theorem 2 can now be completed using Lemmas 3 and 7.

Proof of Theorem 2.

Consider the pair of elements with true ranks and so that . The correct pairwise relation between the two elements will be recovered when the Borda score of the low-rank element is higher than the Borda score of the high-rank one (there is the additional case where the two elements are tied and the tie is resolve in favour of the low-rank element but we will ignore this case; this will only make our result stronger). Again, will be the random variable denoting the difference between the Borda scores of the low- and high-rank elements. Then, by Lemma 7 the probability that the relation between the elements with ranks and is correctly recovered is

where , , and . Now, denoting the expected number of correctly recovered pairwise relations by , we have

We will estimate the (Gaussian) integral using the following claim.

Claim 9.

Let and . Then, .

Proof.

Denote by the error function. Then, we can verify by tedious calculations that

where the inequality follows since the error function takes values in when . ∎

Now, we use Claim 9 and the facts and to obtain

Now, the theorem follows by Lemma 3. Recall that is at most for every -regular bipartite graph and at most when has girth at least . Using the assumption that , we obtain that the rightmost parenthesis in the above expression becomes at least and , respectively. ∎

4 Experimental evaluation

We now describe two sets of experiments that we have conducted.555All experiments presented in this section have been conducted in an Intel 12-core i7 machine with 32Gb of RAM running Windows 7. Our methods have been implemented in Matlab R2013a. In the first one, we have studied perfect grading with Borda and RSD. We have considered three different types of bundle graphs. The first type is that of random -regular bipartite graphs. We build these graphs by picking perfect matchings in the complete bipartite graph as follows. For each node of in –say– the upper666Consider the graph with a bipartition into an upper and lower set of nodes like in Figure 1. node side, we select one edge among its incident ones uniformly at random. We remove this edge from and continue for the remaining nodes; this defines a random perfect matching. We repeat the above procedure times. If a node at the upper side becomes isolated before the completion of the above procedure, we repeat from scratch. Otherwise, the set of edges that have been removed constitutes the bundle graph. The second type of graphs consists of many components of small girth- graphs. For , where is a prime, we use the -regular bipartite graph with nodes per side whose construction is described in Section 2 and which was proved to be order-revealing in Lemma 1. The bundle graph consists of multiple disconnected copies of this graph. Similarly, the third type of bundle graphs contains copies of the complete bipartite graph (possibly, containing one small non-complete -regular bipartite graph if does not divide ). The selection of highly disconnected bundle graphs is intentional; these graphs are in a sense extreme (within their category) and can challenge our methods.

Table 1 depicts the data (percentage of correctly recovered pairwise relations) from the execution of Borda and RSD on distinct triplets of graph type and values777In all experiments reported here, equals or is very close to . This is because the results are essentially identical when significantly higher values of are used (up to ) and since the value of has allowed us to complete our experiments in a reasonable time frame. for the parameters and . The data in the column labelled “random -regular” show the average performance of Borda and RSD using random bundle graphs. A different random permutation is used each time in order to assign elements to nodes. For graphs of the second and third type, one graph is used for each pair of values for and . For example, the data entries in the columns labeled “girth-” and “copies of ” in the line with and correspond to the performance of Borda and RSD on a girth- bundle graph which consists of copies of the -bundle graph of Figure 1, and on a third-type graph that consists of copies of and one more -regular graph with nodes per side. Again, the data are average performance values from executions; in each execution, a different random assignment of the elements to the nodes of the bundle graph is used.

graph random -regular girth- copies of
Borda RSD Borda RSD Borda RSD
Table 1: Performance of Borda and RSD with perfect grading on different bundle graphs of similar size.

The results for Borda complement our theoretical analysis from Section 3. Indeed, the Borda-columns with bundle graphs of the second and third type indicate that the fraction of correctly recovered pairwise relations follows patterns of and , respectively. Interestingly, the constants hidden in the notation are significantly smaller than the theoretical constants and , respectively. The results from the execution of Borda on random bundle graphs shows a pattern of as well, albeit with a slightly higher constant hidden in the notation. We believe that this can be proved by extending our analysis in Section 3. Even though we have not managed to prove that the quantity is for these graphs, we strongly believe that this is the case.

RSD has poor performance on bundle graphs of the second and third type. This can be easily explained by recalling that these bundle graphs consist of small connected components. Even though all pairwise relations between elements assigned to nodes of the same component are correctly recovered, the vast majority of the pairwise relations are between elements that are assigned to different components. The probability that such a relation will be recovered correctly is only . This explains the small percentages in the second and third RSD-columns.

In contrast, the first RSD-column (for random bundle graphs) shows a very interesting pattern. RSD is clearly worse than Borda for values of up to and becomes better as increases further. Actually, this is more apparent in Figure 2 where Borda and RSD are compared in -bundle graphs for all values of from up to (and ). Each data point in Figure 2 corresponds to the average performance among executions. Here, we can again recognize the pattern for Borda that was observed in Table 1 and we further conjecture an even better pattern of for RSD. Proving such a statement formally seems to be a challenging task.

Figure 2: Borda vs. RSD with perfect grading and bundle size ranging from to .

In a second set of experiments, we have studied imperfect grading. Now, we do not assume that the partial rankings are consistent to the ground truth any more. Instead, we have implemented generators of noisy rankings that may differ from the ground truth. In particular, we assume that each student has a quality that affects her position in the ground truth but also her ability to grade. First, the ground truth is the ranking of the elements in decreasing order of quality. Then, the ability of a student to rank the elements in a bundle depends on her quality and is modelled by the following process. For every pair of elements and in the bundle that is ranked as in the ground truth, decide the correct pairwise relation with probability