Large-Scale Educational Question Analysis with Partial Variational Auto-encoders

03/12/2020 ∙ by Zichao Wang, et al. ∙ University of Cambridge Microsoft Universität Wien Rice University 0

Online education platforms enable teachers to share a large number of educational resources such as questions to form exercises and quizzes for students. With large volumes of such crowd-sourced questions, quantifying the properties of these questions in crowd-sourced online education platforms is of great importance to enable both teachers and students to find high-quality and suitable resources. In this work, we propose a framework for large-scale question analysis. We utilize the state-of-the-art Bayesian deep learning method, in particular partial variational auto-encoders, to analyze real-world educational data. We also develop novel objectives to quantify question quality and difficulty. We apply our proposed framework to a real-world cohort with millions of question-answer pairs from an online education platform. Our framework not only demonstrates promising results in terms of statistical metrics but also obtains highly consistent results with domain expert evaluation.



There are no comments yet.


page 7

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online education platforms are transforming current education systems by providing new opportunities such as democratizing high-quality educational resources and personalizing learning experiences. In recent years, many online education platforms have been developed, in particular those that crowd-source a large volume of questions and exercises. The availability of such questions and exercises is a key advantage of these platforms:

Students can utilize them to learn and exercise, while teachers can utilize them to customize quizzes to best understand and improve students’ learning. All of these potentially lead to improved learning outcomes. In this work, we will focus on educational resources in the form of multiple-choice questions which is one of the most common forms of quizzes in online education.

With such massive crowd-sourced questions, how to choose the ones to use is challenging because both teachers and students have limited time. Understanding the quality and difficulty levels of the questions will help teachers and students select which questions to use. It is challenging to require human experts to manually provide quality and difficulty labels to all the questions, which can be millions; an AI solution that automatically obtains the above insights is desired.

Therefore, we aim to develop a machine learning solution for large-scale online educational data analysis, providing insight for question difficulty and quality of each question. This task involves manifold challenges. First, online educational data is massive; both the number of questions and the number of students are extremely large. Second, there exists severe missingness in the data since each student can only answer an extremely small fraction of all available questions. Lastly, we need to design objectives to quantify and extract desired insights such as question quality and difficulty. Overall, we need a solution that is efficient, handles highly sparse data, and automatically acquires educationally meaningful insights about questions.

In this work, we use real-world online educational data in the form of students’ answers to multiple-choice questions and develop a machine learning framework to analyze the difficulty and quality of each question. We briefly summarize our framework below:

  • We develop a novel framework for educational question analysis based on partial variational auto-encoder (p-VAE) [ma2018partial, eddi] to efficiently handle partially observed data at a large scale. p-VAE models existing students’ answers and predicts the potential answer to unseen questions in a probabilistic manner.

  • We design a novel information-theoretic metric to quantify the quality of each question based on the observed data and p-VAE’s predictions. We also define a difficulty metric to quantify the difficulty of each question.

  • We evaluate our results not only using standard quantitative machine learning metrics but also with human experts. We empirically show that our framework is able to identify the quality and difficulty of the questions as consistently as human experts.

2 Cohort

We analyze data from a real-world online education platform. This platform offers crowd-sourced multiple choice questions to students from primary to high school (roughly between 7 and 18 years old). Each question has 4 possible answer choices, among which one answer choice is the correct answer. Currently, the platform mainly focuses on math questions. Figure 1 shows an example question from the platform. We use the data collected from the most recent school year (from September 2018 to May 2019). We organize the data in a matrix form where each row represents a student and each column represents a question. Each entry contains a number that represents a student’s answer to a question (i.e., 0 represents A, 1 represents B, etc.). We also know the correct answer to each of the questions; thus, we can obtain binary-valued information on whether the student has answered a question correctly (i.e., 0 represents wrong answer and 1 represents correct answer). Each student has only answered a tiny fraction of all questions and hence the matrix is extremely sparse. We thus removed questions that contain less than 100 answers and students who have answered less than 100 questions. Besides, when a student has multiple answer records to the same question, we keep the latest answer record. The above preprocessing steps lead to a final data matrix that consists of roughly 6.3 million students’ answer records with 27,219 students (rows) and 13,369 questions (columns). The final data matrix remains highly sparse, with only of all entries in the matrix observed.

Figure 1: An example question from the education platform where the data we analyze is collected.

The dataset also contains additional metadata about questions. Each question may be linked to one or more topics. Each topic covers an area of mathematics and the topics are hierarchically organized into levels of increasing granularity. The questions may also belong to one or more schemes. A scheme divides the school year into a series of topic units. Within each topic units, there are two quizzes, one which is assigned at the end of the topic unit and another which is assigned three weeks later. Teachers may change the order and length of topic units.

3 Method

To gain insights into such real-world educational data, we first need a model that predicts the missing data (questions students did not answer) with uncertainty estimation. We then need to design different objectives to quantify question quality and difficulty.

We formulate the first step above as a missing data imputation (aka matrix completion) problem. We have a data matrix

of size by , where N is the total number of students and M is the total number of questions. Each entry can either be binary which indicates whether student has answered question correctly or categorical which is the answer choice that a student selects. The data matrix is only partially observed; we denote the observed part of the data matrix as . Then, we would then predict the missing entries as accurately as possible in a probabilistic manner. Thus, we use the partial variational auto-encoder (p-VAE) [ma2018partial], which is the state-of-the-art method for the above imputation task.

The second step is to quantify the quality and difficulty of each question based on p-VAE’s predictions. Specifically, we quantify question quality by measuring the value of information that each question carries using an information theoretical objective.

We can also use the complete matrix, with missing entries replaced by p-VAE’s predictions, to gain insights on question difficulty.We present these two steps in detail in the remainder of this section.

3.1 Partial Variational Auto-encoder (p-VAE) for Students’ Answers Prediction

p-VAE is a deep latent variable model that extends traditional VAEs [kingma2013auto, Rezende2014StochasticBA, zhang2017advances]. VAEs assume that the data matrix is generated from a number of local latent variable ’s:

where is the -th student’s answers and is the -th student’s answer to the

-th question. We use a deep neural network for the generative model

because of its expressive power. Of course, contains missing entries because each student only answers a small fraction of all questions. Unfortunately, VAEs can only model fully observed data. To model partially observe data, p-VAE extends traditional VAEs by exploiting the fact that, given , is fully factorized. Thus, the unobserved data entries can be predicted given thed inferred ’s. Concretely, p-VAE optimizes the following partial evidence lower bound (ELBO):

which is in the same form as the ELBO for VAE but only over the observed part of the data.

The challenge is to approximate the posterior of

’s using a partially observed data vector. p-VAE uses a set-based inference network

, where is the observed subset of answers for student [zaheer2017deep, qi2017pointnet].

is assumed to be Gaussian; Concretely, the mean and variance of the posterior of the latent variable is inferenced as


where we have dropped the student index for notation simplicity. is the observed answer value augmented by its location embedding, which is learned; is a permutation invariant transformation such as summation which outputs a fix sized vector; and is a regular feedforward neural network. In this paper, we let where and are a learnable parameters that embed the identity of each question.

Note that, in p-VAE, some parameters have natural interpretations. For example, the per-question parameters can be collectively interpreted as a question embedding for each question . The per-student latent parameter can be interpreted as a student embedding for each student .

3.2 Question Analysis Using Predicted Student Performance

Using the predicted student performance using p-VAE, we design a novel information-theoretic objective to quantify the question quality and use statistical tools to analyze the question difficulty.

Question Quality Quantification.

Working closely with an education expert, we found that high-quality questions are considered to be those that best differentiate the student’s ability. When a question is simple, almost all students will answer it correctly. When a question is badly formulated or too difficult, all students will provide incorrect answers or random guesses. In any of these cases, the question neither helps the teacher gain insights about the students’ abilities nor helps students learn well. Thus, high-quality questions are the ones that can differentiate the students’ abilities.

We thus formulate the following information-theoretic objective to quantify the quality of question as follows:


where is the question index, is the -th student’s answer to the -th question, which can be either binary indicating whether the student has answered it correctly or categorical which is the student’s answer choice for this question. is the latent embedding of students. The -th student ability can be determined by the student’s possible performance on all questions, which can be inferred give .

This objective measures the information gain of estimating the student’s ability by conditioning on the answer to question . Thus, when is large, the question is more informative on differentiating the student ability, and thus it is considered a high-quality question.

In practice, we compute Eq. 2 using to approximate with Monte Carlo integration [eddi, gong2019icebreaker]:


where denotes the number of Monte Carlo samples. The KL term can be computed easily in close form thanks to the Gaussian assumptions in VAEs, i.e., the distributions and

are in the form of Gaussian distributions 


Question Difficulty Quantification.

For a group of questions answered by the same group of students, the difficulty level of the questions can be quantified by the incorrect rate of all students’ answers. However, for real-world online education data, every question is answered only by a small fraction of students and by different subsets of students with a different educational background. Thus, directly comparing the difficulty levels of the question from observational data is not accurate because an easy question that may be answered by only weak students may be shown to be difficult if only observational data is used. Thanks to p-VAE, we can predict whether a student can correctly answer an unseen question. We achieves this by defining the difficulty level of question as where denotes the predicted student’s answer.

4 Experiments

In this section, we demonstrate the applicability of our framework on the real-world educational dataset that we have introduced in Section 2. We first describe the experiment setup including human evaluation procedure, then characterize model’s prediction performance and finally showcase the suite of analytics including quality and difficulty of questions.

Experiment Setup.

We split the students (rows of the data matrix) into train, validation and test sets with an 80:10:10 ratio. Therefore, students that are in the test set are never seen in the training set. We train the model on the train set for 25 epochs using Adam optimizer 


with a learning rate of 0.001. We train p-VAE on binary students’ answer records (correct or incorrect answers). To evaluate imputation performance, we supply the trained p-VAE model a subset of test set as input and compute the model’s prediction accuracy on the rest of the test set. We use prediction accuracy as the evaluation metric. To compute question difficulty, we use all available data in binary format for p-VAE to perform imputation.

method Accuracy
Mean imputation 0.660
Majority imputation 0.660
Linear MICE 0.667
Missforest 0.571
ExtraTree 0.576
p-VAE imputation 0.734
Table 1: Imputation performances of various models. p-VAE beats baselines by a large margin.

Human Evaluation Procedure.

We compare our model’s analytics with that of an education expert to examine the degree of agreement between our model’ outputs and human expert’s judgements. Our evaluator is a senior and highly respected math teacher who has no prior information about this work. We ask him to evaluate both question quality and question difficulty. For question quality, We resort to pairwise comparison, i.e., we give the evaluator a pair of questions and ask the evaluator to give a preference on which question is of higher quality. We then compute the number of times our model agrees with the expert’s choice of quality. Because there are more than ten thousand questions and thus many more possible pairs, we sample 40 pairs This is to ensure that the two questions present to the evaluator at the same time are not of the same quality.

For question difficulty, we ask the evaluator to rank the difficulty of topics and schemes, respectively. We then compute the Spearman correlation coefficient as a measure of the level of agreement between the model’s and the expert’s difficulty rankings.

4.1 Students’ Answer Prediction

Table 1 shows the accuracy of p-VAE trained on binary answer records comparing to various baselines, including mean imputation, majority imputation, Linear MICE [JSSv045i03], Missforest [missforest], and an ExtraTree variation of Missforest. We did not compare to regular VAEs because, as mentioned earlier, regular VAEs do not handle partially observed data matrix. Given the same split of the data as described earlier, these baselines (except for mean/majority imputation) are no longer practical because of high computational complexity. ; Even with a linear method, the computational complexity is where is the number of questions (recall that we have more than 10 thousand questions). Thus, we further downsample the questions to for methods marked with to compare. We can see from Table 1 that p-VAE clearly outperforms all baselines by a significant margin. Moreover, regarding the size of the data, among all prediction based methods, p-VAE is the only method that scales to such data size thanks to its efficient, amortized inference.

Figure 2: Two examples of high-quality questions (top row) and two examples of low-quality questions (bottom row) determined by the model. For each pair, the left image shows the actual question, and the right image shows the stacked portion plot indicating the percentage of students who answered this question correctly.

4.2 Question Quality Quantification

We compute question quality according to Eq. 3. Fig. 2 shows two examples of high-quality questions (top row) and two examples of low-quality questions (bottom row) determined by the model. For each pair, the left image shows the actual question, and the right image shows the stacked portion plot. The stacked portion plot shows the percentage of students in different ability ranges (computed using the complete matrix imputed by p-VAE) who have answered the question correctly. (the correct answer choice is always at the top, i.e., the red color part of the plot and the remaining colors are the remaining three incorrect answer choices). The stacked portion plot is produced using the observed students’ answer choices to the questions (i.e., A, B, C, or D choices).

In addition to the question content itself, we can gain some insights by examining and comparing the stacked portion plots. For example, We can see that high-quality questions can better test the variability in students’ abilities because fewer students with a lower ability score can answer them correctly, whereas more students with a higher ability score can answer them correctly. This phenomenon is not present in lower quality questions, where most of the students, regardless of their ability score, tend to answer them correctly.

Domain expert evaluation.

We further confirm the above intuition about high and low-quality questions by comparing our model’s ranking of questions in terms of their quality scores and expert’s ranking as Fig. 3 illustrates the evaluation interface that we presented to the evaluator. Among the 40 pairs of questions that we give to the expert to rank, 32 of the expert’s rankings agree with the model’s rankings, yielding an 80% agreement. Although the sample size is rather small, the high agreement between the model’s and expert’s rankings is encouraging and shows our framework’s promise in applying to real-world educational scenarios.

Figure 3: Illustration of question quality evaluation interface for the human evaluator. We randomly sample 40 pairs of questions and ask the expert evaluator as well as our model to choose which one is of higher quality. Our model agrees with the evaluator’s choice 80% of the time.

4.3 Question Difficulty Quantification

With the complete data matrix imputed by p-VAE, we compute question difficulty by taking the average of all students’ answers including observed and predicted answers. Instead of reporting the difficulty score of each individual question, we show difficulties of all schemes and topics that cover all questions in the data set. This allows for better visualization and interpretation. The difficulty score of each scheme or topic is computed by simply averaging the difficulty scores of all questions that belong to the same scheme or topic.

method Spearman Correlation
Scheme Rank Topic Rank
random 0.22 0.11 0.09 0.11
majority imp. 0.115 0.395
using obs. 0.225 0.523
p-VAE imp. 0.659 0.758

[table]A table beside a figure

Figure 4: Question difficulty evaluation results. Fig. 4 shows difficulty of collections of questions by averaging the model-computed question difficulty in each collection. Higher score indicates easier collection. Table 2 shows Spearman Correlation coefficients for scheme difficulty rankings between human expert and our model’s prediction, comparing to several baselines. We see that our model achieves much better agreement in terms of ranking question difficulties compared to the baselines.

Fig. 4 shows the scheme difficulties, arranged from the most to the least difficult schemes from left to right. The difficulty trend agrees with intuition. For example, collections with the word “Higher” in their names are intended for more advanced students (i.e., high school students) and they appear on the left side of the plot (i.e., more difficult). Collections with the word “foundation” are intended for less advanced students and they appear in the middle and right side of the plot (i.e., less difficult).

Domain expert evaluation.

Table 2 shows the Spearman correlation coefficients comparing expert’s topic and scheme rankings, respectively, to model’s and two other baselines’ rankings. The baselines include random ordering, using majority imputation to fill the data matrix, and using the observed data alone. We see that our model’s ranking closely matches the human expert’s ranking while baselines do not produce rankings that are any close to the expert’s ranking. These two tables showcase the potential applicability of our model because our model can produce analytics that are close to the expert’s judgments.

5 Related Work

Our work appears to be most similar to [vae_aied_2019] but there are several major differences. First and foremost, the methods used in our work and that in [vae_aied_2019] are different. The partial VAE introduced in our work is designed to handle partially observed data which is the primary reason we apply partial VAE to educational data. In contrast, [vae_aied_2019] simply used existing, standard (variational) auto-encoders that are incapable of effectively handling partially observed data. Second, our work considers a more realistic and more difficult data set and experiment setting than [vae_aied_2019]. Experiments in [vae_aied_2019] are all based on small, simulated data set with only 10000 students and 28 questions. In contrast, our paper’s experiments are based on large-scale, real-world data set with more than 25000 students and more than 10000 questions. Lastly, our work produces more relevant and applicable results than [vae_aied_2019] because all of our results are based on real data while results in [vae_aied_2019] are based on simulated data.


also considers the problem of question quality assessment, where the authors classify the questions into one of four categories ranging from very shallow to very deep from human-labeled data set of roughly 4000 questions. In contrast, our method infers question quality directly from students’ performances on questions and thus does not need any annotations or labels.  

[question_quality_aied2018] and [question_quality_aied2017] both focuses on question quality assessment. However, their works focus on questions used in specialized systems while our method can be potentially applied and integrated into generic educational platforms. Thus, our method complements prior work and is potentially more practical and applicable in real-world scenarios. Our work also appears to be similar in technical approach to [vae_edm_2017], but the problem setting in our work is distinct from that in [vae_edm_2017]. Our work also develops a new VAE framework and information-theoretic metric for question quality which [vae_edm_2017] does not consider.

Some work attempts to mine question insights from data. [inferring_faq_edm_2017] develops a method to extract frequently asked questions from question-answer forum data.  [authentic_questions_edm2018] discovers teachers’ use of authentic questions. [las2019_1] also perform assessment on questions (assignments) using NLP techniques. Our work differs from the above in that we obtain question quality analytics which is a different form of information about questions. Furthermore, we use the variational auto-encoder framework instead of resorting to NLP techniques. Thus, our work complements prior work that obtains question insights.

6 Conclusion

In this paper, we develop a framework to analyze questions in online education platforms on a large scale. Our framework combines the recently proposed partial variational auto-encoder (p-VAE) for efficiently processing large scale, partially observed educational datasets, and novel information-theoretic objectives for automatically producing a suite of actionable insights about quiz questions. We demonstrate the applicability of our framework on a large scale, real-world educational dataset, showcasing the rich and interpretable information including question difficult and quality that our framework uncovers from millions of students’ answer records to multiple-choice questions.

Further improving our framework is part of ongoing research. One extension is to customize the information-theoretic criteria such that they can be flexibly designed to extract various other information of interest. Another extension is to adapt the p-VAE model for time-series data, where we can work with the more realistic yet challenging assumption that students’ states of knowledge change over time.


Appendix 0.A Additional Results

We provide additional results in the appendix. These results provide more information on the data set and further verification of the effectiveness of our framework.

0.a.1 Additional data set statistics

Figure 5 and 6 show the number of questions in each topic (level 1) and scheme, respectively. Figure 7 shows the number of students in each year group. Note that these statistics are calculated on only students with a year group label because not all students are assigned a year group label.

Figure 5: Number of questions in each Level-1 topic. In this data set, a question belongs to only one topic.
Figure 6: Number of questions in each Scheme. A question can belong to more than one Schemes.
Figure 7: Number of students in each Year Group.

0.a.2 Full Scheme and Topic Rankings

Figure 8: Difficulty of each math topic by averaging the model-computed difficulty of all questions under each topic. A higher score indicates easier questions.

Table 2 and 3 compare the complete rankings of scheme difficulty and topic difficulty, respectively, from the education expert and our model. We can see from both tables that our model’s rankings have a strong correlation with the expert’s ranking; by simple inspection, both expert’s and model’s rankings in both tables show a very similar trend in terms of increasing difficulty of topics and schemes.

expert’s ranking model’s ranking
White Rose Maths Hub White Rose Maths Hub
OCR Foundation AQA Foundation
Edexcel Foundation Eedi Maths GCSE Foundation
Eedi Maths GCSE Foundation Eedi Maths iGCSE Core
AQA Foundation CCEA
Eedi Maths iGCSE Core OCR Foundation
OCR Higher Edexcel Foundation
Edexcel Higher Eedi Maths GCSE Higher
Eedi Maths GCSE Higher Eedi Maths iGCSE Extension
AQA Higher Edexcel Higher
Eedi Maths iGCSE Extension OCR Higher
Table 2: The complete rankings of all question schemes comparing expert’s (left column) and model’s (right column) ranking. The two rankings agree quite nicely. Both rank “AET” and “White Rose Maths Hub” exactly the same. Even though some of the schemes are not exactly ranked the same, the two rankings agree on the general trend. For example, both rank “foundation” schemes to be easier and “higher” schemes to be more difficult.
expert’s ranking model’s ranking
symmetry number-others
coordinates basic arithmetic
2d names and properties money
3d shapes decimals
rounding and estimating symmetry
basic arithmetic negative numbers
calculator use calculator use
unites of measurement algebra-others
data collection factors, multiples and primes
data processing unites of measurement
decimals 3d shapes
money fractions
factors, multiples and primes fractions, decimals and percentage equivalence
perimeter and area coordinates
percentages angles
number-others indices powers and roots
writing and simplifying expressions data representation
negative numbers rounding and estimating
expanding brackets sequences
angles perimeter and area
circles circles
algebra-others ratio
sequences writing and simplifying expressions
factorising 2d names and properties
solving equations data processing
formula expanding brackets
fractions, decimals and percentage equivalence probability
ratio proportion
volumn and surface area
construction, loci and scale drawing solving equations
data representation similarity and congruency
fractions factorising
pythagoras data collection
indices powers and roots formula
basic vectors compound measures
proportion construction, loci and scale drawing
probability pythagoras
basic trigonometry surds
similarity and congruency basic trigonometry
straight line graphs inequalities
inequalities volumn and surface area
quadratic graphs basic vectors
functions other graphs
other graphs straight line graphs
compound measures forces and motion
transformation of functions quadratic graphs
surds functions
algebraic fractions algebraic fractions
forces and motion transformation of functions
Table 3: The complete rankings of all question topics comparing expert’s (left column) and model’s (right column) ranking. Even though the exact matches of the two rankings are rare, the general trend remains the same. For example, both rank “symmetry” and “basic arithmetic” as easier topics, and “surds” and “transformation of functions” as more difficult topics.

0.a.3 Additional results on question quality

Figure 9 and 10 show additional images of high and low quality questions as determined by our model. By inspecting the stacked portion plots comparing those belonging to high quality questions and those belonging to low quality questions, we can see that high quality questions more effectively tell the difference in mastery of knowledge among students. For example, for high quality questions, more capable students can answer them correctly whereas less capable students cannot answer them correctly. This can be seen from the diminishing portion of the top red part which indicates the portion of students who answers the question correctly from right (more capable students) to the left (less capable students). On the contrary, for lower quality questions, almost all studnets can answer the question correctly. Thus, these lower quality question cannot effectively diagnose students capabilities.

Figure 9: Additional examples of high quality questions.
Figure 10: Additional examples of lower quality questions.

Appendix 0.B Experimental settings

We include additional experiment settings.

0.b.1 Model Setting

Below, we present the exact model architecture that we use for all of our experiments. In the model description below, the words in the parenthesis are the identifiers for each module in the model; in particular, encoder is the inference model and decoder

is the generation model. We use PyTorch

111 for actual implementation.

P_VAE( (encoder): encoder( (enc): Linear(in_features=12, out_features=100, bias=True) (linear1): Linear(in_features=100, out_features=1024, bias=True) (linear2): Linear(in_features=1024, out_features=256, bias=True) (linear3): Linear(in_features=256, out_features=20, bias=True) (bn_feat1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True) (bn_feat2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True) (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True) (bn_out): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True) (relu): ELU(alpha=1.0, inplace) ) (decoder): decoder( (linear1): Linear(in_features=10, out_features=256, bias=True) (linear2): Linear(in_features=256, out_features=1024, bias=True) (linear6): Linear(in_features=1024, out_features=13369, bias=True) (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True) (bn2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True) (bn_out): BatchNorm1d(13369, eps=1e-05, momentum=0.1, affine=True) (relu): ELU(alpha=1.0, inplace) ) )

0.b.2 Additional Training Settings

In addition to the settings in the main text, we also evaluate the model performance every epoch on the validation set and anneal the learning rate by a factor of 0.7 if the validation loss does not reduce for 5 consecutive epochs.

0.b.3 Human Evaluation Settings

We include selected slide deck that we show to the education expert to perform human evaluations including difficulty and quality rankings. We will present the slide for each task and explain what the evaluator’s tasks are.

Difficulty rankings.

Figure 11 and 12 show the slides that we present to the evaluator to rank the difficulty of schemes and topics, respectively. On the left side of each figure is a slide with the instructions. On the right side of each figure is a slide with boxes containing scheme or topic names. These boxes are originally ordered alphabetically. The evaluator is instructed (see the instruction slide) to drag the boxes and reorder them in increasing difficulty from top to bottom and from left to right. We find that dragging is an intuitive and user-friendly way for the evaluator to perform the ranking tasks.

Quality rankings.

Figure 13 shows the slides that we present to the evaluator to rank the quality of pairs of questions. On the left side of the figure is a page with the instructions. On the right side of the figure is an example slide with two questions (there are 40 such slides; see main text for details). The evaluator is instructed to choose, for each slide, a question among the two that has higher quality where quality is defined as “effectiveness of distinguishing good and not so good students” (see the instruction slide).

Figure 11: Instructions and interface for scheme difficulty evaluation presented to the human evaluator.
Figure 12: Instructions and interface for topic difficulty evaluation presented to the human evaluator.
Figure 13: Instructions and interface for question quality evaluation presented to the human evaluator.