1 Introduction
Online education platforms are transforming current education systems by providing new opportunities such as democratizing highquality educational resources and personalizing learning experiences. In recent years, many online education platforms have been developed, in particular those that crowdsource a large volume of questions and exercises. The availability of such questions and exercises is a key advantage of these platforms:
Students can utilize them to learn and exercise, while teachers can utilize them to customize quizzes to best understand and improve students’ learning. All of these potentially lead to improved learning outcomes. In this work, we will focus on educational resources in the form of multiplechoice questions which is one of the most common forms of quizzes in online education.
With such massive crowdsourced questions, how to choose the ones to use is challenging because both teachers and students have limited time. Understanding the quality and difficulty levels of the questions will help teachers and students select which questions to use. It is challenging to require human experts to manually provide quality and difficulty labels to all the questions, which can be millions; an AI solution that automatically obtains the above insights is desired.
Therefore, we aim to develop a machine learning solution for largescale online educational data analysis, providing insight for question difficulty and quality of each question. This task involves manifold challenges. First, online educational data is massive; both the number of questions and the number of students are extremely large. Second, there exists severe missingness in the data since each student can only answer an extremely small fraction of all available questions. Lastly, we need to design objectives to quantify and extract desired insights such as question quality and difficulty. Overall, we need a solution that is efficient, handles highly sparse data, and automatically acquires educationally meaningful insights about questions.
In this work, we use realworld online educational data in the form of students’ answers to multiplechoice questions and develop a machine learning framework to analyze the difficulty and quality of each question. We briefly summarize our framework below:

We develop a novel framework for educational question analysis based on partial variational autoencoder (pVAE) [ma2018partial, eddi] to efficiently handle partially observed data at a large scale. pVAE models existing students’ answers and predicts the potential answer to unseen questions in a probabilistic manner.

We design a novel informationtheoretic metric to quantify the quality of each question based on the observed data and pVAE’s predictions. We also define a difficulty metric to quantify the difficulty of each question.

We evaluate our results not only using standard quantitative machine learning metrics but also with human experts. We empirically show that our framework is able to identify the quality and difficulty of the questions as consistently as human experts.
2 Cohort
We analyze data from a realworld online education platform. This platform offers crowdsourced multiple choice questions to students from primary to high school (roughly between 7 and 18 years old). Each question has 4 possible answer choices, among which one answer choice is the correct answer. Currently, the platform mainly focuses on math questions. Figure 1 shows an example question from the platform. We use the data collected from the most recent school year (from September 2018 to May 2019). We organize the data in a matrix form where each row represents a student and each column represents a question. Each entry contains a number that represents a student’s answer to a question (i.e., 0 represents A, 1 represents B, etc.). We also know the correct answer to each of the questions; thus, we can obtain binaryvalued information on whether the student has answered a question correctly (i.e., 0 represents wrong answer and 1 represents correct answer). Each student has only answered a tiny fraction of all questions and hence the matrix is extremely sparse. We thus removed questions that contain less than 100 answers and students who have answered less than 100 questions. Besides, when a student has multiple answer records to the same question, we keep the latest answer record. The above preprocessing steps lead to a final data matrix that consists of roughly 6.3 million students’ answer records with 27,219 students (rows) and 13,369 questions (columns). The final data matrix remains highly sparse, with only of all entries in the matrix observed.
The dataset also contains additional metadata about questions. Each question may be linked to one or more topics. Each topic covers an area of mathematics and the topics are hierarchically organized into levels of increasing granularity. The questions may also belong to one or more schemes. A scheme divides the school year into a series of topic units. Within each topic units, there are two quizzes, one which is assigned at the end of the topic unit and another which is assigned three weeks later. Teachers may change the order and length of topic units.
3 Method
To gain insights into such realworld educational data, we first need a model that predicts the missing data (questions students did not answer) with uncertainty estimation. We then need to design different objectives to quantify question quality and difficulty.
We formulate the first step above as a missing data imputation (aka matrix completion) problem. We have a data matrix
of size by , where N is the total number of students and M is the total number of questions. Each entry can either be binary which indicates whether student has answered question correctly or categorical which is the answer choice that a student selects. The data matrix is only partially observed; we denote the observed part of the data matrix as . Then, we would then predict the missing entries as accurately as possible in a probabilistic manner. Thus, we use the partial variational autoencoder (pVAE) [ma2018partial], which is the stateoftheart method for the above imputation task.The second step is to quantify the quality and difficulty of each question based on pVAE’s predictions. Specifically, we quantify question quality by measuring the value of information that each question carries using an information theoretical objective.
We can also use the complete matrix, with missing entries replaced by pVAE’s predictions, to gain insights on question difficulty.We present these two steps in detail in the remainder of this section.
3.1 Partial Variational Autoencoder (pVAE) for Students’ Answers Prediction
pVAE is a deep latent variable model that extends traditional VAEs [kingma2013auto, Rezende2014StochasticBA, zhang2017advances]. VAEs assume that the data matrix is generated from a number of local latent variable ’s:
where is the th student’s answers and is the th student’s answer to the
th question. We use a deep neural network for the generative model
because of its expressive power. Of course, contains missing entries because each student only answers a small fraction of all questions. Unfortunately, VAEs can only model fully observed data. To model partially observe data, pVAE extends traditional VAEs by exploiting the fact that, given , is fully factorized. Thus, the unobserved data entries can be predicted given thed inferred ’s. Concretely, pVAE optimizes the following partial evidence lower bound (ELBO):which is in the same form as the ELBO for VAE but only over the observed part of the data.
The challenge is to approximate the posterior of
’s using a partially observed data vector. pVAE uses a setbased inference network
, where is the observed subset of answers for student [zaheer2017deep, qi2017pointnet].is assumed to be Gaussian; Concretely, the mean and variance of the posterior of the latent variable is inferenced as
(1) 
where we have dropped the student index for notation simplicity. is the observed answer value augmented by its location embedding, which is learned; is a permutation invariant transformation such as summation which outputs a fix sized vector; and is a regular feedforward neural network. In this paper, we let where and are a learnable parameters that embed the identity of each question.
Note that, in pVAE, some parameters have natural interpretations. For example, the perquestion parameters can be collectively interpreted as a question embedding for each question . The perstudent latent parameter can be interpreted as a student embedding for each student .
3.2 Question Analysis Using Predicted Student Performance
Using the predicted student performance using pVAE, we design a novel informationtheoretic objective to quantify the question quality and use statistical tools to analyze the question difficulty.
Question Quality Quantification.
Working closely with an education expert, we found that highquality questions are considered to be those that best differentiate the student’s ability. When a question is simple, almost all students will answer it correctly. When a question is badly formulated or too difficult, all students will provide incorrect answers or random guesses. In any of these cases, the question neither helps the teacher gain insights about the students’ abilities nor helps students learn well. Thus, highquality questions are the ones that can differentiate the students’ abilities.
We thus formulate the following informationtheoretic objective to quantify the quality of question as follows:
(2) 
where is the question index, is the th student’s answer to the th question, which can be either binary indicating whether the student has answered it correctly or categorical which is the student’s answer choice for this question. is the latent embedding of students. The th student ability can be determined by the student’s possible performance on all questions, which can be inferred give .
This objective measures the information gain of estimating the student’s ability by conditioning on the answer to question . Thus, when is large, the question is more informative on differentiating the student ability, and thus it is considered a highquality question.
In practice, we compute Eq. 2 using to approximate with Monte Carlo integration [eddi, gong2019icebreaker]:
(3) 
where denotes the number of Monte Carlo samples. The KL term can be computed easily in close form thanks to the Gaussian assumptions in VAEs, i.e., the distributions and
are in the form of Gaussian distributions
[kingma2013auto].Question Difficulty Quantification.
For a group of questions answered by the same group of students, the difficulty level of the questions can be quantified by the incorrect rate of all students’ answers. However, for realworld online education data, every question is answered only by a small fraction of students and by different subsets of students with a different educational background. Thus, directly comparing the difficulty levels of the question from observational data is not accurate because an easy question that may be answered by only weak students may be shown to be difficult if only observational data is used. Thanks to pVAE, we can predict whether a student can correctly answer an unseen question. We achieves this by defining the difficulty level of question as where denotes the predicted student’s answer.
4 Experiments
In this section, we demonstrate the applicability of our framework on the realworld educational dataset that we have introduced in Section 2. We first describe the experiment setup including human evaluation procedure, then characterize model’s prediction performance and finally showcase the suite of analytics including quality and difficulty of questions.
Experiment Setup.
We split the students (rows of the data matrix) into train, validation and test sets with an 80:10:10 ratio. Therefore, students that are in the test set are never seen in the training set. We train the model on the train set for 25 epochs using Adam optimizer
[kingma2014adam]with a learning rate of 0.001. We train pVAE on binary students’ answer records (correct or incorrect answers). To evaluate imputation performance, we supply the trained pVAE model a subset of test set as input and compute the model’s prediction accuracy on the rest of the test set. We use prediction accuracy as the evaluation metric. To compute question difficulty, we use all available data in binary format for pVAE to perform imputation.
method  Accuracy 

Mean imputation  0.660 
Majority imputation  0.660 
Linear MICE  0.667 
Missforest  0.571 
ExtraTree  0.576 
pVAE imputation  0.734 
Human Evaluation Procedure.
We compare our model’s analytics with that of an education expert to examine the degree of agreement between our model’ outputs and human expert’s judgements. Our evaluator is a senior and highly respected math teacher who has no prior information about this work. We ask him to evaluate both question quality and question difficulty. For question quality, We resort to pairwise comparison, i.e., we give the evaluator a pair of questions and ask the evaluator to give a preference on which question is of higher quality. We then compute the number of times our model agrees with the expert’s choice of quality. Because there are more than ten thousand questions and thus many more possible pairs, we sample 40 pairs This is to ensure that the two questions present to the evaluator at the same time are not of the same quality.
For question difficulty, we ask the evaluator to rank the difficulty of topics and schemes, respectively. We then compute the Spearman correlation coefficient as a measure of the level of agreement between the model’s and the expert’s difficulty rankings.
4.1 Students’ Answer Prediction
Table 1 shows the accuracy of pVAE trained on binary answer records comparing to various baselines, including mean imputation, majority imputation, Linear MICE [JSSv045i03], Missforest [missforest], and an ExtraTree variation of Missforest. We did not compare to regular VAEs because, as mentioned earlier, regular VAEs do not handle partially observed data matrix. Given the same split of the data as described earlier, these baselines (except for mean/majority imputation) are no longer practical because of high computational complexity. ; Even with a linear method, the computational complexity is where is the number of questions (recall that we have more than 10 thousand questions). Thus, we further downsample the questions to for methods marked with to compare. We can see from Table 1 that pVAE clearly outperforms all baselines by a significant margin. Moreover, regarding the size of the data, among all prediction based methods, pVAE is the only method that scales to such data size thanks to its efficient, amortized inference.
4.2 Question Quality Quantification
We compute question quality according to Eq. 3. Fig. 2 shows two examples of highquality questions (top row) and two examples of lowquality questions (bottom row) determined by the model. For each pair, the left image shows the actual question, and the right image shows the stacked portion plot. The stacked portion plot shows the percentage of students in different ability ranges (computed using the complete matrix imputed by pVAE) who have answered the question correctly. (the correct answer choice is always at the top, i.e., the red color part of the plot and the remaining colors are the remaining three incorrect answer choices). The stacked portion plot is produced using the observed students’ answer choices to the questions (i.e., A, B, C, or D choices).
In addition to the question content itself, we can gain some insights by examining and comparing the stacked portion plots. For example, We can see that highquality questions can better test the variability in students’ abilities because fewer students with a lower ability score can answer them correctly, whereas more students with a higher ability score can answer them correctly. This phenomenon is not present in lower quality questions, where most of the students, regardless of their ability score, tend to answer them correctly.
Domain expert evaluation.
We further confirm the above intuition about high and lowquality questions by comparing our model’s ranking of questions in terms of their quality scores and expert’s ranking as Fig. 3 illustrates the evaluation interface that we presented to the evaluator. Among the 40 pairs of questions that we give to the expert to rank, 32 of the expert’s rankings agree with the model’s rankings, yielding an 80% agreement. Although the sample size is rather small, the high agreement between the model’s and expert’s rankings is encouraging and shows our framework’s promise in applying to realworld educational scenarios.
4.3 Question Difficulty Quantification
With the complete data matrix imputed by pVAE, we compute question difficulty by taking the average of all students’ answers including observed and predicted answers. Instead of reporting the difficulty score of each individual question, we show difficulties of all schemes and topics that cover all questions in the data set. This allows for better visualization and interpretation. The difficulty score of each scheme or topic is computed by simply averaging the difficulty scores of all questions that belong to the same scheme or topic.
method  Spearman Correlation  

Scheme Rank  Topic Rank  
random  0.22 0.11  0.09 0.11 
majority imp.  0.115  0.395 
using obs.  0.225  0.523 
pVAE imp.  0.659  0.758 
[table]A table beside a figure
Fig. 4 shows the scheme difficulties, arranged from the most to the least difficult schemes from left to right. The difficulty trend agrees with intuition. For example, collections with the word “Higher” in their names are intended for more advanced students (i.e., high school students) and they appear on the left side of the plot (i.e., more difficult). Collections with the word “foundation” are intended for less advanced students and they appear in the middle and right side of the plot (i.e., less difficult).
Domain expert evaluation.
Table 2 shows the Spearman correlation coefficients comparing expert’s topic and scheme rankings, respectively, to model’s and two other baselines’ rankings. The baselines include random ordering, using majority imputation to fill the data matrix, and using the observed data alone. We see that our model’s ranking closely matches the human expert’s ranking while baselines do not produce rankings that are any close to the expert’s ranking. These two tables showcase the potential applicability of our model because our model can produce analytics that are close to the expert’s judgments.
5 Related Work
Our work appears to be most similar to [vae_aied_2019] but there are several major differences. First and foremost, the methods used in our work and that in [vae_aied_2019] are different. The partial VAE introduced in our work is designed to handle partially observed data which is the primary reason we apply partial VAE to educational data. In contrast, [vae_aied_2019] simply used existing, standard (variational) autoencoders that are incapable of effectively handling partially observed data. Second, our work considers a more realistic and more difficult data set and experiment setting than [vae_aied_2019]. Experiments in [vae_aied_2019] are all based on small, simulated data set with only 10000 students and 28 questions. In contrast, our paper’s experiments are based on largescale, realworld data set with more than 25000 students and more than 10000 questions. Lastly, our work produces more relevant and applicable results than [vae_aied_2019] because all of our results are based on real data while results in [vae_aied_2019] are based on simulated data.
[question_quality_aied2017]
also considers the problem of question quality assessment, where the authors classify the questions into one of four categories ranging from very shallow to very deep from humanlabeled data set of roughly 4000 questions. In contrast, our method infers question quality directly from students’ performances on questions and thus does not need any annotations or labels.
[question_quality_aied2018] and [question_quality_aied2017] both focuses on question quality assessment. However, their works focus on questions used in specialized systems while our method can be potentially applied and integrated into generic educational platforms. Thus, our method complements prior work and is potentially more practical and applicable in realworld scenarios. Our work also appears to be similar in technical approach to [vae_edm_2017], but the problem setting in our work is distinct from that in [vae_edm_2017]. Our work also develops a new VAE framework and informationtheoretic metric for question quality which [vae_edm_2017] does not consider.Some work attempts to mine question insights from data. [inferring_faq_edm_2017] develops a method to extract frequently asked questions from questionanswer forum data. [authentic_questions_edm2018] discovers teachers’ use of authentic questions. [las2019_1] also perform assessment on questions (assignments) using NLP techniques. Our work differs from the above in that we obtain question quality analytics which is a different form of information about questions. Furthermore, we use the variational autoencoder framework instead of resorting to NLP techniques. Thus, our work complements prior work that obtains question insights.
6 Conclusion
In this paper, we develop a framework to analyze questions in online education platforms on a large scale. Our framework combines the recently proposed partial variational autoencoder (pVAE) for efficiently processing large scale, partially observed educational datasets, and novel informationtheoretic objectives for automatically producing a suite of actionable insights about quiz questions. We demonstrate the applicability of our framework on a large scale, realworld educational dataset, showcasing the rich and interpretable information including question difficult and quality that our framework uncovers from millions of students’ answer records to multiplechoice questions.
Further improving our framework is part of ongoing research. One extension is to customize the informationtheoretic criteria such that they can be flexibly designed to extract various other information of interest. Another extension is to adapt the pVAE model for timeseries data, where we can work with the more realistic yet challenging assumption that students’ states of knowledge change over time.
References
Appendix 0.A Additional Results
We provide additional results in the appendix. These results provide more information on the data set and further verification of the effectiveness of our framework.
0.a.1 Additional data set statistics
0.a.2 Full Scheme and Topic Rankings
Table 2 and 3 compare the complete rankings of scheme difficulty and topic difficulty, respectively, from the education expert and our model. We can see from both tables that our model’s rankings have a strong correlation with the expert’s ranking; by simple inspection, both expert’s and model’s rankings in both tables show a very similar trend in terms of increasing difficulty of topics and schemes.
expert’s ranking  model’s ranking 

AET  AET 
White Rose Maths Hub  White Rose Maths Hub 
OCR Foundation  AQA Foundation 
Edexcel Foundation  Eedi Maths GCSE Foundation 
Eedi Maths GCSE Foundation  Eedi Maths iGCSE Core 
AQA Foundation  CCEA 
Eedi Maths iGCSE Core  OCR Foundation 
OCR Higher  Edexcel Foundation 
Edexcel Higher  Eedi Maths GCSE Higher 
Eedi Maths GCSE Higher  Eedi Maths iGCSE Extension 
AQA Higher  Edexcel Higher 
Eedi Maths iGCSE Extension  OCR Higher 
CCEA  AQA Higher 
expert’s ranking  model’s ranking  
symmetry  numberothers  
coordinates  basic arithmetic  
2d names and properties  money  
3d shapes  decimals  
rounding and estimating  symmetry  
basic arithmetic  negative numbers  
calculator use  calculator use  
unites of measurement  algebraothers  
data collection  factors, multiples and primes  
data processing  unites of measurement  
decimals  3d shapes  
money  fractions  
factors, multiples and primes  fractions, decimals and percentage equivalence  
perimeter and area  coordinates  
percentages  angles  
numberothers  indices powers and roots  
writing and simplifying expressions  data representation  
negative numbers  rounding and estimating  
expanding brackets  sequences  
angles  perimeter and area  
circles  circles  
algebraothers  ratio  
sequences  writing and simplifying expressions  
factorising  2d names and properties  
solving equations  data processing  
formula  expanding brackets  
fractions, decimals and percentage equivalence  probability  
ratio  proportion  

percentages  
construction, loci and scale drawing  solving equations  
data representation  similarity and congruency  
fractions  factorising  
pythagoras  data collection  
indices powers and roots  formula  
basic vectors  compound measures  
proportion  construction, loci and scale drawing  
probability  pythagoras  
basic trigonometry  surds  
similarity and congruency  basic trigonometry  
straight line graphs  inequalities  
inequalities  volumn and surface area  
quadratic graphs  basic vectors  
functions  other graphs  
other graphs  straight line graphs  
compound measures  forces and motion  
transformation of functions  quadratic graphs  
surds  functions  
algebraic fractions  algebraic fractions  
forces and motion  transformation of functions 
0.a.3 Additional results on question quality
Figure 9 and 10 show additional images of high and low quality questions as determined by our model. By inspecting the stacked portion plots comparing those belonging to high quality questions and those belonging to low quality questions, we can see that high quality questions more effectively tell the difference in mastery of knowledge among students. For example, for high quality questions, more capable students can answer them correctly whereas less capable students cannot answer them correctly. This can be seen from the diminishing portion of the top red part which indicates the portion of students who answers the question correctly from right (more capable students) to the left (less capable students). On the contrary, for lower quality questions, almost all studnets can answer the question correctly. Thus, these lower quality question cannot effectively diagnose students capabilities.
Appendix 0.B Experimental settings
We include additional experiment settings.
0.b.1 Model Setting
Below, we present the exact model architecture that we use for all of our experiments. In the model description below, the words in the parenthesis are the identifiers for each module in the model; in particular, encoder is the inference model and decoder
is the generation model. We use PyTorch
^{1}^{1}1https://pytorch.org for actual implementation.P_VAE( (encoder): encoder( (enc): Linear(in_features=12, out_features=100, bias=True) (linear1): Linear(in_features=100, out_features=1024, bias=True) (linear2): Linear(in_features=1024, out_features=256, bias=True) (linear3): Linear(in_features=256, out_features=20, bias=True) (bn_feat1): BatchNorm1d(100, eps=1e05, momentum=0.1, affine=True) (bn_feat2): BatchNorm1d(100, eps=1e05, momentum=0.1, affine=True) (bn1): BatchNorm1d(1024, eps=1e05, momentum=0.1, affine=True) (bn2): BatchNorm1d(256, eps=1e05, momentum=0.1, affine=True) (bn_out): BatchNorm1d(20, eps=1e05, momentum=0.1, affine=True) (relu): ELU(alpha=1.0, inplace) ) (decoder): decoder( (linear1): Linear(in_features=10, out_features=256, bias=True) (linear2): Linear(in_features=256, out_features=1024, bias=True) (linear6): Linear(in_features=1024, out_features=13369, bias=True) (bn1): BatchNorm1d(256, eps=1e05, momentum=0.1, affine=True) (bn2): BatchNorm1d(1024, eps=1e05, momentum=0.1, affine=True) (bn_out): BatchNorm1d(13369, eps=1e05, momentum=0.1, affine=True) (relu): ELU(alpha=1.0, inplace) ) )
0.b.2 Additional Training Settings
In addition to the settings in the main text, we also evaluate the model performance every epoch on the validation set and anneal the learning rate by a factor of 0.7 if the validation loss does not reduce for 5 consecutive epochs.
0.b.3 Human Evaluation Settings
We include selected slide deck that we show to the education expert to perform human evaluations including difficulty and quality rankings. We will present the slide for each task and explain what the evaluator’s tasks are.
Difficulty rankings.
Figure 11 and 12 show the slides that we present to the evaluator to rank the difficulty of schemes and topics, respectively. On the left side of each figure is a slide with the instructions. On the right side of each figure is a slide with boxes containing scheme or topic names. These boxes are originally ordered alphabetically. The evaluator is instructed (see the instruction slide) to drag the boxes and reorder them in increasing difficulty from top to bottom and from left to right. We find that dragging is an intuitive and userfriendly way for the evaluator to perform the ranking tasks.
Quality rankings.
Figure 13 shows the slides that we present to the evaluator to rank the quality of pairs of questions. On the left side of the figure is a page with the instructions. On the right side of the figure is an example slide with two questions (there are 40 such slides; see main text for details). The evaluator is instructed to choose, for each slide, a question among the two that has higher quality where quality is defined as “effectiveness of distinguishing good and not so good students” (see the instruction slide).
Comments
There are no comments yet.