Online education platforms are transforming current education systems by providing new opportunities such as democratizing high-quality educational resources and personalizing learning experiences. In recent years, many online education platforms have been developed, in particular those that crowd-source a large volume of questions and exercises. The availability of such questions and exercises is a key advantage of these platforms:
Students can utilize them to learn and exercise, while teachers can utilize them to customize quizzes to best understand and improve students’ learning. All of these potentially lead to improved learning outcomes. In this work, we will focus on educational resources in the form of multiple-choice questions which is one of the most common forms of quizzes in online education.
With such massive crowd-sourced questions, how to choose the ones to use is challenging because both teachers and students have limited time. Understanding the quality and difficulty levels of the questions will help teachers and students select which questions to use. It is challenging to require human experts to manually provide quality and difficulty labels to all the questions, which can be millions; an AI solution that automatically obtains the above insights is desired.
Therefore, we aim to develop a machine learning solution for large-scale online educational data analysis, providing insight for question difficulty and quality of each question. This task involves manifold challenges. First, online educational data is massive; both the number of questions and the number of students are extremely large. Second, there exists severe missingness in the data since each student can only answer an extremely small fraction of all available questions. Lastly, we need to design objectives to quantify and extract desired insights such as question quality and difficulty. Overall, we need a solution that is efficient, handles highly sparse data, and automatically acquires educationally meaningful insights about questions.
In this work, we use real-world online educational data in the form of students’ answers to multiple-choice questions and develop a machine learning framework to analyze the difficulty and quality of each question. We briefly summarize our framework below:
We develop a novel framework for educational question analysis based on partial variational auto-encoder (p-VAE) [ma2018partial, eddi] to efficiently handle partially observed data at a large scale. p-VAE models existing students’ answers and predicts the potential answer to unseen questions in a probabilistic manner.
We design a novel information-theoretic metric to quantify the quality of each question based on the observed data and p-VAE’s predictions. We also define a difficulty metric to quantify the difficulty of each question.
We evaluate our results not only using standard quantitative machine learning metrics but also with human experts. We empirically show that our framework is able to identify the quality and difficulty of the questions as consistently as human experts.
We analyze data from a real-world online education platform. This platform offers crowd-sourced multiple choice questions to students from primary to high school (roughly between 7 and 18 years old). Each question has 4 possible answer choices, among which one answer choice is the correct answer. Currently, the platform mainly focuses on math questions. Figure 1 shows an example question from the platform. We use the data collected from the most recent school year (from September 2018 to May 2019). We organize the data in a matrix form where each row represents a student and each column represents a question. Each entry contains a number that represents a student’s answer to a question (i.e., 0 represents A, 1 represents B, etc.). We also know the correct answer to each of the questions; thus, we can obtain binary-valued information on whether the student has answered a question correctly (i.e., 0 represents wrong answer and 1 represents correct answer). Each student has only answered a tiny fraction of all questions and hence the matrix is extremely sparse. We thus removed questions that contain less than 100 answers and students who have answered less than 100 questions. Besides, when a student has multiple answer records to the same question, we keep the latest answer record. The above preprocessing steps lead to a final data matrix that consists of roughly 6.3 million students’ answer records with 27,219 students (rows) and 13,369 questions (columns). The final data matrix remains highly sparse, with only of all entries in the matrix observed.
The dataset also contains additional metadata about questions. Each question may be linked to one or more topics. Each topic covers an area of mathematics and the topics are hierarchically organized into levels of increasing granularity. The questions may also belong to one or more schemes. A scheme divides the school year into a series of topic units. Within each topic units, there are two quizzes, one which is assigned at the end of the topic unit and another which is assigned three weeks later. Teachers may change the order and length of topic units.
To gain insights into such real-world educational data, we first need a model that predicts the missing data (questions students did not answer) with uncertainty estimation. We then need to design different objectives to quantify question quality and difficulty.
We formulate the first step above as a missing data imputation (aka matrix completion) problem. We have a data matrixof size by , where N is the total number of students and M is the total number of questions. Each entry can either be binary which indicates whether student has answered question correctly or categorical which is the answer choice that a student selects. The data matrix is only partially observed; we denote the observed part of the data matrix as . Then, we would then predict the missing entries as accurately as possible in a probabilistic manner. Thus, we use the partial variational auto-encoder (p-VAE) [ma2018partial], which is the state-of-the-art method for the above imputation task.
The second step is to quantify the quality and difficulty of each question based on p-VAE’s predictions. Specifically, we quantify question quality by measuring the value of information that each question carries using an information theoretical objective.
We can also use the complete matrix, with missing entries replaced by p-VAE’s predictions, to gain insights on question difficulty.We present these two steps in detail in the remainder of this section.
3.1 Partial Variational Auto-encoder (p-VAE) for Students’ Answers Prediction
p-VAE is a deep latent variable model that extends traditional VAEs [kingma2013auto, Rezende2014StochasticBA, zhang2017advances]. VAEs assume that the data matrix is generated from a number of local latent variable ’s:
where is the -th student’s answers and is the -th student’s answer to the
-th question. We use a deep neural network for the generative modelbecause of its expressive power. Of course, contains missing entries because each student only answers a small fraction of all questions. Unfortunately, VAEs can only model fully observed data. To model partially observe data, p-VAE extends traditional VAEs by exploiting the fact that, given , is fully factorized. Thus, the unobserved data entries can be predicted given thed inferred ’s. Concretely, p-VAE optimizes the following partial evidence lower bound (ELBO):
which is in the same form as the ELBO for VAE but only over the observed part of the data.
The challenge is to approximate the posterior of
’s using a partially observed data vector. p-VAE uses a set-based inference network, where is the observed subset of answers for student [zaheer2017deep, qi2017pointnet].
is assumed to be Gaussian; Concretely, the mean and variance of the posterior of the latent variable is inferenced as
where we have dropped the student index for notation simplicity. is the observed answer value augmented by its location embedding, which is learned; is a permutation invariant transformation such as summation which outputs a fix sized vector; and is a regular feedforward neural network. In this paper, we let where and are a learnable parameters that embed the identity of each question.
Note that, in p-VAE, some parameters have natural interpretations. For example, the per-question parameters can be collectively interpreted as a question embedding for each question . The per-student latent parameter can be interpreted as a student embedding for each student .
3.2 Question Analysis Using Predicted Student Performance
Using the predicted student performance using p-VAE, we design a novel information-theoretic objective to quantify the question quality and use statistical tools to analyze the question difficulty.
Question Quality Quantification.
Working closely with an education expert, we found that high-quality questions are considered to be those that best differentiate the student’s ability. When a question is simple, almost all students will answer it correctly. When a question is badly formulated or too difficult, all students will provide incorrect answers or random guesses. In any of these cases, the question neither helps the teacher gain insights about the students’ abilities nor helps students learn well. Thus, high-quality questions are the ones that can differentiate the students’ abilities.
We thus formulate the following information-theoretic objective to quantify the quality of question as follows:
where is the question index, is the -th student’s answer to the -th question, which can be either binary indicating whether the student has answered it correctly or categorical which is the student’s answer choice for this question. is the latent embedding of students. The -th student ability can be determined by the student’s possible performance on all questions, which can be inferred give .
This objective measures the information gain of estimating the student’s ability by conditioning on the answer to question . Thus, when is large, the question is more informative on differentiating the student ability, and thus it is considered a high-quality question.
In practice, we compute Eq. 2 using to approximate with Monte Carlo integration [eddi, gong2019icebreaker]:
where denotes the number of Monte Carlo samples. The KL term can be computed easily in close form thanks to the Gaussian assumptions in VAEs, i.e., the distributions and
are in the form of Gaussian distributions[kingma2013auto].
Question Difficulty Quantification.
For a group of questions answered by the same group of students, the difficulty level of the questions can be quantified by the incorrect rate of all students’ answers. However, for real-world online education data, every question is answered only by a small fraction of students and by different subsets of students with a different educational background. Thus, directly comparing the difficulty levels of the question from observational data is not accurate because an easy question that may be answered by only weak students may be shown to be difficult if only observational data is used. Thanks to p-VAE, we can predict whether a student can correctly answer an unseen question. We achieves this by defining the difficulty level of question as where denotes the predicted student’s answer.
In this section, we demonstrate the applicability of our framework on the real-world educational dataset that we have introduced in Section 2. We first describe the experiment setup including human evaluation procedure, then characterize model’s prediction performance and finally showcase the suite of analytics including quality and difficulty of questions.
We split the students (rows of the data matrix) into train, validation and test sets with an 80:10:10 ratio. Therefore, students that are in the test set are never seen in the training set. We train the model on the train set for 25 epochs using Adam optimizer[kingma2014adam]
with a learning rate of 0.001. We train p-VAE on binary students’ answer records (correct or incorrect answers). To evaluate imputation performance, we supply the trained p-VAE model a subset of test set as input and compute the model’s prediction accuracy on the rest of the test set. We use prediction accuracy as the evaluation metric. To compute question difficulty, we use all available data in binary format for p-VAE to perform imputation.
Human Evaluation Procedure.
We compare our model’s analytics with that of an education expert to examine the degree of agreement between our model’ outputs and human expert’s judgements. Our evaluator is a senior and highly respected math teacher who has no prior information about this work. We ask him to evaluate both question quality and question difficulty. For question quality, We resort to pairwise comparison, i.e., we give the evaluator a pair of questions and ask the evaluator to give a preference on which question is of higher quality. We then compute the number of times our model agrees with the expert’s choice of quality. Because there are more than ten thousand questions and thus many more possible pairs, we sample 40 pairs This is to ensure that the two questions present to the evaluator at the same time are not of the same quality.
For question difficulty, we ask the evaluator to rank the difficulty of topics and schemes, respectively. We then compute the Spearman correlation coefficient as a measure of the level of agreement between the model’s and the expert’s difficulty rankings.
4.1 Students’ Answer Prediction
Table 1 shows the accuracy of p-VAE trained on binary answer records comparing to various baselines, including mean imputation, majority imputation, Linear MICE [JSSv045i03], Missforest [missforest], and an ExtraTree variation of Missforest. We did not compare to regular VAEs because, as mentioned earlier, regular VAEs do not handle partially observed data matrix. Given the same split of the data as described earlier, these baselines (except for mean/majority imputation) are no longer practical because of high computational complexity. ; Even with a linear method, the computational complexity is where is the number of questions (recall that we have more than 10 thousand questions). Thus, we further downsample the questions to for methods marked with to compare. We can see from Table 1 that p-VAE clearly outperforms all baselines by a significant margin. Moreover, regarding the size of the data, among all prediction based methods, p-VAE is the only method that scales to such data size thanks to its efficient, amortized inference.
4.2 Question Quality Quantification
We compute question quality according to Eq. 3. Fig. 2 shows two examples of high-quality questions (top row) and two examples of low-quality questions (bottom row) determined by the model. For each pair, the left image shows the actual question, and the right image shows the stacked portion plot. The stacked portion plot shows the percentage of students in different ability ranges (computed using the complete matrix imputed by p-VAE) who have answered the question correctly. (the correct answer choice is always at the top, i.e., the red color part of the plot and the remaining colors are the remaining three incorrect answer choices). The stacked portion plot is produced using the observed students’ answer choices to the questions (i.e., A, B, C, or D choices).
In addition to the question content itself, we can gain some insights by examining and comparing the stacked portion plots. For example, We can see that high-quality questions can better test the variability in students’ abilities because fewer students with a lower ability score can answer them correctly, whereas more students with a higher ability score can answer them correctly. This phenomenon is not present in lower quality questions, where most of the students, regardless of their ability score, tend to answer them correctly.
Domain expert evaluation.
We further confirm the above intuition about high and low-quality questions by comparing our model’s ranking of questions in terms of their quality scores and expert’s ranking as Fig. 3 illustrates the evaluation interface that we presented to the evaluator. Among the 40 pairs of questions that we give to the expert to rank, 32 of the expert’s rankings agree with the model’s rankings, yielding an 80% agreement. Although the sample size is rather small, the high agreement between the model’s and expert’s rankings is encouraging and shows our framework’s promise in applying to real-world educational scenarios.
4.3 Question Difficulty Quantification
With the complete data matrix imputed by p-VAE, we compute question difficulty by taking the average of all students’ answers including observed and predicted answers. Instead of reporting the difficulty score of each individual question, we show difficulties of all schemes and topics that cover all questions in the data set. This allows for better visualization and interpretation. The difficulty score of each scheme or topic is computed by simply averaging the difficulty scores of all questions that belong to the same scheme or topic.
|Scheme Rank||Topic Rank|
|random||0.22 0.11||0.09 0.11|
[table]A table beside a figure
Fig. 4 shows the scheme difficulties, arranged from the most to the least difficult schemes from left to right. The difficulty trend agrees with intuition. For example, collections with the word “Higher” in their names are intended for more advanced students (i.e., high school students) and they appear on the left side of the plot (i.e., more difficult). Collections with the word “foundation” are intended for less advanced students and they appear in the middle and right side of the plot (i.e., less difficult).
Domain expert evaluation.
Table 2 shows the Spearman correlation coefficients comparing expert’s topic and scheme rankings, respectively, to model’s and two other baselines’ rankings. The baselines include random ordering, using majority imputation to fill the data matrix, and using the observed data alone. We see that our model’s ranking closely matches the human expert’s ranking while baselines do not produce rankings that are any close to the expert’s ranking. These two tables showcase the potential applicability of our model because our model can produce analytics that are close to the expert’s judgments.
5 Related Work
Our work appears to be most similar to [vae_aied_2019] but there are several major differences. First and foremost, the methods used in our work and that in [vae_aied_2019] are different. The partial VAE introduced in our work is designed to handle partially observed data which is the primary reason we apply partial VAE to educational data. In contrast, [vae_aied_2019] simply used existing, standard (variational) auto-encoders that are incapable of effectively handling partially observed data. Second, our work considers a more realistic and more difficult data set and experiment setting than [vae_aied_2019]. Experiments in [vae_aied_2019] are all based on small, simulated data set with only 10000 students and 28 questions. In contrast, our paper’s experiments are based on large-scale, real-world data set with more than 25000 students and more than 10000 questions. Lastly, our work produces more relevant and applicable results than [vae_aied_2019] because all of our results are based on real data while results in [vae_aied_2019] are based on simulated data.
also considers the problem of question quality assessment, where the authors classify the questions into one of four categories ranging from very shallow to very deep from human-labeled data set of roughly 4000 questions. In contrast, our method infers question quality directly from students’ performances on questions and thus does not need any annotations or labels.[question_quality_aied2018] and [question_quality_aied2017] both focuses on question quality assessment. However, their works focus on questions used in specialized systems while our method can be potentially applied and integrated into generic educational platforms. Thus, our method complements prior work and is potentially more practical and applicable in real-world scenarios. Our work also appears to be similar in technical approach to [vae_edm_2017], but the problem setting in our work is distinct from that in [vae_edm_2017]. Our work also develops a new VAE framework and information-theoretic metric for question quality which [vae_edm_2017] does not consider.
Some work attempts to mine question insights from data. [inferring_faq_edm_2017] develops a method to extract frequently asked questions from question-answer forum data. [authentic_questions_edm2018] discovers teachers’ use of authentic questions. [las2019_1] also perform assessment on questions (assignments) using NLP techniques. Our work differs from the above in that we obtain question quality analytics which is a different form of information about questions. Furthermore, we use the variational auto-encoder framework instead of resorting to NLP techniques. Thus, our work complements prior work that obtains question insights.
In this paper, we develop a framework to analyze questions in online education platforms on a large scale. Our framework combines the recently proposed partial variational auto-encoder (p-VAE) for efficiently processing large scale, partially observed educational datasets, and novel information-theoretic objectives for automatically producing a suite of actionable insights about quiz questions. We demonstrate the applicability of our framework on a large scale, real-world educational dataset, showcasing the rich and interpretable information including question difficult and quality that our framework uncovers from millions of students’ answer records to multiple-choice questions.
Further improving our framework is part of ongoing research. One extension is to customize the information-theoretic criteria such that they can be flexibly designed to extract various other information of interest. Another extension is to adapt the p-VAE model for time-series data, where we can work with the more realistic yet challenging assumption that students’ states of knowledge change over time.
Appendix 0.A Additional Results
We provide additional results in the appendix. These results provide more information on the data set and further verification of the effectiveness of our framework.
0.a.1 Additional data set statistics
0.a.2 Full Scheme and Topic Rankings
Table 2 and 3 compare the complete rankings of scheme difficulty and topic difficulty, respectively, from the education expert and our model. We can see from both tables that our model’s rankings have a strong correlation with the expert’s ranking; by simple inspection, both expert’s and model’s rankings in both tables show a very similar trend in terms of increasing difficulty of topics and schemes.
|expert’s ranking||model’s ranking|
|White Rose Maths Hub||White Rose Maths Hub|
|OCR Foundation||AQA Foundation|
|Edexcel Foundation||Eedi Maths GCSE Foundation|
|Eedi Maths GCSE Foundation||Eedi Maths iGCSE Core|
|Eedi Maths iGCSE Core||OCR Foundation|
|OCR Higher||Edexcel Foundation|
|Edexcel Higher||Eedi Maths GCSE Higher|
|Eedi Maths GCSE Higher||Eedi Maths iGCSE Extension|
|AQA Higher||Edexcel Higher|
|Eedi Maths iGCSE Extension||OCR Higher|
|expert’s ranking||model’s ranking|
|2d names and properties||money|
|rounding and estimating||symmetry|
|basic arithmetic||negative numbers|
|calculator use||calculator use|
|unites of measurement||algebra-others|
|data collection||factors, multiples and primes|
|data processing||unites of measurement|
|factors, multiples and primes||fractions, decimals and percentage equivalence|
|perimeter and area||coordinates|
|number-others||indices powers and roots|
|writing and simplifying expressions||data representation|
|negative numbers||rounding and estimating|
|angles||perimeter and area|
|sequences||writing and simplifying expressions|
|factorising||2d names and properties|
|solving equations||data processing|
|fractions, decimals and percentage equivalence||probability|
|construction, loci and scale drawing||solving equations|
|data representation||similarity and congruency|
|indices powers and roots||formula|
|basic vectors||compound measures|
|proportion||construction, loci and scale drawing|
|similarity and congruency||basic trigonometry|
|straight line graphs||inequalities|
|inequalities||volumn and surface area|
|quadratic graphs||basic vectors|
|other graphs||straight line graphs|
|compound measures||forces and motion|
|transformation of functions||quadratic graphs|
|algebraic fractions||algebraic fractions|
|forces and motion||transformation of functions|
0.a.3 Additional results on question quality
Figure 9 and 10 show additional images of high and low quality questions as determined by our model. By inspecting the stacked portion plots comparing those belonging to high quality questions and those belonging to low quality questions, we can see that high quality questions more effectively tell the difference in mastery of knowledge among students. For example, for high quality questions, more capable students can answer them correctly whereas less capable students cannot answer them correctly. This can be seen from the diminishing portion of the top red part which indicates the portion of students who answers the question correctly from right (more capable students) to the left (less capable students). On the contrary, for lower quality questions, almost all studnets can answer the question correctly. Thus, these lower quality question cannot effectively diagnose students capabilities.
Appendix 0.B Experimental settings
We include additional experiment settings.
0.b.1 Model Setting
Below, we present the exact model architecture that we use for all of our experiments. In the model description below, the words in the parenthesis are the identifiers for each module in the model; in particular, encoder is the inference model and decoder
is the generation model. We use PyTorch111https://pytorch.org for actual implementation.
P_VAE( (encoder): encoder( (enc): Linear(in_features=12, out_features=100, bias=True) (linear1): Linear(in_features=100, out_features=1024, bias=True) (linear2): Linear(in_features=1024, out_features=256, bias=True) (linear3): Linear(in_features=256, out_features=20, bias=True) (bn_feat1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True) (bn_feat2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True) (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True) (bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True) (bn_out): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True) (relu): ELU(alpha=1.0, inplace) ) (decoder): decoder( (linear1): Linear(in_features=10, out_features=256, bias=True) (linear2): Linear(in_features=256, out_features=1024, bias=True) (linear6): Linear(in_features=1024, out_features=13369, bias=True) (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True) (bn2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True) (bn_out): BatchNorm1d(13369, eps=1e-05, momentum=0.1, affine=True) (relu): ELU(alpha=1.0, inplace) ) )
0.b.2 Additional Training Settings
In addition to the settings in the main text, we also evaluate the model performance every epoch on the validation set and anneal the learning rate by a factor of 0.7 if the validation loss does not reduce for 5 consecutive epochs.
0.b.3 Human Evaluation Settings
We include selected slide deck that we show to the education expert to perform human evaluations including difficulty and quality rankings. We will present the slide for each task and explain what the evaluator’s tasks are.
Figure 11 and 12 show the slides that we present to the evaluator to rank the difficulty of schemes and topics, respectively. On the left side of each figure is a slide with the instructions. On the right side of each figure is a slide with boxes containing scheme or topic names. These boxes are originally ordered alphabetically. The evaluator is instructed (see the instruction slide) to drag the boxes and reorder them in increasing difficulty from top to bottom and from left to right. We find that dragging is an intuitive and user-friendly way for the evaluator to perform the ranking tasks.
Figure 13 shows the slides that we present to the evaluator to rank the quality of pairs of questions. On the left side of the figure is a page with the instructions. On the right side of the figure is an example slide with two questions (there are 40 such slides; see main text for details). The evaluator is instructed to choose, for each slide, a question among the two that has higher quality where quality is defined as “effectiveness of distinguishing good and not so good students” (see the instruction slide).