1 Introduction
The rapid developments of largescale learning platforms (e.g., MOOCs (edx.org, coursera.org) and OpenStax Tutor (openstaxtutor.org)) have enabled not only access to highquality learning resources to a large number of students, but also the collection of student data at very large scale. The scale of this data presents a great opportunity to revolutionize education by using machine learning algorithms to
automatically deliver personalized analytics and feedback to students and instructors in order to improve the quality of teaching and learning.
1.1 Detecting misconceptions from
studentresponse data
The predominant form of student data, their responses to assessment questions, contain rich information on their knowledge. Analyzing why a student answers a question incorrectly is of crucial importance to deliver timely and effective feedback. Among the possible causes for a student to answer a question incorrectly, exhibiting one or more misconceptions is critical, since upon detection of a misconception, it is very important to provide targeted feedback to a student to correct their misconception in a timely manner. Examples of using misconceptions to improve teaching include incorporating misconceptions to design better distractors for multiplechoice questions [12], implementing a dialoguebased tutor to detect misconceptions and provide corresponding feedback to help students selfpractice [23], preparing prospective instructors by examining the causes of common misconceptions among students [22], and incorporating misconceptions into item response theory (IRT) for learning analytics [21].
The conventional way of leveraging misconceptions is to rely on a set of predefined misconceptions provided by domain experts [6, 12, 22, 23]. However, this approach is not scalable, since it requires a large amount of human effort and is domainspecific. With the large scale of student data at our disposal, a more scalable approach is to automatically detect misconceptions from data.
Recently, researchers have developed approaches for datadriven misconception detection; most of these approaches analyze students’ response to multiplechoice questions. Examples of these approaches include detecting misconceptions in multiplechoice mathematics questions and modeling students’ progress in correcting them [11] via the additive factor model [3], and clustering students’ responses across a number of multiplechoice physics questions [24]. However, multiplechoice questions have been shown to be inferior to openresponse questions in terms of pedagogical value [9]. Indeed, students’ responses to openresponse questions can offer deeper insights into their knowledge state.
To date, detecting misconceptions from students’ responses to openresponse questions has largely remained an unexplored problem. A few recent developments work exclusively with structured responses, e.g., sketches [20], short mathematical expressions [13], group discussions in a chemistry class [19], and algebra with simple syntax [4].
1.2 Contributions
In this paper, we propose a natural language processing framework that detects students’ common misconceptions from their textual responses to openresponse, shortanswer questions. This problem is very difficult, since the responses are, in general, unstructured.
Our proposed framework consists of the following steps. First, we transform students’ textual responses to a number of shortanswer questions into lowdimensional textual feature vectors using several wellknown wordvector embeddings. These tools include the popular Word2Vec embedding
[14], the GLOVE embedding [18], and an embedding based on the longshort term memory (LSTM) neural network
[17, 7]. We then propose a new statistical model that jointly models both the transformed response textual feature vectors and expert labels on whether a response exhibits one or more misconceptions; these labels identify only whether or not a response exhibits one or more misconceptions but not which misconception it exhibits.Our model uses a series of latent variables: the feature vectors corresponding to the correct response to each question, the feature vectors corresponding to each misconception, the tendency of each student to exhibit each misconception, and the confusion level of each question on each misconception. We develop a Markov chain Monte Carlo (MCMC) algorithm for parameter inference under the proposed statistical model. We experimentally validate the proposed framework on a realworld educational dataset collected from high school classes on AP biology.
Our experimental results show that the proposed framework excels at classifying whether a response exhibits one or more misconceptions compared to standard classification algorithms and significantly outperforms a baseline random forest classifier. We also compare the prediction performance across all three embeddings. More importantly, we show examples of common misconceptions detected from our dataset and discuss how this information can be used to deliver targeted feedback to help students correct their misconceptions.
2 Dataset and preprocessing
In this section, we first detail our shortanswer response dataset, and then detail our preprocessing approach to convert responses into vectors using wordtovector embeddings.
2.1 Dataset
Our dataset consists of students’ textual responses to shortanswer questions in high school classes on AP Biology administered on OpenStax Tutor [16]. Every response was labeled by an expert grader as to whether it exhibited one or more misconceptions. A total of students each responded to a subset of a total of questions; each response was manually labeled by one or multiple expert graders, resulting in a total of labeled responses. Since there is no clear rubric defining what is a misconception, graders might not necessarily agree on what label to assign to each response. Therefore, we trim the dataset to only keep responses that are labeled by multiple graders and they also assigned the same label, resulting in responses. We also further trim the dataset by filtering out students who respond to less than 5 questions and questions with less than 5 responses in every dataset. This subset contains responses.
The questions in our dataset are drawn from the OpenStax AP biology textbook; we divide the full dataset into smaller subsets corresponding to each of the first four units [15], since different units correspond to entirely different subareas in biology. These units cover the following topics:

Unit 1: The Chemistry of Life, Chapters 13

Unit 2: The Cell, Chapters 410

Unit 3: Genetics, Chapters 1117

Unit 4: Evolutionary Processes, Chapters 1820
To summarize, we show the dimensions of the subsets of the data corresponding to each unit in Table 1. Since not every student was assigned to every question, the dataset is sparsely populated; Table 1 also shows the portion of responses that are observed in the trimmed data subsets, denoted as “sparsity”.
Sparsity (%)  

Unit 1  47  77  0.280 
Unit 2  101  104  0.243 
Unit 3  73  91  0.236 
Unit 4  43  75  0.315 
2.2 Response embeddings
We first perform a preprocessing step by transforming each textual student response into a corresponding realvalued vector via three different wordvector embeddings. Our first embedding uses the Word2Vec embedding [14] trained on the OpenStax Biology textbook (an approach also mentioned in [2]), to learn embeddings that put more emphasis on the technical vocabulary specific to each subject. We create the feature vector for each response by mapping each individual word in the response to its corresponding feature vector, and then adding them together. Concretely, denote as the collection of words in the textual response of student to question , where denotes the total number of words in this response (excluding common stopwords). We then map each word to its corresponding dimensional feature vector using the trained Word2Vec model. We use for the Word2Vec embedding. We then compute the student response feature vector as .
Our second wordvector embedding is a pretrained GLOVE embedding with [18]. The GLOVE embedding is very similar to the Word2Vec embedding, with the main difference being that it takes corpuslevel word cooccurrence statistics into account. Moreover, the quality of the GLOVE embedding for common words is likely higher since it is pretrained on a huge corpus (comparing to only the OpenStax Biology textbook for Word2Vec).
Both the Word2Vec embedding and the GLOVE embedding do not take word ordering into account, and for misconception classification, this drawback can lead to problems. For example, responses “If X then Y” and “If Y then X” may have completely different meanings depending on the context, where it’s possible for one to exhibit a common misconception while the other one does not. Using the Word2Vec and GLOVE embeddings, these responses will be embedded to the same feature vector
, making them indistinguishable from each other. Therefore, our third wordvector embedding is based on the long shortterm memory (LSTM) neural network, which is a recurrent neural network that excels at capturing longterm dependencies in sequential data. Therefore, it can take word ordering into account, a feature that we believe is critical for misconception detection. We implement a 2layer LSTM network with 10 hidden units and train it on the OpenStax Biology textbook. For each student response, we use the text as characterbycharacter inputs to the LSTM network and use the last layer’s hidden unit activation values (stacked in a
dimensional vector) as its textual feature .3 Statistical Model
We now detail our statistical model; its graphical model is visualized in Figure 1. Concretely, let there be a total of students, questions, and misconceptions. Let denote the binaryvalued misconception label on the response of student to question provided by an expert grader, with and , where represents the presence of (one or more) misconceptions, and represents no misconceptions.
We transform the raw text of student ’s response to question into a dimensional realvalued feature vector, denoted by , via a preprocessing step (detailed in the previous section). Let denote the subset of student responses that are labeled, since every student only responds to a subset of the questions.
We denote the tendency of student to exhibit misconception , with as , and the confusion level of question on misconception , as . Then, let denote the binaryvalued latent variable that represents whether student exhibits misconception in their response to question , with denoting that the misconception is present and otherwise. We model
as a Bernoulli random variable
(1) 
where
denotes the inverse probit link function (the cumulative distribution function of the standard normal random variable). Given
, we model the observed misconception label as(2) 
In words, a response is labeled as having a misconception if one or more misconceptions is present (given by the latent misconception exhibition variables ). Given , the textual response feature vector that corresponds to student ’s response to question , , is modeled as
(3) 
where denotes the feature vector that corresponds to the correct response to question , denotes the feature vector that corresponds to misconception , and
denotes the covariance matrix of the multivariate normal distribution characterizing the feature vectors. In other words, the feature vector of each response is a
mixture of the feature vectors corresponding to the correct response to the question and each misconception the student exhibits.In the next section, we develop an MCMC inference algorithm to infer the values of the latent variables , , , , , and , given observed data and .
4 Parameter Inference
We use a Gibbs sampling algorithm [5] for parameter inference under the proposed statistical model. The prior distributions of the latent variables are listed as follows:
where denotes the inverseWishart distribution and , , , , , , , , , and
are hyperparameters.
We start by randomly initializing the values of the latent variables , , , , , , , and by sampling from their prior distributions. Then, in each iteration of our Gibbs sampling algorithm, we iteratively sample the value of each random variable from its full conditional posterior distribution. Specifically, in each iteration, we perform the following steps:

We first sample the latent misconception indicator variable from its posterior distribution as
where
The terms in the expression above are given by eq:p and eq:f.

We then sample the feature vector that corresponds to the correct response to each question, , from its posterior distribution as where
where .

We then sample the feature vector that corresponds to each misconception, , from its posterior distribution as where
where .

We then sample the covariance matrix from its posterior distribution as
where and .

In order to sample and , we first sample the value of the auxiliary variable (following the standard approach proposed in [1]) as
where denotes the truncated normal random distribution truncated to the positive side when and negative side when . We then sample from its posterior distribution as
where , , and . We then sample from its posterior distribution as
where , and .
We run the iterations detailed above for a number of total iterations with a certain burnin period, and use the samples of each latent variable to approximate their posterior distributions.
Label switching
Parameter inference under our model suffers from the labelswitching issue that is common in mixture models [5], meaning that the mixture components might be permuted between iterations. We employ a postprocessing step to resolve this issue. Specifically, we first calculate the augmented data likelihood at each iteration (indexed by ) as
Then, we identify the iteration with the largest augmented data likelihood, and permute the variables , , and that best match the variables , , and . After this postprocessing step, we can simply calculate the posterior means of each one of these sets of variables by taking averages of their values across non burnin iterations.
5 Experiments
We experimentally validate the efficacy of the proposed framework using our AP Biology class dataset. We first compare the proposed framework against a baseline random forest (RF) classifier that classifies whether a student response exhibits one or more misconceptions. We then show common misconceptions detected in our datasets and discuss how the proposed framework can use this information to deliver meaningful targeted feedback to students that helps them correct their misconceptions.
5.1 Experimental setup
We run our experiments with latent misconceptions with hyperparameters , , , , and , for a total of iterations with the first iterations as burnin. We compare the proposed framework against a baseline random forest (RF) classifier^{1}^{1}1
The RF classifier achieves the best performance among a number of offtheshelf baseline classifiers, e.g., logistic regression, support vector machines, etc. Therefore, we do not compare it against other baseline classifiers.
using the textual response feature vectors to classify the binaryvalued misconception label, with 200 decision trees.
We randomly partition each dataset into 5 folds and use 4 folds as the training set and the other fold as the test set. We then train the proposed framework and RF on the training set and evaluate their performance on the test set, using two metrics: i) prediction accuracy (ACC), i.e., the portion of correct predictions, and ii) area under curve (AUC), i.e., the area under the receiver operating characteristic (ROC) curve of the resulting binary classifier [8]. Both metrics take values in , with larger values corresponding to better prediction performance. We repeat our experiments for random partitions of the folds.
For the proposed framework, the predictive probability that a response with its feature vector
exhibits a misconception, i.e., the probability that at least one of the latent misconception exhibition state variables take the value of , is given by , wherewhere in the last expression we omitted the conditional dependency of on and due to spatial constraints. For RF, the predictive probability is given by the fraction of decision trees that classifies given .
5.2 Results and discussions
The number of latent misconceptions is an important parameter controlling the granularity of the misconceptions that we aim to detect. Figure 2 shows the comparison between the proposed framework using different values of and RF using the ACC metric with the LSTM embedding. We see an obvious trend that, as increases, the prediction performance decreases. The likely cause of this trend is that the proposed framework tends to overfit as the number of latent misconceptions grows very large since some of our datasets do not contain very rich misconception types. Moreover, the number of common misconceptions varies across different units, with Unit likely containing more misconception types than Units and .
We then compare the performance of the proposed framework against RF on misconception label classification in tbl:word2vec using and all three embeddings. Tables 24 show comparisons of the proposed framework against RF using both the ACC and AUC metrics on all three different word embeddings. The proposed framework significantly outperforms RF (1–4% using the ACC metric and 418% using the AUC metric) on almost all 4 data subsets using every embedding. The only case where the proposed framework does not outperform RF is on Unit using the GLOVE embedding. We postulate that the reason for this result is that this unit is about chemistry and has a lot of responses with more chemical molecular expressions than words; therefore, the proposed framework does not have enough textual information to exhibit its advantages (grouping responses that share the same misconceptions into clusters) over the simple classifier RF.
Both the proposed framework and RF perform much better using the GLOVE and LSTM embeddings than the Word2Vec embedding. This result is likely due to the fact that these embeddings are more advanced than the Word2Vec embedding: the GLOVE embedding considers additional word cooccurrence statistics than the Word2Vec embedding, is trained on a much larger corpus, and has a higher dimension , while the LSTM embedding is the only embedding that takes word ordering into account. Moreover, both algorithms perform best on Unit , which is likely due to two reasons: i) the Unit subset has a larger portion of its responses labeled, and ii) Unit is about evolution, which results in responses that are much longer and thus contains richer textual information.
Unit 1  Unit 2  Unit 3  Unit 4  

ACC  AUC  ACC  AUC  ACC  AUC  ACC  AUC  
Proposed framework  
RF 
Performance comparison on misconception label classification of a textual response in terms of the prediction accuracy (ACC) and area under the receiver operating characteristic curve (AUC) of the proposed framework against a random forest (RF) classifier, using the AP Biology dataset and the Word2Vec embedding.
Unit 1  Unit 2  Unit 3  Unit 4  

ACC  AUC  ACC  AUC  ACC  AUC  ACC  AUC  
Proposed framework  
RF 
Unit 1  Unit 2  Unit 3  Unit 4  

ACC  AUC  ACC  AUC  ACC  AUC  ACC  AUC  
Proposed framework  
RF 
5.3 Uncovering common misconceptions
We emphasize that, in addition to the proposed framework’s significant improvement over RF in terms of misconception label classification, it features great interpretability since it identifies common misconceptions from data. As an illustrative example, the following responses from multiple students across two questions are identified to exhibit the same misconception in the Unit subset using the Word2Vec embedding:
Question 1: People who breed domesticated animals try to avoid inbreeding even though most domesticated animals are indiscriminate. Evaluate why this is a good practice.
Correct Response: A breeder would not allow close relatives to mate, because inbreeding can bring together deleterious recessive mutations that can cause abnormalities and susceptibility to disease.
Student Response 1: Inbreeding can cause a rise in unfavorable or detrimental traits such as genes that cause individuals to be prone to disease or have unfavorable mutations.
Student Response 2: Interbreeding can lead to harmful mutations.
Question 2: When closely related individuals mate with each other, or inbreed, the offspring are often not as fit as the offspring of two unrelated individuals. Why?
Correct Response: Inbreeding can bring together rare, deleterious mutations that lead to harmful phenotypes.
Student Response 3: Leads to more homozygous recessive genes thus leading to mutation or disease.
Student Response 4: When related individuals mate it can lead to harmful mutations.
Although these responses are from different students to different questions, they exhibit one common misconception, that inbreeding leads to harmful mutations. Once this misconception is identified, course instructors can deliver the targeted feedback that inbreeding only brings together harmful mutations, leading to issues like abnormalities, rather than directly leading to harmful mutations.
Moreover, the proposed framework can automatically discover common misconceptions that students exhibit without input from domain experts, especially when the number of students and questions are very large. Specifically, in the example above, we are able to detect such a common misconception that 4 responses exhibit by analyzing the 1016 responses in the AP Biology Unit 4 dataset; however, it would not likely be detected if the number of responses was smaller and fewer students exhibited the misconception. This feature makes it an attractive datadriven aid to domain experts in designing educational content to address student misconceptions.
We show another example that the proposed framework can automatically group student responses to the same group according to the misconceptions they exhibit. The example shows two detected common misconceptions among students’ responses to a single question in the Unit subset using the LSTM embedding:
Question: What is the primary energy source for cells?
Correct response: Glucose.
Student responses with misconception :
sunlight
sun
The sun
he sun?
Student responses with misconception :
ATP
adenosine triphosphate
ATPPPPPPPPPPPPP
atp mitochondria
We see that the proposed framework has successfully identified two common misconception groups, with incorrect responses that list “sun” and “ATP” as the primary energy source for cells. Note that the LSTM embedding enables it to assign the full and abbreviated form of the same entity (“adenosine triphosphate” and “ATP”) into the same misconception cluster, without employing any preprocessing on the raw textual response data. The likely reason for this result is that our LSTM embedding is trained on a characterbycharacter level on the OpenStax Biology textbook, where these terms appear together frequently, thus enabling the LSTM to transform them into similar vectors. This observation highlights the importance of using good, informationpreserving wordvector embeddings for the proposed framework to maximize its capability of detecting common misconceptions.
6 Conclusions and Future Work
In this paper, we have proposed a natural language processingbased framework for detecting and classifying common misconceptions in students’ textual responses. Our proposed framework first transforms their textual responses into lowdimensional feature vectors using three existing wordvector embedding techniques, and then estimates the feature vectors characterizing each misconception, among other latent variables, using a proposed mixture model that leverages information provided by expert human graders. Our experiments on a realworld educational dataset consisting of students’ textual responses to shortanswer questions showed that the proposed framework excels at classifying whether a response exhibits one or more misconceptions. Our proposed framework is also able to group responses with the same misconceptions into clusters, enabling the datadriven discovery of common misconceptions without input from domain experts.
Possible avenues of future work include i) automatically generate the appropriate feedback to correct each misconception, ii) leverage additional information, such as the text of the correct response to each question, to further improve the performance on predicting misconception labels, iii) explore the relationship between the dimension of the wordvector embeddings and prediction performance, and iv) develop embeddings for other types of responses, e.g., mathematical expressions [10] and chemical equations.
References
 [1] J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc., 88(422):669–679, June 1993.
 [2] S. Bhatnagar, M. Desmarais, N. Lasry, and E. S. Charles. Text classification of student selfexplanations in college physics questions. In Proc. 9th Intl. Conf. Educ. Data Min., pages 571–572, July 2016.
 [3] H. Cen, K. R. Koedinger, and B. Junker. Learning factors analysis – A general method for cognitive model evaluation and improvement. In Proc. 8th. Intl. Conf. Intell. Tutoring Syst., pages 164–175, June 2006.
 [4] M. Elmadani, M. Mathews, A. Mitrovic, G. Biswas, L. H. Wong, and T. Hirashima. Datadriven misconception discovery in constraintbased intelligent tutoring systems. In Proc. 20th Int. Conf. Comput. in Educ., pages 1–8, Nov. 2012.
 [5] A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin. Bayesian Data Analysis. CRC press, 2013.
 [6] A. K. Griffiths and K. R. Preston. Grade12 students’ misconceptions relating to fundamental characteristics of atoms and molecules. J. Res. in Sci. Teaching, 29(6):611–628, Aug. 1992.
 [7] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Comput., 9(8):1–32, Nov. 1997.
 [8] H. Jin and C. X. Ling. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng., 17(3):299–310, Mar. 2005.
 [9] S. Kang, K. McDermott, and H. Roediger III. Test format and corrective feedback modify the effect of testing on longterm retention. Eur. J. Cogn. Psychol., 19(45):528–558, July 2007.
 [10] A. S. Lan, D. Vats, A. E. Waters, and R. G. Baraniuk. Mathematical language processing: Automatic grading and feedback for open response mathematical questions. In Proc. 2nd ACM Conf. on Learning at Scale, pages 167–176, Mar. 2015.
 [11] R. Liu, R. Patel, and K. R. Koedinger. Modeling common misconceptions in learning process data. In Proc. 6th Intl. Conf. on Learn. Analyt. & Knowl., pages 369–377, Apr. 2016.
 [12] J. K. Maass and P. I. Pavlik Jr. Modeling the influence of format and depth during effortful retrieval practice. In Proc. 9th Intl. Conf. Educ. Data Min., pages 143–149, July 2016.
 [13] T. McTavish and J. Larusson. Discovering and describing types of mathematical errors. In Proc. 7th Intl. Conf. Educ. Data Min., pages 353–354, July 2014.
 [14] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, Sep. 2013.
 [15] OpenStax Biology. https://openstax.org/details/biology, 2016.
 [16] OpenStax Tutor. https://openstaxtutor.org/, 2016.
 [17] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward. Deep sentence embedding using long shortterm memory networks: Analysis and application to information retrieval. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 24:694–707, Apr 2016.
 [18] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. In Proc. ACM SIGDAT Conf. Emp. Method. Nat. Lang. Process., pages 1532–1543, Oct. 2014.
 [19] H. J. Schmidt. Students’ misconceptions—Looking for a pattern. Sci. Educ., 81(2):123–135, Apr. 1997.
 [20] A. Smith, E. N. Wiebe, B. W. Mott, and J. C. Lester. SketchMiner: Mining learnergenerated science drawings with topological abstraction. In Proc. 7th Intl. Conf. Educ. Data Min., pages 288–291, July 2014.
 [21] K. K. Tatsuoka. Rule space: An approach for dealing with misconceptions based on item response theory. J. Educ. Meas., 20(4):345–354, Dec. 1983.
 [22] D. Tirosh. Enhancing prospective teachers’ knowledge of children’s conceptions: The case of division of fractions. J. Res. Math. Educ., 31(1):5–25, Jan. 2000.
 [23] K. VanLehn, P. W. Jordan, C. P. Rosé, D. Bhembe, M. Böttner, A. Gaydos, M. Makatchev, U. Pappuswamy, M. Ringenberg, A. Roque, S. Siler, and R. Srivastava. The architecture of Why2Atlas: A coach for qualitative physics essay writing. In Proc. 6th Intl. Conf. on Intell. Tutoring Syst., pages 158–167, June 2002.
 [24] G. Zheng, S. Kim, Y. Tan, and A. Galyardt. Soft clustering of physics misconceptions using a mixed membership model. In Proc. 9th Intl. Conf. Educ. Data Min., pages 658–659, July 2016.
Comments
There are no comments yet.