Automated Personalized Feedback Improves Learning Gains in an Intelligent Tutoring System

05/05/2020 ∙ by Ekaterina Kochmar, et al. ∙ 8

We investigate how automated, data-driven, personalized feedback in a large-scale intelligent tutoring system (ITS) improves student learning outcomes. We propose a machine learning approach to generate personalized feedback, which takes individual needs of students into account. We utilize state-of-the-art machine learning and natural language processing techniques to provide the students with personalized hints, Wikipedia-based explanations, and mathematical hints. Our model is used in Korbit, a large-scale dialogue-based ITS with thousands of students launched in 2019, and we demonstrate that the personalized feedback leads to considerable improvement in student learning outcomes and in the subjective evaluation of the feedback.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligent Tutoring Systems (ITS) [anderson1985intelligent, nye2014autotutor] attempt to mimic personalized tutoring in a computer-based environment and are a low-cost alternative to human tutors. Over the past two decades, many ITS have been successfully deployed to enhance teaching and improve students’ learning experience in a number of domains [AbuEl, Agha, Nakhal, Rekhawi, budenbender2002, goguadze2005interactivity, leelawong2008designing, melis2004, passier2006, Qwaider], not only providing feedback and assistance but also addressing individual student characteristics [Graesser] and cognitive processes [Wu]. Many ITS consider the development of a personalized curriculum and personalized feedback [Aldahdooh, Nakhal, albacete2019impact, chi2011instructional, Lin, munshi2019personalization, rus2014macro, rus2014deeptutor], with dialogue-based ITS being some of the most effective tools for learning [ahn2018adaptive, graesser2005autotutor, graesser2001intelligent, nye2014autotutor, ventura2018preliminary], as they simulate a familiar learning environment of student–tutor interaction, thus helping to improve student motivation. The main bottleneck is the ability of ITS to address the multitude of possible scenarios in such interactions, and this is where methods of automated, data-driven feedback generation are of critical importance.

Our paper has two major contributions. Firstly, we describe how state-of-the-art machine learning (ML) and natural language processing (NLP) techniques can be used to generate automated, data-driven personalized hints and explanations, Wikipedia-based explanations, and mathematical hints. Feedback generated this way takes the individual needs of students into account, does not require expert intervention or hand-crafted rules, and is easily scalable and transferable across domains. Secondly, we demonstrate that the personalized feedback leads to substantially improved student learning gains and improved subjective feedback evaluation in practice. To support our claims, we utilize our feedback models in Korbit, a large-scale dialogue-based ITS.

2 Korbit Learning Platform

Figure 1: An example illustrating how the Korbit ITS inner-loop system selects the pedagogical intervention. The student gives an incorrect solution and receives a text hint.


is a large-scale, open-domain, mixed-interface, dialogue-based ITS, which uses ML, NLP and reinforcement learning to provide interactive, personalized learning online. Currently, the platform has thousands of students enrolled and is capable of teaching topics related to data science, machine learning, and artificial intelligence.

Students enroll based on courses or skills they would like to study. Once a student has enrolled, Korbit

tutors them by alternating between short lecture videos and interactive problem-solving. During the problem-solving sessions, the student may attempt to solve an exercise, ask for help, or even skip it. If the student attempts to solve the exercise, their solution attempt is compared against the expectation (i.e. reference solution) using an NLP model. If their solution is classified as incorrect, the

inner-loop system (see Fig. 1) will activate and respond with one of a dozen different pedagogical interventions, which include hints, mathematical hints, elaborations, explanations, concept tree diagrams, and multiple choice quiz answers. The pedagogical intervention is chosen by an ensemble of machine learning models from the student’s zone of proximal development (ZPD) [cazden1979peekaboo] based on their student profile and last solution attempt.

3 Automatically Generated Personalized Feedback

In this paper, we present experiments on the Korbit learning platform with actual students. These experiments involve varying the text hints and explanations based on how they were generated and how they were adapted to each unique student.

3.0.1 Personalized Hints and Explanations

are generated using NLP techniques applied by a 3-step algorithm to all expectations (i.e. reference solutions) in our database: (1) keywords, including nouns and noun phrases, are identified within the question (e.g. overfitting and underfitting in Table 1); (2) appropriate sentence span that does not include keywords is identified in a reference solution using state-of-the-art dependency parsing with spaCy222 (e.g., A model is underfitting is filtered out, while it has a high bias is considered as a candidate for a hint); and (3) a grammatically correct hint is generated using discourse-based modifications (e.g., Think about the case) and the partial hint from step (2) (e.g., when it has a high bias).

Question Expectation Generated hint
What is the difference between A model is underfitting Think about the case
overfitting and underfitting? when it has a high bias. when it has a high bias.
Table 1: Hint generation. Keywords are marked with boxes

Next, hints are ranked according to their linguistic quality as well as the past student–system interactions. We employ a Random Forest classifier using two broad sets of features: (1)

Linguistic quality features assess the quality of the hint from the linguistic perspective only (e.g., considering length of the hint/explanation, keyword and topic overlap between the hint/explanation and the question, etc.), and are used by the baseline model only. (2) Performance-based features additionally take into account past student interaction with the system. Among them, the shallow personalization model includes features related to the number of attempted questions, proportion of correct and incorrect answers, etc., and the deep personalization model additionally includes linguistic features pertaining to up to previous student–system interaction turns. The three types of feedback models are trained and evaluated on a collection of previously recorded student–system interactions.

3.0.2 Wikipedia-Based Explanations

provide alternative ways of helping students to understand and remember concepts. We generate such explanations using another multi-stage pipeline: first, we use a 2 GB dataset on “Machine learning” crawled from Wikipedia and extract all relevant domain keywords from the reference questions and solutions using spaCy. Next, we use the first sentence in each article as an extracted Wikipedia-based explanation and the rest of the article to generate candidate explanations

. A Decision Tree classifier is trained on a dataset of positive and negative examples to evaluate the quality of a Wikipedia-based explanation using a number of linguistically-motivated features. This model is then applied to identify the most appropriate Wikipedia-based explanations among the generated ones.

3.0.3 Mathematical Hints

are either provided by Korbit in the form of suggested equations with gapped mathematical terms for the student to fill in, or in the form of a hint on what the student needs to change if they input an incorrect equation. Math equations are particularly challenging because equivalent expressions can have different representations: for example, in could be a function or a term multiplied by . To evaluate student equations, we first convert their LaTeX string into multiple parse trees, where each tree represents a possible interpretation, and then use a classifier to select the most likely parse tree and compare it to the expectation. Our generated feedback is fully automated, which differentiates Korbit from other math-oriented ITS, where feedback is generated by hand-crafted test cases [budenbender2002, hennecke1999online].

4 Experimental Results and Analysis

Our preliminary experiments with the baseline, shallow and deep personalization models run on the historical data using -fold cross-validation strongly suggested that deep personalization model selects the most appropriate personalized feedback. To support our claims, we ran experiments involving annotated student–system interactions, collected from students enrolled for free and studying the machine learning course on the Korbit platform between January and February, 2020. First, a hint or explanation was selected at uniform random from one of the personalized feedback models when a student gives an incorrect solution. Afterwards, the student learning gain was measured as the proportion of instances where a student provided a correct solution after receiving a personalized hint or explanation. Since it’s possible for the ITS to provide several pedagogical interventions for a given exercise, we separate the learning gains observed for all students from those for students who received a personalized hint or explanation before their second attempt at the exercise. Table 2 presents the results, showing that the deep personalization model leads to the highest student learning gains at followed by the shallow personalization model at and the baseline model at for all attempts. The difference between the learning gains of the deep personalization model and baseline model for the students before their second attempt is statistically significant at 95% confidence level based on a z-test (p=0.03005). These results support the hypothesis that automatically generated personalized hints and explanations lead to substantial improvements in student learning gains.

All Attempts Before Second Attempt
Model Mean  95% C.I. Mean 95% C.I.
Baseline (No Personalization)
Shallow Personalization
Deep Personalization
Table 2:

Student learning gains for personalized hints and explanations with 95% confidence intervals (C.I.).

Experiments on the Korbit platform confirm that extracted and generated Wikipedia-based explanations lead to comparable student learning gains. Students rated either or both types of explanations as helpful of the time. This shows that automatically-generated Wikipedia-based explanations can be included in the set of interventions used to personalize the feedback. Moreover, two domain experts independently analyzed a set of student–system interactions with Korbit, where the student’s solution attempt contained an incorrect mathematical equation. The results showed that over of the mathematical hints would be considered either “very useful” or “somewhat useful”.

In conclusion, our experiments strongly support the hypothesis that the personalized hints and explanations, as well as Wikipedia-based explanations, help to improve student learning outcomes significantly. Preliminary results also indicate that the mathematical hints are useful. Future work should investigate how and what types of Wikipedia-based explanations and mathematical hints may improve student learning outcomes, as well as their interplay with student learning profiles and knowledge gaps.