Adaptive Forgetting Curves for Spaced Repetition Language Learning

04/23/2020 ∙ by Ahmed Zaidi, et al. ∙ University of Cambridge 0

The forgetting curve has been extensively explored by psychologists, educationalists and cognitive scientists alike. In the context of Intelligent Tutoring Systems, modelling the forgetting curve for each user and knowledge component (e.g. vocabulary word) should enable us to develop optimal revision strategies that counteract memory decay and ensure long-term retention. In this study we explore a variety of forgetting curve models incorporating psychological and linguistic features, and we use these models to predict the probability of word recall by learners of English as a second language. We evaluate the impact of the models and their features using data from an online vocabulary teaching platform and find that word complexity is a highly informative feature which may be successfully learned by a neural network model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optimal human learning techniques have been extensively studied by researchers in psychology [3] and computer science [13, 18, 7, 17]. The impact of learning techniques can be measured by how they affect the long-term retention of the learning materials. Measuring retention requires a model of the human forgetting curve, which plots the probability of recall over time. The first version of the forgetting curve was defined by Ebbinghaus [4] but has since been developed further by many researchers who have incorporated additional psychologically grounded variations to the model [15, 11, 8, 2, 12]. The ideal forgetting curve should adapt to learning materials as well as user meta-features (including current ability). In this study we examine the task of vocabulary learning. We investigate a range of linguistically motivated features, meta-features, and a variety of models in order to predict the probability a given learner will correctly recall a particular word.

2 Method

We use the Duolingo spaced repetition dataset [14] in order to train and evaluate our features and variety of models. The dataset is filtered for English language learners which results in approximately 4.28 million learner-word datapoints. Our models are a modification of the half-life regression model proposed by Settles & Meeder [13].

2.1 Half-Life Regression (HLR)

The half-life regression model is defined as follows:


where is the probability of recall, is the time since last seen (days) and is the half-life

or strength of the learner’s memory. We denote the estimated half-life by

, and it is defined as:



is a vector of weights for the features

. The features of the model are made up of lexeme tags, one tag for each word in the vocabulary (e.g. the lexeme tag for word camera is camera.N.SG). The aim of these features is to capture the inherent difficulty of the word.

The HLR model is trained using the following loss function:


In practice, it was found that optimising for both and in the loss function improved the model. The true value of is defined as .

2.2 HLR with Linguistic/Psychological Features (HLR+)

We now expand on the HLR model by adding additional linguistic, psychological and meta-features to . We refer to this model as HLR+. The features include word complexity scores estimated by a pre-trained model [5], mean concreteness scores and percent known based on human judgements [1], SUBTLEX word frequencies [16] and user ids.

The motivation for including complexity as a feature is based on the intuition that the more complex the word, the harder it is to remember. Concreteness is included based on previous work showing that concrete words are easier to remember than abstract words because they activate perceptual memory codes in addition to verbal codes [9]. SUBTLEX is the relative frequency of an English word based on a corpus of 201.3 million words: we hypothesise that more frequent words are more likely to be encountered and reinforced during the time since last seen . Similarly, we expect that ‘percent known’ (the proportion of respondents familiar with each word based on survey data) will correlate with probability of recall. Lastly, we include user id to capture latent behavioural aspects about the learners.

2.3 Complexity-based Half-Life Regression (C-HLR+)

In addition to adding new features, we now describe a new model that modifies the such that it directly incorporates word complexity. Gooding et al. [5] derived word complexity to express perceived difficulty. We hypothesise that this will correlate with probability of recall. As the complexity of the word rises, the forgetting curve will become steeper. Therefore, the new model is as follows:


where is the mean complexity for word . We define estimated half-life as where is a vector composed of all of the features described in Section 2.2.

2.4 Neural Half-Life Regression (N-HLR+)

Motivated by the recent success of neural networks, we now describe the N-HLR+ model which replaces with a neural network. The network can be described as follows:


where the network contains a single hidden layer. is a vector of input features, is the weight matrix between the inputs and the hidden layer and is the weight matrix between the hidden layer and the output. We use the same loss function as HLR which optimises for both and .

2.5 Evaluation and Implementation

We use mean absolute error (MAE) of probability of recall for a lexical item as our evaluation metric, in line with previous work

[13]. MAE is defined as: , where is the total data instances, and are the true probability and model estimated probability of recall, respectively.

We divided the Duolingo English data into 90% training and 10% test. We trained all non-neural models (e.g. HLR, HLR+, C-HLR) using the following parameters which were tuned on the first 500k data points — learning rate: , alpha : , : . For all neural models (e.g. N-HLR), we used — learning rate:

, epochs:

, hidden dim: .

3 Results and Discussion

We can see in Table 1 that HLR+ did not perform much better than HLR. By modifying the loss function to include complexity as a parameter in the C-HLR+ model, we considerably improved the performance of our model. This was in line with our hypothesis that more complex words are forgotten faster and thus are an important feature in modelling the forgetting curve.

The N-HLR+ model provided additional improvements to the C-HLR+ model. This is due to the fact that neural models are better at capturing non-linearities between the features and the expected output. Furthermore, when compared to the N-HLR+ model we can see that including complexity into the loss function (CN-HLR+) provides no clear improvements in performance. This is because the model learns to place more importance on the complexity feature. We confirm this by analysing the average weights in the hidden layer of the model as seen in Fig 1. The model learns to give greater importance to word complexity, percent known, and concreteness respectively. It does not however, learn much from the user id and SUBTLEX. This is probably due to the fact that a single dimension for capturing user behaviour is not sufficient and that SUBTLEX does not adequately represent learners’ experience with English as a second language.

Model MAE
Pimsleur[10] 0.396
Leitner[6] 0.214
Linear Regression 0.196
HLR[13] 0.195
HLR-lex[13] 0.130
Model MAE
HLR+ 0.129
C-HLR+ 0.109
N-HLR+ 0.105
CN-HLR+ 0.105
Table 1: Evaluation of forgetting curve models. Pimsleur and Leitner are previous methods of modelling the forgetting curve.
Figure 1: A heatmap showing average weights of the hidden layer for the N-HLR+ transformed between and . Features are in the following order: user id, concreteness, percent known, SUBTLEX, complexity.

4 Conclusion

We present a new model for adaptively learning a forgetting curve for language learning using a modified HLR loss function and a neural network. We incorporate linguistically and psychologically motivated features and show that word complexity is an important feature in predicting probability of recall for a vocabulary item. Furthermore, we illustrate that neural networks can capture the importance of word complexity while a simple HLR fails to take advantage of that signal. This work lays the foundation for work in neural approaches to understanding language learning over time. Future work in this area includes incorporating high-dimensional user embeddings to capture user specific signals that might influence the forgetting curve.


  • [1] M. Brysbaert, A. B. Warriner, and V. Kuperman (2014) Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods 46 (3), pp. 904–911. Cited by: §2.2.
  • [2] B. Choffin, F. Popineau, Y. Bourda, and J. Vie (2019) DAS3H: modeling student learning and forgetting for optimally scheduling distributed practice of skills. In Proceedings of The 12th International Conference on Educational Data Mining (EDM), Cited by: §1.
  • [3] J. Dunlosky, K. A. Rawson, E. J. Marsh, M. J. Nathan, and D. T. Willingham (2013) Improving students’ learning with effective learning techniques: promising directions from cognitive and educational psychology. Psychological Science in the Public Interest 14 (1), pp. 4–58. Cited by: §1.
  • [4] H. Ebbinghaus (1885) Ueber das gedächtnis. Cited by: §1.
  • [5] S. Gooding and E. Kochmar (2019) Complex word identification as a sequence labelling task. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1148–1153. Cited by: §2.2, §2.3.
  • [6] S. Leitner (1972) So lernt man lernen: angewandte lernpsychologie–ein weg zum erfolg. Herder.. Cited by: Table 1.
  • [7] R. Moore, A. Caines, M. Elliott, A. Zaidi, A. Rice, and P. Buttery Skills embeddings: a neural approach to multicomponent representations of students and tasks. In Proceedings of The 12th International Conference on Educational Data Mining (EDM), Vol. 360, pp. 365. Cited by: §1.
  • [8] M. C. Mozer, M. Wiseheart, and T. P. Novikoff (2019) Artificial intelligence to support human instruction. Proceedings of the National Academy of Sciences 116 (10), pp. 3953–3955. Cited by: §1.
  • [9] A. Paivio (2013) Imagery and verbal processes. Psychology Press. Cited by: §2.2.
  • [10] P. Pimsleur (1967) A memory schedule. The Modern Language Journal 51 (2), pp. 73–75. Cited by: Table 1.
  • [11] S. Reddy, S. Levine, and A. Dragan (2017)

    Accelerating human learning with deep reinforcement learning

    In NeurIPS workshop: teaching machines, robots, and humans, Cited by: §1.
  • [12] D. C. Rubin and A. E. Wenzel (1996) One hundred years of forgetting: a quantitative description of retention.. Psychological Review 103 (4), pp. 734. Cited by: §1.
  • [13] B. Settles and B. Meeder (2016) A trainable spaced repetition model for language learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1848–1858. Cited by: §1, §2.5, §2, Table 1.
  • [14] Cited by: §2.
  • [15] B. Tabibian, U. Upadhyay, A. De, A. Zarezade, B. Schölkopf, and M. Gomez-Rodriguez (2019) Enhancing human learning via spaced repetition optimization. Proceedings of the National Academy of Sciences 116 (10), pp. 3988–3993. Cited by: §1.
  • [16] W. J. Van Heuven, P. Mandera, E. Keuleers, and M. Brysbaert (2014) SUBTLEX-UK: a new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology 67 (6), pp. 1176–1190. Cited by: §2.2.
  • [17] A. H. Zaidi, A. Caines, C. Davis, R. Moore, P. Buttery, and A. Rice (2019) Accurate modelling of language learning tasks and students using representations of grammatical proficiency. In Proceedings of The 12th International Conference on Educational Data Mining (EDM), Cited by: §1.
  • [18] A. H. Zaidi, R. Moore, and T. Briscoe (2017) Curriculum Q-learning for visual vocabulary acquisition. In Proceedings of Visually-Grounded Interaction and Language (ViGIL), NeurIPS, Cited by: §1.