# Optimizing Human Learning

Spaced repetition is a technique for efficient memorization which uses repeated, spaced review of content to improve long-term retention. Can we find the optimal reviewing schedule to maximize the benefits of spaced repetition? In this paper, we introduce a novel, flexible representation of spaced repetition using the framework of marked temporal point processes and then address the above question as an optimal control problem for stochastic differential equations with jumps. For two well-known human memory models, we show that the optimal reviewing schedule is given by the recall probability of the content to be learned. As a result, we can then develop a simple, scalable online algorithm, Memorize, to sample the optimal reviewing times. Experiments on both synthetic and real data gathered from Duolingo, a popular language-learning online platform, show that our algorithm may be able to help learners memorize more effectively than alternatives.

## Authors

• 6 publications
• 10 publications
• 21 publications
• 6 publications
• 27 publications
• 35 publications
• ### Cheshire: An Online Algorithm for Activity Maximization in Social Networks

User engagement in social networks depends critically on the number of o...
03/06/2017 ∙ by Ali Zarezade, et al. ∙ 0

• ### RedQueen: An Online Algorithm for Smart Broadcasting in Social Networks

Users in social networks whose posts stay at the top of their followers'...
10/18/2016 ∙ by Ali Zarezade, et al. ∙ 0

• ### Unbounded Human Learning: Optimal Scheduling for Spaced Repetition

In the study of human learning, there is broad evidence that our ability...
02/23/2016 ∙ by Siddharth Reddy, et al. ∙ 0

• ### Stochastic Optimal Control of Epidemic Processes in Networks

We approach the development of models and control strategies of suscepti...
10/30/2018 ∙ by Lars Lorch, et al. ∙ 0

• ### Consequential Ranking Algorithms and Long-term Welfare

Ranking models are typically designed to provide rankings that optimize ...
05/13/2019 ∙ by Behzad Tabibian, et al. ∙ 9

• ### Leveraging the Crowd to Detect and Reduce the Spread of Fake News and Misinformation

Online social networking sites are experimenting with the following crow...
11/27/2017 ∙ by Jooyeon Kim, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Our ability to remember a piece of information depends critically on the number of times we have reviewed it and the time elapsed since the last review, as first shown by a seminal study by Ebbinghaus [10]. The effect of these two factors have been extensively investigated in the experimental psychology literature [9, 16], particularly in second language acquisition research [2, 5, 7, 21]. Moreover, these empirical studies have motivated the use of flashcards, small pieces of information a learner repeatedly reviews following a schedule determined by a spaced repetition algorithm [6], whose goal is to ensure that learners spend more (less) time working on forgotten (recalled) information.

In recent years, spaced repetition software and online platforms such as Mnemosyne, Synap, SuperMemo, or Duolingo

have become increasingly popular, often replacing the use of physical flashcards. The promise of these softwares and online platforms is that automated fine-grained monitoring and greater degree of control will result in more effective spaced repetition algorithms. However, most of these algorithms are simple rule-based heuristic with a few hard-coded parameters

[6]—principled data-driven models and algorithms with provable guarantees have been largely missing until very recently [19, 22]. Among these recent notable exceptions, the work most closely related to ours is by Reddy et al. [22], who proposed a queueing network model for a particular spaced repetition method—the Leitner system [12] for reviewing flashcards—and then developed a heuristic approximation for scheduling reviews. However, their heuristic does not have provable guarantees, it does not adapt to the learner’s performance over time, and it is specifically designed for Leitner systems.

In this paper, we first introduce a novel, flexible representation of spaced repetition using the framework of marked temporal point processes [1]. For two well-known human memory models, we use this presentation to express the dynamics of a learner’s forgetting rates and recall probabilities for the content to be learned by means of a set of stochastic differential equations (SDEs) with jumps. Then, we can find the optimal reviewing schedule for spaced repetition by solving a stochastic optimal control problem for SDEs with jumps [11, 25, 26]. In doing so, we need to introduce a proof technique of independent interest (refer to Appendices C and D).

The solution uncovers a linear relationship between the optimal reviewing intensity and the recall probability of the content to be learned, which allows for a simple, scalable online algorithm, which we name Memorize, to sample the optimal reviewing times (Algorithm 1

). Finally, we experiment with both synthetic and real data gathered from Duolingo, a popular language-learning online platform, and show that our algorithm may be able to help learners memorize more effectively than alternatives. To facilitate research in this area within the machine learning community, we are releasing an implementation of our algorithm at

http://learning.mpi-sws.org/memorize/.

Further related work. There is a rich literature which tries to ascertain which model of human memory predicts performance best [23, 24]. Our aim in this work is to provide a methodology to derive an optimal reviewing schedule given a choice of human memory model. Hence, we apply our methodology to two of the most popular memory models from the literature—exponential and power-law forgetting curve models.

The task of designing reviewing schedules also has a rich history, starting with the Leitner system itself [12]. In this context, Metzler-Baddeley et al. [18] have recently shown that adaptive reviewing schedules perform better than non-adaptive ones using data from SuperMemo. In doing so, they proposed an algorithm that schedules reviews just as the learner is about to forget an item, , when the probability of recall falls below a threshold. Lindsey et al. [14] have also used a similar idea for scheduling reviews, albeit with a model of recall inspired by ACT-R and Multiscale Context Model [20]. In this work, we use such heuristic, which does not have theoretical guarantees, as a baseline (“Threshold”) in our experiments.

Finally, another line of research has pursued locally optimal scheduling by identifying which item would benefit the most from a review. Pavlik et al. [21] have used the ACT-R model to make locally optimal decisions about which item to review by greedily selecting the item which is closest to its maximum learning rate as a heuristic. Mettler et al. [17] have also employed a similar heuristic (ARTS system) to arrive at a reviewing schedule by taking response time into account. In this work, our goal is devising strategies which are globally optimal and allow for explicit bounds on the rate of reviewing.

## 2 Problem Formulation

In this section, we first briefly revisit two popular memory models we will use in our work. Then, we describe how to represent spaced repetition using the framework of marked temporal point processes. Finally, we conclude with a statement of the spaced repetition problem.

Modeling human memory. Following previous work in the psychology literature [3, 10, 15, 24], we consider the exponential and the power-law forgetting curve models with binary recalls (, a user either completely recalls or forgets an item).

The probability of recalling item at time for the exponential forgetting curve model is given by

 mi(t):=\PP(ri(t)=1)=exp(−ni(t)(t−tr)), (1)

where is the time of the last review and is the forgetting rate555Previous works often use the inverse of the forgetting rate, referred as memory strength or half-life,  [22, 23]. However, it will be more tractable for us to work with the forgetting rates. at time , which may depend on many factors, , item difficulty and number of previous (un)successful recalls of the item. The probability of recalling item at time for the power-law forgetting curve model is given by

 mi(t):=\PP(ri(t)=1)=(1+ω(t−tr))−ni(t), (2)

where is the time of the last review, is the forgetting rate and is a time scale parameter. Remarkably, despite their simplicity, the above functional forms have been recently shown to provide accurate quantitative predictions at a user-item level in large scale web data [22, 23].

In the remainder of the paper, for ease of exposition, we derive the optimal reviewing schedule and report experimental results only for the exponential forgetting curve model. Appendix E contains the derivation of the optimal reviewing schedule for the power-law forgetting curve model as well as its experimental validation. Other more complex models of memory can also be expressed using SDEs, , see Appendix F for the MCM model of memory [20]. In this paper, we find a solution to the optimization problem for the simpler exponential and powerlaw models of memory, leaving optimization with more complex models for future work.

Modeling spaced repetition. Given a learner who wants to memorize a set of items using spaced repetition, , repeated, spaced reviews of the items, we represent each reviewing event as a triplet

 e  \coloneqq  ( i\mathclap↑item,   \mathclaptime↓t,   r\mathclap↑recall ),

which means that the learner reviewed item at time and either recalled it ( or forgot it (). Here, note that each reviewing event includes the outcome of a test (, a recall) and this is a key difference from the paradigm used by several laboratory studies [7, 8], which consider a sequence of reviewing events followed by a single test. In other words, our data consists of test/review-…-test/review sequences, in contrast, the data in those studies consists of review-…-review-test sequences666In most spaced repetition software and online platforms such as Mnemosyne, Synap, or Duolingo, the learner is tested in each review, , the learner follows test/review-…-test/review sequences..

In the above representation, we model the recall using the memory model defined by Eq. 1, , , and we keep track of the reviewing times using a multidimensional counting process , in which the -th entry, , counts the number of times the learner has reviewed item up to time . Following the literature on temporal point processes [1], we characterize these counting processes using their corresponding intensities, , , and think of the recall as their binary marks. Moreover, every time a learner reviews an item, the recall has been experimentally shown to have an effect on the forgetting rate of the item [9, 22, 23]. In particular, using large scale web data from Duolingo, Settles et al. [23] have provided strong empirical evidence that (un)successful recalls of an item during a review have a multiplicative effect on the forgetting rate —a successful recall at time changes the forgetting rate by , , , , while an unsuccessful recall changes the forgetting rate by , , , , where and are item specific parameters which can be found using historical data. In this context, the initial forgetting rate, , captures the difficulty of the item, with more difficult items having higher initial forgetting rates compared to easier items.

Hence, we express the dynamics of the forgetting rate for each item using the following stochastic differential equation (SDE) with jumps:

 dni(t)=−αini(t)ri(t)dNi(t)+βini(t)(1−ri(t))dNi(t), (3)

where is the corresponding counting process and indicates whether item has been successfully recalled at time . Here, we would like to highlight that: (i) the forgetting rate, as defined above, is a Markov process and this will be useful in the derivation of the optimal reviewing schedule; (ii) the Leitner system [12] with exponential spacing can also be cast using this formulation with particular choices of and and the same initial forgetting rate, , for all items; and, (iii) several laboratory studies, in which learners follow sequences review-…-review-test, suggest the parameters and should be time-varying since the retention rate follows an inverted U-shape [8], however, we found that in our dataset, in which learners follow sequences test/review-…-test/review, considering constant and is a valid approximation (refer to Appendix I).

Given the above definition, one can also express the dynamics of the recall probability , defined by Eq. 1, by means of a SDE with jumps using the following Proposition (proven in Appendix A): Given an item with reviewing intensity , the recall probability , defined by Eq. 1, is a Markov process whose dynamics can be defined by the following SDE with jumps:

 dmi(t)=−ni(t)mi(t)dt+(1−mi(t))dNi(t), (4)

where is the counting process associated to the reviewing intensity . Expressing the dynamics of the forgetting rates and recall probabilities as SDEs with jumps will be very useful for the design of our stochastic optimal control algorithm for spaced repetition.

The spaced repetition problem. Given a set of items , our goal is to find the optimal item reviewing intensities

that minimize the expected value of a particular convex loss function

of the recall probability of the items, , the forgetting rates, , and the intensities themselves, , over a time window , ,

 minimize\ub(t0,tf] \EE(\Nb,\rb)(t0,tf][ϕ(\mb(tf),\nbb(tf))+∫tft0ℓ(\mb(τ),\nbb(τ),\ub(τ))dτ] subject to \ub(t)≥0 ∀t∈(t0,tf), (5)

where denotes the item reviewing intensities from to , the expectation is taken over all possible realizations of the associated counting processes and (item) recalls, denoted as , the loss function is nonincreasing (nondecreasing) with respect to the recall probabilities (forgetting rates and intensities) so that it rewards long-lasting learning while limiting the number of item reviews, and is an arbitrary penalty function. Finally, note that the forgetting rates and recall probabilities , defined by Eq. 3 and Eq. 4, depend on the reviewing intensities we aim to optimize since .

## 3 The Memorize Algorithm

In this section, we tackle the spaced repetition problem defined by Eq. 2 from the perspective of stochastic optimal control of jump SDEs [11]. More specifically, we first derive a solution to the problem considering only one item, provide an efficient practical implementation of the solution, and then generalize it to the case of multiple items.

Optimizing for one item. Given an item with reviewing intensity and associated counting process , recall outcome , recall probability and forgetting rate , we can rewrite the spaced repetition problem defined by Eq. 2 as:

 minimizeu(t0,tf] \EE(N,r)(t0,tf][ϕ(m(tf),n(tf))+∫tft0ℓ(m(τ),n(τ),u(τ))dτ] subject to u(t)≥0 ∀t∈(t0,tf), (6)

where, using Eq. 3 and Eq. 4, the forgetting rate and recall probability is defined by the following two coupled jump SDEs:

 dn(t)= −αn(t)r(t)dN(t)+βn(t)(1−r(t))dN(t) dm(t)= −n(t)m(t)dt+(1−m(t))dN(t)

with initial conditions and .

Next, we will define an optimal cost-to-go function for the above problem, use Bellman’s principle of optimality to derive the corresponding Hamilton-Jacobi-Bellman (HJB) equation [4], and exploit the unique structure of the HJB equation to find the optimal solution to the problem. The optimal cost-to-go is defined as the minimum of the expected value of the cost of going from state at time to the final state at time .

 J(m(t),n(t),t)=minu(t,tf]\EE(N,r)(t,tf][ϕ(m(tf),n(tf))+∫tftℓ(m(τ),n(τ),u(τ))dτ] (7)

Now, we use Bellman’s principle of optimality, which the above definition allows777Bellman’s principle of optimality readily follows using the Markov property of the recall probability and forgetting rate ., to break the problem into smaller subproblems, and rewrite Eq. 7 as:

 J(m(t),n(t),t) =minu(t,t+dt]\EE[J(m(t+dt),n(t+dt),t+dt)]+ℓ(m(t),n(t),u(t))dt 0 =minu(t,t+dt]\EE[dJ(m(t),n(t),t)]+ℓ(m(t),n(t),u(t))dt, (8)

where . Then, we differentiate with respect to time , and using the following Lemma (proven in Appendix B). Let and be two jump-diffusion processes defined by the following jump SDEs:

 dx(t) =f(x(t),y(t),t)dt+g(x(t),y(t),t)z(t)dN(t)+h(x(t),y(t),t)(1−z(t))dN(t) dy(t) =p(x(t),y(t),t)dt+q(x(t),y(t),t)dN(t)

where is a jump process and . If function is once continuously differentiable in , and , then,

 dF(x,y,t)=(Ft+fFx+pFy)(x,y,t)dt+[F(x+g,y+q,t)z(t)+F(x+h,y+q,t)(1−z(t))−F(x,y,t)]dN(t),

where for notational simplicity we dropped the arguments of the functions , , , , and argument of state variables. Specifically, consider , , and in the above Lemma, then,

 dJ(m,n,t)=Jt(m,n,t)−nmJm(m,n,t)+[J(1,(1−α)n,t)r+J(1,(1+β)n,t)(1−r)−J(m,n,t)]dN(t).

Then, if we substitute the above equation in Eq. 8, use that and , and rearrange terms, the HJB equation follows:

 0=Jt(m,n,t)−nmJm(m,n,t)+minu(t,t+dt]{ℓ(m,n,u)[J(1,(1−α)n,t)m+J(1,(1+β)n,t)(1−m)−J(m,n,t)]u(t)} (9)

To solve the above equation, we need to define the loss . Following the literature on stochastic optimal control [4], we consider the following quadratic form, which is nonincreasing (nondecreasing) with respect to the recall probabilities (intensities) so that it rewards learning while limiting the number of item reviews:

 ℓ(m(t),n(t),u(t))=12(1−m(t))2+12qu2(t). (10)

where is a given parameter, which trade-offs recall probability and number of item reviews. This particular choice of loss function does not directly place a hard constraint on number of reviews—instead, it limits the number of reviews by penalizing high reviewing intensities.

Under these definitions, we can find the relationship between the optimal intensity and the optimal cost by taking the derivative with respect to in Eq. 9:

 u∗(t)=q−1[J(m(t),n(t),t)−J(1,(1−α)n(t),t)m(t)−J(1,(1+β)n(t),t)(1−m(t))]+.

Finally, we plug in the above equation in Eq. 9 and find that the optimal cost-to-go needs to satisfy the following nonlinear differential equation:

 0 =Jt(m(t),n(t),t)−n(t)m(t)Jm(m(t),n(t),t)+12(1−m(t))2 (11) −12q−1[J(m(t),n(t),t)−J(1,(1−α)n(t),t)m(t)−J(1,(1+β)n(t),t)(1−m(t))]2+.

with as terminal condition. To continue further, we rely on a technical Lemma (refer to Appendix C), which derives the optimal cost-to-go for a general family of losses . Using this Lemma, the optimal reviewing intensity is readily given by following Theorem (proven in Appendix D): Given a single item, the optimal reviewing intensity for the spaced repetition problem, defined by Eq. 3, under quadratic loss, defined by Eq. 10, is given by . Note that the optimal intensity only depends on the recall probability, whose dynamics are given by Eqs. 3 and 4, and thus allows for a very efficient procedure to sample reviewing times. Algorithm 1 summarizes our sampling method, which we name Memorize. Within the algorithm, returns the recall outcome of an item review at time , where indicates the item was recalled successfully and indicates it was not recalled, and samples from an inhomogeneous poisson process with intensity and it returns the sampled time. In practice, we sample from an inhomogeneous poisson process using a standard thinning algorithm [13].

Optimizing for multiple items. Given a set of items with reviewing intensities and associated counting processes , recall outcomes , recall probabilities and forgetting rates , we can solve the spaced repetition problem defined by Eq. 2 similarly as in the case of a single item.

More specifically, consider the following quadratic form for the loss :

 ℓ(m(t),n(t),u(t))=12∑i∈\Ical(1−mi(t))2+12∑i∈\Icalqiu2i(t).

where are given parameters, which trade-off recall probability and number of item reviews and may favor the learning of one item over another. Then, one can exploit the independence among items assumption to derive the optimal reviewing intensity for each item, proceeding similarly as in the case of a single item: Given a set of items , the optimal reviewing intensity for each item in the spaced repetition problem, defined by Eq. 2, under quadratic loss is given by . Finally, note that we can easily sample item reviewing times simply by running instances of Memorize (Algorithm 1), one per item.

## 4 Experiments

### 4.1 Experiments on synthetic data

In this section, our goal is analyzing the performance of Memorize under a controlled setting using metrics and baselines that we cannot compute in the real data we have access to.

Experimental setup. We evaluate the performance of Memorize using two quality metrics: recall probability at a given time in the future and forgetting rate . Here, by considering high (low) values of , we can assess long-term (short-term) retention. Moreover, we compare the performance of our method with three baselines: (i) a uniform reviewing schedule, which sends item(s) for review at a constant rate ; (ii) a last minute reviewing schedule, which only sends item(s) for review during a period , at a constant rate therein; and (iii) a threshold based reviewing schedule, which increases the reviewing intensity of an item by at time , when its recall probability reaches a threshold . The threshold baseline is similar to the heuristics proposed by Metzler-Baddeley et al. [18] and Lindsey et al. [14]. We do not compare with the algorithm proposed by Reddy et al. [22] because, as it is specially designed for Leitner system, it assumes a discrete set of forgetting rate values and, as a consequence, is not applicable to our (more general) setting. Unless otherwise stated, we set the parameters of the baselines and our method such that the total number of reviewing events during are equal.

Solution quality. For each method, we run independent simulations and compute the above quality metrics over time. Figure 5 summarizes the results, which show that our model: (i) consistently outperforms all the baselines in terms of both quality metrics; (ii) is more robust across runs both in terms of quality metrics and reviewing schedule; and (iii) reduces the reviewing intensity as times goes by and the recall probability improves, as one could have expected.

Learning effort. The value of the parameter controls the learning effort required by Memorize—the lower its value, the higher the number of reviewing events. Intuitively, one may also expect the learning effort to influence how quickly a learner memorizes a given item—the lower its value, the quicker a learner will memorize it. Figure (a)a confirms this intuition by showing the average forgetting rate and number of reviewing events at several times for different values.

Aptitude of the learner and item difficulty. The parameters and capture the aptitude of a learner and the difficulty of the item to be learned—the higher (lower) the value of (), the quicker a learner will memorize the item. In Figure (b)b, we evaluate quantitatively this effect by means of the average time the learner takes to reach a forgetting rate of using Memorize for different parameter values.

### 4.2 Experiments on real data

In this section, our goal is to evaluate how well each reviewing schedule spaces the reviews leveraging a real dataset888Note that it is not the objective of this paper to evaluate the predictive power of the underlying memory models, we are relying on previous work for that [23, 24]. However, for completeness, we provide a series of benchmarks and evaluations for the models we used in this paper in Appendix H.. Unlike the synthetic experiments, we cannot intervene and determine what would have happened if a user would follow Memorize or any of the baselines in the real dataset. As a consequence, measuring the performance of different algorithms is more challenging. We overcome this difficulty by relying on likelihood comparisons to determine how closely a (user, item) pair followed a particular reviewing schedule and compute quality metrics that do not depend on the choice of memory model.

Datasets description. We use data gathered from Duolingo, a popular language-learning online platform999The dataset is available at https://github.com/duolingo/halflife-regression.. This dataset consists of million sessions of study, involving 5.3 million unique (user, word) pairs, which we denote by , collected over the period of two weeks. In a single session, a user answers multiple questions, each of which contains multiple words. Each word maps to an item and the fraction of correct recalls of sentences containing a word

in the session is used as an estimate of its recall probability at the time of the session, as in previous work

[23]. If a word is recalled perfectly during a session then it is considered as a successful recall, , , and otherwise it is considered as an unsuccessful recall, , . Since we can only expect the estimation of the model parameters to be accurate for users and items with enough number of reviewing events, we only consider users with at least reviewing events and words that were reviewed at least times. After this preprocessing step, our dataset consists of 5.2 million unique (user, word) pairs.

Experimental setup and methodology. As pointed out previously, we cannot intervene in the real datasets and thus rely on likelihood comparisons to determine how closely a (user, item) pair followed a particular reviewing schedule. More in detail, we proceed as follows.

First, we estimate the parameters and using half-life regression101010Half-life is the inverse of our forgetting rate multiplied by a constant., where we fit a single set of parameters for all items, but a different initial forgetting rate per item (refer to Appendix H for more details). Then, for each user, we use maximum likelihood estimation to fit the parameter in Memorize and the parameter in the uniform reviewing schedule. For the threshold based reviewing schedule, we fit one set of parameters for each sequence of review events using maximum likelihood estimation for the parameter and grid search for the parameter .

Then, we compute the likelihood of the times of the reviewing events for each (user, item) pair under the intensity given by Memorize, , , the intensity given by the uniform schedule, , , and the intensity given by the threshold based schedule, , . The likelihood of a set of reviewing events given an intensity function can be computed as follows [1]:

 LL({ti})=∑ilogu(ti)−∫T0u(t)dt.

This allows us to determine how closely a (user, item) pair follows a particular reviewing schedule111111Duolingo uses hand-tuned spaced repetition algorithms, which propose reviewing times to the users. However, since users often do not perform reviews exactly at the recommended times, some pairs will be closer to uniform than threshold or Memorize and viceversa., as shown in Figure 9. Distribution of the likelihood values under each reviewing schedule is provided in Appendix G. We do not compare to the last minute baseline since in Duolingo there is no terminal time which users target. Additionally, in many (user, item) pairs, the first review takes place close to and thus the last minute baseline is equivalent to the uniform reviewing schedule.

Finally, since measurements of the future recall probability are not forthcoming and depend on the memory model of choice, we concentrate on the following alternative quality metrics, which do not depend on the particular choice of memory model:

• [leftmargin=6mm]

• Effort: for each (user, item), we measure the effort by means of the empirical estimate of the inverse of the total reviewing period, , . The lower the effort, the less burden on the user, allowing her to learn more items simultaneously.

• Empirical forgetting rate: for each (user, item), we compute an empirical estimate of the forgetting rate by the time of the last reviewing event, , . Here, note that the estimate of the forgetting rate only depends on the observed data (not model/methods parameters). For a more fair comparison across items, we normalize each empirical forgetting rate using the average empirical initial forgetting rate of the corresponding item at the beginning of the observation window, , for an item , where .

Given a particular recall pattern, the lower the above quality metrics, the more effective the reviewing schedule.

Results. We first group (user, item) pairs by their recall pattern, , the sequence of successful () and unsuccessful () recalls over time—if two pairs have the same recall pattern, then they have the same number of reviews and changes in their forgetting rates . For each recall pattern in our observation window, we pick the top 25% pairs in terms of likelihood for each method and compute the average effort and empirical forgetting rate, as defined above. Figure 11 summarizes the results for the most common recall patterns121212Results are qualitatively similar for other recall patterns., where we report the ratio between the effort and empirical forgetting rate values achieved by the uniform and threshold based reviewing schedules and the values achieved by Memorize. That means, if the reported value is smaller than , Memorize is more effective for the corresponding pattern. We find that both in terms of effort and empirical forgetting rate, Memorize outperforms the uniform and threshold based reviewing schedules for all recall patterns. For example, for the recall pattern consisting of two unsuccessful recalls followed by two successful recalls (red-red-green-green), Memorize achieves lower effort and lower empirical forgetting rate than the second competitor.

Next, we group (user, item) pairs by the number of reviews during a fixed period of time, , we control for the effort, pick the top 25% pairs in terms of likelihood for each method and compute the average empirical forgetting rate. Figure 12 summarizes the results for sequences with up to seven given reviews since the beginning of the observation window, where lower values indicate better performance. The results show that Memorize offers a competitive advantage with respect to the other baselines, which is statistically significant.

## 5 Conclusions

In this paper, we have first introduced a novel representation of spaced repetition using the framework of marked temporal point processes and SDEs with jumps and then designed a framework that exploits this novel representation to cast the design of spaced repetition algorithms as a stochastic optimal control problem of such SDEs. For ease of exposition, we have considered only two memory models, exponential and power-law forgetting curves, and a quadratic loss function, however, our framework is agnostic to this particular modeling choices and it provides a set of novel techniques to find reviewing schedules that are optimal under a given choice of memory model and loss. We experimented on both synthetic and real data gathered from Duolingo, a popular language-learning online platform, and showed that our framework may be able to help learners memorize more effectively than alternatives.

There are many interesting directions for future work. For example, it would be interesting to perform large scale interventional experiments to assess the performance of our algorithm in comparison with existing spaced repetition algorithms deployed by, , Duolingo. Moreover, in our work, we consider a particular quadratic loss, however, it would be useful to derive optimal reviewing intensities for other (non-quadratic) losses capturing particular learning goals. We assumed that, by reviewing an item, one can only influence its recall probability and forgetting rate. However, items may be dependent and thus, by reviewing an item, one can influence the recall probabilities and forgetting rates of several items. Finally, it would be very interesting to allow for reviewing events to be composed of groups of items, some reviewing times to be preferable over others, and then derive both the optimal reviewing schedule and optimal grouping of items.

## References

• [1] O. Aalen, O. Borgan, and H. K. Gjessing. Survival and event history analysis: a process point of view. Springer, 2008.
• [2] R. C. Atkinson. Optimizing the learning of a second-language vocabulary. Journal of Experimental Psychology, 96(1):124, 1972.
• [3] L. Averell and A. Heathcote. The form of the forgetting curve and the fate of memories. Journal of Mathematical Psychology, 55(1), 2011.
• [4] D. P. Bertsekas. Dynamic programming and optimal control. Athena Scientific Belmont, MA, 1995.
• [5] K. C. Bloom and T. J. Shuell. Effects of massed and distributed practice on the learning and retention of second-language vocabulary. The Journal of Educational Research, 74(4):245–248, 1981.
• [6] G. Branwen. Spaced repetition.
• [7] N. J. Cepeda, H. Pashler, E. Vul, J. T. Wixted, and D. Rohrer. Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological bulletin, 132(3):354, 2006.
• [8] N. J. Cepeda, E. Vul, D. Rohrer, J. T. Wixted, and H. Pashler. Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological science, 19(11):1095–1102, 2008.
• [9] F. N. Dempster. Spacing effects and their implications for theory and practice. Educational Psychology Review, 1(4):309–330, 1989.
• [10] H. Ebbinghaus. Memory: a contribution to experimental psychology. Teachers College, Columbia University, 1885.
• [11] F. B. Hanson. Applied stochastic processes and control for Jump-diffusions: modeling, analysis, and computation. SIAM, 2007.
• [12] S. Leitner. So lernt man lernen. Herder, 1974.
• [13] P. A. Lewis and G. S. Shedler. Simulation of nonhomogeneous poisson processes by thinning. Naval research logistics quarterly, 26(3):403–413, 1979.
• [14] R. V. Lindsey, J. D. Shroyer, H. Pashler, and M. C. Mozer. Improving students’ long-term knowledge retention through personalized review. Psychological science, 25(3):639–647, 2014.
• [15] G. R. Loftus. Evaluating forgetting curves. Journal of Experimental Psychology: Learning, Memory, and Cognition, 11(2), 1985.
• [16] A. W. Melton. The situation with respect to the spacing of repetitions and memory. Journal of Verbal Learning and Verbal Behavior, 9(5):596–606, 1970.
• [17] E. Mettler, C. M. Massey, and P. J. Kellman. A comparison of adaptive and fixed schedules of practice. Journal of Experimental Psychology: General, 145(7):897, 2016.
• [18] C. Metzler-Baddeley and R. J. Baddeley. Does adaptive training work? Applied Cognitive Psychology, 23(2):254–266, 2009.
• [19] T. P. Novikoff, J. M. Kleinberg, and S. H. Strogatz. Education of a model student. PNAS, 109(6):1868–1873, 2012.
• [20] H. Pashler, N. Cepeda, R. V. Lindsey, E. Vul, and M. C. Mozer. Predicting the optimal spacing of study: A multiscale context model of memory. In Advances in neural information processing systems, pages 1321–1329, 2009.
• [21] P. I. Pavlik and J. R. Anderson. Using a model to compute the optimal schedule of practice. Journal of Experimental Psychology: Applied, 14(2):101, 2008.
• [22] S. Reddy, I. Labutov, S. Banerjee, and T. Joachims. Unbounded human learning: Optimal scheduling for spaced repetition. In KDD, 2016.
• [23] B. Settles and B. Meeder. A trainable spaced repetition model for language learning. In ACL, 2016.
• [24] J. T. Wixted and S. K. Carpenter. The wickelgren power law and the ebbinghaus savings function. Psychological Science, 18(2):133–134, 2007.
• [25] A. Zarezade, A. De, H. Rabiee, and M. Gomez-Rodriguez. Cheshire: An online algorithm for activity maximization in social networks. In arXiv preprint arXiv:1703.02059, 2017.
• [26] A. Zarezade, U. Upadhyay, H. Rabiee, and M. Gomez-Rodriguez. Redqueen: An online algorithm for smart broadcasting in social networks. In WSDM, 2017.

## Appendix A Proof of Proposition 2

According to Eq. 1, the recall probability depends on the forgetting rate, , and the time elapsed since the last review, . Moreover, we can readily write the differential of as .

We define the vector

. Then, we use Eq. 3 and Itö’s calculus [11] to compute its differential:

 d\Xb(t) \stackunder[2pt]\stackon=$dt% $$zol\fb(\Xb(t),t)dt+\hb(\Xb(t),t)dN(t) \fb(\Xb(t),t) =[01] \hb(\Xb(t),t) =[−αn(t)r(t)+βn(t)(1−r(t))−D(t)] Finally, using again Itö’s calculus and the above differential, we can compute the differential of the recall probability as follows:  dF(\Xb(t)) =F(\Xb(t+dt))−F(\Xb(t)) =F(\Xb(t)+d\Xb(t))−F(\Xb(t)) \stackunder[2pt]\stackon=dt%$$zol$(fTF\Xb(\Xb(t)))dt+F(\Xb(t)+\hb(\Xb(t),t)dN(t))−F(\Xb(t)) \stackunder[2pt]\stackon=$dt% $$zol(fTF\Xb(\Xb(t)))dt+(F(\Xb(t)+h(\Xb(t),t))−F(\Xb(t)))dN(t) =(e−(D(t)−D(t))n(t)(1+αri(t)−β(1−ri(t)))−e−D(t)n(t))dN(t)−n(t)e−D(t)n(t)dt =−n(t)e−D(t)n(t)dt+(1−e−D(t)n(t))dN(t) =−n(t)F(X(t))dt+(1−F(X(t)))dN(t) =−n(t)m(t)dt+(1−m(t))dN(t). ## Appendix B Proof of Lemma 3 According to the definition of differential,  dF:=dF(x(t),y(t),t) =F(x(t+dt),y(t+dt),t+dt)−F(x(t),y(t),t) =F(x(t)+dx(t),y(t)+dy(t),t+dt)−F(x(t),y(t),t). Then, using Itö’s calculus, we can write  dF\stackunder[2pt]\stackon=%dt$$zol$F(x+fdt+g,y+pdt+q,t+dt)dN(t)z+F(x+fdt+h,y+pdt+q,t+dt)dN(t)(1−z) +F(x+fdt,y+pdt,t+dt)(1−dN(t))−F(x,y,t) (12)

where for notational simplicity we drop arguments of all functions except and . Then, we expand the first three terms:

 F(x+fdt+g,y+pdt+q,t+dt) =F(x+g,y+q,t)+Fx(x+g,y+q,t)fdt+Fy(x+g,y+q,t)pdt +Ft(x+g,y+q,t)dt F(x+fdt+h,y+pdt+q,t+dt) =F(x+h,y+q,t)+Fx(x+h,y+q,t)fdt+Fy(x+h,y+q,t)pdt +Ft(x+h,y+q,t)dt F(x+fdt,y+pdt,t+dt) =F(x,y,t)+Fx(x,y,t)fdt+Fy(x,y,t)pdt+Ft(x,y,t)dt

using that the bilinear differential form . Finally, by substituting the above three equations into Eq. B, we conclude that:

 dF(x(t),y(t),t) =(Ft+fFx+pFy)(x(t),y(t),t)dt+[F(x+g,y+q,t)z(t)+F(x+h,y+q,t)(1−z(t)) −F(x,y,t)]dN(t),

## Appendix C Lemma C

Consider the following family of losses with parameter ,

 ℓd(m(t),n(t),u(t)) =hd(m(t),n(t))+g2d(m(t),n(t))+12qu(t)2, gd(m(t),n(t)) =2−1/2[c2log(d)−m(t)2+2m(t)−d−c2log(d)1−d+c1m(t)log(1+β1−α)−c1log(1+β)]+, hd(m(t),n(t)) =−√qm(t)n(t)c2(−2m(t)+2)log(d)(−m(t)2+2m(t)−d)2. (13)

where are arbitrary constants. Then, the cost-to-go that satisfies the HJB equation, defined by Eq. 9, is given by:

 Jd(m(t),n(t),t)=√q(c1log(n(t))+c2log(d)−m(t)2+2m(t)−d) (14)

and the optimal intensity is given by:

 u∗d(t)=q−1/2[c2log(d)−m(t)2+2m(t)−d−c2log(d)1−d+c1m(t)log(1+β1−α)−c1log(1+β)]+.

Consider the family of losses defined by Eq. 13 and the functional form for the cost-to-go defined by Eq. 14. Then, for any parameter value , the optimal intensity is given by

 u∗d(t) =q−1[Jd(m(t),n(t),t)−Jd(1,(1−α)n(t),t)m(t)−Jd(1,(1+β)n(t),t)(1−m(t))]+ =q−1/2[c2log(d)−m2+2m−d−c2log(d)1−d+c1m(t)log(1+β1−α)−c1log(1+β)]+,

and the HJB equation is satisfied:

 ∂Jd(m,n,t)∂t−mn∂Jd(m,n,t)∂m+hd(m,n)+g2d(m,n)−12q−1(Jd(m,n,t) −Jd(1,(1−α)n,t)m−Jd(1,(1+β)n,t)(1−m))2+ =√qmnc2(−2m+2)log(d)(−m2+2m−d)2+hd(m,n)+g2d(m,n)−1