1 Introduction
In a traditional classroom, a teacher uses the same learning material (e.g. textbook, instruction pace, etc.) for all students. However, the selected material may be too hard for some students and too easy for some other students. Further, some students may take longer time in learning than the others. Such a learning process may not be efficient. These issues can be solved if the teacher can make an individualized learning plan for each individual student: Select an appropriate learning material according to each student’ ability and let a student learn at her/his own pace. Considering that a very low teacherstudent ratio is required, such an individualized adaptive learning plan may be too expensive to be applied to all students. As such, adaptive learning systems are developed to provide individualized adaptive learning for all students/learners. In particular, with the fast growth of digital platforms, globally integrated resources, and machine learning algorithms, the adaptive learning systems are becoming increasingly more affordable, applicable, and efficient
(zhang2016smart).An adaptive learning system—also referred to as a personalized/individualized learning or intelligent tutoring system—aims at providing a learner with optimal and individualized learning experience or instructional materials so that the learner can reach a certain achievement level in a shortest time or reach as high as possible an achievement level in a fixed period of time. First, learners’ historical data are used to estimate her/his proficiency. Then, according to the level of her/his proficiency, the system selects the most appropriate learning material for the learner. After the learner finishes the learning material, an assessment is given to the learner and her/his proficiency level is updated and is used by the adaptive learning system to choose the next most appropriate learning material for the learner. Such process repeats until the learner achieves a certain proficiency level.
In previous studies, the proficiencies or latent traits were typically characterized as vectors of binary latent variables
(chen2018recommendation; li2018optimal; tang2019reinforcement). However, it is important to consider the granularity of the latent traits in a complicated learning and assessment environment in which a knowledge domain consists of several finegrained abilities. In some cases, it would be too simple to model learners’ abilities as mastery or nonmastery. For example, when an item is designed to measure several latent traits and a learner regarded as mastering all related traits of the item cannot be assured to answer the item correctly. A possible reason is that the socalled mastery is not full mastery of a latent trait. By measuring learners’ traits as continuous scales, the adaptive learning system can be designed to help learners to learn and improve until they reach the target levels of certain abilities so that the learners can achieve target scores in assessments. Especially in practice, most assessments are designed to measure learners’ latent traits (mcglohen2008combining). In such scenarios, it is better to use a continuous scale to measure the latent traits as the item response theory (IRT) does. In this paper, we will develop an adaptive learning system that estimate learners’ abilities using measurement models in order to provide them with most appropriate materials for further improvements.Existing research studies have focused on modeling learners’ learning paths (chen2018hidden; wang2018tracking), accelerating learners’ memory speed (reddy2017accelerating), providing modelbased sequence recommendation (chen2018recommendation; lan2016contextual; xu2016personalized), tracing learners’ concept knowledge state transitions over time (lan2014time), and selecting materials for learners optimally based on modelfree algorithms (li2018optimal; tang2019reinforcement). However, explicit models are typically needed to characterize learners’ learning progresses in these studies. While there exist research studies that aim to find the optimal learning strategy/plan (called policy in the rest of the paper) which chooses the most appropriate learning materials for learners using modelfree algorithms, they all assume discrete latent traits. In addition, when the number of learners is too small for the system to learn an optimal policy, these algorithms are not applicable. This paper studies the important, yet less addressed adaptive learning problem—the problem of finding the optimal learning policy—based on continuous latent traits, and applies machine learning algorithms to deal with the tackle challenges such as only a small number of learners available in historical data.
In this paper, we formulate the adaptive learning problem as a Markov decision process (MDP), in which the state is the (continuous) latent traits of a learner, the action is the (discrete) learning material given to the learner. Yet, the state transition model is unknown in practice, thus making the MDP unsolvable using conventional modelbased algorithms such as the value iteration algorithm (sutton2018reinforcement). To solve the issue, we apply a modelfree deep reinforcement learning (DRL) algorithm, the socalled deep Qlearning algorithm, to search for the optimal learning policy. The modelfree DRL algorithm is a class of machine learning algorithms that solve an MDP by learning an optimal policy represented by neural networks from a sequence of state transitions directly when the transition model itself is are unknown (franccois2018introduction). DRL algorithms have been widely applied in solving a variety of problems in many different fields such as playing Atari games (mnih2015human), bidding and pricing in electricity market (xu2019deep), manipulating robotics (gu2017deep), and localizing objects (caicedo2015active). We refer interested readers to franccois2018introduction for a detailed review on the theories and applications of DRL. Therefore, the adaptive learning system is embedded with the welldeveloped measurement models and the modelfree DRL algorithm so as to be more flexible.
However, a deep Qlearning algorithm typically requires a large amount of state transition data so as to find an optimal policy, which may be difficult to obtain in practice. To cope with the challenge of insufficient state transition data, we develop a transition model estimator that emulates the learner’s learning process using neural networks. The transition model that is fitted using available historical transition data can be used in the deep Qlearning algorithm to further improve its performance with no additional cost.
The purpose of this paper is to develop a fully adaptive learning system in which (i) the learning material given to a learner is based on her/his continuous latent traits that indicate the levels of certain abilities, and (ii) the learning policy that maps the learner’s latent traits to the learning materials is found adaptively with minimal assumption on the learners’ learning process. First, an MDP formulation for the adaptive learning problem by representing latent traits in a continuum is developed. Second, a modelfree DRL algorithm—the deep Qlearning algorithm—is applied, to the best of our knowledge, for the first time, in solving the adaptive learning problem. Third, a neural network based transition model estimator is developed, which can greatly improve the performance of the deep Qlearning algorithm when the number of learners is inadequate. Last, some interesting simulation studies are conducted to serve as demonstration cases for the development of adaptive learning systems.
The remainder of this paper is organized as follows. In the Preliminaries section, we briefly review measurement models and make some assumptions on the adaptive learning problem. In the Adaptive Learning Problem section, we introduce the conventional adaptive learning systems and develop a MDP formulation for the adaptive learning problem. Then, we apply the deep Qlearning algorithm to solve the MDP in the Optimal Learning Policy Discovery Algorithm section, where a transition model estimator that emulates the learners is also developed. Two simulation studies are conducted in the Numerical Simulation section and some concluding remarks are made at the end of the paper.
2 Preliminaries
In this section, we give a brief introduction on measurement models for continuous latent traits, which is an important component in adaptive learning systems. The representation of learners’ latent traits and assumptions on them are also presented.
2.1 Measurement Models
In an adaptive learning system, a test is given to a learner/student after each learning cycle. The learner’s responses to the test items are collected by the system and her/his latent traits are estimated using measurement models, specifically IRT models (rash1960probabilistic; lord1968statistical).
An appropriate IRT model needs to be chosen based on the test’s features such as the test’s dimensional structure (zhang2013procedure) and its response categories. To be more specific, in the case when item responses are recorded as binary values indicating correct or incorrect answers, the test that evaluates only one latent trait will use the unidimensional item response theory IRT models (rash1960probabilistic; birnbaum1968some; lord1980application), whereas tests that associate more than one trait will use the multidimensional item response theory (MIRT) models (reckase1972development; mulaik1972mathematical; sympson1978model; whitely1980multicomponent). When item responses have more than two categories, polytomous IRT models such as the partial credit model (masters1982rasch), the generalized partial credit model (muraki1992generalized), and the graded response model (samejima1969estimation) are used for unidimensional case. Their extensions can be applied in multidimensional cases.
The basic representation of an IRT model is expressed as
(1) 
where
denotes probability,
is a random variable representing the score on the test item,
is the possible value of , is a vector of parameters describing the learner’s latent traits, is a vector of parameters indicating the characteristic of the item, and denotes a function that maps to a probability in . As pointed out in ackerman2003using, many educational tests are inherently multidimensional. Therefore, we will use the MIRT as the intrinsic model to build up the adaptive learning system. As an illustration, the multidimensional twoparameter logistic IRT (M2PL) model is given by(2) 
where is the response given by test taker to item, is a vector in describing a set of latent traits, is a vector of discrimination parameters for the item, indicating the relative importance of each trait in correctly answering the item, and the intercept parameter is a scalar for item . An applicable item takes each element of to be nonnegative. Therefore, as each element’s value of increases, the probability of correct response increases.
With an online calibration design, an accurately calibrated item bank can be acquired using previous learners’ response data for an adaptive learning system without large pretest subject pools (makransky2014automatic; zhang2016smart). After item parameters are precalibrated, a variety of latent trait estimation methods can be applied to estimate learners’ abilities. Conventional methods such as maximum likelihood estimation (lord1968statistical), weighted likelihood estimation and Bayesian methods (e.g. expected a posteriori estimation (EAP), maximum a posteriori (MAP)) can accurately estimate latent traits in MIRT models. Their variations are also extended for estimating the latent traits in multiple dimensions. Many latent trait estimation methods result in a bias on the order of as small as , where
denotes test length, while approaches that further reduce the bias as well as the variance of estimates have also been identified and proposed
(firth1993bias; tseng2001multidimensional; wang2015latent; warm1989weighted; zhang2011investigating).2.2 Assumptions
Denoted as learner’s latent traits at time step , where is the number of dimensions. Throughout this paper, we make the following simplifying yet practical assumptions:

No retrogression exists in latent traits. That is, , .

The number of learning materials is finite.
3 Adaptive Learning Problem
In this section, we first describe the adaptive learning problem and then formulate this problem as an MDP.
3.1 Problem Statement
A conventional adaptive learning system is illustrated in Figure 1. Such an adaptive learning system is typical in traditional classrooms and online courses like Massive Open Online Courses (MOOCs) (lan2016contextual). In the adaptive learning system, the learner takes some learning materials to improve her/his latent traits. After the learner finishes learning the materials, a test or homework is assigned to the learner. Then, the learner’s latent traits are estimated. Based on the estimated latent traits, the learning system adaptively determines the next learning material for the learner, which may be one of many forms including a textbook chapter, a lecture video, an interactive task, an instructor support, or an instruction pace. Such cyclic learning process continues until the learner’s latent traits reach or are close to a prespecified levels of proficiency.
The tests in an adaptive learning system can be computerized adaptive testing (CAT). The CAT is a test mode that administers tests adapted to test takers’ trait levels (chang2015psychometrics). It provides more accurate trait estimates with much smaller number of items (weiss1982improving) by sequentially selecting and administering items tailored to each individual learner. Therefore, a relatively short test can assess learners’ latent traits with high accuracy.
Conventionally, the learning policy (or plan) is provided by a teacher as illustrated in Figure 1. As aforementioned, however, it is too expensive for teachers to make an individualized adaptive learning policy for each learner. In this paper, we use a DRL algorithm to search for an optimally individualized adaptive learning policy for each learner. The algorithm selects the most appropriate learning material among all available materials for each learner based on her/his provisional estimated latent traits that are obtained from her/his learning history and performances in tests. The adaptive selection of learning materials guarantees the learner reaches a prespecified proficiency level in a shortest number of learning cycles or reaches proficiency level as high as possible in a fixed number of learning cycles. That is, instead of resorting to an experienced teacher for the construction of a learning policy as illustrated in Figure 1, we will develop a systematic method to enable the adaptive learning system to discover an optimal learning policy from the data that have been collected, which include historical learning materials, test responses, and estimated latent traits, etc.
3.2 Markov Decision Process Formulation
3.2.1 Primer on Markov Decision Process
Before presenting the formulation for the adaptive learning problem, we first briefly review MDPs. An MDP is characterized by a 5tuple , where is a set of states, is a set of actions, is a Markovian transition model, is a reward function, and is a discount factor (sutton2018reinforcement). A transition sample is defined as , where and , is a scalar reward when the state transitions into state from state after taking action .
Let and denote the state and action at time step , respectively, and denote the reward obtained after taking action at state . Note that , , and are random variables. When both and are finite, the transition model can be represented by conditional probability, that is,
(3) 
The Markovian property of the transition model is that, for any time step ,
(4) 
Essentially, the Markovian property requires that a future state is independent of all past states given the current state. Assume is timehomogeneous, i.e., for any two time steps and ,
(5) 
Then, we can drop the superscript and write the transition model as . Note that when
is continuous, the transition model can be represented by a conditional probability density function.
Let denote a deterministic policy for the MDP defined above. The actionvalue function for the MDP under policy is defined as follows:
(6) 
where denotes the expectation. The actionvalue function is the expected cumulative discounted reward when the system starts from state , takes action , and follows policy thereafter. The maximum actionvalue function over all policies is defined as . A policy is said to be optimal if for any and . In particular, the greedy policy with respect to , defined as , is an optimal policy (sutton2018reinforcement). The MDP is solved if we find . (bertsekas1996neuro) The optimal actionvalue function satisfies the Bellman optimality equation:
(7) 
Furthermore, there is only one function that solves the Bellman optimality equation. The Bellman optimality equation is of central importance to solving the MDP. When both and are finite and is known, modelbased based algorithms such as the value iteration algorithm can be applied to solve the MDP (sutton2018reinforcement).
3.2.2 Adaptive Learning Problem as MDP
We next formulate the adaptive learning problem as an MDP as follows.
State Space: Define the vector of parameters describing the learner’s latent traits as the state, i.e., , which has continuous variables, where represents the dimension of the latent traits. For the simplicity of the algorithm construction in the following, the state space is defined as when each element of satisfies , in which a smaller value of indicates a lower ability and a larger value indicates a higher ability. Although a latent trait variable is typically defined on in IRT, a closed interval, say , is used as the range of a latent trait variable in practice. Let be the prespecified target proficiency level of the latent trait, which is the level the learners try to reach, where . Because of the fact that there is a bijection between and , an estimated trait can be directly transformed into the scale of . Thus, without loss of generality, we consider the state space as .
Action Space: Let the learning materials available in the adaptive learning system be indexed by . The action in the adaptive learning system is the index of a learning material, which is discrete, and the action space is .
Reward Function: Recall that the objective of the adaptive learning system is to minimize the learning steps it takes before a learner’s latent traits reach the maximum, i.e., for to reach , where is an allones vector in . As such, the reward function is defined as follows:
(8) 
where indicates the infinite norm. Intuitively, the sum of rewards over one episode (the entire learning process of a learner) is to the negative of the total steps a learner takes before all of her/his latent traits are very close to , which indicates that the learner has reached target levels of all prespecified abilities.
Transition Model:
The probability distributions of the latent trait as well as the change of trait are unknown. As a result, the transition model
is not known a priori.Based on this MDP formulation, the adaptive learning problem is essentially to find an optimal learning policy, denoted by , that determines the action (learning material selection) based on the state (latent traits), such that the expected cumulative discounted reward is maximized. Note that the larger the expected cumulative discounted reward is, the less the total learning steps a learner takes to reach the target level(s) of an ability/abilities is. Since the transition model is unknown, the MDP cannot be solved using modelbased algorithms such as the value iteration algorithm. We will resort to a modelfree DRL algorithm to solve it in the next section.
4 Optimal Learning Policy Discovery Algorithm
In this section, we solve the adaptive learning problem by using the deep Qlearning algorithm, which can learn the actionvalue function directly from historical transition data without knowing the underlying transition model. To utilize the available transition information more efficiently, we further develop a transition model estimator and use it to train the deep Qlearning algorithm.
4.1 ActionValue Function As Deep QNetwork
Recall that the optimal learning policy can be readily obtained if we know the actionvalue function. When the state is continuous and the action is discrete, which is the case in the adaptive learning problem, the actionvalue function cannot be exactly represented in a tabular form. In such cases, the actionvalue function can be approximated by some functions, such as linear functions (sutton2018reinforcement) or artificial neural networks (simply referred to as neural networks) (mnih2015human). In the former case, the approximate actionvalue function is represented as an inner product of the parameter vector and a feature vector that is constructed from the state. It is important to point out the choice of the features is critical to the performance of the approximate actionvalue function. Meanwhile, neural networks are capable of extracting useful features from the state directly, and have stronger representation power than linear functions (goodfellow2016deep).
As an example for neural networks, Fig. 2 shows an illustrative neural network that consists of an input layer that has units, a hidden layer that has units, and an output layer with units. Let , , and denote the vectors that come out of the input layer, the hidden layer, and the output layer, respectively. In the neural network, the output of one layer is the input for the next layer. To be more specific, can be computed from , and can be computed from as follows:
(9)  
(10) 
where and are two weight matrices, and
are two bias vectors, and
is the socalled activation function, which is applied to its argument elementwise. A popular choice of the activation function
is the rectifier, i.e., . Conceptually, we can write the output as a function of , where is parameterized by , , , and , which can be collectively denoted as a parameter vector . Given a set of inputoutput values denoted by , the optimal value of can be found by solving the following problem:(11) 
where is the norm. Problem (11) can be solved by using gradient descent algorithm or its variants, in which the gradient of the objective function with respect to
can be computed using the famous backpropagation technique. Neural networks can also be trained using a variety of other optimization algorithms such as Adam and RMSProp
(see, goodfellow2016deep). Note that there may be several hidden layers and the more hidden layers there are, the deeper the neural network is. We refer interested readers to goodfellow2016deep for a more comprehensive details about neural networks.Recall that in the adaptive learning problem, the state is continuous in , while the action is discrete . The approximate actionvalue function, denoted by , can be represented using a neural network as follows. The input layer is the state , or equivalently, the latent trait vector , which has units. The output has units, each of which corresponds to the actionvalue for one action. To more be specific, the unit in the output layer gives , i.e., the actionvalue for state and action . The number of hidden layers and the number of units in each hidden layer can be determined through simulation, which is to be detailed in the numerical simulation section. Such a neural network is also referred to as a deep Qnetwork (DQN) (mnih2013playing). Let denote the parameter vector of the DQN, which includes all weights and biases in the DQN. To emphasize that is parameterized by , we write as .
4.2 Learning Policy Discovery with Deep QLearning
The parameters of the DQN can be learned from the the sequence of latent traits and learning materials using the deep Qlearning algorithm proposed by mnih2013playing. The optimal value of the parameter vector of the DQN, , can be found by minimizing the mean squared error between the approximate actionvalue function and the true actionvalue function:
(12) 
However, solving (12) is extremely difficult if not impossible since both and the transition model are unknown and thus, the expectation of the mean squared error cannot be computed. The deep Qlearning algorithm adopts two measures to cope with these challenges. First, the expectation is replaced with the sample average that can be computed from a set of historical transitions, denote by , with , where denotes the cardinality of a set. That is, (12) is now replaced by the following problem:
(13) 
At time step , the parameter vector is updated using the gradient descent algorithm as follows:
(14) 
where is the learning rate and denotes the value of at time step . Second, the unknown is further substituted by based on the Bellman optimality equation in (7). Note that when , which indicates the learning process has ended, . Therefore, (14) is now becomes
(15) 
where
(16) 
The detailed deep Qlearning algorithm that is used to search the optimal parameter vector for the DQN is presented in Algorithm 1, where one episode represents a complete learning process of one learner and the number of episodes is the number of learners. In order to obtain a good approximate actionvalue function, the stateaction space needs to be sufficiently explored. To achieve this, the socalled greedy exploration is adopted in the deep Qlearning algorithm. Specifically, at time step , a random action is selected with probability , and a greedy action is with probability . In this paper, we adaptively decay from to in time steps. In addition, the parameter vector is updated at each time step using a set of transitions that is resampled from the historical transitions denoted by with so as to reduce the bias that may be caused by the samples.
4.3 Transition Model Estimator
The deep Qlearning algorithm requires a sufficiently large historical transition data in order to find a good approximate of the actionvalue function, based on which the learning policy is then derived. However, we may not be able to obtain adequate transitions due to several reasons including the lack of adequate learners, and the long time it takes to acquire an individual learner’s learning path (transitions). Thus, it is more desirable to develop an adaptive learning system which can efficiently discover the optimal learning policy after training on a relatively small number of learners. To this end, we develop a transition model estimator which emulates the learning behavior of learners. Specifically, the transition model estimator can take a state and an action as inputs, and output the next state
. This can be cast as a supervised learning task, (a regression task), which can be solved using neural networks. The input layer of the neural network that represents the transition model is a pair of state and action, and the output layer is the next state. The number of hidden layers can be adjusted through the parameter tuning process
(see, e.g., goodfellow2016deep, for more details).Conceptually, we can write the neural network that represents the transition model as , the parameter vector of which is denoted by . The optimal value of can be found by solving the following problem using the backpropagation algorithm:
(17) 
where is the set of historical transition (data).
The adaptive learning system with the DQN and a transition model estimator is shown in Fig. 4, where the DQN is trained against the transition model, instead of the actual learners.
5 Numerical Simulation
In this section, we show the performance of the adaptive learning system with and without the transition model estimator, and also investigate the impacts of latent trait estimation errors through two simulation studies.
5.1 Simulation Overview
Consider a group of learners in a twodimensional assessment and a learning environment with three sets of learning materials. We model the group of learners as a homogeneous MDP. Let the random vector denote a learner’s state at time step , which represents the latent traits in our study. Consider three sets of learning materials regarding the twodimensional latent trait levels, that is, . Each set of learning materials contain contents with regards to different latent traits. Denote the change of the latent traits from time step to by . The probability of having transitioning from state to can be represented as
(18) 
where is the index of the set which the selected learning material belongs to. In the following notations, we only consider the set which the selected learning material belongs to, denoted as . Assume , where the value of indicates extremely low ability on the corresponding dimension and the value of indicates the target ability.
In addition, under Assumption A1 of no retrogression, we have and . As we model the transition of the latent traits to be a continuousstate MDP, the change of and only depends on current latent trait and the selected learning material . Therefore, we let and
follow Beta distributions such that
, where , and , where . when and when , which means the first set of materials only helps improving while the second set is only related to . Parameters of and in the Beta distribution are calculated by(19) 
and
(20) 
An intuitive example is how a learner learns addition “+” and subtraction “–”. A learning process usually takes a long time and thus a monotonic decreasing, zeroconcentrated distribution is adopted to simulate the ability increase. In that case, each learning step will most likely lead to a small increase of the ability/abilities. Besides, in the distribution , the larger is, the more the curve approaches , which results in a higher chance in generating a smaller . It implies that a higher ability the learner has on either dimension, the harder for him/her to further improve the corresponding ability. Thus, and have positive coefficients in front of and , respectively. Meanwhile, we assume that a higher ability on one dimension helps to increase the other dimension’s ability, which results in a negative coefficient ahead of in and a negative coefficient ahead of in . In particular, assume the third learning material contains contents related to both abilities, and especially helps learners with intermediate or high ability level of addition to improve further on subtraction. This assumption is included in calculating when in equation (20). In addition, if the learner makes a big progress in mastering the ability of addition, there is a higher chance for the one to improve more on learning subtraction. Thus, the coefficient of in is negative which gives a curve that is less zeroconcentrated as increases. Consequently, has a higher possibility in increasing more as is large. Note that the transition model is not required for adaptive learning system. The simulation gives an example in validating the modelfree deep Qlearning algorithm in discovering the optimal learning policy.
Estimation errors ranging from to are also added to estimated latent traits to evaluate their impacts on the adaptive learning system. Denote the estimation error vector by , where and
are generated by the same normal distribution such that
. As a result, of lie in the range of . In the simulation, the estimated latent traits are calculated by the sum of the true latent traits and the estimation errors, which are. For instance, if the standard deviation
is , the observation is , where , and of lie in the range of .Two simulations cases are studied. In the first case, the DQN is trained against actual learners whose abilities’ changes follow the MDP with kernel distributions described above. In this case, it is presumed that the optimal learning policy can be trained on sufficient number of learners. The resulting optimal learning policy is compared with a heuristic learning policy, which selects the next learning material that can improve the notfullymastered ability, and a random learning policy which selects any material randomly from the set of three. The impact of different estimation errors is also assessed. In the second case, the DQN is trained against an estimated transition model learning that is obtained using a small group of learners. The resulting optimal learning policy is compared with that obtained by training against actual learners.
5.2 Simulation Study I
Assume all learners are beginners on the two latent traits when using the adaptive learning system, i.e. . The DQN has two hidden layers, the first of which has units and the second of which has units. The DQN is trained against learners that are simulated according to the method discussed earlier, i.e. . Other parameters are chosen as follows: , , , , , . The Adam algorithm is adopted for the training of the DQN.
Figure 5 presents the smoothed reward under the deep Qlearning algorithm across the first episodes with a smoothing window of . It can be seen that the reward converges to after episodes, which indicates the optimal learning policy is found after the DQN is trained using learners.
Methods  DQN  Heuristic  Random 

Reward mean  13.49  21.55  24.85 
Reward SD  4.59  4.76  5.59 
Figure 6 and Table 1 compare smoothed rewards across new learners, labeled as episodes in Figure 6, with a smoothing window of between the optimal learning policy found by the deep Qlearning algorithm after being trained in episodes—referred to as the DQN learning policy, the heuristic learning policy, and the random learning policy. The larger the reward is, the fewer learning steps a learner takes to fully master the two latent traits, or in another word, the better the learning policy is. Clearly, the rewards obtained by the deep Qlearning algorithm have a higher mean and smaller standard deviation (SD) than those obtained by the heuristic learning policy and the random learning policy. These results show that the learning policy found by the deep Qlearning algorithm is much better than the other two.
Figure 7 presents an example of a state transition path that shows how the latent traits change with a sequence of actions taken under the DQN learning policy obtained without considering estimation error. Take the addition and subtraction test as an example. The first learning material is repeatedly selected to improve the learner’s ability of addition at the beginning. Then the third material related to both addition and subtraction is selected. In the last few steps, the second learning material is chosen to further improve the learner’s ability of subtraction.
Figure 8 compares rewards under the DQN and the heuristic learning policies when estimation errors with various standard deviations () exist. It shows that the mean rewards obtained by the DQN learning policy under various estimation errors are consistently higher than those of the heuristic learning policy when estimation errors exist. That is, the DQN learning policy still outperforms the heuristic learning policy even with the presence of estimation errors, which demonstrates that the deep Qlearning algorithm is reliable and stable in finding optimal learning policy with the presence of estimation errors.
5.3 Simulation Study II
Next, we show the performance of the adaptive learning system with a transition model estimator, which is represented using a neural network with one hidden layer that has units. The prediction accuracy indices are presented in Table 2. The train and test scores are defined as the coefficient of determination in the training and test sets respectively, calculated by
(21) 
where is the true state, is average value of the true state, is the predicted state using previous state and the action taken, and is the number of the transitions. The best possible score is . The root mean square error (RMSE) is calculated by
(22) 
No. of learners  10  20  30  40  50  100  150  200  2000 

Train Score  0.96  0.97  0.97  0.97  0.97  0.97  0.97  0.97  0.97 
Test Score  0.95  0.97  0.96  0.96  0.97  0.97  0.97  0.97  0.97 
RMSE  0.11  0.08  0.09  0.09  0.08  0.08  0.08  0.08  0.08 
A DQN is trained on episodes against the estimated transition model that is fitted using a certain number of actual learners; the learning policy corresponding to this DQN is referred to as the virtual DQN learning policy. For the purpose of comparison, another DQN is trained on the same number of actual learners; the learning policy corresponding to this DQN is referred to as the actual DQN learning policy. Essentially, these two learning policies differ in the way how the same set of actual learners are utilized. The actual learners are simulated according to the method discussed in “Simulation Overview” section and are used to train the actual DQN learning policy directly. In contrast, these actual learners are used to first fit a transition model, which is then used to train the virtual DQN learning policy; this allows the virtual DQN learning policy to be trained over as many episodes as it needs. Figure 9 compares rewards obtained by the two DQN learning policies when various numbers of actual learners are utilized. It is shown that with no more than actual learners, the utilization of the transition model can significantly improve the performance of the learning policy, generating much larger mean rewards compared than the algorithm without using the transition model. When the number of learners is large enough, both two approaches found optimal learning policies and yield similar rewards.
6 Concluding Remarks and Future Directions
In this paper, we developed an MDP formulation for an adaptive learning system by describing learners’ latent traits as continuous instead of simply classifying learners as mastery or nonmastery of certain skills. The objective of the system is to improve learners’ abilities to the prespecified target levels. We developed a deep Qlearning algorithm, which is a modelfree DRL algorithm that can effectively find the optimal learning policy from data on learners’ learning process without knowing the transition model of the learner’s latent traits. To cope with the challenge of insufficient state transition data, which may result in a poor performance of the deep Qlearning algorithm, we developed a transition model estimator that emulates the learner’s learning process using neural networks, which can be used to further train the DQN and improve the its performance.
The two simulation studies presented in the paper verified that the proposed methodology is very efficient in finding a good learning policy for adaptive learning systems without any help from a teacher. The optimal learning policy found by the DQN algorithm outperformed the heuristic and random methods with much higher rewards, or equivalently, much fewer learning steps/cycles for learners to reach the target levels of all prespecified abilities. Particularly, with the aid of a transition model estimator, the adaptive learning system can find a good learning policy efficiently after training using a few learners.
The directions for extending the adaptive learning research include applying the adaptive learning system on actual learners to further assess the efficiency of the proposed methodology. Both the DQN algorithm and the transition model estimator can be adopted and evaluated through real data analysis on an online learning platform. Second, the adaptive learning system here consists of a latent trait estimator which uses measurement models to estimate latent traits and a learning policy. Instead, some research construct the system assuming that learning materials influence learners’ responses to test items directly, without the latent trait estimator incorporated (lan2014time; lan2016contextual). As such, learners’ learning process is modeled and traced directly and modelfree algorithms can be proposed to find the optimal learning policy. Third, because each group of learners assumes to follow a homogeneous MDP, further researches can be conducted to classify learners into groups before they use the adaptive learning system in order to find the optimal learning policy for each group. Fourth, machine learning algorithms for recommendation systems (e.g., collaborative filtering, matrix decomposition, etc.) can be incorporated with the DRL algorithm to better recommend not only optimal but also preferred materials to learners (li2010contextual). Fifth, it would be interesting to further examine how much better is the DQN learning policy than the random and heuristic learning policies under different scenarios and constrains. Sixth, it is also interesting to formulate the adaptive learning problem as a partially observable Markov decision process (POMDP) and explore solutions to the problem. Finally, future studies can include the modeled learning paths (chen2018hidden; wang2018tracking) as learners’ historical data to search the optimal learning policy more efficiently.
Comments
There are no comments yet.