Curriculum Learning for Reinforcement Learning has been an increasingly active area of research; its core principle is to train an agent on a sequence of intermediate tasks, called a Curriculum, to increase the agent’s performance and learning speed. Previous work has focused on either gradually modifying the agent’s experience within a given task [2, 11], or on scheduling sequences of different and increasingly complex tasks [10, 4].
We introduce a framework for Curriculum Learning that encompasses both paradigms, centred around the concept of task complexity and its progression as the agent becomes increasingly competent. The framework enables ample flexibility as the tasks can be selected from an infinite set and, most significantly, the task difficulty can be modified at each time step.
The framework is based on two components: a progression function calculating the complexity of the task for the agent at any given time, and a mapping function modifying the environment according to the required complexity. The progression function encompasses task selection and sequencing, while the mapping function is responsible for task generation. Generation, selection and sequencing are the central components of Curriculum Learning , which are seamlessly integrated into the proposed framework.
Different definitions of the progression and mapping functions result in different curriculum learning algorithms. In this paper, we focus on the introduction of two families of progression functions: fixed and adaptive progressions. The former produces a curriculum that is predetermined before execution, while the latter results in a dynamic curriculum, adjusted based on the online performance of the agent. The progression function, besides providing task selection and sequencing, also determines how long the agent should train on each task. In most existing methods the agent learns intermediate tasks until convergence, however, it has been suggested that this is unnecessary . Progression functions address this problem by providing an implicit way to determine when to stop training on a task and advance to the next one.
We demonstrate the effectiveness of the curriculum learning algorithms, derived from the proposed framework, on two experimental domains from the literature, and compare them with state-of-the-art methods for task-level curricula and experience ordering within the same task.
In the remainder of the paper we first lay out the mathematical framework behind our approach. We then introduce the two classes of progression functions, fixed and adaptive, based on whether the performance of the agent is taken into account. Furthermore, we propose an online and adaptive progression for the ordering and selection of the tasks in the curriculum. Lastly, we validate our approach by comparing it to two other state-of-the-art Curriculum Learning algorithms.
We model tasks as episodic Markov Decision Processes (MDP). An MDP is a tuple, where is the set of states, is the set of actions, is the transition function, is the reward function and is the discount factor. Episodic tasks have absorbing states, which cannot be left and from which the agent only receives a reward of .
For each time step , the agent receives an observation of the state and takes an action according to a policy . The aim of the agent is to find the optimal policy that maximizes the expected discounted return , where is the maximum length of the episode.
We use two learning algorithms and function approximators, one for each domain used in the experiments.
) is a learning algorithm that takes advantage of an estimate of thevalue function . We represent the value function with a linear function approximator, so that the learning algorithm computes an estimate of as a linear combination of features .
Proximal Policy Optimization (PPO) 
is a policy gradient algorithm that optimizes the loss function defined as, where and is the advantage function. In this case,
are the parameters of a neural network used as function approximator for this learning algorithm.
2.1 Curriculum Learning
Let be a (possibly infinite) set of MDPs that are candidate tasks for the curriculum, and , a final task. The final task is the task the designer wants the agent to learn more efficiently through the curriculum. We define a curriculum as a sequence of tasks in :
[Curriculum] Given a set of tasks , a curriculum over of length is a sequence of tasks where each
The tasks in the curriculum are called the intermediate tasks. The aim of the curriculum is to reach the optimal policy of the final task as quickly as possible, that is, with the smallest possible number of learning steps.
3 Related Work
Learning through a curriculum consists of modifying the experience of the agent so that each learning step is the most beneficial. The general principle underlying curriculum learning spans a vast gamut of methods: from reordering the generated transition samples within a given task, such as in Prioritized Experience Replay , to learning a Curriculum MDP  over the space of all possible policies.
While a curriculum may be constructed on the transition samples within a single MDP (as mentioned above for Prioritized Experience Replay), this work belongs to the category of task-level curriculum methods , where the agent is presented with a sequence of different MDPs over time. We impose no restrictions on the differences between the MDPs in a curriculum.
The problem of Curriculum Learning for Reinforcement Learning has been formalized by Narvekar et al. 
, who also provided heuristics for the creation of the intermediate tasks. Most Task-level methods that impose no restrictions on the MDPs focus on the selection and sequencing of the tasks, assuming that a set of candidate intermediate tasks exists. In all existing methods such a set is finite, whilst in our framework we allow for a potentially infinite set of candidate tasks.
Previous work has considered the automatic generation of a graph of tasks based on a measure of transfer potential , later extended to an object-oriented framework which also enables random task generation . In both these works the curriculum is generated before execution (with online adjustments in the latter case) whilst in our framework the curriculum can be entirely generated online. Narvekar and Stone 
introduce an online curriculum generation method based on the concept of Curriculum MDP (CMDP). A CMDP is defined according to the knowledge representation of the agent (in particular according to the parameter vector of the policy); for any current policy of the agent, a policy over the CMDP returns the next task to train on. The framework has the theoretical guarantees of MDPs, and suffers from the same practical limitations currently imposed by the use of non-linear function approximation, where convergence to a globally optimal policy cannot be guaranteed.
Teacher-Student Curriculum Learning (TSCL) is based on a Partially Observable MDP formulation. The teacher component selects tasks for the student whilst learning, favouring tasks in which the student is making the most progress or that the student appears to have forgotten. The aim is for the agent to be able to solve all the tasks in the curriculum. Conversely, in our problem definition, the intermediate tasks are stepping stones towards the final task, and our agent aims at minimizing the time required to learn the optimal policy for the final task.
The formulation of sequencing as a combinatorial optimization problem over the intermediate tasks lends itself to the design of globally optimal sequencing algorithms. One such algorithm is Heuristic Task Sequencing for Cumulative Return  (HTS-CR), which is a complete anytime algorithm, converging to the optimal curriculum of a maximum length. Due to this guarantee of optimality, we use HTS-CR as one of the baselines to evaluate our sequencing method.
Another category of methods is designed with specific restrictions over the set of MDPs in the curriculum, such as the case in which only the reward function changes . In this category, Reverse Curriculum Generation (RCG)  gradually moves the start state away from the goal using a uniform sampling of random actions. The pace of this sampling is influenced by the performance of the agent, and is akin to our progression functions. For this reason, we use RCG as our second baseline. The main assumption that RCG makes is that the given goal state is reachable from all of the start states (and vice versa) when performing a uniform sampling of random actions. Whilst the possibility to influence only the start state is more limited compared to our approach, and the assumption above restricts the domains where RCG can be applied, RCG has the advantage of requiring minimum domain knowledge to be effective.
4 Curriculum Learning with a Progression Function
This paper aims to answer the question of how the difficulty of the task should change during the training of a Reinforcement Learning agent. We begin by defining our framework for Curriculum Learning, and then we introduce two families of progression functions.
As previously introduced, our curriculum learning framework is composed of two elements: a progression function, and a mapping function.
We define a progression function as follows:
where , is the complexity factor at time , and is a set of domain-dependent inputs which affect the complexity progression over time. The value reflects the difficulty of the MDP where the agent trains at time . The MDP corresponding to is the final task, and the smaller the value of , the easier the corresponding task is.
In order to appropriately adjust the complexity of the environment, we introduce the mapping function:
which maps a specific value of to a Markov Decision Process created from the set . We assume that one such function exists, for instance when provided by a human expert.
At any change of the value of according to the progression function, a new intermediate task is added to the curriculum of the agent. Given a Mapping function , changing defines a new intermediate task:
Specifying the way changes through a progression function selects which tasks are added to the curriculum, and in which order, resulting in the curriculum:
where is the number of updates to the value of .
The co-domain of the mapping function contains the set of all candidate intermediate tasks, and progression functions choose which tasks are to be learnt by the agent, in which order, and at which time step.
4.2 Progression Functions
As mentioned above, progression functions determine how changes throughout the training, and can be considered to be the generator of the curriculum. In this section we make a distinction between two types of progression functions: Fixed and Adaptive. This distinction is based on whether or not the performance of the agent is taken into account when calculating the complexity factor.
4.2.1 Fixed progression
The first class of progression functions used in this paper is called “Fixed progression”. It includes progression functions that do not take into account information regarding the performance of the agent to derive the value of .
The simplest example of one such progression function is the linear progression. The only parameter in this function is the time step in which the progression should come to an end, . The equation of this progression function is as follows:
When using this progression function, the complexity of the environment will increase linearly until reaching a value of 1 at . This function has the advantage of having only one parameter, and being easy to implement. On the other hand, it lacks the flexibility of more complex progression functions.
We also introduce a second progression function, the exponential progression, with equation:
Like the previous function, it includes a parameter to indicate when the progression should end, and also takes a parameter which influences the slope of the progression. If is positive, the smaller the , the steeper the first part of the progression, if is negative, the smaller the absolute value of , the shallower the initial part of the progression. When the absolute value of is large, the progression converges to a linear progression. For this reason, in the results section, only the exponential progression will be included.
In Equation 6, is the key factor responsible for the progression, starting at 1 when t = 0 and then decaying as t increases. The reason why is subtracted in the numerator is that given a certain number of time steps , the value of will be equal to 0 at , and it will be clipped to 0 as t keeps increasing. Lastly, the denominator is necessary to enforce the constraint that , as without the denominator .
The advantages of using a function from the Fixed Progression class is that it is easier to implement than adaptive functions and it performs better than not using a curriculum. However, the agent will learn differently in each run, and so using the same progression function could sometimes hinder the training performance.
4.2.2 Adaptive progression
The second class of progression functions is composed of all the functions that take into account the online performance of the agent and are therefore called adaptive.
In order to utilise the performance of the agent as part of the progression, we define a performance function
, which specifies how the agent is performing in a given domain. This function could be the return, or any other environment-specific function, such as the average probability of success from a uniform sampling of a state space in a navigation domain.
An example of a function in this class is Friction-based progression, which models a body sliding on a plane and uses its speed to determine the progression. The equation representing this progression function is:
where is the speed of the object at time , and is the minimum speed of the object in the time interval from to . The principle behind this progression function is that the friction coefficient between the body and the plane is adjusted based on the chosen performance function. As shown in equation 8, the speed of the object is then used to determine the value of the complexity factor.
The speed of the body is initialised at 1, and as above, can be changed by modifying the friction coefficient between the body and the plane. The weight of the body is another parameter, which influences how much the friction affects the object. The core concept behind this progression function is to slow down the body when the agent is improving, resulting in an increase in difficulty. In order to assess any improvement in the performance of the agent, we use the average value of the first derivative of the performance function over a set interval; the length of this interval influences how sudden the updates to are. Given a differentiable performance function and an interval , the value of the friction, , is:
After every time step, the value of is updated, and a simulation of one second is performed on the sliding object to extract its resulting speed. Assuming the mass of the sliding object is , the speed of the object, , will be:
As the speed of the body is initialised at 1 and decreases to 0, and the value of , conversely, starts at 0 and increases until it reaches 1, the general intuition is that . However, it should be noted that this model is not strictly physically accurate, as it takes into account the possibility of a “negative friction”, where the speed of the object will increase. In the case where there is negative friction and , the value of the complexity factor is set to be a uniform sampling between the current value of the speed and the minimum value of the speed over the training, as seen previously in Equation 8. This causes the average complexity to drop if the agent is not able to solve the current task and ensures that the agent is not stuck in a task too difficult to solve, which could result in a deterioration of the agent’s policy. On some domains, however, a backwards progression may not be possible, such as when transferring between different tasks. In this case it is appropriate to use the following version of the progression function:
which has the advantage of being a monotonically increasing function.
A problem arises when averaging the derivative of the progression function over an interval , more specifically how to handle the progression when . One solution to this problem is to suspend the progression until , however, this creates two main issues. Firstly, it restricts the magnitude of the interval. This would result in the progression being suspended in the first part of the training, which does not utilise the full potential of the algorithm. The second issue is that the progression is heavily dependent on the performance of the agent in the first time-steps. As proven in Appendix A, the progression ends when:
assuming that is the average value of the performance function between and .
This equation implies that the only values of that influence when a progression ends are and , therefore using the agent’s performance for the first values would result in an unpredictable progression. Our approach starts the training at time-step and sets the first values of the performance function to the reward that the agent would experience if it did not perform any action (usually -1 or 0). With this method, Equation 12 can also be solved for and be used to determine the appropriate value of such that the progression ends when a certain average value of the progression function is reached for time-steps.
If it is possible to train with multiple instances of the domain at the same time, like when using a parallel implementation of PPO, each instance of the domain will have its own independent progression function. This allows the magnitude of the interval between instances to be varied, resulting in the agent training on a range of complexities of the domain. When using this technique an interval range should be defined, and the values of should be distributed evenly within that range based on the number of processes available.
An example of this technique can be seen in Figures 4 and 4, where the complexity factor was plotted for two distinct runs in one of our test domains (Point maze). The blue interval in the figure represents the range of values of the complexity factor. It is important to note that the value of is not a uniform sampling between the upper and lower bound of the area, but in this case there are 8 independent values of within the interval. The progression functions with a higher interval give a more stable progression, while the progression functions with a lower interval quickly adapt to any change in the performance of the agent.
The space complexity of this algorithm is , as every value of the performance function in the interval needs to be stored. The time complexity of one iteration of the algorithm is , as the only operations executed are the ones in Equation 8, 9 and 10 and the value of is stored after every iteration. This implies that no further time complexity is added to the RL cycle. Nevertheless, in order to accurately analyse the time complexity of our approach, one needs to take into account the complexity of the mapping function, which is domain dependant.
5 Experimental Setting
This section describes the two domains used to test the benefits of our approach against current state-of-the-art algorithms. It also includes details on the implementation of the algorithms on the two tasks.
5.1 Grid World
The first test domain used in this paper is a grid-world domain where the agent needs to reach a treasure while avoiding fires and pits. The agent can navigate this domain by moving North, East, South or West, and the actions taken in this domain are deterministic. The reward function used for this domain is as follows: 200 for reaching the cell with the treasure, -2500 for entering a pit, -500 for entering a fire, -250 for entering one of the four adjacent cells to a fire, and -1 otherwise. In this domain the episodes terminate when the agent reaches the treasure, falls into a pit or performs 50 actions.
5.2 Point Maze
The second domain used is a Mujoco 
environment where an agent must navigate through a G-shaped maze to a goal area. The model controlled by the agent, referred to as point, is a sphere with a cube added to one end of the point as a directional marker. The point has two degrees of freedom: it accelerates in the direction of the marker, and rotates around its own axis. The agent receives a reward of 1 when entering the goal area and of 0 otherwise. The episodes terminate when the agent reaches the goal, or after 500 time-steps.
5.3 Implementation Details
For both domains we use the implementation of the Friction-Based progression described in Equation 8, which includes a uniform sampling in case the agent’s performance were to drop. In the Grid World domain, the standard implementation was used, with an interval of 3000 time-steps. On the other hand, in the Point Maze domain the parallel variant of the algorithm was used. The number of parallel processes used was 8, and the set of magnitudes of the interval used was episodes. Note that as the learning algorithm used was a parallel implementation of PPO, this is equivalent to episodes in the linear version of the algorithm.
For both domains the mapping function used is , where is the MDP obtained by uniformly sampling points at a normalised distance from the goal. On the Grid World domain the distance metric used was the Manhattan distance; on the other hand on the Point-mass Maze domain the distance was intended as the approximate length of the shortest path to the goal.
The performance function chosen for the Grid World domain was the reward function, whereas in the Mujoco Maze domains the performance function had a value of 1 if the agent successfully reached the goal, and 0 otherwise. In all the domains the mass of the object used was calculated by using Equation 12. In the Grid World domain, the initial value of the performance function was set to -1, and the progression was set to end once the average value of the progression function would reach 150, which is the lowest reward the agent can obtain while still reaching the goal within the time limit. On the other hand, in both the Point-mass and Point maze domains the initial value of the performance function was set to 0, and the progression was set to end once the agent reached the goal successfully 50 percent of the times.
When using HTS-CR  as one of our baselines, we used transfer between tasks; in the case of the Grid World domain, the transfer method is the one outlined in the paper, on the other hand, on the Point Maze domain, the transfer method consisted in directly transferring the neural network from one task to the next. On both domains, the intermediate tasks used by HTS-CR were a subset of the tasks used by our approach; this gives some insight on the quality of the solution found by our approach compared to what is possible with a standard curriculum. HTS-CR was modified to optimise the performance on the final task rather than the cumulative return; this does not violate the guarantee of optimality, and is the more appropriate metric to be optimised for this comparison.
The version of Reverse Curriculum our approach is compared to is the best performing one, which involves choosing the starts used to generate more in the set of states that the agent has not mastered yet, but from where the agent can reach the goal using its current policy. Finally the replay buffer used for this algorithm took into account the performance of the agent over the course of the training.
This section starts with the comparison between the Exponential progression and the Friction-Based progression. It then compares the performance of our approach with two other state-of-the-art Curriculum learning algorithms, as outlined in the Related Work section: Florensa et al.  and Foglino et al. .
6.1 Progression Functions
In order to show the benefits of using the Friction-based progression, we compared it against the best performing Exponential progression. This comparison is meant to ascertain whether an adaptive progression can be beneficial over a simpler (but fixed) progression function. Figure 6 and 6 report the average and confidence intervals over respectively 500 and 50 runs on the two test domains.
As evident from Figure 6, the performance of the two progression functions is similar when the number of training episodes is low, but when the algorithms are given more time the Friction-based progression clearly outperforms the Exponential progression.
In the Point Maze domain, as can be seen in Figure 6 the Exponential Progression is once again surpassed by the Friction-based progression. While the initial performance is similar, the performance of the Exponential progression starts dropping in the later stages of the training. This happens because at times the progression is too fast for the agent, and will result in the environment’s complexity to increase even if the agent is not able to solve the current task. This is highlighted by the fact that some runs converge to the optimal policy, while sometimes the performance drops as the agent fails to keep up with the progression. This effect is amplified by the Point Maze environment, as the agent needs to learn to turn corners: in fact, if the agent fails to learn how to turn a corner and the progression continues further, it will result in an eventual deterioration of the agent’s policy, as the agent will experience no reward. On the other hand, the Friction-based progression is able to mitigate this problem by using the agent’s performance as an indication of when to progress.
These results suggest that the ability of the Friction-based progression to adapt to the performance of the agent is highly advantageous, as the performance obtained by using the Exponential progression is significantly lower, especially in the more complex testing domain.
6.2 Comparison with State-of-the-art
In order to show the potential of our framework, we compared the Friction-based progression with other two state-of-the-art Curriculum Learning algorithms HTS-CR  and Reverse Curriculum Generation . Figures 6 and 6 report the average and 95 percent confidence intervals on both test domains.
|Algorithm||Grid World||Point Maze|
|Frcition-Based||186.09 2.967||98.06 1.20|
|Exponential||169.27 6.06||67.92 5.59|
|HTS-CR final||190.82 0.32||88.24 1.86|
|HTS-CR||179.36 1.69||62.28 3.15|
|Reverse Curriculum||169.74 4.24||55.58 2.84|
In Figure 8 the performance of all three algorithms is compared on the Grid World domain. The x axis represents the maximum number of episodes each algorithm was allowed to run for, whilst the y axis represents the average maximum return obtained in a training run. The curves relative to the Friction-based progression and Reverse Curriculum are the average over 500 runs, whereas HTS-CR is the average of 50 runs. HTS-CR was allowed to run for 30000 episodes in order to allow enough time to find a good curriculum. This plot highlights how the Friction-based progression has a steep increase in the performance over the first part of the training, eventually converging to a policy close to the optimum. In this domain, HTS-CR converges to the optimal policy, and the Reverse Curriculum converges to a sub-optimal policy close to the value obtained when using an Exponential progression. In this domain, 40 percent of the best curricula found by HTS-CR had the intermediate tasks sorted in ascending order in respect to the distance of the starting state from the goal. This suggests that the mapping function used might not be the optimal one.
Figure 8 reports the results on the Point Maze domain. In this domain, HTS-CR was allowed to run for 20 times the amount of time allocated for the other two algorithms, and its final performance is plotted as a green dashed line. Consistently to the experiments conducted in the previous domain, the Friction-based progression converges to a policy close to the optimum, surpassing the performance of HTS-CR and Reverse Curriculum. HTS-CR outperforms Reverse Curriculum throughout the training, and converges to a value below the optimum, highlighting the need for more tasks in the curriculum. We also found that in 58 percent of the runs of HTS, the best curriculum found had the intermediate tasks sorted in ascending order in respect to the distance of the starting state from the goal. This is in line with the mapping function used in this task, which used distance as a metric for complexity.
Table 1 reports the performance of the different Curriculum Learning algorithms on both testing domains, where “HTS-CR final” is the final performance obtained by HTS-CR, and “HTS-CR” is the performance obtained within the same time-frame as the other algorithms. It is evident that the framework proposed in this paper provides an advantage over existing approaches, especially in more complex domains where using a traditional curriculum over a subset of the intermediate tasks fails to converge to an optimal solution. Moreover, the performance of the Exponential progression is equivalent to that of Reverse Curriculum Generation in one of our test domains, and higher in the other.
7 Conclusions and Outlooks
In this paper we proposed a novel curriculum learning framework that by using Progression Functions and Mapping functions can build an online Curriculum tailored to the agent. We performed a comparative evaluation to two state-of-the-art curriculum learning techniques on two domains. The results show that our approach of curriculum learning with progression functions outperforms these baselines.
-  (2018) Object-oriented curriculum generation for reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Cited by: §3.
-  (2017) Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning. Cited by: §1, §3, §6.2, §6.
Curriculum learning for cumulative return maximization.
Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 2308–2314. Cited by: §3, §5.3, §6.2, §6.
-  (2019) An optimization framework for task sequencing in curriculum learning. In 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 207–214. Cited by: §1, §3.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §5.3.
-  (2017) Teacher-student curriculum learning. IEEE transactions on neural networks and learning systems. Cited by: §3.
-  (2020) Curriculum learning for reinforcement learning domains: a framework and survey. arXiv preprint arXiv:2003.04960. Cited by: §1, §3.
-  (2016) Source task creation for curriculum learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 566–574. Cited by: §1, §3.
-  (2017) Autonomous task sequencing for customized curriculum design in reinforcement learning. In (IJCAI), The 2017 International Joint Conference on Artificial Intelligence, Cited by: §3.
-  (2019) Learning curriculum policies for reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 25–33. Cited by: §1, §3.
Learning by playing solving sparse reward tasks from scratch.
Proceedings of the 35th International Conference on Machine Learning, pp. 4344–4353. Cited by: §1, §3.
-  (2016) Prioritized experience replay. In International Conference on Learning Representations (ICLR), Cited by: §3.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2, §5.3.
-  (2017) Automatic curriculum graph generation for reinforcement learning agents. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §3.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.2.
Proof: end of an adaptive progression relative to a performance function
Let , then
As we assume that , we start calculating using this formula only when . The expression above can be therefore simplified to:
if we substitute back the original value of we get the following equation:
Let be the average value of the performance function between and , it follows that:
From this equation it follows that (the progression ends) when: