1 Introduction
Modern reinforcement learning (RL) algorithms are able to learn impressive skills, such as playing Chess or Go, often with a higher competency than humans (silver2017masteringchess; silver2017mastering)
. Nonetheless, the learned skills are highly specific. Given a different context, such as to collect as many opponent pieces in the next 5 moves instead of winning the full game, the algorithms have to learn their behavior from scratch which is time consuming. In difference, humans quickly adapt to such changes and still show a high performance. The field of transfer learning
(taylor2009transfer; lazaric2012transfer) investigates how artificial agents can be more adaptive to such changes. One of the areas where humans are very adaptive and which is the focus of this paper is in the presence of changing time restrictions for a task. Take for example one of our daily routines, going out for lunch. Several restaurants might exist in our neighborhood. Each requires a different amount of time to reach and provides food of different quality. The general goal is to learn the shortest way to a good restaurant, but our specific objective might change from day to day. One day we have a lot of time for lunch and we want to go to the best restaurant. Another day we are under stress and we want to go to the best restaurant given a fixed time limit.Standard modelfree RL algorithms such as Qlearning (watkins1989learning; watkins1992q) would learn such a task by defining a specific reward function for each objective. For example, to learn the way to the best restaurant the reward function would simply return the food quality of a reached restaurant. Thus, an agent will learn the shortest way to the best restaurant. If the agent is under stress and has to judge between food quality and the invested time, the reward function could include a punishment (a negative reward) for each performed step until a restaurant is reached. Thus, the agent might not go to the best restaurant, but to one that is closer. With this approach, each objective represents a different task for which a new policy has to be learned. But learning takes time and an objective might change from one episode to another. Moreover, the number of possible objectives can be infinite. As a result, classical agents might not have time to learn an appropriate policy for each objective.
We formalize this new type of adaptation scenario in form of Time Adaptive MDPs which are extensions of standard MDPs (sutton1998reinforcement). Moreover, we propose two modular, modelfree algorithms, the Independent  Ensemble (IGE) and the nStep Ensemble (NSE), to solve such scenarios. Both learn several behaviors by their modules in parallel, each for a different timescale, resulting in a behavioral library. Given a change in the objective the most appropriate behavior from the library can be selected without need for relearning. Both algorithms are inspired by neuroscientific findings about human decisionmaking which suggests that humans learn behaviors not only for a specific timescale, but for several timescales in parallel (tanaka2007serotonin).
2 Time Adaptive MDPs
We formalize time adaptive reinforcement learning tasks in form of Time Adaptive MDPs:
TAMDPs are extensions of standard MDPs by having a finite or infinite set of objective functions with . Each objective function evaluates the agent’s performance in regard to its total collected reward and the number of time steps until it reached a terminal state for a single episode. During each episode, one objective is active and the goal is to maximize the expectation over it while using a minimum number of steps:
(1) 
where is the policy used for objective . Objective functions are monotonic increasing with reward and decreasing with time , i.e. getting more reward and using less time is better. The agent knows which objective function is active and its mathematical expression. The agents goal is to learn a policy to optimize this objective.
Our restaurant selection example can be formalized as a TAMDP. The agent’s location defines the state space and its actions are different movement directions. Restaurants are terminal states . The reward function indicates the food quality of a restaurant after reaching it. The two time restrictions are represented as objective functions: . The first objective is to go to the restaurant with highest food quality: . The second objective sets a strict time limit of five steps and gives a large punishment if more steps are needed: if otherwise . Depending on the active objective, the optimal policy, i.e. where to eat, changes.
The major challenge of TAMDPs is that the number of objective functions can be infinite and that each episode can have a different objective. Thus, an agent might experience a certain objective only once. As a result, it needs to immediately adapt to it.
3 Algorithms
We propose two algorithms to solve TAMDPs. Both are modular and learn a set of policies . The policies represent optimal behaviors on different time scales. Given the active objective of the episode, one of the policies is selected at the start of the episode and used throughout it. The goal is to select the most appropriate policy from the set . To accomplish this, both algorithms learn for each of its policies the expected total return and the expected number of steps until a terminal state is reached . Based on the expectations the policy is selected at the start of the episode () which maximizes the active objective while minimizing the number of steps (Eq. 1):
(2) 
An important restriction of the method is that the selection of the policy is dependent on an approximation of the expected outcome for the objective (Eq. 1) by using the expectations over the total return and number of steps as input to the objective function:
This approximation is not correct for all types of objective functions but provides often a good heuristic to select an appropriate policy. The proposed algorithms are introduced in the next sections. They differ in their way to learn policy set
.3.1 The Independent Ensemble (IGE)
The Independent Ensemble (IGE) is composed of several modules (Alg. 1). The modules are independent Qfunctions with different discount factors: with , similar to the Horde architecture (modayil2014multi):
(3) 
The factor defines how strong future reward is discounted. For low ’s the discounting is strong and the optimal behavior is to maximize rewards that can be reached on a short time scale. Whereas, for high ’s the discounting is weak, resulting in the maximization of rewards on longer time scales. As a result, each Qfunction defines a different policy and the IGE learns a set of policies via its modules: . The values of each module are learned by Qlearning. Because Qlearning is offpolicy the values of all modules can be updated by the same observations.
Additionally to the values, each module learns the expected total return and number of steps to reach a terminal state for its policy. The expectations are used to select the appropriate policy at the beginning of an episode (Eq. 2). Both expectations can be incrementally formulated similar to the Qfunction and also learned in a similar manner. After an observation , the expectations of all modules which have action as greedy action () are updated by:
(4) 
(5) 
where is the learning rate parameter also used for updates of Qvalues.
The IGE has the restriction that it does not guarantee to learn the Pareto optimal set of policies in regard to the expected total reward and number of steps . The MDP in Fig. 1 shows such a case.
(a) Example MDP  (b) Choices  (c) Curve 

3.2 The nStep Ensemble (NSE)
We propose a second algorithm (Alg. 2), the nStep Ensemble (NSE), to overcome the restriction of the IGE. It is able to learn the set of Pareto optimal policies. Similar to the IGE, the NSE also consists of several modules. Each module is responsible to optimize the expected total reward for number of steps into the future with . Each module learns a value function representing the optimal total reward that can be reached in steps:
(6) 
One extra condition is that a terminal state should be reached within steps. If this is not possible, then the policy is learned which reaches a terminal state with a minimal number of steps. Similar to the standard Qfunction the values can be incrementally defined by the sum of the immediate expected reward and the Qvalue for the optimal action of the next state . In difference, the Qvalue for the next state is from the module responsible to optimize the total reward for steps (harada1997time):
(7) 
The learning of the Qvalues , expected reward , and number of steps for each module is done by Qlearning similar to the IGE. After a transition is observed all modules are updated in parallel.
4 Experimental evaluation
The IGE and NSE were compared to classical Qlearning in a stochastic gridworld environment (Fig. 1). It consists of 7 terminal states. For each terminal state exists an optimal path from the start state for which the agent receives a punishment of per step (otherwise ). Reaching a terminal state results in a positive reward where more distant goals result in a higher reward. Agents had to adapt to 9 different objectives (see Fig. 4 for their formulation). For each objective episodes were performed to evaluate how long the agents need to adapt before the task switched to the next objective.
The classical Qlearning algorithm learned for each objective an independent Qfunction. Its reward function was defined by the outcome of the active objective function after the agent reached a terminal state. This formulation does not fulfill the Markov assumption because the reward for reaching a terminal state depends on the whole trajectory. As a result, the MDP is partially observable for the agent. To reduce this problem the current time step was used as an extra state dimension for the classical Qlearning agent, improving its performance.
The results show that the IGE and NSE outperformed classical Qlearning in terms of their adaptation to new objectives (Fig. 2 and 4). They were able to adapt immediatley to a new objective after they learned their set of policies during the initial phase. Qlearning had to learn the task for each objective from scratch needing approximately 3000 episodes per objective.
[width=0.32]figures/mdp125_results_learncurve_phase8 
in the appendix lists all results). The plots show the mean and standard deviation over
runs per algorithm.5 Conclusion
We introduced Time Adaptive MDPs. They confront RL agents with the problem of quickly adapting to changing objectives in terms of time restrictions. Two algorithms are proposed (IGE and NSE) which learn a behavioral library of policies that are optimal on different time scales. The agents can switch immediately between these policies to adapt to new and unseen objectives allowing zeroshot adaptation. The NSE has the advantage over the IGE to learn the Paretooptimal set of policies in terms of expected reward and time. Nonetheless, the NSE is dependend on discrete time steps whereas the IGE can be used for continuous time MDPs (doya2000reinforcement). Although we used for both algorithms tabular Qlearning to learn the values of their modules, the general scheme of the methods is independent of this choice. The algorithms can also be combined with actorcritic or policy search algorithms and different function approximators such as deep networks. This allows the methods to tackle various problem scenarios which we plan to show in future research.
Acknowledgments
I want to thank Kenji Doya and Eiji Uchibe for their helpful supervision of this project. Moreover, I want to thank PierreYves Oudeyer and Cl´ement MoulinFrier for their helpful comments.
References
Appendix A Algorithmic Details
a.1 The Independent Ensemble (IGE)
The IGE (Alg. 1) learns a set of policies via independent modules called modules. Each module is comprised of a Qfunction (Eq. 3) with a distinct discount factor . The number of modules and their discount parameters are metaparameters of the algorithm. The Qvalues of each module are learned by Qlearning, i.e. after an observation of the values of all modules in are updated by:
(8) 
where is the learning rate. Because of Qlearning’s offpolicy nature, the values of all modules can be updated by the same observations.
Moreover, each module learns for its policy the expected total reward:
and the expected number of steps until a terminal state is reached:
Both are learned via a RobbinsMonro approach for stochastic approximation (robbins1951stochastic) defined in Eq. 4 and 5.
The policy that the IGE follows is defined by one of its modules . It is selected at the beginning of an episode with the goal to maximize the objective function according to Eq. 1. The modules policy is then used for the action selection during the whole episode. An Greedy approach is used for exploration.
The IGE has two restrictions. First, it does not guarantee to learn the Pareto optimal set of policies. Second, it can not handle episodic environments where the agent is able to collect positive rewards by circular trajectories that do no end in a terminal state.
The first restriction of the IGE is that it is not guaranteed to learn the Pareto optimal set regarding the expected reward and time. The MDP used for the experiments shows such a case (Fig. 1, a). The optimal trajectory to goal is part of the Pareto optimal set (Fig. 1, b), but it is not part of the IGE policy set (Fig. 1, c). As a result, the IGE cannot find the optimal policy for objective that has as optimal solution. Nonetheless, optimality can be guaranteed for a subset of goal formulations. The IGE converges to the optimal policy for objectives that maximize the exponentially discounted reward sum . Depending on the discount factor , the Qvalues of the corresponding module will converge to the optimal value function (watkins1992q; tsitsiklis1994asynchronous). This includes also the case of maximizing the expected total reward sum , because this is the same objective as for . Most interestingly, it is possible to prove the convergence of the IGE upon the optimal policy for the average reward in MDPs which are deterministic and where nonnegative reward is only given if a terminal state is reached (reinke2017average). For other objectives, the IGE can be viewed as a heuristic that does not guarantee optimality, but that produces often good results with the ability to immediately adapt to a new objective.
a) Circular, positive reward MDP
b) QValues and expected rewards and number of steps for each state and action




A second restriction of the IGE exists in environments where cyclic trajectories maximize the discounted reward sum and which do not end in a terminal state. Fig. 3 illustrates such a MDP. The agent starts in state , and can stay in this state via action receiving a reward of for every step, or it can go 2 steps left to end in terminal state and receive no reward, or it can go 2 steps right to finish in terminal state receiving a reward of 1. For this MDP the optimal policy for each is to chose action and stay in state . Therefore, the IGE can not learn a policy to reach a terminal state.
a.2 The nStep Ensemble (NSE)
We propose the nStep Ensemble (NSE) (Algorithm 2) to overcome the restrictions of the IGE. It is able to learn the set of Pareto optimal policies, and to learn policies that reach terminal states in circular environments such as the MDP in Fig. 3.
Similar to the IGE, the NSE consists of several modules. Each module is responsible to optimize the expected total reward for number of steps into the future (Eq. 6 and 7) with . To handle circular environments such as in Fig. 3 an extra condition is added. The agent should also reach a terminal state within steps. If this is not possible, then the policy is learned which reaches a terminal state with a minimal number of steps. This is accomplished by defining the optimal action not simply as the action maximizing the Qvalue of a state. Instead, it is defined using information about the total reward , and the number of steps until a terminal state is reached:
The statevalues of and are defined by , and .
Based on , , and the optimal action is defined by:
(9) 
where , , and are sets of actions. This definition of the greedy action allows the NSE to handle cyclic positive reward environments. Moreover, it allows the selection of the best action in situations where more or less number of steps are necessary to end an episode than defined by the number of steps of a module.
To detect the greedy action in Eq. 9 the actions are at first limited to the set . It is comprised of all actions resulting in trajectories that need the same or a smaller number of steps than the current module should optimize for. If none of the actions leads to such a trajectory, then the actions that minimize the number of steps are considered. This restriction guarantees that the NSE learns policies that end in a terminal state in cyclic environments. For example, for module in the MDP of Fig. 3, the greedy action according to the Qvalue is to return to . Therefore, the agent would not learn a policy ending in a terminal state if the greedy action is selected according to the Qvalues. By restricting the possible optimal actions to the set of , only actions and are allowed, because they minimize the number of steps. As a result, the NSE learns policies that end in terminal states for modules and .
The next restriction is according to the Qvalues. All actions from the set that maximize the Qvalue are considered as optimal action. This restriction selects the trajectories resulting in the highest reward within steps.
Although, the actions maximize the reward within steps there can be situations where the resulting trajectory requires more steps or less steps. In these situations, the Qvalue does not inform about the total reward and needed time. For example, for module the Qvalues for going left () and right () are both zero because the Qvalues of this module only look one step into the future. Nonetheless, the final trajectory for going left result in steps and a reward of , whereas for going right in steps and a reward of . In this situation it would be more desirable to go right to collect some reward. Therefore, if several actions maximize the Qvalues, then from those actions the ones resulting in the minimal number of steps to a terminal state are considered as defined by set . Among those, the final optimal action is one that has the maximum total reward . If several actions fulfill this, then one of them is randomly chosen.
The final policy of the NSE is given by selecting at the beginning of an episode the module that optimizes the current active objective according to Eq. 2. The optimal action of this module is used in the first step of the episode. Because each module is responsible to maximize the total reward for a certain number of steps for the next time step the optimal action of the module responsible for one step less () is used. This procedure is repeated until the first module is reached. The policy depends therefore on the current time step during an episode:
During the learning, the agent does not always chose the greedy action. Instead an greedy action selection is used to allow exploration.
The learning of the Qvalues , expected reward , and number of steps for each module is done in a similar way to Qlearning. After a transition is observed all modules are updated in parallel according to a RobbinsMonroe update.
a.3 Timedependent QLearning
The IGE and NSE were compared to a classical Qlearning approach that learned for each objective an independent Qfunction (Alg. 3). Its reward function was defined by the outcome of the active objective function after the agent reached a terminal state . For every other transition a reward of zero was given:
where is the collected total reward according to the reward function of the MDP and is the number of steps of the current episode.
A problem with this formulation of the reward is that it does not fulfill the Markov assumption, i.e. that the outcome of an action depends only on the current state. Instead, the reward for reaching a terminal state depends on the whole trajectory, taking into account the total reward sum and the number of steps . As a result, the MDP is partially observable for the agent because the state, i.e. the agent’s position, does not inform about the collected return or how many steps were needed. Therefore, Qlearning is not guaranteed to converge to the optimal policy. To reduce this problem the current time step was used as an extra state dimension for the Qlearning agent, improving its performance. Although the time step information improves the performance of the Qlearning agent, it does not fully resolve the partial observable nature of the problem for the agent because the collected reward sum is still missing. Adding this information also into the state information would create a huge state space for which learning is impractical.
Appendix B Experiments
b.1 Experimental Procedure
The IGE, NSE and the timedependent Qlearning agent were evaluated in the stochastic MDP illustrated in Fig. 1. The agent moved with in a random direction instead of the intended one. Agents had to adapt to 9 different objectives which are listed in Fig. 4. The first objective is to receive the maximum reward in an episode. The second also maximizes reward, but a punishment of for each step is given after 3 steps. gives exponentially increasing punishment for more than 3 steps. The goal of is to find the shortest path to the closest terminal state. For the shortest path to a terminal state that gives at least a reward of is optimal. Reaching a terminal state with less reward will result in a strong punishment. For the goal is to find the highest reward with a maximum of 7 steps. For the agent has only a maximum of 5 steps. In the goal is to maximize average reward. The final objective maximizes the average reward, but the agent has to reach at least a reward of . The location of the terminal states and their rewards were chosen so that they represent different solutions for the objective functions.
For each algorithm, 100 runs were performed to measure their average learning performance. Each run consisted of episodes that were divided in 9 phases. In each phase, the agents had to adapt to a different objective function. The objectives did not change during the episodes of a phase to evaluate how long an agent needs to adapt to each objective. The performance was measured by the outcome for the objective function that the agent received for each episode during the learning process.
b.2 Learning Parameters
Learning parameters of all algorithms were manually optimized to yield a high asymptotic performance while having a high learning rate. The learning rate parameter of all algorithms was set to in the beginning of learning to allow a faster convergence of the values. Over the course of learning it was reduced to . The IGE and NSE kept for 500 episodes and reduced it linearly to until episode 1000. The learning rate stayed at for the rest of Phase 1 and for all following phases. The Qlearning approach needed a longer learning time to reach its asymptotic performance in each phase. Moreover, it needed to learn a new policy for each phase. Its learning was kept to for 750 episodes and linearly reduced to until episode 3000 for each phase.
All algorithms used the greedy action selection. Similar to the learning rate, the exploration rate was high at the start of learning and then reduced. The IGE’s and NSE’s exploration rate was for the first 500 episodes of Phase 1 and then reduced to until episode 1000. It stayed at for the rest of Phase 1 and all successive phases. Qlearning used a learning rate of for the first 750 episodes. Afterward, it was linearly reduced to until episode 3000 for each phase.
The IGE used a set of 45 modules. The discount factors were chosen to have a stronger concentration for higher factors. This allows to learn different policies for longer trajectories. First, 14 modules were chosen according to . Then 2 modules with equal distance to each were added between each pair of the 14 modules and between the last and 1. The NSE used 20 modules. The discount factor of Qlearning was set to .
Appendix C Experimental Results
The results show that the IGE and NSE performed better compared to Qlearning in terms of adaptation to new objectives, asymptotic performance and learning speed (Fig. 4). After the initial learning phase with objective the IGE and NSE were able to adapt immediately to new objectives whereas Qlearning needed to learn new policies for those phases.
Moreover, the IGE and NSE outperformed Qlearning in their asymptotic performance in 5 of the 9 objectives (, , , , ). All algorithms had a similar final performance for 3 objectives (, , ). Qlearning could slightly outperform IGE for objective , but not the NSE. Comparing the IGE and the NSE, the NSE had a slightly better final performance for and a stronger performance for . The low performance of Qlearning for objective was the result of the negative outcome values that this objective gives. Qvalues were initialized to 0. Because all outcomes for this objective are negative, the agent is doing an optimistic exploration (osband2017optimistic). It explores every possible state action pair for every possible time step, because it has to learn that their initial Qvalues of 0 are not optimal. As a result, Qlearning would need more episodes to learn a good policy for this objective.
As for the learning rate, the IGE and NSE outperformed Qlearning which is visible in the first phase. Qlearning needed at least 3000 episodes to reach its final asymptotic performance for each phase. The IGE and NSE needed less than 1000 episodes to reach their asymptotic performance in the first phase. Qlearning needed longer due to the extra time step information in its state space. Thus, more exploration was necessary to learn the optimal policies.
[width=0.32]figures/mdp125_results_learncurve_phase9 