DeepAI
Log In Sign Up

Time Adaptive Reinforcement Learning

04/18/2020
by   Chris Reinke, et al.
Inria
0

Reinforcement learning (RL) allows to solve complex tasks such as Go often with a stronger performance than humans. However, the learned behaviors are usually fixed to specific tasks and unable to adapt to different contexts. Here we consider the case of adapting RL agents to different time restrictions, such as finishing a task with a given time limit that might change from one task execution to the next. We define such problems as Time Adaptive Markov Decision Processes and introduce two model-free, value-based algorithms: the Independent Gamma-Ensemble and the n-Step Ensemble. In difference to classical approaches, they allow a zero-shot adaptation between different time restrictions. The proposed approaches represent general mechanisms to handle time adaptive tasks making them compatible with many existing RL methods, algorithms, and scenarios.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/12/2022

RLang: A Declarative Language for Expression Prior Knowledge for Reinforcement Learning

Communicating useful background knowledge to reinforcement learning (RL)...
04/26/2019

Reinforcement Learning Based Orchestration for Elastic Services

Due to the highly variable execution context in which edge services run,...
11/29/2021

Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions

Reinforcement learning (RL) agents are widely used for solving complex s...
07/09/2020

On the Reliability and Generalizability of Brain-inspired Reinforcement Learning Algorithms

Although deep RL models have shown a great potential for solving various...
04/22/2019

Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning

This work tackles the problem of robust zero-shot planning in non-statio...
01/08/2021

Evolving Reinforcement Learning Algorithms

We propose a method for meta-learning reinforcement learning algorithms ...
07/12/2021

Polynomial Time Reinforcement Learning in Correlated FMDPs with Linear Value Functions

Many reinforcement learning (RL) environments in practice feature enormo...

1 Introduction

Modern reinforcement learning (RL) algorithms are able to learn impressive skills, such as playing Chess or Go, often with a higher competency than humans (silver2017masteringchess; silver2017mastering)

. Nonetheless, the learned skills are highly specific. Given a different context, such as to collect as many opponent pieces in the next 5 moves instead of winning the full game, the algorithms have to learn their behavior from scratch which is time consuming. In difference, humans quickly adapt to such changes and still show a high performance. The field of transfer learning

(taylor2009transfer; lazaric2012transfer) investigates how artificial agents can be more adaptive to such changes. One of the areas where humans are very adaptive and which is the focus of this paper is in the presence of changing time restrictions for a task. Take for example one of our daily routines, going out for lunch. Several restaurants might exist in our neighborhood. Each requires a different amount of time to reach and provides food of different quality. The general goal is to learn the shortest way to a good restaurant, but our specific objective might change from day to day. One day we have a lot of time for lunch and we want to go to the best restaurant. Another day we are under stress and we want to go to the best restaurant given a fixed time limit.

Standard model-free RL algorithms such as Q-learning (watkins1989learning; watkins1992q) would learn such a task by defining a specific reward function for each objective. For example, to learn the way to the best restaurant the reward function would simply return the food quality of a reached restaurant. Thus, an agent will learn the shortest way to the best restaurant. If the agent is under stress and has to judge between food quality and the invested time, the reward function could include a punishment (a negative reward) for each performed step until a restaurant is reached. Thus, the agent might not go to the best restaurant, but to one that is closer. With this approach, each objective represents a different task for which a new policy has to be learned. But learning takes time and an objective might change from one episode to another. Moreover, the number of possible objectives can be infinite. As a result, classical agents might not have time to learn an appropriate policy for each objective.

We formalize this new type of adaptation scenario in form of Time Adaptive MDPs which are extensions of standard MDPs (sutton1998reinforcement). Moreover, we propose two modular, model-free algorithms, the Independent - Ensemble (IGE) and the n-Step Ensemble (NSE), to solve such scenarios. Both learn several behaviors by their modules in parallel, each for a different time-scale, resulting in a behavioral library. Given a change in the objective the most appropriate behavior from the library can be selected without need for relearning. Both algorithms are inspired by neuroscientific findings about human decision-making which suggests that humans learn behaviors not only for a specific time-scale, but for several time-scales in parallel (tanaka2007serotonin).

2 Time Adaptive MDPs

We formalize time adaptive reinforcement learning tasks in form of Time Adaptive MDPs:

TA-MDPs are extensions of standard MDPs by having a finite or infinite set of objective functions with . Each objective function evaluates the agent’s performance in regard to its total collected reward and the number of time steps until it reached a terminal state for a single episode. During each episode, one objective is active and the goal is to maximize the expectation over it while using a minimum number of steps:

(1)

where is the policy used for objective . Objective functions are monotonic increasing with reward and decreasing with time , i.e. getting more reward and using less time is better. The agent knows which objective function is active and its mathematical expression. The agents goal is to learn a policy to optimize this objective.

Our restaurant selection example can be formalized as a TA-MDP. The agent’s location defines the state space and its actions are different movement directions. Restaurants are terminal states . The reward function indicates the food quality of a restaurant after reaching it. The two time restrictions are represented as objective functions: . The first objective is to go to the restaurant with highest food quality: . The second objective sets a strict time limit of five steps and gives a large punishment if more steps are needed: if otherwise . Depending on the active objective, the optimal policy, i.e. where to eat, changes.

The major challenge of TA-MDPs is that the number of objective functions can be infinite and that each episode can have a different objective. Thus, an agent might experience a certain objective only once. As a result, it needs to immediately adapt to it.

3 Algorithms

We propose two algorithms to solve TA-MDPs. Both are modular and learn a set of policies . The policies represent optimal behaviors on different time scales. Given the active objective of the episode, one of the policies is selected at the start of the episode and used throughout it. The goal is to select the most appropriate policy from the set . To accomplish this, both algorithms learn for each of its policies the expected total return and the expected number of steps until a terminal state is reached . Based on the expectations the policy is selected at the start of the episode () which maximizes the active objective while minimizing the number of steps (Eq. 1):

(2)

An important restriction of the method is that the selection of the policy is dependent on an approximation of the expected outcome for the objective (Eq. 1) by using the expectations over the total return and number of steps as input to the objective function:

This approximation is not correct for all types of objective functions but provides often a good heuristic to select an appropriate policy. The proposed algorithms are introduced in the next sections. They differ in their way to learn policy set

.

3.1 The Independent -Ensemble (IGE)

The Independent -Ensemble (IGE) is composed of several modules (Alg. 1). The modules are independent Q-functions with different discount factors: with , similar to the Horde architecture (modayil2014multi):

(3)

The factor defines how strong future reward is discounted. For low ’s the discounting is strong and the optimal behavior is to maximize rewards that can be reached on a short time scale. Whereas, for high ’s the discounting is weak, resulting in the maximization of rewards on longer time scales. As a result, each Q-function defines a different policy and the IGE learns a set of policies via its modules: . The values of each module are learned by Q-learning. Because Q-learning is off-policy the values of all modules can be updated by the same observations.

Additionally to the values, each module learns the expected total return and number of steps to reach a terminal state for its policy. The expectations are used to select the appropriate policy at the beginning of an episode (Eq. 2). Both expectations can be incrementally formulated similar to the Q-function and also learned in a similar manner. After an observation , the expectations of all modules which have action as greedy action () are updated by:

(4)
(5)

where is the learning rate parameter also used for updates of Q-values.

The IGE has the restriction that it does not guarantee to learn the Pareto optimal set of policies in regard to the expected total reward and number of steps . The MDP in Fig. 1 shows such a case.

(a) Example MDP (b) Choices (c) -Curve
Figure 1: (a) 2D grid-world with 7 terminal states () and one start state . The agent has 4 actions (move north, east, south, west) and moves with in a random direction. Transitions result in negative reward until a terminal state is reached. (b) If the agent starts in it has several choices each represented as the expected reward and number steps of the optimal policies to reach each terminal state. All choices except going to are part of the Pareto optimal choice set. (c) Discounted values for each choice (broken lines). The solid line represents the values the IGE learns (). The IGE does not learn policies to go to and .

3.2 The n-Step Ensemble (NSE)

We propose a second algorithm (Alg. 2), the n-Step Ensemble (NSE), to overcome the restriction of the IGE. It is able to learn the set of Pareto optimal policies. Similar to the IGE, the NSE also consists of several modules. Each module is responsible to optimize the expected total reward for number of steps into the future with . Each module learns a value function representing the optimal total reward that can be reached in steps:

(6)

One extra condition is that a terminal state should be reached within steps. If this is not possible, then the policy is learned which reaches a terminal state with a minimal number of steps. Similar to the standard Q-function the values can be incrementally defined by the sum of the immediate expected reward and the Q-value for the optimal action of the next state . In difference, the Q-value for the next state is from the module responsible to optimize the total reward for steps (harada1997time):

(7)

The learning of the Q-values , expected reward , and number of steps for each module is done by Q-learning similar to the IGE. After a transition is observed all modules are updated in parallel.

4 Experimental evaluation

The IGE and NSE were compared to classical Q-learning in a stochastic grid-world environment (Fig. 1). It consists of 7 terminal states. For each terminal state exists an optimal path from the start state for which the agent receives a punishment of per step (otherwise ). Reaching a terminal state results in a positive reward where more distant goals result in a higher reward. Agents had to adapt to 9 different objectives (see Fig. 4 for their formulation). For each objective episodes were performed to evaluate how long the agents need to adapt before the task switched to the next objective.

The classical Q-learning algorithm learned for each objective an independent Q-function. Its reward function was defined by the outcome of the active objective function after the agent reached a terminal state. This formulation does not fulfill the Markov assumption because the reward for reaching a terminal state depends on the whole trajectory. As a result, the MDP is partially observable for the agent. To reduce this problem the current time step was used as an extra state dimension for the classical Q-learning agent, improving its performance.

The results show that the IGE and NSE outperformed classical Q-learning in terms of their adaptation to new objectives (Fig. 2 and 4). They were able to adapt immediatley to a new objective after they learned their set of policies during the initial phase. Q-learning had to learn the task for each objective from scratch needing approximately 3000 episodes per objective.

[width=0.32]figures/mdp125_results_learncurve_phase8
Figure 2: Results for the outcome of the objective function per episode show that the NSE and IGE adapt immediately to new objectives after the initial learning phase. 3 of the 9 phases are shown where each phase has a different objective function (Fig. 4

in the appendix lists all results). The plots show the mean and standard deviation over

runs per algorithm.

5 Conclusion

We introduced Time Adaptive MDPs. They confront RL agents with the problem of quickly adapting to changing objectives in terms of time restrictions. Two algorithms are proposed (IGE and NSE) which learn a behavioral library of policies that are optimal on different time scales. The agents can switch immediately between these policies to adapt to new and unseen objectives allowing zero-shot adaptation. The NSE has the advantage over the IGE to learn the Pareto-optimal set of policies in terms of expected reward and time. Nonetheless, the NSE is dependend on discrete time steps whereas the IGE can be used for continuous time MDPs (doya2000reinforcement). Although we used for both algorithms tabular Q-learning to learn the values of their modules, the general scheme of the methods is independent of this choice. The algorithms can also be combined with actor-critic or policy search algorithms and different function approximators such as deep networks. This allows the methods to tackle various problem scenarios which we plan to show in future research.

Acknowledgments

I want to thank Kenji Doya and Eiji Uchibe for their helpful supervision of this project. Moreover, I want to thank Pierre-Yves Oudeyer and Cl´ement Moulin-Frier for their helpful comments.

References

Appendix A Algorithmic Details

a.1 The Independent -Ensemble (IGE)

Input:
  Discount factors: with
  Learning rate:
  Exploration rate:
initialize , , and to zero
repeat (for each episode)
       initialize state , and objective
       // select as active module a module that maximizes
       repeat (for each step in episode)
             // choose action -Greedy
             , take action , observe outcome forall  do
                   // update and only if the greedy action was used
                   if  then
                        
                   end if
                  
             end forall
            
      until  is terminal-state
until termination
Algorithm 1 Independent -Ensemble (IGE)

The IGE (Alg. 1) learns a set of policies via independent modules called -modules. Each module is comprised of a Q-function (Eq. 3) with a distinct discount factor . The number of modules and their discount parameters are meta-parameters of the algorithm. The Q-values of each module are learned by Q-learning, i.e. after an observation of the values of all modules in are updated by:

(8)

where is the learning rate. Because of Q-learning’s off-policy nature, the values of all modules can be updated by the same observations.

Moreover, each module learns for its policy the expected total reward:

and the expected number of steps until a terminal state is reached:

Both are learned via a Robbins-Monro approach for stochastic approximation (robbins1951stochastic) defined in Eq. 4 and 5.

The policy that the IGE follows is defined by one of its modules . It is selected at the beginning of an episode with the goal to maximize the objective function according to Eq. 1. The modules policy is then used for the action selection during the whole episode. An -Greedy approach is used for exploration.

The IGE has two restrictions. First, it does not guarantee to learn the Pareto optimal set of policies. Second, it can not handle episodic environments where the agent is able to collect positive rewards by circular trajectories that do no end in a terminal state.

The first restriction of the IGE is that it is not guaranteed to learn the Pareto optimal set regarding the expected reward and time. The MDP used for the experiments shows such a case (Fig. 1, a). The optimal trajectory to goal is part of the Pareto optimal set (Fig. 1, b), but it is not part of the IGE policy set (Fig. 1, c). As a result, the IGE cannot find the optimal policy for objective that has as optimal solution. Nonetheless, optimality can be guaranteed for a subset of goal formulations. The IGE converges to the optimal policy for objectives that maximize the exponentially discounted reward sum . Depending on the discount factor , the Q-values of the corresponding module will converge to the optimal value function (watkins1992q; tsitsiklis1994asynchronous). This includes also the case of maximizing the expected total reward sum , because this is the same objective as for . Most interestingly, it is possible to prove the convergence of the IGE upon the optimal policy for the average reward in MDPs which are deterministic and where non-negative reward is only given if a terminal state is reached (reinke2017average). For other objectives, the IGE can be viewed as a heuristic that does not guarantee optimality, but that produces often good results with the ability to immediately adapt to a new objective.

a) Circular, positive reward MDP

b) Q-Values and expected rewards and number of steps for each state and action

0
1
3
5
1
1
3
5
2
2
3
4
Figure 3: MDP example for which the IGE is not able to learn policies that reach the terminal states (, ). The agent starts in state . It can either go left (), right (), or stay (). Going left results in a final reward of 0, whereas going right results in 1. Both ways need 2 steps to reach their terminal state. For the number of steps and the action to stay () has the maximum Q-value. The NSE uses information from and to identify action as the greedy action.

A second restriction of the IGE exists in environments where cyclic trajectories maximize the discounted reward sum and which do not end in a terminal state. Fig. 3 illustrates such a MDP. The agent starts in state , and can stay in this state via action receiving a reward of for every step, or it can go 2 steps left to end in terminal state and receive no reward, or it can go 2 steps right to finish in terminal state receiving a reward of 1. For this MDP the optimal policy for each is to chose action and stay in state . Therefore, the IGE can not learn a policy to reach a terminal state.

a.2 The n-Step Ensemble (NSE)

We propose the n-Step Ensemble (NSE) (Algorithm 2) to overcome the restrictions of the IGE. It is able to learn the set of Pareto optimal policies, and to learn policies that reach terminal states in circular environments such as the MDP in Fig. 3.

Similar to the IGE, the NSE consists of several modules. Each module is responsible to optimize the expected total reward for number of steps into the future (Eq. 6 and 7) with . To handle circular environments such as in Fig. 3 an extra condition is added. The agent should also reach a terminal state within steps. If this is not possible, then the policy is learned which reaches a terminal state with a minimal number of steps. This is accomplished by defining the optimal action not simply as the action maximizing the Q-value of a state. Instead, it is defined using information about the total reward , and the number of steps until a terminal state is reached:

The state-values of and are defined by , and .

Based on , , and the optimal action is defined by:

(9)

where , , and are sets of actions. This definition of the greedy action allows the NSE to handle cyclic positive reward environments. Moreover, it allows the selection of the best action in situations where more or less number of steps are necessary to end an episode than defined by the number of steps of a module.

To detect the greedy action in Eq. 9 the actions are at first limited to the set . It is comprised of all actions resulting in trajectories that need the same or a smaller number of steps than the current module should optimize for. If none of the actions leads to such a trajectory, then the actions that minimize the number of steps are considered. This restriction guarantees that the NSE learns policies that end in a terminal state in cyclic environments. For example, for module in the MDP of Fig. 3, the greedy action according to the Q-value is to return to . Therefore, the agent would not learn a policy ending in a terminal state if the greedy action is selected according to the Q-values. By restricting the possible optimal actions to the set of , only actions and are allowed, because they minimize the number of steps. As a result, the NSE learns policies that end in terminal states for modules and .

The next restriction is according to the Q-values. All actions from the set that maximize the Q-value are considered as optimal action. This restriction selects the trajectories resulting in the highest reward within steps.

Although, the actions maximize the reward within steps there can be situations where the resulting trajectory requires more steps or less steps. In these situations, the Q-value does not inform about the total reward and needed time. For example, for module the Q-values for going left () and right () are both zero because the Q-values of this module only look one step into the future. Nonetheless, the final trajectory for going left result in steps and a reward of , whereas for going right in steps and a reward of . In this situation it would be more desirable to go right to collect some reward. Therefore, if several actions maximize the Q-values, then from those actions the ones resulting in the minimal number of steps to a terminal state are considered as defined by set . Among those, the final optimal action is one that has the maximum total reward . If several actions fulfill this, then one of them is randomly chosen.

Input:
  Number of modules:
  Learning rate:
  Exploration rate:
initialize , , and to zero repeat (for each episode)
       initialize state , and objective
       // select as active module a module that maximizes
       repeat (for each step in episode)
             // choose action -Greedy
             , take action , observe outcome forall  do
                  
             end forall
             // use appropriate module for the next step
            
      until  is terminal-state
until termination
Algorithm 2 n-Step Ensemble (NSE)

The final policy of the NSE is given by selecting at the beginning of an episode the module that optimizes the current active objective according to Eq. 2. The optimal action of this module is used in the first step of the episode. Because each module is responsible to maximize the total reward for a certain number of steps for the next time step the optimal action of the module responsible for one step less () is used. This procedure is repeated until the first module is reached. The policy depends therefore on the current time step during an episode:

During the learning, the agent does not always chose the greedy action. Instead an -greedy action selection is used to allow exploration.

The learning of the Q-values , expected reward , and number of steps for each module is done in a similar way to Q-learning. After a transition is observed all modules are updated in parallel according to a Robbins-Monroe update.

a.3 Time-dependent Q-Learning

The IGE and NSE were compared to a classical Q-learning approach that learned for each objective an independent Q-function (Alg. 3). Its reward function was defined by the outcome of the active objective function after the agent reached a terminal state . For every other transition a reward of zero was given:

where is the collected total reward according to the reward function of the MDP and is the number of steps of the current episode.

Input:
Learning rate:
Discount factor:
initialize to zero
repeat (for each episode)
       initialize state , and goal
       // start with an empty reward history
       repeat (for each step in episode)
             choose an action for derived from (e.g. -greedy) // take action and save reward in history
             , , take action , observe outcome // if a terminal state is reached, use the outcome for the
             // objective as basis for the Q-function
             if  then
                   or
             else
                  
             end if
            
      until  is terminal-state
until termination
Algorithm 3 Time-dependent Q-learning for general OA-MDPs

A problem with this formulation of the reward is that it does not fulfill the Markov assumption, i.e. that the outcome of an action depends only on the current state. Instead, the reward for reaching a terminal state depends on the whole trajectory, taking into account the total reward sum and the number of steps . As a result, the MDP is partially observable for the agent because the state, i.e. the agent’s position, does not inform about the collected return or how many steps were needed. Therefore, Q-learning is not guaranteed to converge to the optimal policy. To reduce this problem the current time step was used as an extra state dimension for the Q-learning agent, improving its performance. Although the time step information improves the performance of the Q-learning agent, it does not fully resolve the partial observable nature of the problem for the agent because the collected reward sum is still missing. Adding this information also into the state information would create a huge state space for which learning is impractical.

Appendix B Experiments

b.1 Experimental Procedure

The IGE, NSE and the time-dependent Q-learning agent were evaluated in the stochastic MDP illustrated in Fig. 1. The agent moved with in a random direction instead of the intended one. Agents had to adapt to 9 different objectives which are listed in Fig. 4. The first objective is to receive the maximum reward in an episode. The second also maximizes reward, but a punishment of for each step is given after 3 steps. gives exponentially increasing punishment for more than 3 steps. The goal of is to find the shortest path to the closest terminal state. For the shortest path to a terminal state that gives at least a reward of is optimal. Reaching a terminal state with less reward will result in a strong punishment. For the goal is to find the highest reward with a maximum of 7 steps. For the agent has only a maximum of 5 steps. In the goal is to maximize average reward. The final objective maximizes the average reward, but the agent has to reach at least a reward of . The location of the terminal states and their rewards were chosen so that they represent different solutions for the objective functions.

For each algorithm, 100 runs were performed to measure their average learning performance. Each run consisted of episodes that were divided in 9 phases. In each phase, the agents had to adapt to a different objective function. The objectives did not change during the episodes of a phase to evaluate how long an agent needs to adapt to each objective. The performance was measured by the outcome for the objective function that the agent received for each episode during the learning process.

b.2 Learning Parameters

Learning parameters of all algorithms were manually optimized to yield a high asymptotic performance while having a high learning rate. The learning rate parameter of all algorithms was set to in the beginning of learning to allow a faster convergence of the values. Over the course of learning it was reduced to . The IGE and NSE kept for 500 episodes and reduced it linearly to until episode 1000. The learning rate stayed at for the rest of Phase 1 and for all following phases. The Q-learning approach needed a longer learning time to reach its asymptotic performance in each phase. Moreover, it needed to learn a new policy for each phase. Its learning was kept to for 750 episodes and linearly reduced to until episode 3000 for each phase.

All algorithms used the -greedy action selection. Similar to the learning rate, the exploration rate was high at the start of learning and then reduced. The IGE’s and NSE’s exploration rate was for the first 500 episodes of Phase 1 and then reduced to until episode 1000. It stayed at for the rest of Phase 1 and all successive phases. Q-learning used a learning rate of for the first 750 episodes. Afterward, it was linearly reduced to until episode 3000 for each phase.

The IGE used a set of 45 -modules. The discount factors were chosen to have a stronger concentration for higher factors. This allows to learn different policies for longer trajectories. First, 14 modules were chosen according to . Then 2 modules with equal distance to each were added between each pair of the 14 modules and between the last and 1. The NSE used 20 modules. The discount factor of Q-learning was set to .

Appendix C Experimental Results

The results show that the IGE and NSE performed better compared to Q-learning in terms of adaptation to new objectives, asymptotic performance and learning speed (Fig. 4). After the initial learning phase with objective the IGE and NSE were able to adapt immediately to new objectives whereas Q-learning needed to learn new policies for those phases.

Moreover, the IGE and NSE outperformed Q-learning in their asymptotic performance in 5 of the 9 objectives (, , , , ). All algorithms had a similar final performance for 3 objectives (, , ). Q-learning could slightly outperform IGE for objective , but not the NSE. Comparing the IGE and the NSE, the NSE had a slightly better final performance for and a stronger performance for . The low performance of Q-learning for objective was the result of the negative outcome values that this objective gives. Q-values were initialized to 0. Because all outcomes for this objective are negative, the agent is doing an optimistic exploration (osband2017optimistic). It explores every possible state action pair for every possible time step, because it has to learn that their initial Q-values of 0 are not optimal. As a result, Q-learning would need more episodes to learn a good policy for this objective.

As for the learning rate, the IGE and NSE outperformed Q-learning which is visible in the first phase. Q-learning needed at least 3000 episodes to reach its final asymptotic performance for each phase. The IGE and NSE needed less than 1000 episodes to reach their asymptotic performance in the first phase. Q-learning needed longer due to the extra time step information in its state space. Thus, more exploration was necessary to learn the optimal policies.

[width=0.32]figures/mdp125_results_learncurve_phase9
Figure 4: The IGE and NSE immediately adapted to new objective functions compared to the time-dependent Q-learning approach in the stochastic task of Fig. 1. Performance was measured by the outcome of the objective function per episode, where is the reward sum of the agent’s trajectory and is its length. Each of the 9 phases has a different objective function. The plots show the mean and standard deviation over runs per algorithm. The minimal reward per episode was limited to to make the plots more readable, because some goal formulations can result in a large negative reward during explorations.