Reinforcement learning based methods have recently shown great success in many domains, including Atari games [Volodymyr Mnih2013], Go [David Silver2017], and autonomous vehicles [Shashua2016, Ahmad El Sallab2017]
. However, the bias-variance trade-off in n-step TD algorithms,such as n-step SARSA, n-step Expected SARSA and n-step Tree Backup, hasn’t been addressed that much in recent research. This effect is inherent to the stochastic nature of Markov Decision Processes (MDP) and invariant to the specific reinforcement learning algorithm. The bias-variance tradeoff rule states that a small n leads to large bias, whereas big n results in a high variance of the n-step bootstrap update of the Q function[Richard Sutton2017, Kristopher De Asis2018]
. The variance phenomenon is an inherent property of the n-step update which contains a sum of n individual rewards. It is a consequence of the fact that the variance of a sum of random variables grows with the amount of summands. Conversely, the bias, which comes from the possibly biased current estimate of the Q-function in the target state, is a decreasing function of n; This can be easily seen by considering the extreme case of the maximal possible n which is equivalent to Monte Carlo updates; these updates are completely unbiased since they don’t involve any estimates of the Q-function.
Currently there are very few instructions for choosing the best value of
in reinforcement learning literature. We address this challenge in inspiration by the way humans teach each other. One of the ways in which a person can assist another in learning, is by indicating which situations are more critical and as such require higher attention. If, for example, a student driver approaches an obstacle in the road, her teacher may state to her that she must watch out, without suggesting exactly which action to take (e.g. slowing down, turning the wheel right or left etc.). If the car later hits that obstacle, the student will understand that it probably took a wrong action back when the teacher has warned her.
This observation translated into the formal language of reinforcement learning might motivate the introduction of the concept of criticality. We can think of the criticality of a state as a measure which indicates how much the choice of the action in that state influences the return. The criticality of a given state could be provided by either by a rule of thumb, or explicitly by a human trainer. Technically speaking we can think of the criticality of a state as being proportional to the variance of the optimal Q function wrt. the possible action choices in that state.
2 Related Work
Multi-step TD methods, such as the n-step SARSA and SARSA() create a spectrum of algorithms where at one end exist one-step TD learning, and at the other - Monte Carlo Methods. All of these algorithms use the n-step return which is subject to bias-variance tradeoff. Various approaches have been developed to tackle this challenge.
De Asis [Kristofer De Asis2018] addresses this problem for off-policy n-step TD methods, such as n-step Expected SARSA, via the introduction of so called control variates. These special terms have the impact of an expectation correction. Therefore they can be used to decrease the bias of the n-step return.
Jiang et. al. [Nan Jiang2015]
propose an alternative solution for this problem for the prediction task (not the optimal control task). They introduce an unbiased estimator which corrects the current estimate of the value function. This estimator is robust in the sense that it remains unbiased even when the function class for the value function is inappropriate.
Richard Sutton et.al. [Richard Sutton2015] suggest an improvement of TD() which achieves an effective bias reduction for the updates. This beneficial effect is a consequence of specific weights which are being assigned to any given update of the value function. The proposed variant of TD() is particularly useful for off-policy learning, where ordinary TD() suffers from a deficit of stability.
Unlike all of the above mentioned approaches our method does not manipulate the updates of the (action-)value function a-posteriori; instead of this, it chooses the appropriate stepnumber for the update a-priori. This is done by using the criticality function, which is closely related to the variance of the update. Therefore, in our case, we can speak about a technique, which speeds up the learning by controlling the variance of the updates.
3 The relation between criticality and the stepnumber
All prominent n-step RL algorithms, such as n-step SARSA, n-step Expected SARSA and n-step Tree Backup, use a fixed stepnumber for bootstrapping which stays constant both in in the course of an episode and during the complete learning process. In our approach we use a varying stepnumber which is specific to each state encountered during an episode. We believe that the concept of the criticality of a state allows the determination of the optimal for a given state.
The intuition behind this idea is very straightforward. To develop it we present a simple example. Let us assume that in our environment most of the states have only one available action, and that there is no randomness in the MDP, that is, a given state action pair determines the next state. Let us assume, that during the learning process the agent encounters some sequence of states of which only has multiple actions available. In this situation, obviously should be assigned a criticality of (since the agent has no choice, and therefore its “choice” has no influence on the final reward) whereas for simplicity we will assign to a criticality of . Clearly, whenever the agent arrives at , the next states it visits will always be (). We would like to determine which of the states ( or ) should be used as the update target for . Consider the simple 1-step SARSA. This algorithm will update towards and in the next step towards . These updates will be repeated in each episode where these states are being visited so it is easy to see that asymptotically will be updated towards . Therefore there is no benefit from selecting as the update target for versus selecting , and the selection of may speed up the convergence. Using the same argument we can conclude that is a better update target than .
The presented example may lead to the conclusion that the update target for a given state should be the next future state which has a criticality of . However, how will this idea work out if none of the states has a criticality of ? Our actual approach, which is more robust will use cumulative criticality. That is: we will sum up criticality over the encountered states and postpone our update until the criticality accumulates to one. We easily see that in our above example this will result in exactly the same update target as the simple strategy that doesn’t use accumulation. This method will produce large stepnumbers in uncritical domains of the state space and therefore we can expect a speed-up in learning. We will call this algorithm ”Criticality-based Varying Stepsize” (CVS).
5 Evaluation of CVS in the Road-Tree environment
In this section we introduce the Road-Tree environment, an environment which is particularly appropriate to understand the benefits of CVS. We test the algorithm against a number of widely used reinforcement algorithms in order to prove it’s efficiency. By default, if not specified otherwise, we won’t discount the reward () and our initial Q-function will be constant over the state-action space. Our default values are , and .
The Road-Tree environment
In order to test CVS, we construct a plain environment, named Road-Tree, which has a natural criticality function corresponding to it. Road-Tree, is tree-like structured. The agent starts at the root and always moves in one direction–downward. There are two types of states. In a simple state there is only one possible action. In a junction state the agent needs to choose between multiple roads. The reward upon stepping onto a simple state is always zero. The reward is nonzero only upon reaching a junction or a terminal state. Moreover the reward may vary across junctions and terminal states. Figure 1 illustrates a simple Road-Tree environment. The numbers in the junctions represent the rewards. The numbers on the edges show the distance between the two corresponding junctions, that is the number of simple states between them (a distance of indicates simple states).
The very natural criticality function which we are going to use in the Road-Tree environment assigns zero to a simple state and one to a junction or terminal state.
CVS vs. Q-Learning
We now compare the performance of CVS against that of Q-Learning in the 2-level Road-Tree from fig. 1. Clearly, the optimal policy is defined by initially going to the left and then to the right, ending up at the terminal state that has a reward of . In Q-Learning, due to the relatively big distance between the intermediate junction that has a reward of and the optimal terminal state, the optimal reward (
) will be backpropagated to the intermediate junction very slowly. The other intermediate junction that has a reward ofwill be much more attractive to the agent and therefore, the agent, might remain in that nonoptimal path for a long period of time. Conversely, the CVS agent will backpropagate the optimal reward terminal state to the intermediate junction immediately after the fist visit and therefore should quickly converge to the optimal policy.
The plot on fig. 2 confirms our elaboration. The Q-Learning agent needs about 7000 episodes to converge to the optimal policy; the CVS agent, by contrast, converges almost immediately.
CVS vs. Watkin’s Q()
We now compare the performance of CVS against that of Watkin’s Q() in a very simple Road-Tree environment (see Figure 3). Watkin’s Q() is a very popular state of the art algorithm in reinforcement learning, which makes use of eligibility traces.
In our simple Road-Tree example (fig. 3) the right branch has a higher terminal reward and therefore is clearly the optimal policy. However we can easily see that Q() for this case, , might have a hard time finding this optimal policy. For a given the amount of episodes required to backpropagate the terminal reward to the root is an increasing function of the road’s length. Therefore the higher reward of the optimal branch will be backpropagated towards the root much slower than the lower reward of the non-optimal branch.
This leads to 2 possible scenarios in Q learning. The first one, which is very unprobable, occurs when before visiting the nonoptimal branch for the first time the optimal branch has been visited so often that the terminal reward has been backpropagated to the root to a degree, sufficient to outperform a one-time visit of the non-optimal branch. The second one, that is much more probable, is that Q() will be stuck in the non-optimal branch once it has traveled it . Certainly due to epsilon-greedy exploration eventually will converge to the optimal policy, however for a small this might take a very long time.
In order to validate our elaboration we applied both learning algorithms Q() and CVS to learn the optimal policy for our Road-Tree environment. The initial Q-function and the value of were set to: . For each episode we recorded the return. In order to smoothen the plot, we applied a running average with a width of 10 episodes to the return. The plot ( fig. 4) shows the smoothened return vs the episode number. Indeed the plot proves our initial assumption. CVS instantly chooses the optimal policy. The variations in the return are caused only by the ongoing exploration. , however,takes about 40 episodes until the policy becomes optimal.
We now compare the performance of CVS against that of Q() with a slightly more challenging environment, in which the distances between the root and the terminal states are equal(fig. 5).
Intuitively, as soon as the CVS agent will see the optimal policy, the corresponding Q-value will become equal to which basically terminates the learning process. We validated our intuition experimentally via 20 simulations, each containing a 200 episodes long training session and averaged the returns over the simulations. As might have been expected the plot on fig. 6 confirms our guess. We see, that in fact CVS converges to the optimal policy much quicker than Q().
CVS vs Monte Carlo
In the previous section we presented an example in which CVS outperformed Q(). An interesting observation is that in that particular example CVS functioned exactly like Monte Carlo (MC); the update targets were always the terminal states. This observation immediately raises the question whether CVS is able to outperform MC. We are going to show that there is in fact a situation, where it is the case.
In the current example (fig. 7) we have a slightly more complex Road-Tree than in the previous section: a 3-level tree. On the second level there are two junctions with different rewards.The junction with the higher reward, , has many children. All of these children with the exception of one child have negative rewards. The only child with the positive reward corresponds to the optimal policy. The second junction on level 2, which we denote with , has a lower reward. It has two children with nonnegative rewards.
Let us consider how a Monte Carlo agent will act in this environment. The first time the agent visits it is most likely continue to a low reward child, simply because most of the branches have a low reward. Therefore, the total return of the episode will be negative. Once this negative return has been backpropagated to the root, which happens instantly in the case of MC, the agent will avoid to explore policies which pass through . Some exploration will still take place, but it will be only due to the . Therefore, it might take a long time before the agent visits the optimal trajectory which passes through the junction. It might take even much longer until it visits this trajectory sufficiently many times until the Q value at the root, for the right action (towards ) will become higher than that of the left action (towards ).
CVS should learn faster in this Road-Tree. Consider the trajectories which contain . The Q-value for at the root will have as the update target. Therefore, it will most of the time choose this branch. Now let us assume the worst case, in which it takes a large amount of episodes until the agent sees the high-reward child of for the first time. Certainly this might lead to a situation where for all possible actions . However because the learning rate is small the difference will grow slowly. Since the reward upon reaching is higher than that of , as long as the aforementioned difference is not too large, the agent will prefer . Therefore, it should take CVS less episodes to learn the optimal trajectory.
A comparison between MC and CVS is shown in fig. 8. From the plot we can imply that, as expected, MC visits most of the time and as a consequence fails to identify the optimal policy. In contrast, CVS visits much more frequently. The plot shows that the optimal policy is executed for the first time after about 60 episodes and from there on CVS mostly keeps following it.
6 CVS vs. Q-Learning in the Shooter environment
In this section we describe the performance of CVS versus Q-Learning in a different environment: the Shooter environment. Just like the Road-Tree environment, the Shooter environment can be naturally associated with a simple criticality measure.
The Shooter environment
The Shooter environment is located on a rectangular playing field of 10x20 (width x length) parcels. This playing field contains multiple objects: a gun, which is located in the first column and whose random position may change from game to game; a bullet, which initially is located at the gun’s position; and a moving target, which is located in the last column. Each of these objects occupies exactly one parcel. Furthermore there exists an obstacle of a size of 3 parcels in the 8th column. At the beginning of the game the target has a random position in the last column of the field and a random direction of movement which can be either up or down. In every step the target moves by exactly one parcel inside the last column. The direction of the movement is inherited from the previous step with the exception of the case when it hits the wall; in that case the direction is simply being reflected. The agent controls the gun. At any given state of the game the agent can choose one of four actions: Either not shoot at all or shoot in one of the three possible directions - diagonally up, diagonally down or horizontally. The three shooting actions shoot a bullet only if the agent has a bullet to shoot, otherwise these actions are equivalent to doing nothing. At any given step the bullet will move by one parcel in the direction it was shot; when hitting a wall it’s vertical direction is being reflected; if it hits the obstacle the game is being terminated with a reward of -1; in the case it reaches the last column, the game is being terminated with a reward of +1, if it hits the target or -1, if it does not hit it.
There exists a rather natural criticality measure for the Shooter environment. The agent’s actions are relevant only before the shot. Moreover before the shot any state can be considered as equally critical. Therefore the most obvious criticality will be binary. It will assign a criticality of 1.0 to any state in which the shot didn’t take place yet; and a criticality of 0.0 to any state that occurs after the shot.
The performance of CVS vs. Q-Learning
In order to compare CVS to Q-Learning, we implemented a tabular Q-Learning agent and a tabular CVS agent. For both agents, we initiated the Q-function to a value of in every state. The exploration parameter was set to a value of 0.1 and remained constant throughout the learning process. The performance of both agents, which was monitored by averaging the scores over the last 100 episodes, is plotted in fig. 10. As depicted in the plot, CVS clearly outperforms Q-Learning. The Q-Learning agent struggles to make any progress during the first 500 episodes; and it takes the Q-Learning agent about 1400 episodes to reach an average score of 0.0. Conversely, the CVS agent reaches an average score of 0.0 already after about 100 episodes, and after 200 episodes it converges to a performance level of 0.4.
7 CVS vs DDQN in the Tennis environment
The Tennis environment
In this section we test CVS performance in the context of Deep-Q-Learning. For this purpose we implemented the Tennis environment which can be associated with a binary criticality function in a very natural way. The Tennis environment consists of two rackets (one controlled by the agent and the other by a computer opponent), a ball, and a playing field which has a size of 20x40 (width x length) pixels. On this field both the agent’s and the opponent’s racket occupy one pixel each, in the second and second to last columns. The movements of each racket are defined by the three primitive actions (up, down, stay) which either move the racket by one pixel in the corresponding direction or let it remain at the same position. If the racket is located at the wall, and therefore is not able to move in one of the two directions, executing this action is equivalent to staying at the same position. The ball occupies a single pixel and can move in six directions:[horizontally, diagonally up, diagonally down] either towards the agent or towards the opponent. If the ball hits either a wall or a racket its direction of movement is reflected. The opponent’s policy in the Tennis environment is a noisy variant of the optimal policy. At any given state the opponent chooses the optimal action with a probability of or some random action with the probability of . Each game consists of a single point. The agent receives a reward of +1 when it scores, and a reward of -1 when the opponent scores. The starting position of the ball is always at the center of the field. The starting direction is always towards the agent. The exact direction (horizontally, diagonally up, diagonally down) is random.
The DDQN algorithm
We implemented the DDQN algorithm (double DQN) that has been proposed by Hasselt et. al. [Hado van Hasselt2016]. The main benefit of this approach over plain DQN is that the second neural net improves the stability of the learning procedure. The strategy to approach the exploration vs. exploitation challenge consists of three learning periods. The first 2000 games are an ”exploration-only period”. Afterwards we perform a linear decay of the exploration parameter which starts at the value and is finally being decreased to the value of by the 12000th game. In the final learning period is constant. Our learning rate is and our reward decay parameter is . Our neural net takes the 20x40 image as the input and has an output layer whose size equals the amount of possible actions ( in our case three). It has a compact architecture with only three hidden layers: two convolutional and one fully connected layer. The exact structure is [(Conv,32),(Conv,64),(FC,256)].
The criticality measure
Since the agents actions are irrelevant when the ball is moving towards the opponent, it is a rather straightforward strategy to set the criticality of these states to zero. When the ball is moving towards the agent there is a variety of options for a meaningful criticality measure; we simply assign criticality of 1.0 to these states.
Performance of CVS vs DDQN
The difference between CVS and DDQN is reflected in the update targets. Whereas in DDQN each example in the experience buffer has the next state as the update target in CVS the update target is being chosen according to the CVS algorithm. We chose to monitor the learning procedure for each algorithm by looking at the average score over the last 100 points. In fig. 11 the performance boost of CVS in comparison to DDQN is clearly recognizable; after the first 5000 games CVS has only a tiny lead; by game 10000 the lead is already clearly visible; and after 15000 games it becomes significant. The most important observation is that CVS reaches machine level performance about two times faster than DDQN.
All experiments presented in this paper show that CVS outperforms all other baselines which do not take the concept of criticality into account, in terms of convergence speed. However, obtaining a criticality function is not always a trivial task. While a human trainer may provide criticality levels to some states, obtaining a function that evaluates criticality of all states (in a satisfactory matter) may be domain dependent. Furthermore, it may not always be obvious which states should be considered critical and which states should be considered as non-critical. For example, a car driving on a straight road with no traffic may seem as being in a non-critical state. However, a driver that suddenly turns the wheel right (or left), may result in hitting a wall (and resulting with a negative reward), this implies that the state was in fact a critical state. Therefore, in some domains it may be required to refine the concept of criticality in order to improve agent performance in these domains. One such approach may include giving a higher weight to more plausible actions, perhaps by taking the current behavior of the agent into account.
9 Conclusions and future work
In recent years most human-aided reinforcement learning approaches improved the agent’s performance by either integrating human demonstrations into the learning procedure or augmenting the reward function with human feedback. In this paper we introduce a novel idea which opens an alternative way of human assistance in the agent’s learning process: the concept of the criticality of a state. We implemented the Criticality-based Varying Stepnumber (CVS) agent which uses the concept of criticality in order to locally choose the appropriate stepnumber for the update of the Q-function.
We tested the CVS agent in multiple environments, including Road-Tree, the Shooter game and the Tennis game. The conducted experiments showed that CVS is able to outperform popular reinforcement learning algorithms as Q-Learning, Deep-Q-Networks and Monte Carlo Learning.
There is a number of promising research directions in the area of criticality based algorithms. The first is the development of interfaces which would enable the human trainer to communicate criticality related information to the agent. Such a step would be a crucial towards a practical realization of Criticality-based learning. An alternative research direction could arise from the fact that our experiments were limited to criticality functions which were provided by the human trainer for each state of the MDP. In more complex environments it might be interesting to consider the option, that the criticality function will be provided by the human only on a certain portion of the states and will be generalized to all other states by supervised learning techniques. Such an approach could significantly reduce the required effort for the human trainer and consequentially make Criticality-based learning much more attractive.
- [Ahmad El Sallab2017] Ahmad El Sallab, Mohammed Abdou, E. P. 2017. Deep reinforcement learning framework for autonomous driving. arXiv:1704.02532.
- [David Silver2017] David Silver, Julian Schrittwieser, K. S. 2017. Mastering the game of go without human knowledge. Nature.
[Hado van Hasselt2016]
Hado van Hasselt, Arthur Guez, D. S.
Deep reinforcement learning with double q-learning.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16).
- [Kristofer De Asis2018] Kristofer De Asis, R. S. 2018. Per-decision multi-step temporal difference learning with control variates. arxiv:1807.01830.
- [Kristopher De Asis2018] Kristopher De Asis, 1 J. Fernando Hernandez-Garcia, . G. Z. H. 2018. Multi-step reinforcement learning: A unifying algorithm. arXiv:1703.01327.
- [Nan Jiang2015] Nan Jiang, L. L. 2015. Doubly robust off-policy value evaluation for reinforcement learning. arxiv:1511.03722.
- [Richard Sutton2015] Richard Sutton, A. R. M. 2015. An emphatic approach to the problem of off-policy temporal-difference learning. arxiv:1503.04269.
- [Richard Sutton2017] Richard Sutton, A. B. 2017. Reinforcement Learning: An Introduction.
- [Shashua2016] Shashua, S. S.-S. S. S. A. 2016. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv:1610.03295.
- [Volodymyr Mnih2013] Volodymyr Mnih, Koray Kavukcuoglu, D. S. 2013. Playing atari with deep reinforcement learning.