1 Motivation and overview
Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, involving few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such states can be viewed as an action, and there are trillions of them. Assuming the subgoal identification problem is already solved by a method outside the scope of this paper, the quality of any RL method—in real-world settings—depends less on how well it scales with the number of states than on how well it scales with the number of actions.
Likewise, when we humans reach an unfamiliar state, we generally resist testing every possible action before determining the good states to transition to. We can, for example, observe the state transitions that other humans progress through while accomplishing the same task, or reach some rewarding state by happenstance. Then we can focus on reproducing that sequence of states. That is, we are able to first identify a task before acquiring the skills to reliably perform it. Take for example the task of walking along a balance beam. In order to traverse the length of the beam without falling, a precise action must be chosen at every step from a very large set of possibilities. The probability of failure is high because almost all actions at every step lead to imbalance and falling, and therefore a good deal of training is required to learn the precise movements thatreliably take one across. However, throughout the procedure the desired trajectory of states is well understood; the more difficult part is achieving them reliably.
Reinforcement-learning methods that learn action values, such as -learning, Sarsa, and TD(0) are guaranteed to converge to the optimal value function provided all state-action pairs in the underlying MDP are visited infinitely often. These methods therefore can converge extremely slowly in environments with large action spaces.
This paper introduces an elegant new algorithm that automatically focuses search in action space by learning state-transition values independent of action. We call the method -learning, and it represents a novel off-policy approach to reinforcement learning. -learning is a temporal-difference (TD) method , and as such it has much in common with other TD-methods, especially action-value methods, such as Sarsa and -learning [8, 9]. But it is quite different. Instead of learning the values of state-action pairs as action-value methods do, it learns the values of state-state pairs (here referred to as transitions).
The value of the transitions between states is recorded explicitly, rather than the value of the states themselves or the value of state-action pairs. The learning task is decomposed into two separate and independent components: (1) the learning of transition values, (2) the learning of the optimal actions. The transition-value function allows high payoff transitions to be easily identified, allowing for a focused search in action space to discover those actions that make the valuable transitions reliably.
Agents that learn the values of state-transitions can exhibit markedly different behavior from those that learn state-action pairs. Action-value methods are particularly suited to tasks with small action spaces where learning about all state-action pairs is not much more cumbersome than learning about the states alone. However, as the size of the action space increases, such methods become less feasible. Furthermore, action-value methods have no explicit mechanism for identifying valuable state transitions and focusing learning there. They lack an important—real-world—bias: that valuable state transitions can often be achieved with high reliability. As a result, in these common situations, action-value methods require extensive and undue search before converging to an optimal policy. -learning, on the other hand, has an initial bias: it presumes the existence of reliable actions that will achieve any valuable transition yet observed. This bias enables the valuable transitions to be easily identified and search to be focused there. As a result, the difficulties induced by large action spaces are significantly reduced.
2 Environments requiring precision
Consider the transition graph of an MDP, where the vertices of the graph are the states of the environment and the edges represent transitions between states. Define a function that maps to the neighboring vertex whose value under the optimal policy, is the highest of all the neighbors of , where is calculated using a given value for as though the agent had actions available in every state that can move it deterministically along the graph of the environment.
The class of MDPs for which T-Learning is particularly suited can be described formally as follows: If , then
for some ,
where is a small positive value.
These environments are those where specific skills can accomplish tasks reliably. Walking across a balance beam, for example, requires specific skills. The first constraint ensures that the rewarding transitions are likely to be observed. The second constraint ensures that the transitions associated with large reward signals can be achieved by finding a specific skill, i.e., a reliable action. Without this guarantee, one might never attempt to acquire certain skills because the average outcome during learning may be undesirable.
Consider the example of Figure (a)a. This MDP has two parts, one requiring high skill (which yields large reward) and one requiring low skill (which yields small reward). Episodes begin in state and end in states , and . There are actions and the transition table is defined as follows: from state , actions, , take the agent to state deterministically; actions take the agent to state deterministically, and one action, , takes the agent to either state or with equal probability. All actions from state take the agent to state , ending the episode. From state , actions move the agent to either state or with equal probability, while action moves the agent to state deterministically. The agent receives a reward of for arriving in state , and rewards of and for arriving in states and respectively.
This example meets the criteria given above. The rewarding transition, , is likely to be observed even before action is discovered. Temporal difference (TD) methods will find the optimal policy when every state-action pair is visited infinitely often. -learning, for example, will eventually, through exploration, discover action at state . However, before the optimal policy is found, and after only a few episodes, the agent will select actions that take it from state to state . This agent has no bias towards discovering the action (at state ), which represents the skill required to move reliably to the rewarding state.
Agents that assign values to state-action pairs and then determine their policies from these values cannot explicitly search for an action that reliably makes a particular transition; rather, the rewarding state-action pair has to be discovered as a unit.
The next section describes -learning in detail. This algorithm biases the behavior of the agent towards finding the actions that make the most valuable transitions at each state.
3 State transition functions
The general reward for an MDP is a function of three variables,
Most environments considered in practice, however, take reward functions that depend on a single variable, usually only the state of the agent. For reasons discussed below, we consider rewards as functions of state transitions, independent of the action taken; i.e.,
We denote the restricted function by , and the more general reward function by .
Recall TD(0) which learns a function using the following update rule,
For a fixed policy this function converges to which is given recursively by
The next two sections present two separate learning rules. Both learn functions that assign values to state transitions, . The first is on-policy and is essentially equivalent to TD(0). The second is entirely off-policy and has similarity to -learning.
3.1 An on-policy learning rule
In the remainder of the paper, the term transition functions refers to functions . Their values are called transition values or T-values. Consider the following update rule,
The value represents the reward for the transition from to plus the cumulative future expected discounted reward under the given policy. Convergence is therefore implied by theorem that establishes the convergence of the state value function learned by TD(0). For a fixed policy this function converges to which is given recursively by
The recursive relation can be given also when the reward is a function of three variables.111The term would be replaced by the expected reward for making the transition from to under the given policy. To find this value one would have to find the likelihood of each action given the transition . This value would depend on , and . Using the restricted reward function we have the following relations between and :
One can use, for example, a one-step lookahead to select actions based on . For example, a deterministic policy could be given by:
where is a learned model of the transition probabilities . This is similar to determining a policy from state values:
where it is necessary to have a model of the reward function as well.
This learning rule is on-policy. The values, as we have shown, are related to those learned by TD(0). The formulation of the rule itself is similar to the learning rule used in SARSA. Next we introduce -learning, a TD prediction method that is off-policy and analogous to -learning.
Now consider a function which is learned as follows:
We call this learning rule T-Learning. This rule captures the values associated with the best transition available. When the agent’s behavior is determined by these values it becomes possible to search the action space—at the valuable states—to discover the reliable actions. Moreover, this can be done in a straightforward and natural way. Taking the maximum over the possible state transitions is reminiscent of -learning; rather than capture the ideal action associated with each state -learning caches the topology of the ideal transitions. The ideal transitions between states can be determined without having to use a model. At state the ideal transition is simply .
In the example given in Figure (a)a, the first time the transition from is made, regardless of the action selected, the agent learns the value associated to the transition. A subsequent backup for the transition will make the value of greater than the value . The agent’s policy then shifts, preferring state to state . All this can happen before the agent discovers .
3.2.1 Appropriate environments for -learning
The environment needs to satisfy some niceness properties for -learning to be successful. Sufficient conditions were given in Section 2 and occur in many real world environments. These restrictions can, however, be relaxed. We denote the function that -learning converges to by , and the -values of the optimal policy by . For each state the MDP needs to satisfy the following property:
We christen this criterion the precision property. We call this the precision property because it guarantees that the valuable transitions can be made with as high reliability as needed. These reliable actions may be rare, among all possible actions, and may be considered skilled actions or behaviors. Said differently, these are MDPs where there are actions available that make the paths on the state-transition graph with the highest value (as described in Section 2) worth attempting.
As an example of how the learning rule can fail, consider the environment introduced in Figure (a)a without action . -learning will still prefer transition since the rule is biased towards the payoff associated to transition . This happens because the value is calculated independently of the specific actions available. The learning rule tacitly assumes that transitions in the transition graph can be made with arbitrarily high reliability. With removed from its repertoire, this assumption does not lead to the optimal policy. For this reason we restrict our discussion to environments satisfying the precision property.
It is important to realize that action need not deterministically make the transition . Given the same reward function action need only make the transition with probability greater than (making the average reward when going to state greater than the received for transitioning to state ) to ensure that the converged -values can be used to calculate the optimal policy.
We compare -learning to -learning in a model of a balance beam environment ( TD(0) and other methods are discussed in Section 5). The MDP has states, the transition graph is given in Figure (b)b. The transitions are similar to the smaller version of this environment. We vary the number of actions, , throughout the experiments. The first -actions move the agent deterministically from state to and the second -actions move the agent to state . Action transitions the agent to either state or with equal probability. From state all actions advance the agent deterministically along the chain, , where the agent receives a reward of for reaching state
. The odd states represent the balance beam. The transitionscan be made deterministically by action . The other actions, with equal probability, either advance the agent along the balance beam, moving to the next odd state, or cause the agent to fall off the beam, moving to state . The agent receives no reward for reaching state and a reward of for reaching state . Episodes end at states and begin at state .
-learning is a form of TD prediction and as such it requires a separate module to generate a policy. We chose a simple model-based approach for clarity of exposition. We are interested in what -learning is learning compared to what -learning is learning; ideal control (e.g., model versus model-free: actor/critic methods) does not fall into the scope of this paper. The policy for the -learning is derived from a model of the transition matrix and the -values. The algorithm uses a one-step lookahead where new actions are selected in favor of those which fail to make the rewarding transitions reliably. Let denote the number of times a state-action pair has been observed and denote the number of times a transition was seen with a specific action,
. From this it is easy to generate a basic estimateof the transition matrix . Actions are selected, in state , from among those whose values are equal the maximum. Actions which have yet to be taken in state are biased towards state with a transition probability of . See Algorithm 1 for details. For the experiment we set the parameter . The other parameters for both -learning and -learning are as follows: learning rate is , discount factor is , and exploration rate is . We ran 50 trials for each of the experiments. Each trial lasted until the policy converged.
5 Results and Discussion
For —a total of actions—-learning required
steps (actions executed) on average for the policy to converge with a standard deviation of, whereas -learning required an average of steps with a standard deviation of , a speedup of 25 times in this relatively small environment.
In Figure 2 we see how the number of episodes to convergence relates to the size of the action space; T-Learning yields arbitrary speed-up factors over Q-Learning as the action space grows.
Figure 3 illustrates the key differences in behavior as a result of what each algorithm is learning. In the early learning stages the two algorithms exhibit the same behavior. At this point they are equally likely to traverse the beam. After a number of chance successes the -learning algorithm propagates the state-transition values to state . At this time it’s behavior departs from the behavior exhibited by the Q-Learning algorithm– the agent begins to favor state to state . Unlike T-Learning, the Q-Learning algorithm cannot independently identify the task and acquire the skill to succeed in the task. See Figure (a)a. Rather, and this represents the fundamental weakness of -learning in this environment, it’s behavior is such that it always prefers the transition to state from state until it has found action in each of the odd-numbered states. Moreover, this action needs to be discovered in each of the odd states, , after the value of the next odd state, , is positive. Otherwise, after the learning step, the value will remain non-positive and thus valued as a suboptimal action. The sampling along the beam achieved by the exploration factor for Q-Learning is significantly less than than the sampling rate that T-Learning enjoys by a change in policy.
There is a horizontal asymptote for the number of episodes required for the convergence of the T-values with respect to the number of actions. See Figure (b)b. The behavior of the agent will shift—preferring state to state —around episodes regardless of the number of actions. This is a remarkable feature of the T-Learning algorithm and might play an important role in the so-called options  framework, discussed below.
Algorithms learn based on the learning rule they are wrapped around. For example, Dyna-Q  which learn a model and take advantage of planning to speed learning, still needs to first find the actions that represent the skilled movement(s) before the learning is sped-up. This is because the algorithm inherits the disadvantages of Q-Learning discussed above. However, once a well-informed model (having tried the actions representing skilled actions) is learned the values will quickly produce the optimal policy. Also, the bottle neck of having to first discover the skilled action at state before valuing the skilled actions at other odd states will be removed. That said, planning methods can be used with T-Learning as well, and would allow for decrease in the time needed for convergence of the T-values, the agent would only have to make it across the beam a single time before shifting its policy. Using TD(0) under the hood of Dyna or Prioritized Sweeping  does not address the fundamental problem either. TD(0) is an on-policy method, and as such will learn values based on the distribution of its samples. TD(0) has results similar to Q-Learning in the balance beam environment. Models and planning methods will speed Q-Learning, but they do not address the fundamental problem of learning in MDPs with huge action spaces.
An optimistic initialization of -values does not address the heart of the matter either. For tiny actions spaces (, actions) optimistic initialization puts Q-Learning roughly on par with T-Learning. In moderate sized spaces (, actions) optimistic initialization took convergence time to of the original. With larger action spaces the number of episodes to convergence was effected less by optimistic initialization, taking about the original time. In general, when there are thousands upon thousands of actions it is a bad idea to have to try them all in each state. Similarly, using optimistic state-values with TD(0) with either a one-step look-ahead or an actor-critic method to generate the policy also fails to make learning significantly faster. The value of state will decrease faster than the value of state , immediately nullifying the optimistic initialization; the results would be similar to those reported herein.
Even with a large state space, learning functions in is not as daunting as it seems. Typically the state space is far from fully-connected, so that the sampling needed is nowhere near quadratic with respect to the size of the state space. Further, a large action space does not effect the difficultly of learning a function whose domain is . However, when learning functions in the space , sampling must be done at all state-action pairs. Learning transition functions has theoretical advantages over learning state value functions: (1) transition functions give more information, values are assigned to the transitions between states, rather than the states themselves, (2) they implicitly contain a model of the environment. As a result of (1) real-world RL function approximation may prove more powerful for transition functions (regardless of the learning rule) than for state functions since there are more relationships to generalize from.
Robots can be initialized with certain -values, say learned in simulation, and then be left to learn the control that traverses the valuable states. This is not equivalent to using properly initialized state-values. Since the -values come with an implicit model of the environment, the robot, in any given state has a goal state . The deviation from this goal state after taking an action can be used to learn the relationships between the actions both for the current transition and at other transitions. This is a natural framework for transfer . More generally, -learning is biologically plausible in that it allows a goal-state to be valued highly before possessing the skills needed to reach that goal. After seeing someone ride a unicycle for the first time it is clear that this is a skill that can be learned. We can value the difficult goal of balancing on a unicycle before having ever tried it.
Learning transition values is also very attractive in environments that are non-Markovian. A hidden state may dramatically effect the control required to achieve specific state transitions without altering the values of these transitions. For example, a strong wind may dramatically change the control required for flying a plane without effecting the desired flight path. Preliminary work has shown that agents relying on transition values are robust in environments whose dynamics are non-stationary, in the way suggested above, due to the fact that learning is invariant with respect to these changes in the transition tables.
On a more abstract level, rather than focusing on single actions, we can consider subgoals  or behaviors  that transition the agent between relevant states. As soon as an agent has discovered a state transition it can assign it a value, regardless of whether the behavior initially making the transition is reliable. These values can then be used to drive the agent to interesting or valuable states where methods from the options framework can be employed to learn how to reliably reach other valuable states or reward.
The T-Learning algorithm learns values fundamentally different from Q-Learning, allowing an agent to quickly identify the valuable transitions in an environment, regardless of the size of the action space. The behavior exhibited by T-learning allows an agent to sample from the environment in a way that amounts to the focused learning of a skill. As a result T-Learning, in realistic scenarios, can behave arbitrarily better than Q-learning.
-  B. Bakker and J. Schmidhuber. Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In F. G. et al., editor, Proc. 8th Conference on Intelligent Autonomous Systems IAS-8, pages 438–445, Amsterdam, NL, 2004. IOS Press.
-  G. Konidaris and A. Barto. Building portable options: skill transfer in reinforcement learning. In Proceedings of the 20th international joint conference on Artifical intelligence, pages 895–900, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
-  A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. In Machine Learning, pages 103–130, 1993.
-  M. Stolle and D. Precup. Learning options in reinforcement learning. In Lecture Notes in Computer Science, pages 212–223, 2002.
-  P. Stone and S. Mahadevan. Transfer learning for reinforcement learning domains: A survey.
-  R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
-  R. S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull., 2:160–163, July 1991.
-  C. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, May 1989.
-  C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.