1 Introduction
Reinforcement Learning (RL) problems are often formulated with the agent blind to the task reward of the environment. However, for many sparse reward problems, including goaldirected tasks such as pointtopoint navigation, pickandplace manipulation, assembly, etc., endowing the agent with knowledge of the reward function is both feasible and practical for learning generalizable behavior. In general, developers of these problems often know what the task goals are, but not necessarily how to solve them. In this paper, we will describe how we can leverage our knowledge of goals to enable learning of behaviors in these regions before the agent even reaches them. This formulation may be easier to solve than approaches that initialize learning from the start alone. For example, if we know the desired location, pose, or configuration of a task, then we can reverse the actions that brought us there, rather than forcing the agent to solve these difficult problems solely through random discovery.
In this paper, we introduce ForwardBackward Reinforcement Learning (FBRL), which introduces backward induction, to enable our agent to reason backwards in time. Through an iterative process, we both explore forwards from the start position and backwards from the target/goal. To achieve this we introduce a learned backwards dynamics model to explore in reverse from known goal states and update values within this local neighborhood. This has the effect of “spreading out” sparse rewards so that they are easier to discover, thus accelerating the learning process.
Standard modelbased approaches aim to reduce the amount of experience necessary to learn good policies by imagining steps forward and using these hallucinated events to augment the training data. However, there is no guarantee that the projected states will lead to the goal, so these rollouts may be inadequate. The ability to predict the result of an action does not necessarily provide guidance about which actions lead to the goal. In contrast, FBRL takes a more guided approach since, given an accurate model, we have confidence that each state visited in a backwards step has a path to the goal.
In the rest of the paper, we will describe the relevant background and related works. We will then formally introduce FBRL, followed by an empirical section in which we evaluate our approach in Gridworld and Towers of Hanoi, and show that it yields better results than standard Deep Double QLearning (DDQN) (Van Hasselt et al., 2016). Finally, we will conclude with discussions for future work.
2 Background
Reinforcement Learning (RL) problems are specified through a Markov Decision Process (MDP)
(Sutton & Barto, 1998). Here, describes the states in the environment, defines the actions the agent can take, refers to the rewards an agent receives within state , andis a transition model that specifies the probability of entering state
after taking action in . A policy estimates the probability of taking action in state , and we are typically interested in learning an optimal policy that maximizes the expected longterm discounted return. Modelfree approaches do not have access to , and rather learn an actionvalue function that predicts the return after experiencing samples in the environment:(1) 
Here, is a replay buffer that stores experiences (Mnih et al., 2015). This loss aims to minimize the TDerror, or the difference between the expected return and current prediction.
Learning Qvalues often requires a large quantity of samples. Rather than directly experiencing the states, an alternative method is to jointly use modelbased planning to predict values. DYNAQ (Sutton, 1990) makes updates to values by using imagined experiences. In this case, the parameters from Equation 1 may also be obtained from imagined experiences.
3 Related Work
When we have access to the true dynamics model, purely modelbased approaches such as dynamic programming can be used to compute values over all states (Sutton & Barto, 1998). Though when the state space is large or continuous, it may be intractable to iterate over the entire statespace. QLearning is a modelfree approach and updates values in an online manner by directly visiting states, and function approximation techniques such as Deep QLearning enable generalizing to unseen ones (Mnih et al., 2015). Hybrid approaches that combine modelbased and modelfree information can also be used. DYNAQ (Sutton, 1990), for example, was an early approach that used imagined rollouts to update the Qvalues as if they had been experienced in the true environment. There are more recent approaches as well, for example NAF (Gu et al., 2016) and I2A (Weber et al., 2017). But these approaches only use forward imagination.
A similar approach to our own does value iteration in reverse (Zang et al., 2007), but this is a purely modelbased approach, and it does not learn a reverse model. A related approach performs bidirectional search from the start and goal (Baldassarre, 2003), but that work learns values only, whereas we aim to learn actionvalues. Another comparable work solves problems by using a reverse curriculum near goal states (Florensa et al., 2017). However, that approach assumes the agent can be initialized near the goal. We do not make this assumption, as knowing what the goal state is does not mean that we know how to get to it.
Many works have used domain knowledge to help speed up learning, for example through reward shaping (Ng et al., 1999). Another approach is to more efficiently use the experiences from the replay buffer. Prioritized experience replay (Schaul et al., 2015) aims to replay samples that have high TDerror. Hindsight experience replay treats each state in an environment as a potential goal so that the system can learn even when it fails to reach the desired target.
The concept of using reverse dynamics is similar to inverse dynamics (Agrawal et al., 2016; Pathak et al., 2017). In those approaches, a system predicts the dynamics that yielded a transition between two states. In our approach, we use the state and action to predict the previous state. The purpose of this function is to reverse an action and use this unraveling to learn values near the goal.
4 Approach
We now introduce our approach, ForwardBackward Reinforcement Learning (FBRL). In this work, we utilize both imagined and real experiences to learn values. A forward step uses samples of real experiences originating from the start state to update Qvalues, and a backward step uses imagined states that are asynchronously predicted in reverse from known goal states. We hypothesize that this approach will improve our model of values in the vicinity of the goal, and thus expedite learning. We now describe the preliminaries for our approach.
4.1 Preliminaries
We specify FBRL problems through a modified MDP . As before, corresponds to the states in the environment, are the actions the agent can take, and represents the rewards an agent receives in . We assume that does not distinguish between real and imagined inputs and can be queried at any time. Finally, is a distribution of goal states from which we can sample uniformly.
4.2 Backwards model
We aim to learn a backward transition model that captures what happens if we undo an action in a state. We use a tuple of experience to learn the model. Rather than predicting the previous state directly, we aim to learn the difference between the two: . This allows the model to learn how states will change, rather than absolute positional information. It reduces the expected range of output values and generally centers them around zero, resulting in a more stable estimate. This formulation is appropriate since we are using states from the start of the problem to learn the backwards model, which is used near goal states that will initially have little training data.
The backwards model is a neural network that is trained to predict
, where . Now, we can predict the previous state as . The loss for the backward model then is: , where denotes a Huber loss.In some environments, it may be impossible to learn an accurate deterministic backward model, even if the problem has deterministic actions. For example, if an agent is next to a wall, we might not know if it previously bumped into the wall or if it took a step towards it. Additionally, for discretevalued problems, it may be difficult to learn a network that can predict discrete values. These issues are compounded further in stochastic settings. To address this we formulate the problem using a variational approach. If we know the distribution over , then we can predict a distribution over potential outcomes. In this formulation,
will represent a probability distribution for each state variable that can be trained using a crossentropy loss from the true distribution.
4.3 Action sampling
Another important consideration is how to sample actions that lead to useful updates. Our approach either randomly samples actions or uses a more greedy step that aims to direct the rollouts towards the start by moving to states with high Qvalues: .
4.4 Backwards Imagination
Algorithm 1 shows the pseudocode for our approach. In the forward step, we train the agent using experiences from the replay buffer, according to whichever learning paradigm we choose. In this work, we use DDQN. We additionally use real experiences to update the backward model.
The backward step takes place asynchronously. During this process, we use backward imagination for a limited amount of steps. Starting from the goal state, the approach samples an action, uses the model to imagine backwards, and then repeats the process from the resulting state. These imagined experiences are used to augment the replay buffer.
It is important to note that initially the backwards model is unlikely to accurately predict the true dynamics model. The model starts by being trained on experience near the starting region. Often, the portion of the dynamics model exercised outside of this initial region will vary significantly, especially near the goal. For example, consider a maze for navigation task where the maze beyond is unknown or the difference in dynamics for a humanoid lying down versus standing up.
While the model may start out being inaccurate, it provides a constantly improving signal that helps formulate the value function, which is then used to guide exploration. In this way, it acts like an intrinsic reward to provide a predicted direction for exploration for the model. Consider again the navigation problem, where the model in the immediate region will learn a factored representation for locomotion, but cannot predict the walls of the maze further away. The hallucinated experience will likely predict movement through walls. While this is is inaccurate, it does provide a shape for the value function that will encourage traveling towards the goal until a wall is discovered. Once discovered, the model will update and the value function will shift to anticipate the presence of the wall. As training progresses, the system will capture larger regional dynamics and start to predict potential global dynamics, e.g., presence of walls beyond what has been directly observed. As the system approaches the goal, the backward model will converge to the real model.
5 Experiments


The purpose of our experiments is to demonstrate that FBRL can significantly speed up learning in environments with sparse rewards. We evaluate our approach in Gridworld and Towers of Hanoi, illustrated in Figure 1. For comparison we formulate FBRL by augmented DDQN, which we compare against a standard DDQN baseline.
5.1 Gridworld




We first evaluate our approach in an x Gridworld. We use this environment as it allows us to easily show the benefits of our approach as the reward becomes more sparse. The agent’s actions are to move up, down, left, and right by a single unit, and its state consists of its cartesian coordinates. The agent is initialized in the bottom left corner of the grid, and receives a reward of when it reaches the top right. It receives a step cost of per timestep. The inputs to the backward model are and it must learn to predict . The model architecture is a fullyconnected network with
outputs followed by RELU, followed by another fullyconnected network with
outputs, one for each state dimension. For FBRL, we used steps of imagination with asynchronous stream.Figure 2 shows the results for running different size gridworlds. The results show that as we increase the size of the grid, i.e., as the goal gets further away, there is a clear advantage for using reverse imagination. The gap between the performance of DDQN compared to FBRL increases as the size gets larger. This suggests the approach is better suited for longer horizon sparse reward environments–but still does not degrade performance for short horizon tasks.
5.2 Towers of Hanoi


The next environment we evaluate in is disc Towers of Hanoi. In this problem, the agent needs to move discs from the first to the third pillar, but it is only able to place a disc on top of another one if it is smaller than it. The actions are to move each disc to the first, second, or third pillar. It receives a reward of when all discs are in the third pillar and a step cost of per timestep. The inputs to the backward model are bitstrings indicating which pillars each disc is on. For example, the environment in Figure 1 has a representation of since the small disc is on the first pillar and the large disc is on the third pillar. The backward model predicts a distribution for each bit over possible values: . The model architecture is a fullyconnected network with outputs followed by RELU, followed by another fullyconnected network with outputs, representing the distribution over each bit. For FBRL, we used steps of imagination with asynchronous streams.
Figure 3 shows the results for running Towers of Hanoi with a different number of discs. We again see an advantage for using FBRL as the goal gets further away. When we increase the number of discs, FBRL outperforms DDQN. We did find though that the performance of FBRL degraded for discs, which may be due to overfitting.
6 Conclusion
In this paper, we have introduced an approach for speeding up learning in problems with sparse rewards. We introduced FBRL, which takes imagined steps in reverse from the goal. We demonstrated that this approach can perform better than DDQN in Gridworld and Towers of Hanoi. There are many directions for extending this work. We were interested in evaluating a backward planner, but we could also train using both forward and backward imagination. Another improvement would be to improve the planning policy. We used a exploratory and greedy approach, but did not evaluate how to balance the two. We could also use prioritized sweeping (Moore & Atkeson, 1993), which chooses actions that lead to states with high TDerror.
7 Acknowledgements
We thank Anoop Korattikara, Himanshu Sahni, Sergio Guadarrama, and Shixiang Gu for useful discussions and feedback about this work.
References
 Agrawal et al. (2016) Agrawal, Pulkit, Nair, Ashvin V, Abbeel, Pieter, Malik, Jitendra, and Levine, Sergey. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pp. 5074–5082, 2016.
 Baldassarre (2003) Baldassarre, Gianluca. Forward and bidirectional planning based on reinforcement learning and neural networks in a simulated robot. In Anticipatory behavior in adaptive learning systems, pp. 179–200. Springer, 2003.
 Florensa et al. (2017) Florensa, Carlos, Held, David, Wulfmeier, Markus, and Abbeel, Pieter. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.

Gu et al. (2016)
Gu, Shixiang, Lillicrap, Timothy, Sutskever, Ilya, and Levine, Sergey.
Continuous deep qlearning with modelbased acceleration.
In
International Conference on Machine Learning
, pp. 2829–2838, 2016.  Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Moore & Atkeson (1993) Moore, Andrew W and Atkeson, Christopher G. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
 Ng et al. (1999) Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
 Pathak et al. (2017) Pathak, Deepak, Agrawal, Pulkit, Efros, Alexei A, and Darrell, Trevor. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
 Schaul et al. (2015) Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 Sutton (1990) Sutton, Richard S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the seventh international conference on machine learning, pp. 216–224, 1990.
 Sutton & Barto (1998) Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Van Hasselt et al. (2016) Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double qlearning. In AAAI, volume 16, pp. 2094–2100, 2016.
 Weber et al. (2017) Weber, Théophane, Racanière, Sébastien, Reichert, David P, Buesing, Lars, Guez, Arthur, Rezende, Danilo Jimenez, Badia, Adria Puigdomènech, Vinyals, Oriol, Heess, Nicolas, Li, Yujia, et al. Imaginationaugmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
 Zang et al. (2007) Zang, Peng, Irani, Arya, and Isbell Jr, Charles Lee. Horizonbased value iteration. Technical report, Georgia Institute of Technology, 2007.
Appendix A Experimental setup
For each experiment, we used a batchsize of . The discount factor was . The exploration parameter was initialized to and decayed to . The replay memory had a size of and we collected initial samples before training DDQN. The architectures for the backwards models are described in the main text.
a.1 Gridworld
The learning rate for DDQN was . The architecture for DDQN was a fullyconnected network with outputs followed by RELU, followed by another fullyconnected network with outputs, one for each action. We updated the target network every steps. FBRL had the same settings except we increased the learning rate to .
a.2 Towers of Hanoi
The learning rate for DDQN was . The architecture for DDQN was a fullyconnected network with outputs followed by RELU, followed by another fullyconnected network with outputs, one for each action. We updated the target network every steps. Like with Gridworld, we had the same architecture as DDQN, but we found we obtained better results when the learning rate was reduced to .