1 Introduction
Reinforcement learning agent learns by interacting with the environment and uses observed reward for each action as feedback signal to improve policy. In some environments, there are constant reward signals. For example, the score of the game when training an agent to play Pong (Atari game) or the distance travelled when training a robot to run. In such environments, the agent continuously receives constructive reward feedback, providing strong signals and gradients to train the agent’s underlying model.
However, in other environments, desired outcomes are rare, and the agent only receives a reward when the desired outcome happens. For instance, in the Atari game Montezuma’s Revenge, the agent only receives a reward for picking up a key that requires performing a series of tasks successfully. The agent can only start improving the model when it accidentally stumbles into one successful sequence of actions by random actions. Given the extremely low probability, it usually requires extremely large number of training episodes, especially in the beginning, which could be very costly in real world environments.
One way to combat this problem is to design algorithms that can explore environments faster and more thoroughly. In DQN, the agent typically uses a greedy policy to decide exploitation or exploration, and chooses a random action during exploration, which is extremely inefficient in environments with sparse rewards. Therefore, we propose an improved version of DQN that performs a onestep planning during exploration, increasing the chance of discovering unseen states.
2 Background
Unlike supervised and unsupervised learning, which involve learning from data given upfront, reinforcement learning tries to retrofeed its model by observing rewards through interactions with the environment in order to improve. Delayed rewards and interactions with the underlying environments are the two major characteristics of reinforcement learning
[3].Reinforcement learning consists of a sequence of interactions with the environment through actions and observes the reward and next state, illustrated by Figure 1
. This process can be formally defined as a Markov decision processe (MDP).
Definition 2.1 (Markov decision processes)
Defined by: (, , , , ), and a policy
: set of possible states
: set of possible actions
: distribution of reward given state and action pair
: transition probability
: discount factor
: a function from to that tells which action to take in each state
However, in many environments, the underlying dynamics, e.g. the transition probability is not known. Algorithms that can learn without knowing the dynamics are called modelfree, and there are two main approaches: Qlearning and policy gradient.
2.1 QLearning
In Qlearning, the agent learns a QValue function that gives the expected total return given a state and action pair. At each time step, the agent acts with a greedy policy , picking an action that maximizes the Q function.
Definition 2.2 (QValue function)
The QValue function() given a state and action pair is optimal when the agent uses the greedy policy. The QValue function involves the expectation of return over all future time steps, which is hard to learn. One can apply the Bellman operator to convert the equation into a recursive one:
We can then apply the Value iteration algorithm to get an iterative update formula to learn the Qvalues:
This method works fine if the state space and action space are relatively small where one can use a table to keep track of all the stateaction pairs. However, when state action space becomes large, it is infeasible to calculate this optimal QValue function exactly. Thus in QLearning, we use a function approximator instead.
Deep QNetwork (DQN)
DQN [1]
uses a neural network, which can be a deep convolutional network if dealing with high dimensional state space like pixels, to approximate the Qvalue function. During each training step, the transition
is saved in an experience replay memory, and draws samples from it to train the network, increasing sample efficiency. It also deploys another target Qnetwork to provide Qvalue estimates. The target network only gets updated every number of steps, increasing the stability of training.
2.2 Policy Gradients
Policy gradient methods directly learn the optimal actions without learning the values of states. The simplest policy gradient method is REINFORCE, also known as Monte Carlo Policy Gradient [4], is described below.
Given a set of all policies , the expected return of a policy is defined as
where is the sequence of the trajectory
The gradient of is:
where we could use Monte Carlo estimate to find the gradient of :
One can optimize a policy by performing gradient ascent of with respect to
. The idea of relying on the reward of a particular trajectory can cause large variance during training, and one way to improve is combining QLearning and Policy Gradients, which is called ActorCritic.
3 Related Work
Improving exploration and learning efficiency of environments with sparse rewards is an active area of research. Our approach falls under the category of using heuristics as guidance to make an informed exploration step instead of picking a random action. Similar ideas have been presented before. In Oh et el.’s paper
[5] on predicting Atari games frames, their deep neural network architecture is able to generate next 100step frames conditioned on actions with high accuracy. They use this information to guide exploration, choosing actions that will lead to rarer states. The rarity of a state comparing to recently visited states is computed by a Gaussian kernel. Similarly, Dilokthanakul et el.[6] proposed an improved exploration in DQN by informed exploratory actions that encourage visiting states whose values have high uncertainties.The use of intrinsic reward to provide feedback signal is another popular approach. For example, Pathak et el.[7] introduced curiositydriven exploration, where it uses the error of the state prediction by a forward dynamics model against the true next state as an intrinsic reward. And the agent is trained to maximize the sum of the intrinsic reward plus environmental reward.
Methods that improve sampleefficiency of RL algorithms are also helpful in environments with sparse rewards. For example, in the paper Prioritized Experienced Replay [8], Schaul et el. improved DQN by sampling experience replays with priority instead of uniform sampling. The key observation was that transitions that are more surprising, less redundant and rarer provide more information for the agent to learn. They showed that increasing the sampling frequencies of these transitions result in faster learning. Azizzadenesheli et el.[9] proposed a novel RL algorithm that combines both modelfree and modelbased methods to achieve better efficiency. They use a Generative Adversarial Network (GAN) to model the environment’s dynamics as well as a predictor for reward. The algorithm utilizes these models to do planning by a Monte Carlo Tree Search (MCTS).
4 DQN with ModelBased Exploration
The full algorithm is presented in Algorithm 1. The agent chooses between exploration and exploitation based on an greedy policy. Like the original DQN algorithm, our agent trains two Qnetworks, including a target Qnetwork to increase stability. Likewise, we utilize a replay memory and clip the error terms when training Qnetwork. On top of the DQN algorithm, we also train a dynamics network that predicts the next state given a state and action pair. Combining this dynamics network and an explicit modeling of the distribution of recently visited states, our agent is able to pick an action that increases the chance to visit unseen states during exploration.
4.1 Dynamics Network
In environments with sparse rewards, most if not all of the transitions in replay memory have noninformative rewards, providing little signal for the agent to learn Q values. However, we utilize these transitions to train a neural network that is able to predict given current state and an action
. This network is crucial in making the guided exploration step. We use a fully connected feedforward neural network (see Table
1). The dynamics network can be trained using the same transitions sampled from experience replay that is used to train the Qnetwork. Therefore, implementing a prioritized replay memory will benefit the training of the dynamics network as well.4.2 Guided Exploration
The most common way to explore for an greedy policy is a uniform sampling in action space. However, as shown in Figure 5 (a), using random actions to explore will result in: 1) most of the states concentrate around the initial state, 2) large area of the state space is never visited.
The goal of guided exploration is to utilize the learned dynamics of the environment to choose an action by a onestep planning during exploration such that there is a better chance of reaching rare or unseen states. At a given state, we can predict the next state for choosing each action in the action space, and we pick the action that leads to a state that is least similar comparing to the states we have seen.
Unlike Oh et el. [5] who uses a Gaussian kernel as similarity measure, we propose to evaluate the rarity of a state comparing to recently visited states by a probabilistic approach. For simplicity and generality, we model the distribution of past states as a multivariate Gaussian with the empirical mean and empirical covariance of as the parameters:
We pick exploratory action that leads to a next state that has the lowest probability according to this distribution. Explicit modeling of past states as a multivariate distribution has two advantages: 1) it takes into account the correlation between dimensions of the state. For example, in the Mountain Car environment, a higher velocity is more common given that the car is at a higher position, 2) it considers the variance for each component, eliminating the need of normalization. As a result, our method provides better exploration comparing to measuring similarity between states simply by distance metrics.
5 Experiments
We test the proposed algorithm on two classic simulated environments with sparse rewards: Mountain Car and Lunar Lander. We use OpenAI Gym’s [2] implementations (discrete actions version) of the two environments.
1) Evaluate improvement on exploration
We run our algorithm with only exploration and we visualize the states visited. We compare our result to two other exploration techniques: 1) random action 2) informed action by Gaussian Kernel similarity measure. Figure 5 and Figure 10 show the results for each environment respectively.
2) Evaluate improvement on learning speed
We evaluate the learning speed of our agent against two baselines: 1) original DQN, 2) Monte Carlo Policy Gradient. The running average of rewards for each environment is plotted in Figure 6 and Figure 11 respectively.
Our experiments showed that our proposed algorithm achieved significantly better exploration and learning speed in Mountain Car, but did not show any noticeable improvement in Lunar Lander.
6 Limitations and Future Work
Our proposed algorithm depends on two strong assumptions: 1) the dynamics of the environment can be learned with high accuracy, 2) the distribution of recently visited states follows a multivariate Gaussian distribution. Violation of either assumption can result in poor performance, which limits the application of our algorithm to certain environments.
This is why our algorithm did not perform better than baselines in the Lunar Lander environment. Our dynamics predictor network fails to predict the next state with high accuracy, and it’s clear from Figure 10 that the explored states do not follow a Gaussian distribution.
In addition, our method is prone to high dimensionality in state space. Firstly, there is a high computation cost to fit a multivariate Gaussian on high dimensional data. Secondly, numerical issues may become more likely when dimension is higher. For example, if certain dimensions of the state vector always has the same value, it will result in a singular covariance matrix.
Future work and extensions:
Instead of fitting a multivariate Gaussian to recently visited states, one can adopt a distribution that fits the observed states better. This can improve accuracy of assigning probability to a given state, increasing the chance of finding a rarer state. Our exploratory action is chosen by a onestep planning. However, if the dynamics network can predict several steps ahead with high accuracy, one can instead perform an Nstep planning to pick an action that maximizes the chance of finding a rare state N steps into the future. This can be effective for environments where reaching certain states requires temporally extended planning.
7 Conclusion
In this paper, we proposed DQN with modelbased exploration, an improved DQN algorithm that utilizes the environment dynamics to guide exploration. We demonstrated that it outperformed the original DQN on the classic environment with sparse rewards, Mountain Car. Our algorithm was able to explore a wider range of states, and increased the learning speed. However, given the strong assumptions required, our method’s effectiveness is limited to certain types of environments. For example, our experiments showed that it did not perform better than the baseline algorithms in the Lunar Lander environment, where the recently visited states are not normally distributed. We presented several ways to extend and or improve our method to solve more diversified set of environments.
Acknowledgments.
We used the following third party code:
(1)
Deep QLearning with Keras and Gym
[10] as our starter code for original DQN implementation.(2) Reinforcement learning methods and tutorials [11] for our Monte Carlo Policy Gradient baseline.
(3) OpenAI gym [2] for simulated environments: Mountain Car and Lunar Lander.
References
 [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 [2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 [3] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 2st edition, 2018.
 [4] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, Cambridge, MA, USA, 1999. MIT Press.
 [5] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. Actionconditional video prediction using deep networks in atari games. In NIPS, 2015.
 [6] Nat Dilokthanakul and Murray Shanahan. Deep reinforcement learning with riskseeking exploration. In SAB, 2018.

[7]
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell.
Curiositydriven exploration by selfsupervised prediction.
2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
, pages 488–489, 2017.  [8] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
 [9] Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary Chase Lipton, and Anima Anandkumar. Sampleefficient deep rl with generative adversarial tree search. CoRR, abs/1806.05780, 2018.
 [10] Keon kim. Deep qlearning with keras and gym. https://keon.io/deepqlearning/, 2017.
 [11] Morvan Zhou. Reinforcement learning methods and tutorials. https://github.com/MorvanZhou/Reinforcementlearningwithtensorflow, 2017.
Appendix
Hyperparameters  Value 

minimum  0.01 
decay  0.9995 
Reward discount  0.99 
Learning rate (Qnetwork)  0.05 
Learning rate (Dynamics network)  0.02 
Target Qnetwork update interval  8 
Initial exploration only steps  10,000 
Minibatch size (Qnetwork)  16 
Minibatch size (dynamics predictor network)  64 
Number of recent states to fit probability model  50 
QNetwork (Fully Connected)  
Loss  mean squared error 
Hidden Layer 1  
Units  48 
Activation  ReLU 
Initial Weights  glorot uniform 
Dynamics Predictor Network (Fully Connected)  
Loss  mean squared error 
Hidden Layer 1  
Units  24 
Activation  ReLU 
Initial weights  glorot uniform 
Hidden Layer 2  
Units  24 
Activation  ReLU 
Initial weights  glorot uniform 
Hyperparameters  Value 

Learning rate  0.02 
Reward discount  0.995 
Neural Network (Policy)  
Loss  Softmax with cross entropy 
Layer 1  
Units  10 
Activation  tanh 
Initial weights  , 
Initial bias  0.1 
Layer 2  
Units  dimension of action space 
Activation  None 
Initial weights  , 
Initial bias  0.1 
Comments
There are no comments yet.