DQN with model-based exploration: efficient learning on environments with sparse rewards

by   Stephen Zhen Gou, et al.

We propose Deep Q-Networks (DQN) with model-based exploration, an algorithm combining both model-free and model-based approaches that explores better and learns environments with sparse rewards more efficiently. DQN is a general-purpose, model-free algorithm and has been proven to perform well in a variety of tasks including Atari 2600 games since it's first proposed by Minh et el. However, like many other reinforcement learning (RL) algorithms, DQN suffers from poor sample efficiency when rewards are sparse in an environment. As a result, most of the transitions stored in the replay memory have no informative reward signal, and provide limited value to the convergence and training of the Q-Network. However, one insight is that these transitions can be used to learn the dynamics of the environment as a supervised learning problem. The transitions also provide information of the distribution of visited states. Our algorithm utilizes these two observations to perform a one-step planning during exploration to pick an action that leads to states least likely to be seen, thus improving the performance of exploration. We demonstrate our agent's performance in two classic environments with sparse rewards in OpenAI gym: Mountain Car and Lunar Lander.



There are no comments yet.


page 1

page 2

page 3

page 4


PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals

Learning with sparse rewards remains a significant challenge in reinforc...

Episodic Memory for Learning Subjective-Timescale Models

In model-based learning, an agent's model is commonly defined over trans...

Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning

Model-based reinforcement learning (MBRL) methods have shown strong samp...

Explore the Context: Optimal Data Collection for Context-Conditional Dynamics Models

In this paper, we learn dynamics models for parametrized families of dyn...

Combining Model-Free Q-Ensembles and Model-Based Approaches for Informed Exploration

Q-Ensembles are a model-free approach where input images are fed into di...

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

In many environments only a tiny subset of all states yield high reward....

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Unsupervised exploration and representation learning become increasingly...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning agent learns by interacting with the environment and uses observed reward for each action as feedback signal to improve policy. In some environments, there are constant reward signals. For example, the score of the game when training an agent to play Pong (Atari game) or the distance travelled when training a robot to run. In such environments, the agent continuously receives constructive reward feedback, providing strong signals and gradients to train the agent’s underlying model.

However, in other environments, desired outcomes are rare, and the agent only receives a reward when the desired outcome happens. For instance, in the Atari game Montezuma’s Revenge, the agent only receives a reward for picking up a key that requires performing a series of tasks successfully. The agent can only start improving the model when it accidentally stumbles into one successful sequence of actions by random actions. Given the extremely low probability, it usually requires extremely large number of training episodes, especially in the beginning, which could be very costly in real world environments.

One way to combat this problem is to design algorithms that can explore environments faster and more thoroughly. In DQN, the agent typically uses a -greedy policy to decide exploitation or exploration, and chooses a random action during exploration, which is extremely inefficient in environments with sparse rewards. Therefore, we propose an improved version of DQN that performs a one-step planning during exploration, increasing the chance of discovering unseen states.

2 Background

Unlike supervised and unsupervised learning, which involve learning from data given upfront, reinforcement learning tries to retro-feed its model by observing rewards through interactions with the environment in order to improve. Delayed rewards and interactions with the underlying environments are the two major characteristics of reinforcement learning





New state

Figure 1: Agent interacts with the environment

Reinforcement learning consists of a sequence of interactions with the environment through actions and observes the reward and next state, illustrated by Figure 1

. This process can be formally defined as a Markov decision processe (MDP).

Definition 2.1 (Markov decision processes)

Defined by: (, , , , ), and a policy
: set of possible states
: set of possible actions
: distribution of reward given state and action pair
: transition probability
: discount factor
: a function from to that tells which action to take in each state

However, in many environments, the underlying dynamics, e.g. the transition probability is not known. Algorithms that can learn without knowing the dynamics are called model-free, and there are two main approaches: Q-learning and policy gradient.

2.1 Q-Learning

In Q-learning, the agent learns a Q-Value function that gives the expected total return given a state and action pair. At each time step, the agent acts with a greedy policy , picking an action that maximizes the Q function.

Definition 2.2 (Q-Value function)

The Q-Value function() given a state and action pair is optimal when the agent uses the greedy policy. The Q-Value function involves the expectation of return over all future time steps, which is hard to learn. One can apply the Bellman operator to convert the equation into a recursive one:

We can then apply the Value iteration algorithm to get an iterative update formula to learn the Q-values:

This method works fine if the state space and action space are relatively small where one can use a table to keep track of all the state-action pairs. However, when state action space becomes large, it is in-feasible to calculate this optimal Q-Value function exactly. Thus in Q-Learning, we use a function approximator instead.

Deep Q-Network (DQN)
DQN [1]

uses a neural network, which can be a deep convolutional network if dealing with high dimensional state space like pixels, to approximate the Q-value function. During each training step, the transition

is saved in an experience replay memory, and draws samples from it to train the network, increasing sample efficiency. It also deploys another target Q-network to provide Q-value estimates. The target network only gets updated every number of steps, increasing the stability of training.

2.2 Policy Gradients

Policy gradient methods directly learn the optimal actions without learning the values of states. The simplest policy gradient method is REINFORCE, also known as Monte Carlo Policy Gradient [4], is described below.

Given a set of all policies , the expected return of a policy is defined as

where is the sequence of the trajectory

The gradient of is:

where we could use Monte Carlo estimate to find the gradient of :

One can optimize a policy by performing gradient ascent of with respect to

. The idea of relying on the reward of a particular trajectory can cause large variance during training, and one way to improve is combining Q-Learning and Policy Gradients, which is called Actor-Critic.

3 Related Work

Improving exploration and learning efficiency of environments with sparse rewards is an active area of research. Our approach falls under the category of using heuristics as guidance to make an informed exploration step instead of picking a random action. Similar ideas have been presented before. In Oh et el.’s paper

[5] on predicting Atari games frames, their deep neural network architecture is able to generate next 100-step frames conditioned on actions with high accuracy. They use this information to guide exploration, choosing actions that will lead to rarer states. The rarity of a state comparing to recently visited states is computed by a Gaussian kernel. Similarly, Dilokthanakul et el.[6] proposed an improved exploration in DQN by informed exploratory actions that encourage visiting states whose values have high uncertainties.

The use of intrinsic reward to provide feedback signal is another popular approach. For example, Pathak et el.[7] introduced curiosity-driven exploration, where it uses the error of the state prediction by a forward dynamics model against the true next state as an intrinsic reward. And the agent is trained to maximize the sum of the intrinsic reward plus environmental reward.

Methods that improve sample-efficiency of RL algorithms are also helpful in environments with sparse rewards. For example, in the paper Prioritized Experienced Replay [8], Schaul et el. improved DQN by sampling experience replays with priority instead of uniform sampling. The key observation was that transitions that are more surprising, less redundant and rarer provide more information for the agent to learn. They showed that increasing the sampling frequencies of these transitions result in faster learning. Azizzadenesheli et el.[9] proposed a novel RL algorithm that combines both model-free and model-based methods to achieve better efficiency. They use a Generative Adversarial Network (GAN) to model the environment’s dynamics as well as a predictor for reward. The algorithm utilizes these models to do planning by a Monte Carlo Tree Search (MCTS).

4 DQN with Model-Based Exploration

The full algorithm is presented in Algorithm 1. The agent chooses between exploration and exploitation based on an -greedy policy. Like the original DQN algorithm, our agent trains two Q-networks, including a target Q-network to increase stability. Likewise, we utilize a replay memory and clip the error terms when training Q-network. On top of the DQN algorithm, we also train a dynamics network that predicts the next state given a state and action pair. Combining this dynamics network and an explicit modeling of the distribution of recently visited states, our agent is able to pick an action that increases the chance to visit unseen states during exploration.

4.1 Dynamics Network

In environments with sparse rewards, most if not all of the transitions in replay memory have non-informative rewards, providing little signal for the agent to learn Q values. However, we utilize these transitions to train a neural network that is able to predict given current state and an action

. This network is crucial in making the guided exploration step. We use a fully connected feed-forward neural network (see Table

1). The dynamics network can be trained using the same transitions sampled from experience replay that is used to train the Q-network. Therefore, implementing a prioritized replay memory will benefit the training of the dynamics network as well.

4.2 Guided Exploration

The most common way to explore for an -greedy policy is a uniform sampling in action space. However, as shown in Figure 5 (a), using random actions to explore will result in: 1) most of the states concentrate around the initial state, 2) large area of the state space is never visited.

The goal of guided exploration is to utilize the learned dynamics of the environment to choose an action by a one-step planning during exploration such that there is a better chance of reaching rare or unseen states. At a given state, we can predict the next state for choosing each action in the action space, and we pick the action that leads to a state that is least similar comparing to the states we have seen.

Unlike Oh et el. [5] who uses a Gaussian kernel as similarity measure, we propose to evaluate the rarity of a state comparing to recently visited states by a probabilistic approach. For simplicity and generality, we model the distribution of past states as a multivariate Gaussian with the empirical mean and empirical covariance of as the parameters:

We pick exploratory action that leads to a next state that has the lowest probability according to this distribution. Explicit modeling of past states as a multivariate distribution has two advantages: 1) it takes into account the correlation between dimensions of the state. For example, in the Mountain Car environment, a higher velocity is more common given that the car is at a higher position, 2) it considers the variance for each component, eliminating the need of normalization. As a result, our method provides better exploration comparing to measuring similarity between states simply by distance metrics.

1:Initialize replay memory M to capacity N
2:Initialize Q-network Q with random weights
3:Initialize target Q-network with weights
4:Initialize dynamics predictor D with random weights
5:for episode = 1, E do
6:     for t=1,T do
7:         Explore = True with probability
8:         if Explore then
9:              Retrieve the last F states visited from transitions in M and store in
10:              Compute mean and covariance of
11:              Pick
12:         else
13:              Pick
14:              Execute
15:              Store transition in M
16:              Sample a batch of transitions uniformly from M
17:              Set if episode not done, else
18:              Perform gradient descent on with respect to on
19:              Perform gradient descent on with respect to on
20:              Every C Steps set =
21:         end if
22:     end for
23:end for
Algorithm 1 DQN with Model-Based Exploration
(a) Random
(b) Similarity by Gaussian kernel
(c) Our Method
Figure 5: Scatter plots of explored states in Mountain Car after 50 episodes of only exploration. Our method was able to explore a wider range of states. The best result out of 3 independent runs is plotted for each method. (a) Explore with random action. (b) Pick an action that leads to a state which has the least similarity (measured by a Gaussian kernel) with recently visited states. (c) Our proposed algorithm for exploration.

5 Experiments

We test the proposed algorithm on two classic simulated environments with sparse rewards: Mountain Car and Lunar Lander. We use OpenAI Gym’s [2] implementations (discrete actions version) of the two environments.

1) Evaluate improvement on exploration
We run our algorithm with only exploration and we visualize the states visited. We compare our result to two other exploration techniques: 1) random action 2) informed action by Gaussian Kernel similarity measure. Figure 5 and Figure 10 show the results for each environment respectively.

2) Evaluate improvement on learning speed
We evaluate the learning speed of our agent against two baselines: 1) original DQN, 2) Monte Carlo Policy Gradient. The running average of rewards for each environment is plotted in Figure 6 and Figure 11 respectively.

Our experiments showed that our proposed algorithm achieved significantly better exploration and learning speed in Mountain Car, but did not show any noticeable improvement in Lunar Lander.

Figure 6: The running average rewards of 400 training episodes for Mountain Car, comparing our method against baselines. Our algorithm was able to improve rewards much sooner during training, comparing to the original DQN, which took at least 100 episodes to start seeing progress. Original DQN also failed to learn anything during one run, and Monte Carlo Policy Gradient always failed to learn anything in all runs. Each plot involves 3 independent runs of each algorithm. Solid lines represent the mean, and shaded areas represent range.

6 Limitations and Future Work

Our proposed algorithm depends on two strong assumptions: 1) the dynamics of the environment can be learned with high accuracy, 2) the distribution of recently visited states follows a multivariate Gaussian distribution. Violation of either assumption can result in poor performance, which limits the application of our algorithm to certain environments.

This is why our algorithm did not perform better than baselines in the Lunar Lander environment. Our dynamics predictor network fails to predict the next state with high accuracy, and it’s clear from Figure 10 that the explored states do not follow a Gaussian distribution.

In addition, our method is prone to high dimensionality in state space. Firstly, there is a high computation cost to fit a multivariate Gaussian on high dimensional data. Secondly, numerical issues may become more likely when dimension is higher. For example, if certain dimensions of the state vector always has the same value, it will result in a singular covariance matrix.

Future work and extensions:
Instead of fitting a multivariate Gaussian to recently visited states, one can adopt a distribution that fits the observed states better. This can improve accuracy of assigning probability to a given state, increasing the chance of finding a rarer state. Our exploratory action is chosen by a one-step planning. However, if the dynamics network can predict several steps ahead with high accuracy, one can instead perform an N-step planning to pick an action that maximizes the chance of finding a rare state N steps into the future. This can be effective for environments where reaching certain states requires temporally extended planning.

(a) Random
(b) Similarity by Gaussian kernel
(c) Our Method
Figure 10: Scatter plots of explored states in Lunar Lander after 100 episodes of only exploration. The state has 8 dimensions in this environment. Only x-velocity and y-velocity are shown. Our method didn’t show any noticeable difference in the range of explored states. The best result out of 3 independent runs is plotted for each method. (a) Explore with random action. (b) Pick an action that leads to a state which has least similarity (measured by a Gaussian kernel) with recently visited states. (c) Our proposed algorithm for exploration.
Figure 11: The running average rewards of 500 training episodes for Lunar Lander, comparing our method against baselines. The policy gradient method achieved best result, while our method didn’t show sign of improvement over original DQN. Each plot involves 3 independent runs of each algorithm. Solid lines represent the mean, and shaded areas represent range.

7 Conclusion

In this paper, we proposed DQN with model-based exploration, an improved DQN algorithm that utilizes the environment dynamics to guide exploration. We demonstrated that it outperformed the original DQN on the classic environment with sparse rewards, Mountain Car. Our algorithm was able to explore a wider range of states, and increased the learning speed. However, given the strong assumptions required, our method’s effectiveness is limited to certain types of environments. For example, our experiments showed that it did not perform better than the baseline algorithms in the Lunar Lander environment, where the recently visited states are not normally distributed. We presented several ways to extend and or improve our method to solve more diversified set of environments.

We used the following third party code:

Deep Q-Learning with Keras and Gym

[10] as our starter code for original DQN implementation.
(2) Reinforcement learning methods and tutorials [11] for our Monte Carlo Policy Gradient baseline.
(3) OpenAI gym [2] for simulated environments: Mountain Car and Lunar Lander.



Hyper-parameters Value
minimum 0.01
decay 0.9995
Reward discount 0.99
Learning rate (Q-network) 0.05
Learning rate (Dynamics network) 0.02
Target Q-network update interval 8
Initial exploration only steps 10,000
Minibatch size (Q-network) 16
Minibatch size (dynamics predictor network) 64
Number of recent states to fit probability model 50
Q-Network (Fully Connected)
Loss mean squared error
Hidden Layer 1
Units 48
Activation ReLU
Initial Weights glorot uniform
Dynamics Predictor Network (Fully Connected)
Loss mean squared error
Hidden Layer 1
Units 24
Activation ReLU
Initial weights glorot uniform
Hidden Layer 2
Units 24
Activation ReLU
Initial weights glorot uniform
Table 1: Hyper-parameters for DQN with Model-Based Exploration (Mountain Car)
Hyper-parameters Value
Learning rate 0.02
Reward discount 0.995
Neural Network (Policy)
Loss Softmax with cross entropy
Layer 1
Units 10
Activation tanh
Initial weights ,
Initial bias 0.1
Layer 2
Units dimension of action space
Activation None
Initial weights ,
Initial bias 0.1
Table 2: Hyper-parameters for Policy Gradients