Combining Model-Free Q-Ensembles and Model-Based Approaches for Informed Exploration

Q-Ensembles are a model-free approach where input images are fed into different Q-networks and exploration is driven by the assumption that uncertainty is proportional to the variance of the output Q-values obtained. They have been shown to perform relatively well compared to other exploration strategies. Further, model-based approaches, such as encoder-decoder models have been used successfully for next frame prediction given previous frames. This paper proposes to integrate the model-free Q-ensembles and model-based approaches with the hope of compounding the benefits of both and achieving superior exploration as a result. Results show that a model-based trajectory memory approach when combined with Q-ensembles produces superior performance when compared to only using Q-ensembles.



page 3


Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion

Integrating model-free and model-based approaches in reinforcement learn...

Automatic detection of alarm sounds in a noisy hospital environment using model and non-model based approaches

In the noisy acoustic environment of a Neonatal Intensive Care Unit (NIC...

DQN with model-based exploration: efficient learning on environments with sparse rewards

We propose Deep Q-Networks (DQN) with model-based exploration, an algori...

Goal recognition via model-based and model-free techniques

Goal recognition aims at predicting human intentions from a trace of obs...

Model-free, Model-based, and General Intelligence

During the 60s and 70s, AI researchers explored intuitions about intelli...

Behaviorally Grounded Model-Based and Model Free Cost Reduction in a Simulated Multi-Echelon Supply Chain

Amplification and phase shift in ordering signals, commonly referred to ...

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Unsupervised exploration and representation learning become increasingly...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Quantifying predictive uncertainty is a problem that has started to receive a lot of attention as Deep Neural Networks achieve state-of-the-art performance


in a wide variety of domains such as computer vision

krizhevsky2012imagenet , speech recognition hinton2012deep

, natural language processing

mikolov2013efficient and bio-informatics zhou2015predicting ; alipanahi2015predicting

but have continued to produce overconfident estimates. These overconfident estimates can be detrimental or even harmful for practical applications


. Therefore, quantifying predictive uncertainties aside from just the accuracy of networks is an important problem. The contribution of our work is to combine an encoder-decoder model-based architecture and trajectory memory with the model-free Q-ensemble approach for the purpose of uncertainty estimation which in the context of reinforcement learning is exploration.

1.1 Neural Network Ensembles for Uncertainty Prediction

Current approaches to quantifying uncertainty have mostly been Bayesian where a prior distribution is specified over the parameters of the Neural Network and then using the training data, the computed posterior distribution over the parameters is used to calculate the uncertainty bernardo2009bayesian

. Since this form of Bayesian inference is computationally intractable, approaches have ranged from Laplace Approximation


, Markov Chain Monte Carlo methods

neal2012bayesian to Variational Bayesian Inference methods blundell2015weight ; graves2011practical ; louizos2016structured

. These methods however suffer from issues due to bounds of computational power and over-reliance on the correctness of the prior probability distribution over the parameters. Having priors of convenience can in fact lead to unreasonable uncertainty estimates

rasmussen2005healing . In order to overcome these challenges and produce a more robust uncertainty estimate, Lakshminarayanan et al. lakshminarayanan2017simple proposed using an ensemble of Neural Networks trained under a defined scoring rule. This approach when compared to Bayesian approaches is much simpler, has parallelization advantages and achieves state-of-the-art or better performance. State-of-the-art performance before this was achieved by MC-dropout which can also in essence be considered an ensemble approach where the predictions are averaged over an ensemble of Neural Networks with parameter sharing srivastava2014dropout .

1.2 Model-Free Q-Ensembles

This ensemble approach for uncertainty estimation in Neural Networks motivated its use for uncertainty estimation for exploration in the case of Deep Reinforcement Learning in the form of Q-Ensembles chen2018ucb

. Specifically, an ensemble voting algorithm is proposed where the agent takes action based on a majority vote over the Q-ensemble. The exploration strategy described uses the estimate of the confidence interval to then optimistically explore in the direction of the largest confidence interval (highest uncertainty). This approach was demonstrated to improve significantly over an Atari benchmark. The Q-Ensemble approach is an example of a model-free reinforcement learning approach where we do not need to infer the environment in the learning process. Model-free approaches are however generally high in sample complexity. Sampling from a learned model of the environment can help us mitigate this problem.

1.3 Model-Based Encoder-Decoder Approach to Next-Frame Prediction

Model-based approaches in the Atari environment (Arcade Learning Environment bellemare2013arcade ) have been successfully used in the past for the problem of next-frame prediction. The game environments are high in complexity with the frames themselves being high-dimensional, involving tens of objects that are being controlled directly or indirectly by agent actions and cases of objects leaving and entering the frame. Oh et al. oh2015action described an encoder-decoder model that produces visually-realistic next frames which can then be used as control for action-conditionals in the game. They further described an informed exploration approach called trajectory memory that follows the -greedy strategy but leads to the frame that was least visited in the last n-steps instead of random exploration.

1.4 Need for Combining Model-Free and Model-Based Approaches

In order to address the issues related to modeling of the latent space, we propose taking advantage of a combination of the model-free and model-based approaches described so far. Typically, model-free approaches are very effective at learning complex policies but convergence might required millions of trials and could lead to globally sub-optimal local minima deisenroth2013survey . On the other hand, model-based approaches have the theoretical benefit of being able to generalize to new tasks better and reduce the number of trials required significantly deisenroth2015gaussian ; levine2014learning but require an environment which is either engineered or learned well to achieve this generalization. Another advantage of the model-free approach is that it can model arbitrarily complex unknown dynamics better but is substantially less sample efficient as indicated earlier. Prior attempts at combining the two approaches while retaining the relative advantages have been met with some success bansal2017mbmf ; chebotar2017combining ; leibfried2016deep . We will therefore be performing a combination of an encoder-decoder model and one that uses the informed trajectory memory exploration strategy that was proposed by Oh et al. oh2015action with the Q-ensemble and report the results.

Section 2 describes the two methods of combining model-based methods and model-free q-ensembles for exploration that we implemented. Section 3 details our experimental setup. Section 4 details the results we obtained and provides a discussion about the methods we attempted based on our experiments and related work. Section 5 concludes the paper with pointers about work we intend to do in the future.

2 Methods

Figure 1: Combination of Auto-encoder and Q-Ensemble
(a) Ground Truth
(b) Predicted
(c) Ground Truth
(d) Predicted
Figure 2: Side-by-side Comparison of Ground Truth and Predicted Next Frames by Model-based Encoder-Decoder Approach at 744000 and 746000 Iterations Respectively.
Figure 3: Training and validation loss for frame prediction using an auto-encoder
Figure 4: DDQN and DDQN Ensemble using -greedy and UCB approaches

Figure 1 visually represents the combination of the model-based and model-free Q-ensemble approaches to guide exploration. The model-based encoder-decoder is first used to predict next frames over all possible actions. Given each of those actions, now the Q-values associated with them are predicted using the Q-ensemble and the variance is used to drive exploration. This is somewhat similar to the model predictive control (MPC) framework garcia1989model (but instead of planning over the action-tree, we simply repeat each action multiple times). Further details about the methods are provided in the following sections.

2.1 Method 1 - Feeding Auto-Encoder Images Directly To Q-Ensemble

In this method, we use the auto-encoder model to generate next-step frames for all of the actions and feed in these predicted frames to a Q-ensemble. We use the variance in the predicted Q-values (for each predicted frame) to estimate how likely this state has been visited during Q-learning (visit frequency). So to drive exploration, we need to simply pick those actions for which there is high variance. Given a well trained auto-encoder model, we can unroll many steps of predicted frames and select paths with high-variance.

Figure 1 shows the architecture first of the encoder-decoder model which is composed of encoding layers that extract spatio-temporal features from the input frames, action-conditional transformation layers that then transform the encoded features into a next-frame prediction by using action variables as an additional input, finally followed by decoding layers that map the high level features back onto the pixel space.

2.1.1 Encoding, Action-Conditional Transformation, Decoding

Similar to oh2015action ; leibfried2016deep

, the convolutional layers use a feedforward encoding that takes a concatenated set of previous images and extracts spatio-temporal features from them. The convolutional layers are essentially a functional transformation from the pixel space to a high level feature vector space by passing through multiple convolutional layers followed by a fully-connected layer at the end, each of which is followed by a non-linearity. The encoded feature vector is therefore -

where denotes frames of pixels with color channels at time .

The encoded feature vector is now transformed using multiplicative interactions with the control variables. Ideally, the transformation would looks as follows -

where is the action transformed feature,

is the action-vector at time t, W is a 3-way tensor weight and b is the bias term. However, computing the 3-way tensor is not scalable but allows the architecture to model different transformations for different actions as has been demonstrated in prior work

taylor2009factored ; sutskever2011generating ; memisevic2013learning . Therefore, an approximation of the 3-way tensor is used instead -

De-convolutions using CNNs perform the inverse operation of convolutions, transforming the 1 x 1 spatial regions onto d x d using de-convolutional kernels. In the architecture we implemented, the De-convolutions was performed as follows -

where Reshape is a fully connected layer and Deconv consists of multiple deconvolution layers.

2.1.2 Modifying Q-Ensemble

For each state-action pair, the encoder-decoder model produces an image. We now need to determine, which of these we are most uncertain about, in order to explore in that direction. For this purpose, the images are passed to several Q-networks. Each Q-network provides a distribution over actions and therefore outputs corresponding to the number of actions (9 in Pacman). The total number of outputs therefore will be number of networks (which in our case was 5) times the number of actions. A variance metric is then calculated over all of these outputs to explore in the direction of highest variance.

where is the number of actions, is the number of Q-networks (ensemble), is the transition function (frame prediction model) and Action denotes the action with highest uncertainty. Another way of estimating the variance can be, (converts Q-estimates to Value estimates of )

These methods of exploration can be combined with other common exploration strategies like -greedy where instead of selecting random actions, we can select the action with highest uncertainty.

2.2 Method 2 - Trajectory Memory and Q-Ensembles

The trajectory memory method as described in oh2015action uses an -greedy exploration policy. The trajectory memory is used to measure the similarity between the predicted frame and the most recent frames is calculated to give the estimated visit frequency. We use the same settings as described in oh2015action ; , and .

We can combine this exploration method with the variance in Q-ensembles to yield a more informed exploration strategy. This is similar to the previous method except that the uncertainty is calculated on the current frame instead of the predicted frames (). The predicted frames are used only to estimate their visit frequency. The algorithm used to select actions using the combined strategies is described in Algorithm  1.

1:procedure action_selector(current state: )
2:     Init
3:     for each action
6:      and
8:     action =
9:     Decay,
10:end procedure
Algorithm 1 Combined Exploration Strategy

3 Experimental Setup

All our experiments are run on the Ms. Pacman Atari environment bellemare2013arcade where exploration is challenging and it is important to achieve better scores. Below are the experimental settings and hyper-parameters used.

  • Network Architecture

    : For the Q-ensembles, we train 5 different Q-networks and each one uses a standard DQN architecture: Conv(32, 8, 4), Conv(64, 4, 2), Conv(64, 3, 1) and Linear(256) with ReLU activations throughout. The auto-encoder frame prediction model is the same as the one used in

    oh2015action .

  • Optimization: For training the Q-Ensemble, we use Adam with a learning rate = 0.0001, weight decay = 0 and gradient norm clipped at 10 for every layer.

  • Training details: Batch size = 32, Training Frequency = 4 (train every 4 frames), Discount Factor (gamma) = 0.99, Size of Replay Memory = 10000, Target network sync frequency = 1000 (fully replaced with the weights from training network)

  • Frame Preprocessing: We use the same frame processing technique as used by mnih2015human (frame skipping, max over 4 frames, RGB to gray-scale).

  • Exploration: For -greedy based strategies, Initial Epsilon = 1.0, Final Epsilon = 0.01, Exploration Timesteps = 1000000. UCB like strategy uses = {1.0 0.1, 0.01, 0.001}

4 Results and Discussion

4.1 Separately Training Auto-Encoder and Q-Ensemble

We discuss about the results of training the auto-encoder and Q-ensemble models separately as reported in oh2015action and chen2018ucb respectively.

Model-Based Approach

Results from the next-frame prediction done by Oh et al.’s model-based approach oh2015action are shown in the figures. Figure 2 shows ground truth next frames and predicted next frames produced by the encoder-decoder model side-by-side. Figure 3 further shows the training and validation losses over 800000 iterations.

Model-Free Q-Ensemble

Figure 4

shows the results of training different Q-Ensemble methods and the standard Double DQN implementation. Due to the inherent slowness in training ensembles all of the models were not trained for the full 8000 epochs but they have been trained sufficiently to say that the ensemble with

-greedy outperforms the UCB and standard DDQN approaches which replicates the results reported by Chen et al. chen2018ucb . These Q-ensemble models with different exploration strategies will serve as our baseline to beat.

4.2 Method 1 - Combining Encoder-Decoder Model and Q-Ensemble

(a) Average Rewards (Every 100 episodes) (Training)
Figure 5: Loss (Training)
Figure 6: Method 1 - Feeding auto-encoder images directly

Fig 6 shows the reward and reward averaged per 100 instances for the combination of the encoder-decoder and q-ensemble models using different seeds. Fig 5 shows the loss for this method (Method 1). As can be seen, in comparison to just using the Q-ensemble, the combination hovers between the 400 and 800 mark for rewards without any improvement in training behavior even after 2.5 million timesteps. Also, training this model is really slow because of the multiple steps involved in action selection (generating next frames, feed to Q-ensemble, calculate path (if any) uncertainties). We restrict the path to be just the immediate action but repeat the action multiple times (4) to obtain a reasonable difference in the next states predicted by the auto-encoder.

In our analysis we found that the frames predicted by the auto-encoder are highly noisy themselves and this can lead to poor uncertainty estimates from the Q-ensemble (the Q-ensemble almost always has never seen such frames). This tends to bias the exploration in the wrong direction where actions with noisy predicted frames are always preferred instead of unexplored actions. Our results for this method are quite similar to the results obtained by oh2015action (Section 4.2) when they tried to replace the emulator with the frames prediction by the action-conditional model during testing.

4.3 Method 2 - Trajectory Memory and Q-Ensembles

(a) Average Rewards (Every 100 episodes) (Training)
Figure 7: Loss (Training)
Figure 8: Method 2 - Combining trajectory memory and q-ensemble estimates

Figure 6(a) and 7 show the reward and loss plots obtained during training. As seen from the figures, the combination of trajectory memory and Q-ensemble variance (ucb-like exploration) often yields better rewards and reaches higher rewards very quickly ( frames) compared to either of the baselines (Double DQN, Q-Ensembles). However, the behavior is highly dependent on the starting seeds and can lead to a lot of variance in performance (as seen in 6(a)). Training is again slow (not as slow as Method 1 though) and can take upto 12-15 hours to reach 1M timesteps. The results are encouraging and show that combining estimated visit frequency and variance in Q-estimates to drive exploration is much better than -greedy or plain UCB-like exploration.

5 Conclusion and Future Work

Intelligent exploration strategies other than dithering ones like -greedy are important for Q-learning in large state-spaces. We find that combining trajectory memory based visit estimates with variance estimates from a Q-ensemble improves exploration and helps the agent reach better rewards much faster than other methods.

In the future, we hope to repeat these experiments on other Atari games like QBert or Seaquest where exploration is harder. As observed, the frames predicted by the auto-encoder are highly noisy themselves and this can lead to poor uncertainty estimates from the Q-ensemble. Using generator-discriminator methods that improve the quality of predicted future frames could be employed in the place of the encoder-decoder models. In order to address the stability issues generally faced by GANs, architectures like Wasserstein GANs will also be an important direction of research. With more realistic frames, it is easier to obtain unbiased uncertainty estimates from the Q-ensemble to drive exploration. Also, as exploration strategies become complex, training becomes really slow and it will become important to use computational tricks like separating exploration from Q-network training, parallel multiple environments etc. to reduce training time.