1 Introduction and Related Work
Quantifying predictive uncertainty is a problem that has started to receive a lot of attention as Deep Neural Networks achieve state-of-the-art performance
lecun2015deepin a wide variety of domains such as computer vision
krizhevsky2012imagenet , speech recognition hinton2012deep mikolov2013efficient and bio-informatics zhou2015predicting ; alipanahi2015predictingbut have continued to produce overconfident estimates. These overconfident estimates can be detrimental or even harmful for practical applications
amodei2016concrete. Therefore, quantifying predictive uncertainties aside from just the accuracy of networks is an important problem. The contribution of our work is to combine an encoder-decoder model-based architecture and trajectory memory with the model-free Q-ensemble approach for the purpose of uncertainty estimation which in the context of reinforcement learning is exploration.
1.1 Neural Network Ensembles for Uncertainty Prediction
Current approaches to quantifying uncertainty have mostly been Bayesian where a prior distribution is specified over the parameters of the Neural Network and then using the training data, the computed posterior distribution over the parameters is used to calculate the uncertainty bernardo2009bayesian
. Since this form of Bayesian inference is computationally intractable, approaches have ranged from Laplace Approximation
mackay1992bayesian, Markov Chain Monte Carlo methods
neal2012bayesian to Variational Bayesian Inference methods blundell2015weight ; graves2011practical ; louizos2016structured. These methods however suffer from issues due to bounds of computational power and over-reliance on the correctness of the prior probability distribution over the parameters. Having priors of convenience can in fact lead to unreasonable uncertainty estimates
rasmussen2005healing . In order to overcome these challenges and produce a more robust uncertainty estimate, Lakshminarayanan et al. lakshminarayanan2017simple proposed using an ensemble of Neural Networks trained under a defined scoring rule. This approach when compared to Bayesian approaches is much simpler, has parallelization advantages and achieves state-of-the-art or better performance. State-of-the-art performance before this was achieved by MC-dropout which can also in essence be considered an ensemble approach where the predictions are averaged over an ensemble of Neural Networks with parameter sharing srivastava2014dropout .1.2 Model-Free Q-Ensembles
This ensemble approach for uncertainty estimation in Neural Networks motivated its use for uncertainty estimation for exploration in the case of Deep Reinforcement Learning in the form of Q-Ensembles chen2018ucb
. Specifically, an ensemble voting algorithm is proposed where the agent takes action based on a majority vote over the Q-ensemble. The exploration strategy described uses the estimate of the confidence interval to then optimistically explore in the direction of the largest confidence interval (highest uncertainty). This approach was demonstrated to improve significantly over an Atari benchmark. The Q-Ensemble approach is an example of a model-free reinforcement learning approach where we do not need to infer the environment in the learning process. Model-free approaches are however generally high in sample complexity. Sampling from a learned model of the environment can help us mitigate this problem.
1.3 Model-Based Encoder-Decoder Approach to Next-Frame Prediction
Model-based approaches in the Atari environment (Arcade Learning Environment bellemare2013arcade ) have been successfully used in the past for the problem of next-frame prediction. The game environments are high in complexity with the frames themselves being high-dimensional, involving tens of objects that are being controlled directly or indirectly by agent actions and cases of objects leaving and entering the frame. Oh et al. oh2015action described an encoder-decoder model that produces visually-realistic next frames which can then be used as control for action-conditionals in the game. They further described an informed exploration approach called trajectory memory that follows the -greedy strategy but leads to the frame that was least visited in the last n-steps instead of random exploration.
1.4 Need for Combining Model-Free and Model-Based Approaches
In order to address the issues related to modeling of the latent space, we propose taking advantage of a combination of the model-free and model-based approaches described so far. Typically, model-free approaches are very effective at learning complex policies but convergence might required millions of trials and could lead to globally sub-optimal local minima deisenroth2013survey . On the other hand, model-based approaches have the theoretical benefit of being able to generalize to new tasks better and reduce the number of trials required significantly deisenroth2015gaussian ; levine2014learning but require an environment which is either engineered or learned well to achieve this generalization. Another advantage of the model-free approach is that it can model arbitrarily complex unknown dynamics better but is substantially less sample efficient as indicated earlier. Prior attempts at combining the two approaches while retaining the relative advantages have been met with some success bansal2017mbmf ; chebotar2017combining ; leibfried2016deep . We will therefore be performing a combination of an encoder-decoder model and one that uses the informed trajectory memory exploration strategy that was proposed by Oh et al. oh2015action with the Q-ensemble and report the results.
Section 2 describes the two methods of combining model-based methods and model-free q-ensembles for exploration that we implemented. Section 3 details our experimental setup. Section 4 details the results we obtained and provides a discussion about the methods we attempted based on our experiments and related work. Section 5 concludes the paper with pointers about work we intend to do in the future.
2 Methods

![]() |
![]() |
![]() |
![]() |


Figure 1 visually represents the combination of the model-based and model-free Q-ensemble approaches to guide exploration. The model-based encoder-decoder is first used to predict next frames over all possible actions. Given each of those actions, now the Q-values associated with them are predicted using the Q-ensemble and the variance is used to drive exploration. This is somewhat similar to the model predictive control (MPC) framework garcia1989model (but instead of planning over the action-tree, we simply repeat each action multiple times). Further details about the methods are provided in the following sections.
2.1 Method 1 - Feeding Auto-Encoder Images Directly To Q-Ensemble
In this method, we use the auto-encoder model to generate next-step frames for all of the actions and feed in these predicted frames to a Q-ensemble. We use the variance in the predicted Q-values (for each predicted frame) to estimate how likely this state has been visited during Q-learning (visit frequency). So to drive exploration, we need to simply pick those actions for which there is high variance. Given a well trained auto-encoder model, we can unroll many steps of predicted frames and select paths with high-variance.
Figure 1 shows the architecture first of the encoder-decoder model which is composed of encoding layers that extract spatio-temporal features from the input frames, action-conditional transformation layers that then transform the encoded features into a next-frame prediction by using action variables as an additional input, finally followed by decoding layers that map the high level features back onto the pixel space.
2.1.1 Encoding, Action-Conditional Transformation, Decoding
Similar to oh2015action ; leibfried2016deep
, the convolutional layers use a feedforward encoding that takes a concatenated set of previous images and extracts spatio-temporal features from them. The convolutional layers are essentially a functional transformation from the pixel space to a high level feature vector space by passing through multiple convolutional layers followed by a fully-connected layer at the end, each of which is followed by a non-linearity. The encoded feature vector is therefore -
where denotes frames of pixels with color channels at time .
The encoded feature vector is now transformed using multiplicative interactions with the control variables. Ideally, the transformation would looks as follows -
where is the action transformed feature,
is the action-vector at time t, W is a 3-way tensor weight and b is the bias term. However, computing the 3-way tensor is not scalable but allows the architecture to model different transformations for different actions as has been demonstrated in prior work
taylor2009factored ; sutskever2011generating ; memisevic2013learning . Therefore, an approximation of the 3-way tensor is used instead -De-convolutions using CNNs perform the inverse operation of convolutions, transforming the 1 x 1 spatial regions onto d x d using de-convolutional kernels. In the architecture we implemented, the De-convolutions was performed as follows -
where Reshape is a fully connected layer and Deconv consists of multiple deconvolution layers.
2.1.2 Modifying Q-Ensemble
For each state-action pair, the encoder-decoder model produces an image. We now need to determine, which of these we are most uncertain about, in order to explore in that direction. For this purpose, the images are passed to several Q-networks. Each Q-network provides a distribution over actions and therefore outputs corresponding to the number of actions (9 in Pacman). The total number of outputs therefore will be number of networks (which in our case was 5) times the number of actions. A variance metric is then calculated over all of these outputs to explore in the direction of highest variance.
where is the number of actions, is the number of Q-networks (ensemble), is the transition function (frame prediction model) and Action denotes the action with highest uncertainty. Another way of estimating the variance can be, (converts Q-estimates to Value estimates of )
These methods of exploration can be combined with other common exploration strategies like -greedy where instead of selecting random actions, we can select the action with highest uncertainty.
2.2 Method 2 - Trajectory Memory and Q-Ensembles
The trajectory memory method as described in oh2015action uses an -greedy exploration policy. The trajectory memory is used to measure the similarity between the predicted frame and the most recent frames is calculated to give the estimated visit frequency. We use the same settings as described in oh2015action ; , and .
We can combine this exploration method with the variance in Q-ensembles to yield a more informed exploration strategy. This is similar to the previous method except that the uncertainty is calculated on the current frame instead of the predicted frames (). The predicted frames are used only to estimate their visit frequency. The algorithm used to select actions using the combined strategies is described in Algorithm 1.
3 Experimental Setup
All our experiments are run on the Ms. Pacman Atari environment bellemare2013arcade where exploration is challenging and it is important to achieve better scores. Below are the experimental settings and hyper-parameters used.
-
Network Architecture
: For the Q-ensembles, we train 5 different Q-networks and each one uses a standard DQN architecture: Conv(32, 8, 4), Conv(64, 4, 2), Conv(64, 3, 1) and Linear(256) with ReLU activations throughout. The auto-encoder frame prediction model is the same as the one used in
oh2015action . -
Optimization: For training the Q-Ensemble, we use Adam with a learning rate = 0.0001, weight decay = 0 and gradient norm clipped at 10 for every layer.
-
Training details: Batch size = 32, Training Frequency = 4 (train every 4 frames), Discount Factor (gamma) = 0.99, Size of Replay Memory = 10000, Target network sync frequency = 1000 (fully replaced with the weights from training network)
-
Frame Preprocessing: We use the same frame processing technique as used by mnih2015human (frame skipping, max over 4 frames, RGB to gray-scale).
-
Exploration: For -greedy based strategies, Initial Epsilon = 1.0, Final Epsilon = 0.01, Exploration Timesteps = 1000000. UCB like strategy uses = {1.0 0.1, 0.01, 0.001}
4 Results and Discussion
4.1 Separately Training Auto-Encoder and Q-Ensemble
We discuss about the results of training the auto-encoder and Q-ensemble models separately as reported in oh2015action and chen2018ucb respectively.
Model-Based Approach
Results from the next-frame prediction done by Oh et al.’s model-based approach oh2015action are shown in the figures. Figure 2 shows ground truth next frames and predicted next frames produced by the encoder-decoder model side-by-side. Figure 3 further shows the training and validation losses over 800000 iterations.
Model-Free Q-Ensemble
Figure 4
shows the results of training different Q-Ensemble methods and the standard Double DQN implementation. Due to the inherent slowness in training ensembles all of the models were not trained for the full 8000 epochs but they have been trained sufficiently to say that the ensemble with
-greedy outperforms the UCB and standard DDQN approaches which replicates the results reported by Chen et al. chen2018ucb . These Q-ensemble models with different exploration strategies will serve as our baseline to beat.4.2 Method 1 - Combining Encoder-Decoder Model and Q-Ensemble
![]() |
![]() |
Fig 6 shows the reward and reward averaged per 100 instances for the combination of the encoder-decoder and q-ensemble models using different seeds. Fig 5 shows the loss for this method (Method 1). As can be seen, in comparison to just using the Q-ensemble, the combination hovers between the 400 and 800 mark for rewards without any improvement in training behavior even after 2.5 million timesteps. Also, training this model is really slow because of the multiple steps involved in action selection (generating next frames, feed to Q-ensemble, calculate path (if any) uncertainties). We restrict the path to be just the immediate action but repeat the action multiple times (4) to obtain a reasonable difference in the next states predicted by the auto-encoder.
In our analysis we found that the frames predicted by the auto-encoder are highly noisy themselves and this can lead to poor uncertainty estimates from the Q-ensemble (the Q-ensemble almost always has never seen such frames). This tends to bias the exploration in the wrong direction where actions with noisy predicted frames are always preferred instead of unexplored actions. Our results for this method are quite similar to the results obtained by oh2015action (Section 4.2) when they tried to replace the emulator with the frames prediction by the action-conditional model during testing.
4.3 Method 2 - Trajectory Memory and Q-Ensembles
![]() |
![]() |
Figure 6(a) and 7 show the reward and loss plots obtained during training. As seen from the figures, the combination of trajectory memory and Q-ensemble variance (ucb-like exploration) often yields better rewards and reaches higher rewards very quickly ( frames) compared to either of the baselines (Double DQN, Q-Ensembles). However, the behavior is highly dependent on the starting seeds and can lead to a lot of variance in performance (as seen in 6(a)). Training is again slow (not as slow as Method 1 though) and can take upto 12-15 hours to reach 1M timesteps. The results are encouraging and show that combining estimated visit frequency and variance in Q-estimates to drive exploration is much better than -greedy or plain UCB-like exploration.
5 Conclusion and Future Work
Intelligent exploration strategies other than dithering ones like -greedy are important for Q-learning in large state-spaces. We find that combining trajectory memory based visit estimates with variance estimates from a Q-ensemble improves exploration and helps the agent reach better rewards much faster than other methods.
In the future, we hope to repeat these experiments on other Atari games like QBert or Seaquest where exploration is harder. As observed, the frames predicted by the auto-encoder are highly noisy themselves and this can lead to poor uncertainty estimates from the Q-ensemble. Using generator-discriminator methods that improve the quality of predicted future frames could be employed in the place of the encoder-decoder models. In order to address the stability issues generally faced by GANs, architectures like Wasserstein GANs will also be an important direction of research. With more realistic frames, it is easier to obtain unbiased uncertainty estimates from the Q-ensemble to drive exploration. Also, as exploration strategies become complex, training becomes really slow and it will become important to use computational tricks like separating exploration from Q-network training, parallel multiple environments etc. to reduce training time.
References
- [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
- [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
- [4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- [5] Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931, 2015.
- [6] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.
- [7] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- [8] José M Bernardo and AF Smith. Bayesian theory, vol. 405. JohnWiley & Sons, 2009.
- [9] David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
- [10] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- [11] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- [12] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
-
[13]
Christos Louizos and Max Welling.
Structured and efficient variational deep learning with matrix
gaussian posteriors.
In
International Conference on Machine Learning
, pages 1708–1716, 2016. - [14] Carl Edward Rasmussen and Joaquin Quinonero-Candela. Healing the relevance vector machine through augmentation. In Proceedings of the 22nd international conference on Machine learning, pages 689–696. ACM, 2005.
- [15] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6405–6416, 2017.
- [16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- [17] Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles. 2018.
- [18] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253–279, 2013.
- [19] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015.
- [20] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
- [21] Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
- [22] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
- [23] Somil Bansal, Roberto Calandra, Sergey Levine, and Claire Tomlin. Mbmf: Model-based priors for model-free reinforcement learning. arXiv preprint arXiv:1709.03153, 2017.
- [24] Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine. Combining model-based and model-free updates for trajectory-centric reinforcement learning. arXiv preprint arXiv:1703.03078, 2017.
- [25] Felix Leibfried, Nate Kushman, and Katja Hofmann. A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078, 2016.
- [26] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: theory and practice—a survey. Automatica, 25(3):335–348, 1989.
-
[27]
Graham W Taylor and Geoffrey E Hinton.
Factored conditional restricted boltzmann machines for modeling motion style.
In Proceedings of the 26th annual international conference on machine learning, pages 1025–1032. ACM, 2009. -
[28]
Ilya Sutskever, James Martens, and Geoffrey E Hinton.
Generating text with recurrent neural networks.
In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011. - [29] Roland Memisevic. Learning to relate images. IEEE transactions on pattern analysis and machine intelligence, 35(8):1829–1846, 2013.
- [30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.