1 Introduction and Related Work
Quantifying predictive uncertainty is a problem that has started to receive a lot of attention as Deep Neural Networks achieve stateoftheart performance
lecun2015deepin a wide variety of domains such as computer vision
krizhevsky2012imagenet , speech recognition hinton2012deep mikolov2013efficient and bioinformatics zhou2015predicting ; alipanahi2015predictingbut have continued to produce overconfident estimates. These overconfident estimates can be detrimental or even harmful for practical applications
amodei2016concrete. Therefore, quantifying predictive uncertainties aside from just the accuracy of networks is an important problem. The contribution of our work is to combine an encoderdecoder modelbased architecture and trajectory memory with the modelfree Qensemble approach for the purpose of uncertainty estimation which in the context of reinforcement learning is exploration.
1.1 Neural Network Ensembles for Uncertainty Prediction
Current approaches to quantifying uncertainty have mostly been Bayesian where a prior distribution is specified over the parameters of the Neural Network and then using the training data, the computed posterior distribution over the parameters is used to calculate the uncertainty bernardo2009bayesian
. Since this form of Bayesian inference is computationally intractable, approaches have ranged from Laplace Approximation
mackay1992bayesian, Markov Chain Monte Carlo methods
neal2012bayesian to Variational Bayesian Inference methods blundell2015weight ; graves2011practical ; louizos2016structured. These methods however suffer from issues due to bounds of computational power and overreliance on the correctness of the prior probability distribution over the parameters. Having priors of convenience can in fact lead to unreasonable uncertainty estimates
rasmussen2005healing . In order to overcome these challenges and produce a more robust uncertainty estimate, Lakshminarayanan et al. lakshminarayanan2017simple proposed using an ensemble of Neural Networks trained under a defined scoring rule. This approach when compared to Bayesian approaches is much simpler, has parallelization advantages and achieves stateoftheart or better performance. Stateoftheart performance before this was achieved by MCdropout which can also in essence be considered an ensemble approach where the predictions are averaged over an ensemble of Neural Networks with parameter sharing srivastava2014dropout .1.2 ModelFree QEnsembles
This ensemble approach for uncertainty estimation in Neural Networks motivated its use for uncertainty estimation for exploration in the case of Deep Reinforcement Learning in the form of QEnsembles chen2018ucb
. Specifically, an ensemble voting algorithm is proposed where the agent takes action based on a majority vote over the Qensemble. The exploration strategy described uses the estimate of the confidence interval to then optimistically explore in the direction of the largest confidence interval (highest uncertainty). This approach was demonstrated to improve significantly over an Atari benchmark. The QEnsemble approach is an example of a modelfree reinforcement learning approach where we do not need to infer the environment in the learning process. Modelfree approaches are however generally high in sample complexity. Sampling from a learned model of the environment can help us mitigate this problem.
1.3 ModelBased EncoderDecoder Approach to NextFrame Prediction
Modelbased approaches in the Atari environment (Arcade Learning Environment bellemare2013arcade ) have been successfully used in the past for the problem of nextframe prediction. The game environments are high in complexity with the frames themselves being highdimensional, involving tens of objects that are being controlled directly or indirectly by agent actions and cases of objects leaving and entering the frame. Oh et al. oh2015action described an encoderdecoder model that produces visuallyrealistic next frames which can then be used as control for actionconditionals in the game. They further described an informed exploration approach called trajectory memory that follows the greedy strategy but leads to the frame that was least visited in the last nsteps instead of random exploration.
1.4 Need for Combining ModelFree and ModelBased Approaches
In order to address the issues related to modeling of the latent space, we propose taking advantage of a combination of the modelfree and modelbased approaches described so far. Typically, modelfree approaches are very effective at learning complex policies but convergence might required millions of trials and could lead to globally suboptimal local minima deisenroth2013survey . On the other hand, modelbased approaches have the theoretical benefit of being able to generalize to new tasks better and reduce the number of trials required significantly deisenroth2015gaussian ; levine2014learning but require an environment which is either engineered or learned well to achieve this generalization. Another advantage of the modelfree approach is that it can model arbitrarily complex unknown dynamics better but is substantially less sample efficient as indicated earlier. Prior attempts at combining the two approaches while retaining the relative advantages have been met with some success bansal2017mbmf ; chebotar2017combining ; leibfried2016deep . We will therefore be performing a combination of an encoderdecoder model and one that uses the informed trajectory memory exploration strategy that was proposed by Oh et al. oh2015action with the Qensemble and report the results.
Section 2 describes the two methods of combining modelbased methods and modelfree qensembles for exploration that we implemented. Section 3 details our experimental setup. Section 4 details the results we obtained and provides a discussion about the methods we attempted based on our experiments and related work. Section 5 concludes the paper with pointers about work we intend to do in the future.
2 Methods
Figure 1 visually represents the combination of the modelbased and modelfree Qensemble approaches to guide exploration. The modelbased encoderdecoder is first used to predict next frames over all possible actions. Given each of those actions, now the Qvalues associated with them are predicted using the Qensemble and the variance is used to drive exploration. This is somewhat similar to the model predictive control (MPC) framework garcia1989model (but instead of planning over the actiontree, we simply repeat each action multiple times). Further details about the methods are provided in the following sections.
2.1 Method 1  Feeding AutoEncoder Images Directly To QEnsemble
In this method, we use the autoencoder model to generate nextstep frames for all of the actions and feed in these predicted frames to a Qensemble. We use the variance in the predicted Qvalues (for each predicted frame) to estimate how likely this state has been visited during Qlearning (visit frequency). So to drive exploration, we need to simply pick those actions for which there is high variance. Given a well trained autoencoder model, we can unroll many steps of predicted frames and select paths with highvariance.
Figure 1 shows the architecture first of the encoderdecoder model which is composed of encoding layers that extract spatiotemporal features from the input frames, actionconditional transformation layers that then transform the encoded features into a nextframe prediction by using action variables as an additional input, finally followed by decoding layers that map the high level features back onto the pixel space.
2.1.1 Encoding, ActionConditional Transformation, Decoding
Similar to oh2015action ; leibfried2016deep
, the convolutional layers use a feedforward encoding that takes a concatenated set of previous images and extracts spatiotemporal features from them. The convolutional layers are essentially a functional transformation from the pixel space to a high level feature vector space by passing through multiple convolutional layers followed by a fullyconnected layer at the end, each of which is followed by a nonlinearity. The encoded feature vector is therefore 
where denotes frames of pixels with color channels at time .
The encoded feature vector is now transformed using multiplicative interactions with the control variables. Ideally, the transformation would looks as follows 
where is the action transformed feature,
is the actionvector at time t, W is a 3way tensor weight and b is the bias term. However, computing the 3way tensor is not scalable but allows the architecture to model different transformations for different actions as has been demonstrated in prior work
taylor2009factored ; sutskever2011generating ; memisevic2013learning . Therefore, an approximation of the 3way tensor is used instead Deconvolutions using CNNs perform the inverse operation of convolutions, transforming the 1 x 1 spatial regions onto d x d using deconvolutional kernels. In the architecture we implemented, the Deconvolutions was performed as follows 
where Reshape is a fully connected layer and Deconv consists of multiple deconvolution layers.
2.1.2 Modifying QEnsemble
For each stateaction pair, the encoderdecoder model produces an image. We now need to determine, which of these we are most uncertain about, in order to explore in that direction. For this purpose, the images are passed to several Qnetworks. Each Qnetwork provides a distribution over actions and therefore outputs corresponding to the number of actions (9 in Pacman). The total number of outputs therefore will be number of networks (which in our case was 5) times the number of actions. A variance metric is then calculated over all of these outputs to explore in the direction of highest variance.
where is the number of actions, is the number of Qnetworks (ensemble), is the transition function (frame prediction model) and Action denotes the action with highest uncertainty. Another way of estimating the variance can be, (converts Qestimates to Value estimates of )
These methods of exploration can be combined with other common exploration strategies like greedy where instead of selecting random actions, we can select the action with highest uncertainty.
2.2 Method 2  Trajectory Memory and QEnsembles
The trajectory memory method as described in oh2015action uses an greedy exploration policy. The trajectory memory is used to measure the similarity between the predicted frame and the most recent frames is calculated to give the estimated visit frequency. We use the same settings as described in oh2015action ; , and .
We can combine this exploration method with the variance in Qensembles to yield a more informed exploration strategy. This is similar to the previous method except that the uncertainty is calculated on the current frame instead of the predicted frames (). The predicted frames are used only to estimate their visit frequency. The algorithm used to select actions using the combined strategies is described in Algorithm 1.
3 Experimental Setup
All our experiments are run on the Ms. Pacman Atari environment bellemare2013arcade where exploration is challenging and it is important to achieve better scores. Below are the experimental settings and hyperparameters used.

Network Architecture
: For the Qensembles, we train 5 different Qnetworks and each one uses a standard DQN architecture: Conv(32, 8, 4), Conv(64, 4, 2), Conv(64, 3, 1) and Linear(256) with ReLU activations throughout. The autoencoder frame prediction model is the same as the one used in
oh2015action . 
Optimization: For training the QEnsemble, we use Adam with a learning rate = 0.0001, weight decay = 0 and gradient norm clipped at 10 for every layer.

Training details: Batch size = 32, Training Frequency = 4 (train every 4 frames), Discount Factor (gamma) = 0.99, Size of Replay Memory = 10000, Target network sync frequency = 1000 (fully replaced with the weights from training network)

Frame Preprocessing: We use the same frame processing technique as used by mnih2015human (frame skipping, max over 4 frames, RGB to grayscale).

Exploration: For greedy based strategies, Initial Epsilon = 1.0, Final Epsilon = 0.01, Exploration Timesteps = 1000000. UCB like strategy uses = {1.0 0.1, 0.01, 0.001}
4 Results and Discussion
4.1 Separately Training AutoEncoder and QEnsemble
We discuss about the results of training the autoencoder and Qensemble models separately as reported in oh2015action and chen2018ucb respectively.
ModelBased Approach
Results from the nextframe prediction done by Oh et al.’s modelbased approach oh2015action are shown in the figures. Figure 2 shows ground truth next frames and predicted next frames produced by the encoderdecoder model sidebyside. Figure 3 further shows the training and validation losses over 800000 iterations.
ModelFree QEnsemble
Figure 4
shows the results of training different QEnsemble methods and the standard Double DQN implementation. Due to the inherent slowness in training ensembles all of the models were not trained for the full 8000 epochs but they have been trained sufficiently to say that the ensemble with
greedy outperforms the UCB and standard DDQN approaches which replicates the results reported by Chen et al. chen2018ucb . These Qensemble models with different exploration strategies will serve as our baseline to beat.4.2 Method 1  Combining EncoderDecoder Model and QEnsemble
Fig 6 shows the reward and reward averaged per 100 instances for the combination of the encoderdecoder and qensemble models using different seeds. Fig 5 shows the loss for this method (Method 1). As can be seen, in comparison to just using the Qensemble, the combination hovers between the 400 and 800 mark for rewards without any improvement in training behavior even after 2.5 million timesteps. Also, training this model is really slow because of the multiple steps involved in action selection (generating next frames, feed to Qensemble, calculate path (if any) uncertainties). We restrict the path to be just the immediate action but repeat the action multiple times (4) to obtain a reasonable difference in the next states predicted by the autoencoder.
In our analysis we found that the frames predicted by the autoencoder are highly noisy themselves and this can lead to poor uncertainty estimates from the Qensemble (the Qensemble almost always has never seen such frames). This tends to bias the exploration in the wrong direction where actions with noisy predicted frames are always preferred instead of unexplored actions. Our results for this method are quite similar to the results obtained by oh2015action (Section 4.2) when they tried to replace the emulator with the frames prediction by the actionconditional model during testing.
4.3 Method 2  Trajectory Memory and QEnsembles
Figure 6(a) and 7 show the reward and loss plots obtained during training. As seen from the figures, the combination of trajectory memory and Qensemble variance (ucblike exploration) often yields better rewards and reaches higher rewards very quickly ( frames) compared to either of the baselines (Double DQN, QEnsembles). However, the behavior is highly dependent on the starting seeds and can lead to a lot of variance in performance (as seen in 6(a)). Training is again slow (not as slow as Method 1 though) and can take upto 1215 hours to reach 1M timesteps. The results are encouraging and show that combining estimated visit frequency and variance in Qestimates to drive exploration is much better than greedy or plain UCBlike exploration.
5 Conclusion and Future Work
Intelligent exploration strategies other than dithering ones like greedy are important for Qlearning in large statespaces. We find that combining trajectory memory based visit estimates with variance estimates from a Qensemble improves exploration and helps the agent reach better rewards much faster than other methods.
In the future, we hope to repeat these experiments on other Atari games like QBert or Seaquest where exploration is harder. As observed, the frames predicted by the autoencoder are highly noisy themselves and this can lead to poor uncertainty estimates from the Qensemble. Using generatordiscriminator methods that improve the quality of predicted future frames could be employed in the place of the encoderdecoder models. In order to address the stability issues generally faced by GANs, architectures like Wasserstein GANs will also be an important direction of research. With more realistic frames, it is easier to obtain unbiased uncertainty estimates from the Qensemble to drive exploration. Also, as exploration strategies become complex, training becomes really slow and it will become important to use computational tricks like separating exploration from Qnetwork training, parallel multiple environments etc. to reduce training time.
References
 [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [5] Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931, 2015.
 [6] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dnaand rnabinding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.
 [7] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
 [8] José M Bernardo and AF Smith. Bayesian theory, vol. 405. JohnWiley & Sons, 2009.
 [9] David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
 [10] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 [11] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 [12] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.

[13]
Christos Louizos and Max Welling.
Structured and efficient variational deep learning with matrix
gaussian posteriors.
In
International Conference on Machine Learning
, pages 1708–1716, 2016.  [14] Carl Edward Rasmussen and Joaquin QuinoneroCandela. Healing the relevance vector machine through augmentation. In Proceedings of the 22nd international conference on Machine learning, pages 689–696. ACM, 2005.
 [15] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6405–6416, 2017.
 [16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [17] Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via qensembles. 2018.
 [18] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253–279, 2013.
 [19] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015.
 [20] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
 [21] Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
 [22] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
 [23] Somil Bansal, Roberto Calandra, Sergey Levine, and Claire Tomlin. Mbmf: Modelbased priors for modelfree reinforcement learning. arXiv preprint arXiv:1709.03153, 2017.
 [24] Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. arXiv preprint arXiv:1703.03078, 2017.
 [25] Felix Leibfried, Nate Kushman, and Katja Hofmann. A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078, 2016.
 [26] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: theory and practice—a survey. Automatica, 25(3):335–348, 1989.

[27]
Graham W Taylor and Geoffrey E Hinton.
Factored conditional restricted boltzmann machines for modeling motion style.
In Proceedings of the 26th annual international conference on machine learning, pages 1025–1032. ACM, 2009. 
[28]
Ilya Sutskever, James Martens, and Geoffrey E Hinton.
Generating text with recurrent neural networks.
In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1017–1024, 2011.  [29] Roland Memisevic. Learning to relate images. IEEE transactions on pattern analysis and machine intelligence, 35(8):1829–1846, 2013.
 [30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
Comments
There are no comments yet.