Uncertainty-Based Out-of-Distribution Classification in Deep Reinforcement Learning

by   Andreas Sedlmeier, et al.
Universität München

Robustness to out-of-distribution (OOD) data is an important goal in building reliable machine learning systems. Especially in autonomous systems, wrong predictions for OOD inputs can cause safety critical situations. As a first step towards a solution, we consider the problem of detecting such data in a value-based deep reinforcement learning (RL) setting. Modelling this problem as a one-class classification problem, we propose a framework for uncertainty-based OOD classification: UBOOD. It is based on the effect that an agent's epistemic uncertainty is reduced for situations encountered during training (in-distribution), and thus lower than for unencountered (OOD) situations. Being agnostic towards the approach used for estimating epistemic uncertainty, combinations with different uncertainty estimation methods, e.g. approximate Bayesian inference methods or ensembling techniques are possible. We further present a first viable solution for calculating a dynamic classification threshold, based on the uncertainty distribution of the training data. Evaluation shows that the framework produces reliable classification results when combined with ensemble-based estimators, while the combination with concrete dropout-based estimators fails to reliably detect OOD situations. In summary, UBOOD presents a viable approach for OOD classification in deep RL settings by leveraging the epistemic uncertainty of the agent's value function.



There are no comments yet.


page 7


Uncertainty-Based Out-of-Distribution Detection in Deep Reinforcement Learning

We consider the problem of detecting out-of-distribution (OOD) samples i...

Policy Entropy for Out-of-Distribution Classification

One critical prerequisite for the deployment of reinforcement learning s...

Estimating Risk and Uncertainty in Deep Reinforcement Learning

This paper demonstrates a novel method for separately estimating aleator...

Ensemble Quantile Networks: Uncertainty-Aware Reinforcement Learning with Applications in Autonomous Driving

Reinforcement learning (RL) can be used to create a decision-making agen...

Know Your Limits: Monotonicity Softmax Make Neural Classifiers Overconfident on OOD Data

A crucial requirement for reliable deployment of deep learning models fo...

Multivariate Deep Evidential Regression

There is significant need for principled uncertainty reasoning in machin...

Incorporating Epistemic Uncertainty into the Safety Assurance of Socio-Technical Systems

In system development, epistemic uncertainty is an ever-present possibil...

Code Repositories


Cross-domain Robot Navigation with Deep Reinforcement Learning

view repo


Modifiable version of the LunarLander gym environment

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the main impediments to the deployment of autonomous machine learning systems in the real world is the difficulty to show that the system will continue to reliably execute beneficial actions in all the situations it encounters in production use. One of the possible reasons for failure is so called out-of-distribution (OOD) data, i.e. data which deviates substantially from the data encountered during training. As the fundamental problem of limited training data seems unsolvable for most cases, especially in sequential decision making tasks like reinforcement learning (RL), a possible first step towards a solution is to detect and report the occurrence of OOD data. This can prevent silent and possibly safety critical failures of the machine learning system (caused by wrong predictions which lead to the execution of unfavorable actions), for example by handing control over to a human supervisor [1]. Recently, several different approaches were proposed that try to detect OOD samples in classification tasks [13, 20]

, or perform anomaly detection via generative models


. While these methods show promising results in the evaluated classification tasks, we are not aware of applications to value-based RL settings where non-stationary regression targets are present. Thus, our research aims to provide a first step towards developing and evaluating suitable OOD detection methods that are applicable to changing environments in sequential decision making tasks. We model the OOD-detection problem as a one-class classification problem with the two classes: in-distribution and out-of-distribution. Having framed the problem this way, we propose a framework for uncertainty-based OOD classification: UBOOD. It is based on the effect that epistemic uncertainty in the agent’s chosen actions is reduced for situations encountered during training (in-distribution), and is thus lower than for unencountered (OOD) situations. The framework itself is agnostic towards the approach used for estimating epistemic uncertainty. Thus, it is possible to use e.g. approximate Bayesian inference methods or ensembling techniques. In order to evaluate the performance of any OOD classifier in a RL setting, modifiable environments which can generate OOD samples are needed. Due to a lack of publicly available RL environments that allow systematic modification, we developed two different environments: one using a gridworld-style discrete state-space, the other using a continuous state-space. Both allow modifications of increasing strength (and consequently produce OOD samples of increasing strength) after the training process. We empirically evaluated the performance of the UBOOD framework with different uncertainty estimation methods on these environments. Evaluation results show that the framework produces reliable OOD classification results when combined with ensemble-based estimators, while the combination with concrete dropout-based estimators fails to capture increased uncertainty in the OOD situations. Ensemble-based approaches also show increasing classification accuracy, the

stronger the OOD samples are (i.e. the more the environments differ from training) and increasing uncertainty is inversely related with the agent’s achieved return.

2 Basics

2.1 Uncertainty

When viewed from a statistical perspective, uncertainty arises whenever the outcome of a random variable cannot be known with certainty. Uncertainty measures can then be understood to describe how random the outcome of such a random variable is. This “amount of randomness” is described by the dispersion of the random variable’s probability distribution, i.e. how stretched or squeezed the probability distribution is. Measures of this dispersion are e.g. the probability distribution’s variance or standard deviation.


2.1.1 Uncertainty Estimation

In the context of this work, we are interested in the uncertainty of a neural network’s prediction, which in a value-based deep RL setting is the certainty that an agent’s chosen action is optimal in the given situation. Different approaches exist that make it possible to estimate this uncertainty. Ensemble techniques for example aggregate the predictions of multiple networks, often trained on different versions of the data, and interpret the variance of the individual predictions as the uncertainty

[24, 17]. An example of this approach can be seen in Figure 1, which shows the individual predictions of a Bootstrap ensemble as well as their mean and variance. These and other methods applicable to deep neural networks will be presented in more detail in Section 3.1. Besides the various ways of measuring uncertainty, it is equally important to differentiate the different sources of uncertainty.

Figure 1: Example regression of a 1-D toy-dataset showing the predictions of a Bootstrap ensemble (see Section 4.2) of size . Blue dots represent the training data. Thin red lines show the individual ensemble predictions, while the thick red line represents the mean of the predictions. The variance of the individual predictions can be interpreted as epistemic uncertainty.

2.1.2 Aleatoric Uncertainty

Aleatoric uncertainty models the inherent stochasticity in the system, i.e. no amount of data can explain the observed stochasticity. In other words, the uncertainty cannot be reduced by capturing more data. A reason for this might be that certain features that would be needed to explain the behaviour of the system are not part of the collected data. E.g. consider trying to model the distance different cars travel on a highway in a certain amount of time, without measuring their speed. If the speed is not part of the collected data, the randomness in the measured distances cannot be explained. It is also possible that the uncertainty is a fundamental property of the measured system, as is the case when dealing with quantum mechanics. As such, aleatoric uncertainty cannot be reduced, irrespective of how much data is collected.

2.1.3 Epistemic Uncertainty

Epistemic uncertainty by contrast arises out of a lack of sufficient data to exactly infer the underlying system’s data generating function. In this case, the features available in the data do in principle allow the explanation of the behaviour of the system. In the previous example, this would e.g. be the case if both time and speed are measured, but so far only cars traveling at the same speed had been observed. The uncertainty caused by the effect of different speed in this case is epistemic, as collecting more data could allow for a correct inference of the system’s behaviour and consequently the reduction of the uncertainty.

2.2 Markov Decision Processes

We base our problem formulation on Markov decision processes (MDPs)

[26]. MDPs are defined by tuples: . is a (finite) set of states; being the state of the MDP at time step . is the (finite) set of actions; is the action the MDP takes at step . defines the transition probability function; a transition occurs by executing action in state . The resulting next state is determined based on . In this paper we focus on deterministic domains represented by deterministic MDPs, so . Finally, is the scalar reward; for this paper we assume that .

Goal of the problem is to find a policy in the space of all possible policies , which maximizes the expectation of return at state over a potentially infinite horizon:


where is the discount factor.

2.3 Reinforcement Learning

In order to search the policy space , we consider model-free reinforcement learning (RL). In this setting, an agent interacts with an environment defined as an MDP by executing a sequence of actions  [30]. In the fully observable case of RL, the agent knows its current state and the action space , but not the effect of executing in , i.e., and . In order to find the optimal policy , we focus on Q-Learning [31], a commonly used value-based approach. It is named for the action-value function , which describes the expected return when taking action in state and then following policy for all states afterwards.

The optimal action-value function of policy is any action-value function that yields higher accumulated rewards than all other action-value functions, i.e., . Q-Learning aims to approximate by starting from an initial guess for , which is then updated via


It uses experience samples of the form , where is the reward earned at time step , i.e., by executing action when in state . The learning rate is a setup-specific parameter. The set of all experience samples taken at time steps for some training limit is called the training set .

The learned action-value function converges to the optimal action-value function , which then implies an optimal policy .

In high-dimensional settings or when learning in continuous state-spaces, it is common to use parameterized function approximators like neural networks to approximate the action-value function: with specifying the weights of the neural network. When using a deep neural network as the function approximator, this approach is called deep reinforcement learning. [22]

3 Related Work

3.1 Uncertainty in Deep Learning

When dealing with uncertainty, a systematic way is via Bayesian inference. Its combination with neural networks in the form of Bayesian neural networks is realised by placing a probability distribution over the weight-values of the network [21]. As calculating the exact Bayesian posterior quickly becomes computationally intractable for deep models, a popular solution are approximate inference methods [12, 6, 10, 14, 19, 11]. Another option is the construction of model ensembles, e.g., based on the idea of the statistical bootstrap [9]. The resulting distribution of the ensemble predictions can then be used to approximate the uncertainty [24, 17].

Both approaches have been used for tasks as diverse as machine vision [16] or disease detection [18]. In the field of decision making, uncertainty is used to implicitly guide exploration, e.g by creating an ensemble of models [24], or for learning safety predictors, e.g. predicting the probability of a collision [15]. Recently, a distributional approach to RL [2] was proposed which tries to learn the value distribution of a RL environment. Although this approach also models uncertainty, its goal of estimating the distribution of values is different from the work at hand, which tries to detect epistemic uncertainty, i.e. uncertainty in the model itself.

3.2 OOD and Novelty Detection

For the case of low-dimensional feature spaces, OOD detection (also called novelty detection) is a well-researched problem. For a survey on the topic, see e.g.

[25], who distinguish between probabilistic, distance-based, reconstruction-based, domain-based and information theoretic methods. During the last years, several new methods based on deep neural networks were proposed for high-dimensional cases, mostly focusing on classification tasks, e.g. image classification. [13] propose a baseline for detecting OOD examples in neural networks, based on the predicted class probabilities of a softmax classifier. [20] improve upon this baseline by using temperature scaling and by adding perturbations to the input. [19] evaluate the performance of a proposed alpha-divergence-based variational inference technique in an image classification task of adversarial examples. This can be understood as a form of OOD detection, as the generated adversarial examples lie outside of the training image manifold and consequently far from the training data. The authors report increased epistemic uncertainty, confirming the viability of their approach for the detection of adversarial image examples. The basic idea of this uncertainty-based approach is closely related to our proposed method, but no evaluation of the performance in a RL setting with non-stationary regression targets was performed. To the best our knowledge, none of the previously mentioned methods were evaluated regarding the epistemic uncertainty detection performance in a RL setting.

4 Ubood: Uncertainty-Based Out-of-Distribution Classification

In this paper we propose UBOOD, an uncertainty-based OOD-classifier that can be employed in value-based deep reinforcement learning settings. It is based on the reducibility of epistemic uncertainty in the action-value function approximation.

As previously described, epistemic uncertainty arises out of a lack of sufficient data to exactly infer the underlying system’s data generating function. As such, it tends to be higher in areas of low data density. [27], who in turn refers to [4] for the initial conjecture, showed that the epistemic uncertainty is approximately inversely proportional to the density

of the input data, for the case of generalized linear regression models as well as multi-layer neural networks:


This also forms the basis of our approach: to use this inverse relation between epistemic uncertainty and data density in order to differentiate in- from out-of-distribution samples.

We define as the epistemic uncertainty function of a given Q-function approximation . If a suitable method for epistemic uncertainty estimation for deep neural networks is applied, the process of training the agent reduces for those state-action tuples that were used for training, i.e., there exists a successor state and a reward so that . consequently defines the set of in-distribution data. By contrast, state-action tuples that were not encountered during training i.e. define the set of out-of-distribution data . The epistemic uncertainty of these state-action tuples is not reduced during training. Thus, epistemic uncertainty of out-of-distribution data will be higher than that of in-distribution data:


UBOOD directly uses the output of the epistemic uncertainty function as the real-valued classification score. As is the case for many one-class classificators, this real-valued score forms the input of a threshold-based decision function, which then assigns the in- or out-of-distribution class label.

4.1 Classification Threshold

As is the case for any score-based one-class classification method, the classification threshold can be adjusted to modify the behaviour of the classifier, depending on the application’s requirements. For many applications, where some amount of OOD data is intermixed with the training data and the percentage is known, this information can be used to specify the threshold. As in our case, per definition, there are no OOD samples in the training data, such an approach is not possible. As a viable first solution, we propose the following simple algorithm to calculate a dynamic classification threshold:

  1. Calculate the average uncertainty of the in-distribution samples .

  2. Treat as a probability distribution and define the classification threshold as .

Thus, a dynamic threshold-based on the uncertainty distribution is realized that adjusts over the training process as more data is gathered. Please note that more complex algorithms for the threshold determination can be developed, e.g. by using multimodal probability distributions to model or by making use of additional information about the available data on a per-application basis.

4.2 Epistemic Uncertainty Estimation Methods

In principle, any of the epistemic uncertainty estimation methods mentioned in Section 3.1 that are applicable to the function approximator used to model the Q-function, can be used in the UBOOD framework. In this paper, we evaluate three different UBOOD versions using different methods for epistemic uncertainty estimation and their effect on the OOD classification performance, as the networks are being used by the RL agent for value estimation.

The Monte-Carlo Concrete Dropout method is based on the dropout variational inference architecture as described by [16]. Instead of default dropout layers, we use concrete dropout layers as described by [11], which do not require pre-specified dropout rates and instead learn individual dropout rates per layer. Figure 1(a) presents a schematic of the network used by this method.

(a) MCCD network

(b) Bootstrap network

(c) Bootstrap-Prior network
Figure 2: Model architectures of the evaluated networks. (fig:models_mccd) The Monte-Carlo Concrete Dropout network. For this architecture, multiple MC samples are required to calculate the epistemic uncertainty. (fig:models_boot) The Bootstrap neural network with bootstrap heads, and (fig:models_bootp) the Bootstrap-Prior neural network which adds the output of an untrainable prior network to the output of the bootstrap heads to generate posterior heads. For both bootstrap-based architectures, epistemic uncertainty is calculated as the variance of the output heads.

This concrete dropout method is of special interest in our context of reinforcement learning, as here the available data change during the training process, rendering a manual optimization of the dropout rate hyperparameter even more difficult. Model loss is calculated by minimizing the negative log-likelihood of the predicted output distribution. Epistemic uncertainty as part of the total predictive uncertainty is then calculated as:


with outputs of the Monte-Carlo sampling.

The Bootstrap method is based on the network architecture described by [24]. It represents an efficient implementation of the bootstrap principle by sharing a set of hidden layers between all members of the ensemble. In the network, the shared, fully-connected hidden layers are followed by an output layer of size , called the bootstrap heads, as can be seen in Figure 1(b). For each datapoint, a Boolean mask of length equal to the number of heads is generated, which determines the heads this datapoint is visible to. The mask’s values are set by drawing

times from a masking distribution. For the work at hand, the values are independently drawn from Bernoulli distributions with either

or . In the case of , the bootstrap is reduced to a classic ensemble where all heads are trained on the complete data.

The Bootstrap-Prior method is based on the extension presented in [23]. It has the same basic architecture as the Bootstrap method but with the addition of a so-called random Prior Network. Predictions are generated by adding the data dependent output of this untrainable prior network to the output of the different bootstrap heads in order to calculate the ensemble posterior (Figure 1(c)). The authors conjecture that the addition of this randomized prior function outperforms deep ensemble-based methods without explicit priors, as for the latter, the initial weights have to act both as prior and training initializer.

For both bootstrap-based methods, epistemic uncertainty is calculated as the variance of the outputs.

5 Experimental Setup

5.1 Framework versions

We evaluate three different versions of the UBOOD framework:

  • UB-MC: UBOOD with Monte-Carlo Concrete Dropout (MCCD) network

  • UB-B: UBOOD with Bootstrap network

  • UB-BP: UBOOD with Bootstrap-Prior network

The UB-MC version’s estimator network consists of two fully-connected hidden layers with 64 neurons each, followed by two separate neurons in the output layer representing


of a normal distribution. As concrete dropout layers are used, no dropout probability has to be specified. Model loss and epistemic uncertainty are calculated as described in Section 


The UB-B Bootstrap neural network and UB-BP Bootstrap-Prior neural network versions all consist of two fully-connected hidden layers with 64 neurons each, which are shared between all heads, followed by an output layer of bootstrap heads.

Each of these UBOOD versions is further evaluated with two parametrizations of the respective epistemic uncertainty estimation method: UB-MC40 and UB-MC80 differ in respect to the amount of Monte-Carlo forward passes that are executed to approximate the epistemic uncertainty: or passes. UB-B and UB-BP parametrizations (UB-B07, UB-B10, UB-BP07, UB-BP10) differ in respect to the Bernoulli distribution used to determine the bootstrap mask: probability for UB-B07 & UB-BP07 and probability for UB-B10 & UB-BP10.

For all networks, ReLU is used as the layers’ activation function, with the exception of the output layers, where no activation function is used. The classification threshold is calculated as

, as described in section 4.1.

5.2 Environments

One of the problems in evaluating OOD detection for RL is the lack of datasets or environments which can be used for generating and assessing OOD samples in a controlled and reproducible way. By contrast to the field of image classification, where benchmark datasets like notMNIST [8] exist that contain OOD samples, there are no equivalent sets for RL. We apply a principled approach to develop two environments, one using a gridworld-style discrete state-space, the other using a continuous state-space. Both environments allow systematic modifications after the training process, thus producing OOD states during evaluation.

The first environment is a simple gridworld pathfinding environment. It is built on the design presented in [29] and has a discrete state-space. The basic layout consists of two rooms, separated by a vertical wall. Movement between the rooms is only possible via two hallways, as is visualised in Figure 3. The agent starts every episode at a random position on the grid (labeled S in Figure 3). Its task is to reach a specific goal position on the grid (labeled G in Figure 3), which also varies randomly every episode, by choosing one of the four possible actions: .

(a) Example environment: Config 0
(b) Example environment: Config 7
Figure 3: Example initializations of the gridworld pathfinding environment using different configurations. The label S indicates the agent’s start position, while G marks the goal. Both positions are randomly set in the ranges defined by the respective configuration every episode. (fig:factory_env0) shows a placement using environment configuration 0 as active in training. Samples collected with this configuration define the in-distribution set. (fig:factory_env7) shows an initialization of environment configuration 7 which differs maximally from the training configuration.

The state of the environment is represented as a stack of three feature planes, with each plane representing the spatial positions of all environment objects of a specific type: agent, goal or wall. Each step of the agent incurs a cost of except the goal-reaching action, which is rewarded with and ends the episode. We evaluate the performance of the UBOOD framework on a set of environment configurations. All environment configurations have a size of and randomly vary the y-coordinate of the agent’s start position as well as the goal position every episode, in the interval . Configuration , the only configuration used in training, varies the x-coordinate of the agent’s start position in the interval and the goal position in the interval . Each environment configuration is then defined by shifting the start interval right by compared to the previous configuration, while the goal interval is shifted left by . E.g. configuration has start position range and goal position range . This results in environment configurations with increasing difference from the training configuration , as can be seen in the example shown in Figure 2(b).

The continuous state-space environment is based on OpenAI’s LunarLander environment [7]

. The goal is to safely land a rocket inside a defined landing pad, without crashing. This task can be understood as rocket trajectory optimization. While the original environment defines a static position for the landing pad, our modified environment allows for random placement inside specified intervals. As the original environment does not encode the landing pad’s position in the state representation, our version extends the state encoding to include the left and right x-coordinate as well as the y-coordinate of the pad. For evaluating the performance of the UBOOD framework in this continuous state-space environment, we created a set of

configurations. Configuration , the only configuration used in training, varies the x-coordinate of the center of the landing pad in the interval and the y-coordinate in the interval , which results in the landing pad being placed in the upper left side of the environment. An example of this configuration can be seen in Figure 3(a). Each environment configuration is then defined by shifting the x-coordinate interval right by compared to the previous configuration, while the y-coordinate interval is shifted left by . This results in the pads being placed increasingly to the lower right side of the environment. Like in the gridworld environment, this produces environment configurations with increasing difference from the training configuration .

(a) Example: Config 0
(b) Example: Config 5
Figure 4: Examples from the LunarLander environment using different configurations. (fig:lunar_env0) Example using environment configuration 0 as active in training. Samples collected with this configuration define the in-distribution set. Example using (fig:lunar_env5) environment configuration 5 which differs maximally from the training configuration.

Note that training on both environments is solely performed using the respective environment configuration . Evaluation runs are executed independently of the training process, based on model snapshots generated at the respective training episodes. Consequently, data collected during these evaluation runs is not used for training.

6 Performance Results

All evaluated versions learn successful policies on both the gridworld and LunarLander environments. Returns achieved by the trained policies after training episodes on different environment configurations are shown in Figure 5. As is to be expected, increasing changes to the environment (configuration ) reduce the achieved return, as the evaluation environment increasingly differs from the training environment configuration .

Figure 5: Returns achieved by the different versions on varying configurations of the LunarLander environment after training episodes on configuration . Envionment configurations modify the environment with increasing strength as described in Section 5.2. All values shown are averages of evaluation runs.
(a) LunarLander
(b) Gridworld
Figure 6: F1-Scores of the classifier evaluated on different configurations of the LunarLander and gridworld environments. Samples collected on the training configuration of each environment are defined as negatives (in-distribution), samples from the other configurations as positives (OOD). X-Axis shows evaluations performed with samples from the training configuration and the respective environment configuration . Samples are aggregated from consecutive episode runs.

We evaluate the performance of the UBOOD framework based on the F1-Score as the harmonic mean of precision and recall. Figure 

6 shows the F1-Scores achieved, dependent on the uncertainty estimation technique used in the framework. Best overall classification results on the LunarLander environment are achieved for UB-BP, i.e. using UBOOD with the Bootstrap-Prior estimator with F1-values as high as for UB-BP07 on environment configuration . F1-Scores of the UB-B and UB-BP versions on the gridworld environment are higher overall, when compared to the UB-MC versions. Here, values range between a minimum of on evaluation configuration , which is closest to the training configuration, and on configuration , which produces the strongest OOD samples. Overall, classification performance increases over environment configurations when Bootstrap-based estimators are used in the UBOOD framework. UB-MC, i.e. UBOOD combined with MCCD estimators, generates highly varying F1-scores, ranging between and on the gridworld environment and and on the LunarLander environment. By contrast to the Bootstrap-based versions, there is no relation apparent between the strength of the environment modification and the classification performance.

We further evaluate the relation between reported uncertainty and the return achieved by the agent. Figure 7 shows evaluation results of the UB-BP10 and UB-MC80 versions evaluated on different configurations of the gridworld environment. For UB-BP10 (), increases in uncertainty (caused by increasing environment modifications) are reflected in decreases of return. This behaviour was also present on the LunarLander environment and consistent for different values of . No such clear relation was visible for UB-MC80. As can be seen in the results visualised in Figure 7, the uncertainty reported by the MCCD-based version decreases strongly between configuration and , although the achieved return also decreases.

Figure 7: Uncertainty VS return of UB-BP10 and UB-MC80 evaluated on different configurations of the gridworld environment. While for the Bootstrap-based version UB-BP10, increases in uncertainty are reflected in decreases of return, a large decrease in uncertainty is visible for UB-MC80 between configuration and , although the achieved return also decreases. All values shown are averages of evaluation runs.

7 Discussion and Future Work

In this paper, we proposed UBOOD, an uncertainty-based out-of-distribution classification framework. Evaluation results show that using the epistemic uncertainty of the agent’s value function presents a viable approach for OOD classification in a deep RL setting. We find that the framework’s performance is ultimately dependent on the reliability of the underlying uncertainty estimation method, which is why good uncertainty estimates are required.

(a) UB-B07
(b) UB-MC80
Figure 8: Average uncertainties reported by (fig:factory_uncertainty_over_ep_boot) Bootstrap-based version UB-B07 and (fig:factory_uncertainty_over_ep_mccd) Monte-Carlo Concrete Dropout based version UB-MC80 on the Gridworld environment. Env. config 0 shows uncertainties reported on the training configuration of the environment (in-distribution), Env. config 7 the uncertainties on the maximaly diverging configuration. While for UB-B07 the uncertainties start diverging with progressing training, there is no such effect for UB-MC80. As a consequence, only the Bootstrap-based version allows for an increasingly better differentiation between in- and OOD samples. All values shown are averages of evaluation runs.

On both evaluation domains, UBOOD combined with ensemble-based bootstrap uncertainty estimation methods (UB-B / UB-BP) shows good results with F1-scores as high as , allowing for a reliable differentiation between in- and OOD-samples. F1-Scores increase as the environment configuration differs more from the training environment, i.e. the stronger OOD the observed samples, the more reliable the classification. The addition of a prior as done with the UB-BP version seems to have a positive effect on the separation of in- and out-of-distribution samples as is reflected in higher F1-scores on the LunarLander environment. By contrast, UBOOD combined with the concrete dropout-based uncertainty estimation method (UB-MC) does not produce viable results. Although increasing the amount of Monte-Carlo samples improves the performance somewhat, the resulting classification performance is not on par with the Bootstrap-based versions. The reason for the large difference in performance between the Bootstrap-based and MCCD-based versions can be seen in the example shown in Figure  8. For the UB-B version, the reported uncertainties on environment configuration (training) and (strong modification) increasingly diverge with progressing training episodes (Figure 7(a)). As this is not the case for the UB-MC version (Figure 7(b)), only the Bootstrap-based version allows for an increasingly better differentiation between in- and OOD samples and consequently high F1-scores of the classifier. We found this effect to be consistent over all parametrizations of the Bootstrap- and MCCD-based versions we evaluated.

Our results match recent findings [3]

, where ensemble-based uncertainty estimators were compared against Monte-Carlo Dropout based ones for the case of active learning in image classification. Results presented in that work also showed that ensembles performed better and led to more calibrated uncertainty estimates. As a possible explanation, the authors argue that the difference in performance is a result of a combination of decreased model capacity and lower diversity of the Monte-Carlo Dropout methods when compared to ensemble approaches. This effect would also explain the behaviour we observed when comparing reported uncertainty and achieved return. While there is a strong inverse relation visible when using Bootstrap-based UBOOD versions, no clear pattern emerged for the evaluated MCCD-based versions. We think that further research into the relation between epistemic uncertainty and achieved return when train- and test-environments differ could provide interesting insights relating to generalization performance in deep RL. Being able to differentiate between an agent having encountered a situation in training versus the agent generalizing its experience to new situations could provide a huge benefit in safety-critical situations.


  • [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016-06) Concrete Problems in AI Safety. ArXiv e-prints. External Links: 1606.06565 Cited by: §1.
  • [2] M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. Cited by: §3.1.
  • [3] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler (2018-06) The power of ensembles for active learning in image classification. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §7.
  • [4] C. M. Bishop (1994-08) Novelty detection and neural network validation. IEE Proceedings - Vision, Image and Signal Processing 141 (4), pp. 217–222. External Links: Document, ISSN 1350-245X Cited by: §4.
  • [5] C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §2.1.
  • [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015-05) Weight Uncertainty in Neural Networks. ArXiv e-prints. External Links: 1505.05424 Cited by: §3.1.
  • [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.2.
  • [8] Y. Bulatov (2011) Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html. Cited by: §5.2.
  • [9] B. Efron (1992) Bootstrap methods: another look at the jackknife. In Breakthroughs in Statistics: Methodology and Distribution, pp. 569–593. External Links: Document, ISBN 978-1-4612-4380-9 Cited by: §3.1.
  • [10] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In international conference on machine learning, pp. 1050–1059. Cited by: §3.1.
  • [11] Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3581–3590. Cited by: §3.1, §4.2.
  • [12] A. Graves (2011) Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2348–2356. Cited by: §3.1.
  • [13] D. Hendrycks and K. Gimpel (2016-10) A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. ArXiv e-prints. External Links: 1610.02136 Cited by: §1, §3.2.
  • [14] J. Hernández-Lobato, Y. Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R. Ttarner (2016) Black-box -divergence minimization. In 33rd International Conference on Machine Learning, ICML 2016, Vol. 4, pp. 2256–2273. Cited by: §3.1.
  • [15] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine (2017) Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182. Cited by: §3.1.
  • [16] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5574–5584. Cited by: §3.1, §4.2.
  • [17] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6402–6413. Cited by: §2.1.1, §3.1.
  • [18] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl (2017) Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports 7 (1), pp. 17816. Cited by: §3.1.
  • [19] Y. Li and Y. Gal (2017-03) Dropout Inference in Bayesian Neural Networks with Alpha-divergences. ArXiv e-prints. External Links: 1703.02914 Cited by: §3.1, §3.2.
  • [20] S. Liang, Y. Li, and R. Srikant (2017-06) Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. ArXiv e-prints. External Links: 1706.02690 Cited by: §1, §3.2.
  • [21] D. J. C. MacKay (1992)

    A practical bayesian framework for backpropagation networks

    Neural Computation 4, pp. 448–472. Cited by: §3.1.
  • [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.3.
  • [23] I. Osband, J. Aslanides, and A. Cassirer (2018-06) Randomized Prior Functions for Deep Reinforcement Learning. ArXiv e-prints. External Links: 1806.03335 Cited by: §4.2.
  • [24] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4026–4034. Cited by: §2.1.1, §3.1, §3.1, §4.2.
  • [25] M. A.F. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko (2014) A review of novelty detection. Signal Processing 99, pp. 215 – 249. External Links: Document, ISSN 0165-1684 Cited by: §3.2.
  • [26] M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.2.
  • [27] C. S. Qazaz (1996) Bayesian error bars for regression. Ph.D. Thesis, Aston University. Cited by: §4.
  • [28] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In IPMI, Cited by: §1.
  • [29] A. Sedlmeier, T. Gabor, T. Phan, L. Belzner, and C. Linnhoff-Popien (2019) Uncertainty-based out-of-distribution detection in deep reinforcement learning. arXiv preprint arXiv:1901.02219. Cited by: §5.2.
  • [30] R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §2.3.
  • [31] C. J. C. H. Watkins (1989) Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge. Cited by: §2.3.