Cross-domain Robot Navigation with Deep Reinforcement Learning
Robustness to out-of-distribution (OOD) data is an important goal in building reliable machine learning systems. Especially in autonomous systems, wrong predictions for OOD inputs can cause safety critical situations. As a first step towards a solution, we consider the problem of detecting such data in a value-based deep reinforcement learning (RL) setting. Modelling this problem as a one-class classification problem, we propose a framework for uncertainty-based OOD classification: UBOOD. It is based on the effect that an agent's epistemic uncertainty is reduced for situations encountered during training (in-distribution), and thus lower than for unencountered (OOD) situations. Being agnostic towards the approach used for estimating epistemic uncertainty, combinations with different uncertainty estimation methods, e.g. approximate Bayesian inference methods or ensembling techniques are possible. We further present a first viable solution for calculating a dynamic classification threshold, based on the uncertainty distribution of the training data. Evaluation shows that the framework produces reliable classification results when combined with ensemble-based estimators, while the combination with concrete dropout-based estimators fails to reliably detect OOD situations. In summary, UBOOD presents a viable approach for OOD classification in deep RL settings by leveraging the epistemic uncertainty of the agent's value function.READ FULL TEXT VIEW PDF
Cross-domain Robot Navigation with Deep Reinforcement Learning
Modifiable version of the LunarLander gym environment
One of the main impediments to the deployment of autonomous machine learning systems in the real world is the difficulty to show that the system will continue to reliably execute beneficial actions in all the situations it encounters in production use. One of the possible reasons for failure is so called out-of-distribution (OOD) data, i.e. data which deviates substantially from the data encountered during training. As the fundamental problem of limited training data seems unsolvable for most cases, especially in sequential decision making tasks like reinforcement learning (RL), a possible first step towards a solution is to detect and report the occurrence of OOD data. This can prevent silent and possibly safety critical failures of the machine learning system (caused by wrong predictions which lead to the execution of unfavorable actions), for example by handing control over to a human supervisor . Recently, several different approaches were proposed that try to detect OOD samples in classification tasks [13, 20]
, or perform anomaly detection via generative models
. While these methods show promising results in the evaluated classification tasks, we are not aware of applications to value-based RL settings where non-stationary regression targets are present. Thus, our research aims to provide a first step towards developing and evaluating suitable OOD detection methods that are applicable to changing environments in sequential decision making tasks. We model the OOD-detection problem as a one-class classification problem with the two classes: in-distribution and out-of-distribution. Having framed the problem this way, we propose a framework for uncertainty-based OOD classification: UBOOD. It is based on the effect that epistemic uncertainty in the agent’s chosen actions is reduced for situations encountered during training (in-distribution), and is thus lower than for unencountered (OOD) situations. The framework itself is agnostic towards the approach used for estimating epistemic uncertainty. Thus, it is possible to use e.g. approximate Bayesian inference methods or ensembling techniques. In order to evaluate the performance of any OOD classifier in a RL setting, modifiable environments which can generate OOD samples are needed. Due to a lack of publicly available RL environments that allow systematic modification, we developed two different environments: one using a gridworld-style discrete state-space, the other using a continuous state-space. Both allow modifications of increasing strength (and consequently produce OOD samples of increasing strength) after the training process. We empirically evaluated the performance of the UBOOD framework with different uncertainty estimation methods on these environments. Evaluation results show that the framework produces reliable OOD classification results when combined with ensemble-based estimators, while the combination with concrete dropout-based estimators fails to capture increased uncertainty in the OOD situations. Ensemble-based approaches also show increasing classification accuracy, thestronger the OOD samples are (i.e. the more the environments differ from training) and increasing uncertainty is inversely related with the agent’s achieved return.
When viewed from a statistical perspective, uncertainty arises whenever the outcome of a random variable cannot be known with certainty. Uncertainty measures can then be understood to describe how random the outcome of such a random variable is. This “amount of randomness” is described by the dispersion of the random variable’s probability distribution, i.e. how stretched or squeezed the probability distribution is. Measures of this dispersion are e.g. the probability distribution’s variance or standard deviation.
In the context of this work, we are interested in the uncertainty of a neural network’s prediction, which in a value-based deep RL setting is the certainty that an agent’s chosen action is optimal in the given situation. Different approaches exist that make it possible to estimate this uncertainty. Ensemble techniques for example aggregate the predictions of multiple networks, often trained on different versions of the data, and interpret the variance of the individual predictions as the uncertainty[24, 17]. An example of this approach can be seen in Figure 1, which shows the individual predictions of a Bootstrap ensemble as well as their mean and variance. These and other methods applicable to deep neural networks will be presented in more detail in Section 3.1. Besides the various ways of measuring uncertainty, it is equally important to differentiate the different sources of uncertainty.
Aleatoric uncertainty models the inherent stochasticity in the system, i.e. no amount of data can explain the observed stochasticity. In other words, the uncertainty cannot be reduced by capturing more data. A reason for this might be that certain features that would be needed to explain the behaviour of the system are not part of the collected data. E.g. consider trying to model the distance different cars travel on a highway in a certain amount of time, without measuring their speed. If the speed is not part of the collected data, the randomness in the measured distances cannot be explained. It is also possible that the uncertainty is a fundamental property of the measured system, as is the case when dealing with quantum mechanics. As such, aleatoric uncertainty cannot be reduced, irrespective of how much data is collected.
Epistemic uncertainty by contrast arises out of a lack of sufficient data to exactly infer the underlying system’s data generating function. In this case, the features available in the data do in principle allow the explanation of the behaviour of the system. In the previous example, this would e.g. be the case if both time and speed are measured, but so far only cars traveling at the same speed had been observed. The uncertainty caused by the effect of different speed in this case is epistemic, as collecting more data could allow for a correct inference of the system’s behaviour and consequently the reduction of the uncertainty.
We base our problem formulation on Markov decision processes (MDPs). MDPs are defined by tuples: . is a (finite) set of states; being the state of the MDP at time step . is the (finite) set of actions; is the action the MDP takes at step . defines the transition probability function; a transition occurs by executing action in state . The resulting next state is determined based on . In this paper we focus on deterministic domains represented by deterministic MDPs, so . Finally, is the scalar reward; for this paper we assume that .
Goal of the problem is to find a policy in the space of all possible policies , which maximizes the expectation of return at state over a potentially infinite horizon:
where is the discount factor.
In order to search the policy space , we consider model-free reinforcement learning (RL). In this setting, an agent interacts with an environment defined as an MDP by executing a sequence of actions . In the fully observable case of RL, the agent knows its current state and the action space , but not the effect of executing in , i.e., and . In order to find the optimal policy , we focus on Q-Learning , a commonly used value-based approach. It is named for the action-value function , which describes the expected return when taking action in state and then following policy for all states afterwards.
The optimal action-value function of policy is any action-value function that yields higher accumulated rewards than all other action-value functions, i.e., . Q-Learning aims to approximate by starting from an initial guess for , which is then updated via
It uses experience samples of the form , where is the reward earned at time step , i.e., by executing action when in state . The learning rate is a setup-specific parameter. The set of all experience samples taken at time steps for some training limit is called the training set .
The learned action-value function converges to the optimal action-value function , which then implies an optimal policy .
In high-dimensional settings or when learning in continuous state-spaces, it is common to use parameterized function approximators like neural networks to approximate the action-value function: with specifying the weights of the neural network. When using a deep neural network as the function approximator, this approach is called deep reinforcement learning. 
When dealing with uncertainty, a systematic way is via Bayesian inference. Its combination with neural networks in the form of Bayesian neural networks is realised by placing a probability distribution over the weight-values of the network . As calculating the exact Bayesian posterior quickly becomes computationally intractable for deep models, a popular solution are approximate inference methods [12, 6, 10, 14, 19, 11]. Another option is the construction of model ensembles, e.g., based on the idea of the statistical bootstrap . The resulting distribution of the ensemble predictions can then be used to approximate the uncertainty [24, 17].
Both approaches have been used for tasks as diverse as machine vision  or disease detection . In the field of decision making, uncertainty is used to implicitly guide exploration, e.g by creating an ensemble of models , or for learning safety predictors, e.g. predicting the probability of a collision . Recently, a distributional approach to RL  was proposed which tries to learn the value distribution of a RL environment. Although this approach also models uncertainty, its goal of estimating the distribution of values is different from the work at hand, which tries to detect epistemic uncertainty, i.e. uncertainty in the model itself.
For the case of low-dimensional feature spaces, OOD detection (also called novelty detection) is a well-researched problem. For a survey on the topic, see e.g., who distinguish between probabilistic, distance-based, reconstruction-based, domain-based and information theoretic methods. During the last years, several new methods based on deep neural networks were proposed for high-dimensional cases, mostly focusing on classification tasks, e.g. image classification.  propose a baseline for detecting OOD examples in neural networks, based on the predicted class probabilities of a softmax classifier.  improve upon this baseline by using temperature scaling and by adding perturbations to the input.  evaluate the performance of a proposed alpha-divergence-based variational inference technique in an image classification task of adversarial examples. This can be understood as a form of OOD detection, as the generated adversarial examples lie outside of the training image manifold and consequently far from the training data. The authors report increased epistemic uncertainty, confirming the viability of their approach for the detection of adversarial image examples. The basic idea of this uncertainty-based approach is closely related to our proposed method, but no evaluation of the performance in a RL setting with non-stationary regression targets was performed. To the best our knowledge, none of the previously mentioned methods were evaluated regarding the epistemic uncertainty detection performance in a RL setting.
In this paper we propose UBOOD, an uncertainty-based OOD-classifier that can be employed in value-based deep reinforcement learning settings. It is based on the reducibility of epistemic uncertainty in the action-value function approximation.
As previously described, epistemic uncertainty arises out of a lack of sufficient data to exactly infer the underlying system’s data generating function. As such, it tends to be higher in areas of low data density. , who in turn refers to  for the initial conjecture, showed that the epistemic uncertainty is approximately inversely proportional to the density
of the input data, for the case of generalized linear regression models as well as multi-layer neural networks:
This also forms the basis of our approach: to use this inverse relation between epistemic uncertainty and data density in order to differentiate in- from out-of-distribution samples.
We define as the epistemic uncertainty function of a given Q-function approximation . If a suitable method for epistemic uncertainty estimation for deep neural networks is applied, the process of training the agent reduces for those state-action tuples that were used for training, i.e., there exists a successor state and a reward so that . consequently defines the set of in-distribution data. By contrast, state-action tuples that were not encountered during training i.e. define the set of out-of-distribution data . The epistemic uncertainty of these state-action tuples is not reduced during training. Thus, epistemic uncertainty of out-of-distribution data will be higher than that of in-distribution data:
UBOOD directly uses the output of the epistemic uncertainty function as the real-valued classification score. As is the case for many one-class classificators, this real-valued score forms the input of a threshold-based decision function, which then assigns the in- or out-of-distribution class label.
As is the case for any score-based one-class classification method, the classification threshold can be adjusted to modify the behaviour of the classifier, depending on the application’s requirements. For many applications, where some amount of OOD data is intermixed with the training data and the percentage is known, this information can be used to specify the threshold. As in our case, per definition, there are no OOD samples in the training data, such an approach is not possible. As a viable first solution, we propose the following simple algorithm to calculate a dynamic classification threshold:
Calculate the average uncertainty of the in-distribution samples .
Treat as a probability distribution and define the classification threshold as .
Thus, a dynamic threshold-based on the uncertainty distribution is realized that adjusts over the training process as more data is gathered. Please note that more complex algorithms for the threshold determination can be developed, e.g. by using multimodal probability distributions to model or by making use of additional information about the available data on a per-application basis.
In principle, any of the epistemic uncertainty estimation methods mentioned in Section 3.1 that are applicable to the function approximator used to model the Q-function, can be used in the UBOOD framework. In this paper, we evaluate three different UBOOD versions using different methods for epistemic uncertainty estimation and their effect on the OOD classification performance, as the networks are being used by the RL agent for value estimation.
The Monte-Carlo Concrete Dropout method is based on the dropout variational inference architecture as described by . Instead of default dropout layers, we use concrete dropout layers as described by , which do not require pre-specified dropout rates and instead learn individual dropout rates per layer. Figure 1(a) presents a schematic of the network used by this method.
This concrete dropout method is of special interest in our context of reinforcement learning, as here the available data change during the training process, rendering a manual optimization of the dropout rate hyperparameter even more difficult. Model loss is calculated by minimizing the negative log-likelihood of the predicted output distribution. Epistemic uncertainty as part of the total predictive uncertainty is then calculated as:
with outputs of the Monte-Carlo sampling.
The Bootstrap method is based on the network architecture described by . It represents an efficient implementation of the bootstrap principle by sharing a set of hidden layers between all members of the ensemble. In the network, the shared, fully-connected hidden layers are followed by an output layer of size , called the bootstrap heads, as can be seen in Figure 1(b). For each datapoint, a Boolean mask of length equal to the number of heads is generated, which determines the heads this datapoint is visible to. The mask’s values are set by drawing
times from a masking distribution. For the work at hand, the values are independently drawn from Bernoulli distributions with eitheror . In the case of , the bootstrap is reduced to a classic ensemble where all heads are trained on the complete data.
The Bootstrap-Prior method is based on the extension presented in . It has the same basic architecture as the Bootstrap method but with the addition of a so-called random Prior Network. Predictions are generated by adding the data dependent output of this untrainable prior network to the output of the different bootstrap heads in order to calculate the ensemble posterior (Figure 1(c)). The authors conjecture that the addition of this randomized prior function outperforms deep ensemble-based methods without explicit priors, as for the latter, the initial weights have to act both as prior and training initializer.
For both bootstrap-based methods, epistemic uncertainty is calculated as the variance of the outputs.
We evaluate three different versions of the UBOOD framework:
UB-MC: UBOOD with Monte-Carlo Concrete Dropout (MCCD) network
UB-B: UBOOD with Bootstrap network
UB-BP: UBOOD with Bootstrap-Prior network
The UB-MC version’s estimator network consists of two fully-connected hidden layers with 64 neurons each, followed by two separate neurons in the output layer representingand
of a normal distribution. As concrete dropout layers are used, no dropout probability has to be specified. Model loss and epistemic uncertainty are calculated as described in Section4.
The UB-B Bootstrap neural network and UB-BP Bootstrap-Prior neural network versions all consist of two fully-connected hidden layers with 64 neurons each, which are shared between all heads, followed by an output layer of bootstrap heads.
Each of these UBOOD versions is further evaluated with two parametrizations of the respective epistemic uncertainty estimation method: UB-MC40 and UB-MC80 differ in respect to the amount of Monte-Carlo forward passes that are executed to approximate the epistemic uncertainty: or passes. UB-B and UB-BP parametrizations (UB-B07, UB-B10, UB-BP07, UB-BP10) differ in respect to the Bernoulli distribution used to determine the bootstrap mask: probability for UB-B07 & UB-BP07 and probability for UB-B10 & UB-BP10.
One of the problems in evaluating OOD detection for RL is the lack of datasets or environments which can be used for generating and assessing OOD samples in a controlled and reproducible way. By contrast to the field of image classification, where benchmark datasets like notMNIST  exist that contain OOD samples, there are no equivalent sets for RL. We apply a principled approach to develop two environments, one using a gridworld-style discrete state-space, the other using a continuous state-space. Both environments allow systematic modifications after the training process, thus producing OOD states during evaluation.
The first environment is a simple gridworld pathfinding environment. It is built on the design presented in  and has a discrete state-space. The basic layout consists of two rooms, separated by a vertical wall. Movement between the rooms is only possible via two hallways, as is visualised in Figure 3. The agent starts every episode at a random position on the grid (labeled S in Figure 3). Its task is to reach a specific goal position on the grid (labeled G in Figure 3), which also varies randomly every episode, by choosing one of the four possible actions: .
The state of the environment is represented as a stack of three feature planes, with each plane representing the spatial positions of all environment objects of a specific type: agent, goal or wall. Each step of the agent incurs a cost of except the goal-reaching action, which is rewarded with and ends the episode. We evaluate the performance of the UBOOD framework on a set of environment configurations. All environment configurations have a size of and randomly vary the y-coordinate of the agent’s start position as well as the goal position every episode, in the interval . Configuration , the only configuration used in training, varies the x-coordinate of the agent’s start position in the interval and the goal position in the interval . Each environment configuration is then defined by shifting the start interval right by compared to the previous configuration, while the goal interval is shifted left by . E.g. configuration has start position range and goal position range . This results in environment configurations with increasing difference from the training configuration , as can be seen in the example shown in Figure 2(b).
The continuous state-space environment is based on OpenAI’s LunarLander environment 
. The goal is to safely land a rocket inside a defined landing pad, without crashing. This task can be understood as rocket trajectory optimization. While the original environment defines a static position for the landing pad, our modified environment allows for random placement inside specified intervals. As the original environment does not encode the landing pad’s position in the state representation, our version extends the state encoding to include the left and right x-coordinate as well as the y-coordinate of the pad. For evaluating the performance of the UBOOD framework in this continuous state-space environment, we created a set ofconfigurations. Configuration , the only configuration used in training, varies the x-coordinate of the center of the landing pad in the interval and the y-coordinate in the interval , which results in the landing pad being placed in the upper left side of the environment. An example of this configuration can be seen in Figure 3(a). Each environment configuration is then defined by shifting the x-coordinate interval right by compared to the previous configuration, while the y-coordinate interval is shifted left by . This results in the pads being placed increasingly to the lower right side of the environment. Like in the gridworld environment, this produces environment configurations with increasing difference from the training configuration .
Note that training on both environments is solely performed using the respective environment configuration . Evaluation runs are executed independently of the training process, based on model snapshots generated at the respective training episodes. Consequently, data collected during these evaluation runs is not used for training.
All evaluated versions learn successful policies on both the gridworld and LunarLander environments. Returns achieved by the trained policies after training episodes on different environment configurations are shown in Figure 5. As is to be expected, increasing changes to the environment (configuration ) reduce the achieved return, as the evaluation environment increasingly differs from the training environment configuration .
We further evaluate the relation between reported uncertainty and the return achieved by the agent. Figure 7 shows evaluation results of the UB-BP10 and UB-MC80 versions evaluated on different configurations of the gridworld environment. For UB-BP10 (), increases in uncertainty (caused by increasing environment modifications) are reflected in decreases of return. This behaviour was also present on the LunarLander environment and consistent for different values of . No such clear relation was visible for UB-MC80. As can be seen in the results visualised in Figure 7, the uncertainty reported by the MCCD-based version decreases strongly between configuration and , although the achieved return also decreases.
In this paper, we proposed UBOOD, an uncertainty-based out-of-distribution classification framework. Evaluation results show that using the epistemic uncertainty of the agent’s value function presents a viable approach for OOD classification in a deep RL setting. We find that the framework’s performance is ultimately dependent on the reliability of the underlying uncertainty estimation method, which is why good uncertainty estimates are required.
On both evaluation domains, UBOOD combined with ensemble-based bootstrap uncertainty estimation methods (UB-B / UB-BP) shows good results with F1-scores as high as , allowing for a reliable differentiation between in- and OOD-samples. F1-Scores increase as the environment configuration differs more from the training environment, i.e. the stronger OOD the observed samples, the more reliable the classification. The addition of a prior as done with the UB-BP version seems to have a positive effect on the separation of in- and out-of-distribution samples as is reflected in higher F1-scores on the LunarLander environment. By contrast, UBOOD combined with the concrete dropout-based uncertainty estimation method (UB-MC) does not produce viable results. Although increasing the amount of Monte-Carlo samples improves the performance somewhat, the resulting classification performance is not on par with the Bootstrap-based versions. The reason for the large difference in performance between the Bootstrap-based and MCCD-based versions can be seen in the example shown in Figure 8. For the UB-B version, the reported uncertainties on environment configuration (training) and (strong modification) increasingly diverge with progressing training episodes (Figure 7(a)). As this is not the case for the UB-MC version (Figure 7(b)), only the Bootstrap-based version allows for an increasingly better differentiation between in- and OOD samples and consequently high F1-scores of the classifier. We found this effect to be consistent over all parametrizations of the Bootstrap- and MCCD-based versions we evaluated.
Our results match recent findings 
, where ensemble-based uncertainty estimators were compared against Monte-Carlo Dropout based ones for the case of active learning in image classification. Results presented in that work also showed that ensembles performed better and led to more calibrated uncertainty estimates. As a possible explanation, the authors argue that the difference in performance is a result of a combination of decreased model capacity and lower diversity of the Monte-Carlo Dropout methods when compared to ensemble approaches. This effect would also explain the behaviour we observed when comparing reported uncertainty and achieved return. While there is a strong inverse relation visible when using Bootstrap-based UBOOD versions, no clear pattern emerged for the evaluated MCCD-based versions. We think that further research into the relation between epistemic uncertainty and achieved return when train- and test-environments differ could provide interesting insights relating to generalization performance in deep RL. Being able to differentiate between an agent having encountered a situation in training versus the agent generalizing its experience to new situations could provide a huge benefit in safety-critical situations.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §3.1.
A practical bayesian framework for backpropagation networks. Neural Computation 4, pp. 448–472. Cited by: §3.1.