Recently, there has been an increased interest in Bayesian neural networks (BNNs) and their possible use in reinforcement learning (RL) problems (Gal et al., 2016; Blundell et al., 2015; Houthooft et al., 2016). In particular, recent work has extended BNNs with a latent variable model to describe complex stochastic functions (Depeweg et al., 2016; Moerland et al., 2017). The proposed approach enables the automatic identification of arbitrary stochastic patterns such as multimodality and heteroskedasticity, without having to manually incorporate these into the model.
In model-based RL, the aforementioned BNNs with latent variables can be used to describe complex stochastic dynamics. The BNNs encode a probability distribution over stochastic functions, with each function serving as an estimate of the ground truth continuous Markov Decision Process (MDP). Such probability distribution can then be used for policy search, by finding the optimal policy with respect to state trajectories simulated from the model. The BNNs with latent variables produce improved probabilistic predictions and these result in better performing policies(Depeweg et al., 2016; Moerland et al., 2017).
We can identify two distinct forms of uncertainties in the class of models given by BNNs with latent variables. As described by Kendall & Gal (2017), ”Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model.” In particular, epistemic uncertainty arises from our lack of knowledge of the values of the synaptic weights in the network, whereas aleatoric uncertainty originates from our lack of knowledge of the value of the latent variables.
In the domain of model-based RL the epistemic uncertainty is the source of model bias (or representational bias, see e.g. Joseph et al. (2013)). When there is high discrepancy between model and real-world dynamics, policy behavior may deteriorate. In analogy to the principle that ”a chain is only as strong as its weakest link” a drastic error in estimating the ground truth MDP at a single transition step can render the complete policy useless (see e.g. Schneegass et al. (2008)).
In this work we address the decomposition of the uncertainty present in the predictions of BNNs with latent variables into its epistemic and aleatoric components. We show the usefulness of such decomposition in two different domains: active learning and risk-sensitive RL.
First we consider an active learning scenario with stochastic functions. We derive an information-theoretic objective that decomposes the entropy of the predictive distribution of BNNs with latent variables into its epistemic and aleatoric components. By building on that decomposition, we then investigate safe RL using a risk-sensitive criterion (García & Fernández, 2015) which focuses only on risk related to model bias, that is, the risk of the policy performing at test time significantly different from at training time. The proposed criterion quantifies the amount of epistemic uncertainty (model bias risk) in the model’s predictive distribution and ignores any risk stemming from the aleatoric uncertainty. Our experiments show that, by using this risk-sensitive criterion, we are able to find policies that, when evaluated on the ground truth MDP, are safe in the sense that on average they do not deviate significantly from the performance predicted by the model at training time.
We focus on the off-policy batch RL scenario, in which we are given an initial batch of data from an already-running system and are asked to find a better policy. Such scenarios are common in real-world industry settings such as turbine control, where exploration is restricted to avoid possible damage to the system.
2 Bayesian Neural Networks with Latent Variables
, formed by feature vectorsand targets , we assume that , where is the output of a neural network with weights and output units. The network receives as input the feature vector and the latent variable
. The activation functions for the hidden layers are rectifiers:. The activation functions for the output layers are the identity function: . The network output is corrupted by the additive noise variable with diagonal covariance matrix . The role of the latent variable is to capture unobserved stochastic features that can affect the network’s output in complex ways. Without , randomness is only given by the additive Gaussian observation noise , which can only describe limited stochastic patterns. The network has layers, with hidden units in layer , and is the collection of weight matrices. The is introduced here to account for the additional per-layer biases. We approximate the exact posterior distribution
with the factorized Gaussian distribution
The parameters , and , are determined by minimizing a divergence between and the approximation . For more detail the reader is referred to the work of Hernández-Lobato et al. (2016); Depeweg et al. (2016). In all our experiments we use black-box -divergence minimization with , as it seems to produce a better decomposition of uncertainty into its empistemic and aleatoric components, although further studies are needed to strengthen this claim.
The described BNNs with latent variables can describe complex stochastic patterns while at the same time account for model uncertainty. They achieve this by jointly learning , which captures the specific values of the latent variables in the training data, and , which represents any uncertainty about the model parameters. The result is a principled Bayesian approach for inference of stochastic functions.
3 Active Learning of Stochastic Functions
Active learning is the problem of choosing which data points to incorporate next into the training data so that the resulting gains in predictive performance are as high as possible. In this section, we derive a Bayesian active learning procedure for stochastic functions. This procedure illustrates how to separate two sources of uncertainty, that is, aleatoric and epistemic, in the predictive distribution of BNNs with latent variables.
Within a Bayesian setting, active learning can be formulated as choosing data based on the expected reduction in entropy of the posterior distribution (MacKay, 1992). Hernández-Lobato & Adams (2015) apply this entropy-based approach to scalable BNNs. In (Houthooft et al., 2016), the authors use a similar approach as an exploration scheme in RL problems in which a BNN is used to represent current knowledge about the transition dynamics. These previous works only assume additive Gaussian noise and, unlike the BNNs with latent variables from Section 2, they and cannot capture complex stochastic patterns.
We start by deriving the expected reduction in entropy in BNNs with latent variables. We assume a scenario in which a BNN with latent variables has been fitted to a batch of data to produce a posterior approximation . We now want to estimate the expected reduction in posterior entropy for when a particular data point is incorporated in the training set. The expected reduction in entropy is
denotes the entropy of a random variable andand denote the conditional entropy and the mutual information between two random variables. In (6) and (7) we see that the active learning objective is given by the difference of two terms. The first term is the entropy of the predictive distribution, that is, . The second term, that is, , is a conditional entropy. To compute this term, we average across the entropy . As shown in this expression, the randomness in has its origin in the latent variable (and the constant output noise which is not shown here). Therefore, this second term can be interpreted as the ’aleatoric uncertainty’ present in the predictive distribution, that is, the average entropy of that originates from the latent variable and not from the uncertainty about . We can refer to the whole objective function in (7) as an estimate of the epistemic uncertainty: the full predictive uncertainty about given minus the corresponding aleatoric uncertainty.
The previous decomposition is also illustrated by the information diagram from Figure 1. The entropy of the predictive distribution is composed of the blue, cyan, grey and pink areas. The blue area is constant: when both and are determined, the entropy of is constant and given by the entropy of the additive Gaussian noise . is given by the light and dark blue areas. The reduction in entropy is therefore obtained by the grey and pink areas.
The quantity in equation (7) can be approximating using standard entropy estimators, e.g. nearest-neighbor methods (Kozachenko & Leonenko, 1987; Kraskov et al., 2004; Gao et al., 2016). For that, we repeatedly sample and and do forward passes through the neural network to sample . The resulting samples of can then be used to approximate the respective entropies for each using the nearest-neighbor approach:
where computes the nearest-neighbor estimate of the entropy given an empirical sample of points, , and for .
3.1 Toy Problems
We now will illustrate the active learning procedure described in the previous section on two toy examples. In each problem we will first train a BNN with 2 layers and 20 units in each layer on the available data. Afterwards, we approximate the information-theoretic measures as outlined in the previous section
We first consider a toy problem given by a regression task with heteroskedastic noise. For this, we define the stochastic function with . The data availability is limited to specific regions of . In particular, we sample 750 values of from a mixture of three Gaussians with mean parameters
, variance parametersand with each Gaussian component having weight equal to in the mixture. Figure (a)a shows the raw data. We have lots of points at both borders of the axis and in the center, but little data available in between.
Figure 8 visualizes the respective quantities. We see that the BNN with latent variables does an accurate decomposition of its predictive uncertainty between epistemic uncertainty and aleatoric uncertainty: the reduction in entropy approximation, as shown in Figure (f)f, seems to be inversely proportional to the density used to sample the data (shown in Figure (b)b). This makes sense, since in this toy problem the most informative data points are expected to be located in regions where data is scarce. Note that, in more complicated settings, the most informative data points may not satisfy this property.
Next we consider a toy problem given by a regression task with bimodal data. We define and with probability and , otherwise, where and is independent of . The data availability is not uniform in . In particular we sample 750 values of
from an exponential distribution with
Figure 15 visualizes the respective quantities. The predictive distribution shown in Figure (c)c suggests that the BNN has learned the bimodal structure in the data. The predictive distribution appears to get increasingly ’washed out’ as we increase . This increase in entropy as a function of is shown in Figure (d)d. The conditional entropy of the predictive distribution shown in Figure (e)e appears to be symmetric around . This suggest that the BNN has correctly learned to separate the aleatoric component from the full uncertainty for the problem at hand: the ground truth function is symmetric around at which point it changes from a bimodal to a unimodal stochastic function. Figure (f)f shows the estimate of reduction in entropy for each . Here we can observe two effects: First, as expected, the expected entropy reduction will increase with higher . Second, we see a slight decrease from to . We believe the reason for this is twofold: because the data is limited to we expect a higher level of uncertainty in the vicinity of both borders. Furthermore we expect that learning a bimodal function requires more data to reach the same level of confidence than a unimodal function.
4 Risk-Sensitive Reinforcement Learning
In the previoius section we studied how BNNs with latent variables can be used for active learning of stochastic functions. The resulting algorithm is based on a decomposition of predictive uncertainty into its aleatoric and epistemic components. In this section we build up on this result to derive a new risk-sensitive objective in model-based RL with the aim to minimize the effect of model bias. Our new risk criteron enforces that the learned policies, when evaluated on the ground truth system, are safe in the sense that on average they do not deviate significantly from the performance prediced by the model at training time.
Similar to (Depeweg et al., 2016), we consider the domain of batch reinforcement learning. In this setting we are given a batch of state transitions formed by triples containing the current state , the action applied and the next state . For example, may be formed by measurements taken from an already-running system. In addition to , we are also given a cost function . The goal is to obtain from a policy in parametric form that minimizes on average under the system dynamics.
The aforementioned problem can be solved using model-based policy search methods. These methods include two key parts (Deisenroth et al., 2013). The first part consists in learning a dynamics model from . We assume that the true dynamical system can be expressed by an unknown neural network with stochastic inputs:
where denotes the synaptic weights of the network and , and are the inputs to the network. In the second part of our model-based policy search algorithm, we optimize a parametric policy given by a deterministic neural network with synaptic weights . This parametric policy computes the action as a function of , that is, . We optimize to minimize the expected cost over a finite horizon with respect to our belief , where . This expected cost is obtained by averaging over multiple virtual roll-outs. For each roll-out we choose randomly from the states in , sample and then simulate state trajectories using the model with policy , input noise and additive noise . This procedure allows us to obtain estimates of the policy’s expected cost for any particular cost function. If model, policy and cost function are differentiable, we are then able to tune
by stochastic gradient descent over the roll-out average.
Given the cost function , the objective to be optimized by the policy search algorithm is
use the standard deviation of the costas risk measure. High risk is associated to high variability in the cost . To penalize risk, the new objective to be optimized is given by
where denotes the standard deviation of the cost and the free parameter determines the amount of risk-avoidance () or risk-seeking behavior (). In this standard setting, the variability of the cost originates from two different sources. First, from the existing uncertainty over the model parameters and secondly, from the intrinsic stochasticity of the dynamics.
One of the main dangers of model-based RL is model bias: the discrepancy of policy behavior under a) the assumed model and b) the ground truth MDP. While we cannot avoid the existence of such discrepancy when data is limited, we wish to guide the policy search towards policies that stay in state spaces where the risk for model-bias is low. For this, we can define the model bias as follows:
where is the expected cost obtained at time when starting at the initial state and the ground truth dynamics are evolved according to policy . Note that we focus on having similar expectations of the individual state costs instead of having similar expectations of the final episode cost . The former is a more strict criterion since it may occur that model and ground truth diverge, but both give roughly the same cost on average.
As indicated in (9), we assume that the true dynamics are determined by a neural network with latent variables and weights given by . By using the approximate posterior , and assuming that , we can obtain an upper bound on the expected model bias as follows:
We note that is the expected reward of a policy under the dynamics given by . The expectation integrates out the influence of the latent variables and the output noise . The last equation in (13) can thereby be interpreted as the variability of the reward, that originates from our uncertainty over the dynamics given by distribution .
In Section 3 we showed how (7) encodes decomposition of the entropy of the predictive distribution into its aleatoric and epistemic components. The resulting decomposition naturally arises from an information-theoretic approach for active learning. We can express in a similar way using the law of total variance:
We extend the policy search objective of (10) with a risk component given by an approximation to the model bias. Similar to Depeweg et al. (2016), we derive a Monte Carlo approximation that enables optimization by gradient descent. For this, we perform roll-outs by first sampling a total of times and then, for each of these samples of , performing roll-outs in which is fixed and we only sample the latent variables and the additive Gaussian noise. In particular,
where is the cost that is obtained at time in a roll-out generated by using a policy with parameters , a transition function parameterized by and latent variable values , with additive noise values . is an empirical estimate of the standard deviation calculated over draws of .
The free parameter determines the importance of the risk criterion. As described above, the proposed approximation generates roll-out trajectories for each starting state . For this, we sample for and for each we then do roll-outs with different draws of the latent variables and the additive Gaussian noise . We average across the roll-outs to estimate . Similarly, for each , we average across the corresponding roll-outs to estimate . Finally, we compute the empirical standard deviation of the resulting estimates to approximate .
4.1 Application: Industrial Benchmark
We show now the effectiveness of the proposed method on a stochastic dynamical system. For this we use the industrial benchmark, a high-dimensional stochastic model inspired by properties of real industrial systems. A detailed description and example experiments can be found in (Hein et al., 2016; Depeweg et al., 2016), with python source code available111https://github.com/siemens/industrialbenchmark222https://github.com/siemens/policy_search_bb-alpha.
In our experiments, we first define a behavior policy that is used to collect data by interacting with the system. This policy is used to perform three roll-outs of length for each setpoint value in . The setpoint is a hyper-parameter of the industrial benchmark that indicates the complexity of its dynamics. The setpoint is included in the state vector as a non-controllable variable which is constant throughout the roll-outs. Policies in the industrial benchmark specify changes , and in three steering variables (velocity), (gain) and (shift) as a function of . In the behavior policy these changes are stochastic and sampled according to
The velocity and gain can take values in . Therefore, the data collection policy will try to keep these values only in the medium range given by the interval . Because of this, large parts of the state space will be unobserved. After collecting the data, the
state transitions are used to train a BNN with latent variables with the same hyperparameters as in(Depeweg et al., 2016).
After this, we train different policies using the Monte Carlo approximation described in equation (14). We consider different choices of and use a horizon of steps, with and and a minibatch size of .
Performance is measured using two different objectives. The first one is the expected cost obtained under the ground truth dynamics of the system, that is . The second objective is the model bias as defined in equation (12). We compare with two baselines. The first one ignores any risk and, therefore, is obtained by just optimizing equation (10). The second baseline uses the standard deviation as risk criterion and, therefore, is similar to equation (11), which is the standard approach in risk-sensitive RL.
In Figure 18 we show the results obtained by our method and by the second baseline when performance is evaluated under the model (Figure (a)a) or under the ground truth (Figure (b)b). Each plot shows empirical estimates of the model bias vs. the expected cost, for various choices of . We also highlight the result obtained with , the first baseline.
Our novel approach for risk-sensitive reinforcement learning produces policies that attain at test time better trade-offs between expected cost and model bias. As increases, the policies gradually put more emphasis on the expected model bias. This leads to higher costs but lower discrepancy between model and real-world performance.
We have studied a decomposition of predictive uncertainty into its epistemic and aleatoric components when working with Bayesian neural networks with latent variables. This decomposition naturally arises in an information-theoretic active learning setting. The decomposition also inspired us to derive a novel risk objective for safe reinforcement learning that minimizes the effect of model bias in stochastic dynamical systems.
- Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- Deisenroth et al. (2013) Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
- Depeweg et al. (2016) Depeweg, Stefan, Hernández-Lobato, José Miguel, Doshi-Velez, Finale, and Udluft, Steffen. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016.
Gal et al. (2016)
Gal, Yarin, McAllister, Rowan Thomas, and Rasmussen, Carl Edward.
Improving pilco with bayesian neural network dynamics models.
Data-Efficient Machine Learning workshop, volume 951, pp. 2016, 2016.
- Gao et al. (2016) Gao, Weihao, Oh, Sewoong, and Viswanath, Pramod. Breaking the bandwidth barrier: Geometrical adaptive entropy estimation. In Advances in Neural Information Processing Systems, pp. 2460–2468, 2016.
- García & Fernández (2015) García, Javier and Fernández, Fernando. A comprehensive survey on safe reinforcement learning. The Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Hein et al. (2016) Hein, Daniel, Hentschel, Alexander, Sterzing, Volkmar, Tokic, Michel, and Udluft, Steffen. Introduction to the” industrial benchmark”. arXiv preprint arXiv:1610.03793, 2016.
- Hernández-Lobato & Adams (2015) Hernández-Lobato, José Miguel and Adams, Ryan P. Probabilistic backpropagation for scalable learning of bayesian neural networks. arXiv preprint arXiv:1502.05336, 2015.
- Hernández-Lobato et al. (2016) Hernández-Lobato, José Miguel, Li, Yingzhen, Rowland, Mark, Hernández-Lobato, Daniel, Bui, Thang, and Turner, Richard E. Black-box -divergence minimization. In Proceedings of The 33rd International Conference on Machine Learning (ICML), 2016.
- Houthooft et al. (2016) Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter. VIME: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
- Joseph et al. (2013) Joseph, Joshua, Geramifard, Alborz, Roberts, John W, How, Jonathan P, and Roy, Nicholas. Reinforcement learning with misspecified model classes. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 939–946. IEEE, 2013.
- Kendall & Gal (2017) Kendall, Alex and Gal, Yarin. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017.
- Kozachenko & Leonenko (1987) Kozachenko, LF and Leonenko, Nikolai N. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
- Kraskov et al. (2004) Kraskov, Alexander, Stögbauer, Harald, and Grassberger, Peter. Estimating mutual information. Physical review E, 69(6):066138, 2004.
- MacKay (1992) MacKay, David JC. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
- Maddison et al. (2017) Maddison, Chris J, Lawson, Dieterich, Tucker, George, Heess, Nicolas, Doucet, Arnaud, Mnih, Andriy, and Teh, Yee Whye. Particle value functions. arXiv preprint arXiv:1703.05820, 2017.
- Mihatsch & Neuneier (2002) Mihatsch, Oliver and Neuneier, Ralph. Risk-sensitive reinforcement learning. Machine learning, 49(2-3):267–290, 2002.
- Moerland et al. (2017) Moerland, Thomas M, Broekens, Joost, and Jonker, Catholijn M. Learning multimodal transition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.
- Schneegass et al. (2008) Schneegass, Daniel, Udluft, Steffen, and Martinetz, Thomas. Uncertainty propagation for quality assurance in reinforcement learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pp. 2588–2595. IEEE, 2008.