1 Introduction
In modelbased reinforcement learning, an agent uses its experience to first learn a model of the environment and then uses that model to reason about what action to take next. We consider the case in which the agent observes the current state , takes some action , and then observes the next state . The problem of learning the model corresponds then to learning a stochastic transition function specifying the conditional distribution of given and . Most classic control theory texts, e.g. Bertsekas (2002), will start with the most general model of dynamical systems:
where is some deterministic function parameterized by weights that takes as input the current state , the control signal , and some stochastic disturbance .
However, to date, we have not been able to robustly learn dynamical system models to such a level of generality. Popular modes for transition functions include Gaussian processes (Rasmussen et al., 2003; Ko et al., 2007; Deisenroth & Rasmussen, 2011), fixed bases such as Laguerre functions (Wahlberg, 1991), and adaptive basis functions or neural networks (Draeger et al., 1995). All of these methods assume deterministic transition functions, perhaps with some addition of Gaussian observation noise. Thus, they are severely limited in the kinds of stochasticity—or transition noise—they can express. In many realworld scenarios stochasticity may often arise due to some unobserved environmental feature that can affect the dynamics in complex ways (such as unmeasured gusts of wind on a boat).
In this work we use Bayesian neural networks (BNNs) in conjunction with a random input noise source to express stochastic dynamics. We take advantage of a very recent inference advance based on divergence minimization (HernándezLobato et al., 2016), with , to learn with high accuracy BNN transition functions that are both scalable and expressive in terms of stochastic patterns. Previous work achieved one but not both of these two characteristics.
We focus our evaluation on the offpolicy batch reinforcement learning scenario, in which we are given an initial batch of data from an alreadyrunning system and are asked to find a better (ideally nearoptimal) policy. Such scenarios are common in realworld industry settings such as turbine control, where exploration is restricted to avoid possible damage to the system. We propose an algorithm that uses random rollouts and stochastic optimization for learning an optimal policy from the predictions of BNNs. This method produces (to our knowledge) the first modelbased solution of a 20yearold benchmark problem: the WetChicken (Tresp, 1994). We also obtain very promising results on a realworld application on controlling gas turbines and on an industrial benchmark.
2 Background
2.1 ModelBased Reinforcement Learning
We consider reinforcement learning problems in which an agent acts in a stochastic environment by sequentially choosing actions over a sequence of time steps, in order to minimize a cumulative cost. We assume that our environment has some true dynamics , and we are given a cost function . In the modelbased reinforcement learning setting, our goal is to learn an approximation for the true dynamics based on collected samples . The agent then tries to solve the control problem in which is assumed to be the true dynamics.
2.2 Bayesian neural networks with stochastic inputs
Given data
, formed by feature vectors
and targets , we assume that , where is the output of a neural network with weights . The network receives as input the feature vector and the random disturbance. The activation functions for the hidden layers are rectifiers:
. The activation functions for the output layers are the identity function: . The network output is corrupted by the additive noise variable with diagonal covariance matrix . The role of the noise disturbance is to capture unobserved stochastic features that can affect the network’s output in complex ways. Without , randomness is only given by the additive Gaussian observation noise , which can only describe limited stochastic patterns. The network has layers, with hidden units in layer , and is the collection of weight matrices. The is introduced here to account for the additional perlayer biases.One could argue why is needed at all when we are already using the more flexible stochastic model based on . The reason for this is that, in practice, we make predictions with the above model by averaging over a finite number of samples of and . By using , we obtain a predictive distribution whose density is well defined and given by a mixture of Gaussians. If we eliminate , the predictive density is degenerate and given by a mixture of delta functions.
Let be an matrix with the targets and be an matrix of feature vectors . We denote by the dimensional vector with the values of the random disturbances that were used to generate the data. The likelihood function is
(1) 
The prior for each entry in is . We also specify a Gaussian prior distribution for each entry in each of the weight matrices in . That is,
(2) 
where is the entry in the th row and th column of and and
are a prior variances. The posterior distribution for the weights
and the random disturbances is given by Bayes’ rule:(3) 
Given a new input vector , we can then make predictions for using the predictive distribution
(4) 
Unfortunately, the exact computation of (4) is intractable and we have to use approximations.
2.3 divergence minimization
We approximate the exact posterior distribution
with the factorized Gaussian distribution
(5) 
The parameters , and , are determined by minimizing a divergence between and the approximation . After fitting , we make predictions by replacing with in (4) and approximating the integrals in (4) with empirical averages over samples of .
We aim to adjust the parameters of (5) by minimizing the divergence between and (Minka, 2005):
(6) 
which includes a parameter that controls the properties of the optimal . Figure 1 illustrates these properties for the onedimensional case. When , tends to cover the whole posterior distribution . When , tends to fit a local mode in . The value is expected to achieve a balance between these two tendencies. Importantly, when , the solution obtained is the same as with variational Bayes (VB) (Wainwright & Jordan, 2008).
The direct minimization of (6) is infeasible in practice for arbitrary . Instead, we follow HernándezLobato et al. (2016) and optimize an energy function whose minimizer corresponds to a local minimization of divergences, with one divergence for each of the likelihood factors in (1). Since is Gaussian and the priors and are also Gaussian, we represent as
(7) 
where
is a Gaussian factor that approximates the geometric mean of the
likelihood factors in (1) as a function of . Each is also a Gaussian factor that approximates the th likelihood factor in (1) as a function of . We adjust and the by minimizing local divergences. In particular, we minimize the energy function(8) 
(HernándezLobato et al., 2016), where and are in exponential Gaussian form and parameterized in terms of the parameters of and the priors and , that is,
(9)  
(10) 
and is the logarithm of the normalization constant of the exponential Gaussian form of :
(11) 
The scalable optimization of (8
) is done in practice by using stochastic gradient descent. For this, we subsample the sums for
in (8) and (11) using minibatches and approximate the expectations over in (8) with an average over samples drawn from . We can then use the reparametrization trick (Kingma et al., 2015) to obtain gradients from the resulting stochastic approximator to (8). The hyperparameters , and can also be tuned by minimizing (8). In practice we only tune and keep and . The latter means that the prior scale of each grows with the data dimensionality. This guarantees that, a priori, the effect of each in the neural network’s output does not diminish when more and more features are available.Minimizing (8) when is equivalent to running the method VB (HernándezLobato et al., 2016), which has recently been used to train Bayesian neural networks in reinforcement learning problems (Blundell et al., 2015; Houthooft et al., 2016; Gal et al., 2016). However, we propose to minimize (8) using , which often results in better test loglikelihood values.
We have also observed to be more robust than VB when is not fully optimized. In particular, can still capture complex stochastic patterns even when we do not learn and instead keep it fixed to the prior . By contrast, VB fails completely in this case (see Appendix A).
3 Policy search using BNNs with stochastic inputs
We now describe a gradientbased policy search algorithm that uses the BNNs with stochastic disturbances from the previous section. The motivation for our approach lies in its applicability to industrial systems: we wish to estimate a policy in parametric form, using only an available batch of state transitions obtained from an alreadyrunning system. We assume that the true dynamics present stochastic patterns that arise due to some unobserved process affecting the system in complex ways.
Modelbased policy search methods include two key parts (Deisenroth et al., 2013). The first part consists in learning a dynamics model from data in the form of state transitions , where denotes the current state, is the action applied and is the resulting state. The second part consists in learning the parameters of a deterministic policy function that returns the optimal action as function of the current state . The policy function can be a neural network with deterministic weights given by .
The first part in the aforementioned procedure is a standard regression task, which we solve by using the modeling approach from the previous section. We assume the dynamics to be stochastic with the following true transition model:
(12) 
where the input disturbances account for the stochasticity in the dynamics. When the Markov state is hidden and we are given only observations , we can use the time embedding theorem using a suitable window of length and approximate:
(13) 
The transition model in equation 12
specifies a probability distribution
that we approximate using a BNN with stochastic inputs:(14) 
where the feature vectors in our BNN are now and and the targets are given by . In this expression, the integration with respect to accounts for stochasticity arising from lack of knowledge of the model parameters, while the integration with respect to accounts for stochasticity arising from unobserved processes that cannot be modeled. In practice, these integrals are approximated by an average over samples of and .
In the second part of our modelbased policy search algorithm, we optimize the parameters of a policy that minimizes the sum of expected cost over a finite horizon with respect to our belief . This expected cost is obtained by averaging over multiple virtual rollouts. For each rollout we sample and then simulate state trajectories using the model with policy , input noise and additive noise . This procedure allows us to obtain estimates of the policy’s expected cost for any particular cost function. If model, policy and cost function are differentiable, we are then able to tune by stochastic gradient descent over the rollout average.
Given a cost function , the objective to be optimized by our policy search algorithm is
(15) 
We approximate (15) by using (14), replacing with and using sampling to approximate the expectations:
(16) 
The first line in (16) is obtained by using the assumption that the dynamics are Markovian with respect to the current state and the current action and by replacing with the righthand side of (14). In the second line, is the state that is obtained at time in a rollout generated by using a policy with parameters , a transition function parameterized by and input noise , with additive noise values . In the last line we have approximated the integration with respect to , and by averaging over samples of these variables. To sample , we draw this variable uniformly from the available transitions .
The expected cost (15) can then be optimized by stochastic gradient descent using the gradients of the Monte Carlo approximation given by the last line of (16). Algorithm 1
computes this Monte Carlo approximation. The gradients can then be obtained using automatic differentiation tools such as Theano
(Theano Development Team, ). Note that Algorithm 1 uses the BNNs to make predictions for the change in the state instead of for the next state since this approach often performs better in practice (Deisenroth & Rasmussen, 2011).4 Experiments
We now evaluate the performance of our algorithm for policy search in different benchmark problems. These problems are chosen based on two reasons. First, they contain complex stochastic dynamics and second, they represent realworld applications common in industrial settings. A theano implementation of algorithm 1 is available online^{1}^{1}1 https://github.com/siemens/policy_search_bbalpha. See the appendix B for a short introduction to all methods we compare to and appendix C for the hyperparameters used.
4.1 WetChicken benchmark
The WetChicken benchmark (Tresp, 1994) is a challenging problem for modelbased policy search that presents both bimodal and heteroskedastic transition dynamics. We use the twodimensional version of the problem (Hans & Udluft, 2009) and extend it to the continuous case.
Dataset  MLP  VB  GP  PSOP  

Wetchicken  2.71  0.09  2.67  0.10  2.37  0.01  2.42  0.01  3.05  0.06  2.34 
Turbine  0.65  0.14  0.45  0.02  0.41  0.03  0.55  0.08  0.64  0.18  NA 
Industrial  183.5  4.1  180.2  0.6  174.2  1.1  171.1  2.1  285.2  20.5  145.5 
Avg. Rank  3.6  0.3  3.1  0.2  1.5  0.2  2.3  0.3  4.5  0.3 
runs with respective standard errors. Bottom row is the average rank over all
runs.Dataset  MLP  VB  GP  

MSE  
WetChicken  1.289  0.013  1.347  0.015  1.347  0.008  1.359  0.017  1.359  0.017 
Turbine  0.16  0.001  0.21  0.003  0.192  0.002  0.237  0.004  0.492  0.026 
Industrial  0.0186  0.0052  0.0182  0.0052  0.017  0.0046  0.0171  0.0047  0.0233  0.0049 
Avg. Rank  2.0  0.34  3.1  0.24  2.4  0.23  2.9  0.36  4.6  0.23 
LogLikelihood  
WetChicken  1.755  0.003  1.140  0.033  1.057  0.014  1.070  0.011  1.722  0.011 
Turbine  0.868  0.007  0.775  0.004  0.746  0.013  0.774  0.015  2.663  0.131 
Industrial  0.767  0.047  1.132  0.064  1.328  0.108  1.326  0.098  0.724  0.04 
Avg. Rank  4.3  0.12  2.6  0.16  1.3  0.15  2.1  0.18  4.7  0.12 
In this problem, a canoeist is paddling on a twodimensional river. The canoeist’s position at time is . The river has width and length with a waterfall at the end, that is, at . The canoeist wants to move as close to the waterfall as possible because at time he gets reward . However, going beyond the waterfall boundary makes the canoeist fall down, having to start back again at the origin . At time the canoeist can choose an action that represents the direction and magnitude of his paddling. The river dynamics have stochastic turbulences and drift that depend on the canoeist’s position on the axis. The larger , the larger the drift and the smaller , the larger the turbulences. The underlying dynamics are given by the following system of equations. The drift and the turbulence magnitude are given by and , respectively. The new location is given by the current location and current action using
(17) 
where and
is a random variable that represents the current turbulence. These dynamics result in rich transition distributions depending on the position as illustrated by the plots in Figure
2. As the canoeist moves closer to the waterfall, the distribution for the next state becomes increasingly bimodal (see Figure 2) because when he is close to the waterfall, the change in the current location can be large if the canoeist falls down the waterfall and starts again at . The distribution may also be truncated uniform for states close to the borders (see Figure 2). Furthermore the system has heteroskedastic noise, the smaller the value of the higher the noise variance (compare Figure 2 with 2). Because of these properties, the WetChicken problem is especially difficult for modelbased reinforcement learning methods. To our knowledge it has only been solved using modelfree approaches after a discretization of the state and action sets (Hans & Udluft, 2009). For model training we use a batch 2500 random state transitions.The predictive distributions of different models for are shown in Figure 2 for specific choices of and . These plots show that BNNs with are very close to the groundtruth. While it is expected that Gaussian processes fail to model multimodalities in Figure 2, the FTIC approximation allows them to model the heteroskedasticity to an extent. VB captures the stochastic patterns on a global level, but often under or overestimates the true probability density in specific regions. The testloglikelihood and test MSE in dimension are reported in Table 2 for all methods. (the transitions for are deterministic given ).
After fitting the models, we train policies using Algorithm 1 with a horizon of size . Table 1 shows the average reward obtained by each method. BNNs with
perform best and produce policies that are very close to the optimal upper bound, as indicated by the performance of the particle swarm optimization policy (PSOP). In this problem VB seems to lack robustness and has much larger empirical variance across experiment repetitions than
or .Figure 3 shows three example policies, , and (Figure 3,3 and 3, respectively). The policies obtained by BNNs with random inputs (VB and ) show a richer selection of actions. The biggest differences are in the middleright regions of the plots, where the drift towards the waterfall is large and the bimodal transition for (missed by the GP) is more important.
4.2 Industrial applications
We now present results on two industrial cases. First, we focus on data generated by a real gas turbine and second, we consider a recently introduced simulator called the "industrial benchmark", with code publicly available^{2}^{2}2http://github.com/siemens/industrialbenchmark (Hein et al., 2016b). According to the authors: "The "industrial benchmark" aims at being realistic in the sense, that it includes a variety of aspects that we found to be vital in industrial applications."
4.2.1 Gas turbine data
For the experiment with gas turbine data we simulate a task with partial observability. To that end we use 40,000 observations of a 30 dimensional timeseries of sensor recordings from a real gas turbine. We are also given a cost function that evaluates the performance of the current state of the turbine. The features in the timeseries are grouped into three different sets: a set of environmental variables (e.g. temperature and measurements from sensors in the turbine) that cannot be influenced by the agent, a set of variables relevant for the cost function (e.g. the turbines current pollutant emission) and a set of steering variables that can be manipulated to control the turbine.
We first train a world model as a reflection of the real turbine dynamics. To that end we define the world model’s transitions for to have the functional form . The world model assumes constant transitions for the environmental variables: . To make fair comparisons, our world model is given by a nonBayesian neural network with deterministic weights and with additive Gaussian output noise.
We then use the world model to generate an artificial batch of data for training the different methods. The inputs in this batch are still the same as in the original turbine data, but the outputs are now sampled from the world model. After generating the artificial data, we only keep a small subset of the original inputs to the world model. The aim of this experiment is to learn policies that are robust to noise in the dynamics. This noise would originate from latent factors that cannot be controlled, such as the missing features that were originally used to generate the outputs by the world model but which are no longer available. After training the models for the dynamics, we use algorithm 1 for policy optimization. The resulting policies are then finally evaluated in the world model.
4.2.2 Industrial benchmark
In this benchmark the hidden Markov state space consists of variables, whereas the observable state is only 5 dimensional. This observable state consists of 3 adjustable steering variables : the velocity , the gain and the shift . We also observe the fatigue and consumption that together form the reward signal . Also visible is the setpoint , a constant hyperparameter of the benchmark that indicates the complexity of the dynamics.
For each setpoint we generate trajectories of length using random exploration. This batch with state transitions forms the training set. We use state transitions, consisting of trajectories for each setpoint, as test set.
For data preprocessing, in addition to the standard normalization process, we apply a log transformation to the reward variable. Because the reward is bounded in the interval
, we use a logit transformation to map this interval into the real line. We define the functional form for the dynamics as
.The test errors and loglikelihood are given in Table 2. We see that BNNs with and perform best here, whereas Gaussian processes or the MLP obtain rather poor results.
Each row in Figure 4 visualizes long term predictions of the MLP and BNNs trained with VB and in two specific cases. In the top row we see that while all three methods produce wrong predictions in expectation (compare dark blue curve to red curve). However, BNNs trained with and with exhibit a bimodal distribution of predicted trajectories, with one mode following the groundtruth very closely. By contrast, the MLP misses the upper mode completely. The bottom row shows that the VB and also produce more tight confident bands in other settings.
Next, we learn policies using the trained models. Here we use a relatively long horizon of steps. Table 1 shows average rewards obtained when applying the policies to the real dynamics. Because both benchmark and models have an autoregressive component, we do an initial warmup phase using random exploration before we apply the policies to the system and start to measure rewards.
We observe that GPs perform very poorly in this benchmark. We believe the reason for this is the long search horizon, which makes the uncertainties in the predictive distributions of the GPs become very large. Tighter confidence bands, as illustrated in Figure 4 seem to be key for learning good policies. Overall, performs best with being very close.
5 Related work
There has been relatively little attention to using Bayesian neural networks for reinforcement learning. In Blundell et al. (2015)
a Thompson sampling approach is used for a contextual bandits problem; the focus is tackling the explorationexploitation tradeoff, while the work in
Watter et al. (2015) combines variational autoencoder with stochastic optimal control for visual data. Compared to our approach the first of these contributions focusses on the exploration/exploitation dilemma, while the second one uses a stochastic optimal control approach to solve the learning problem. By contrast, our work seeks to find an optimal parameterized policy.Policy gradient techniques are a prominent class of policy search algorithms (Peters & Schaal, 2008). While modelbased approaches were often used in discrete spaces (Wang & Dietterich, 2003), modelfree approaches tended to be more popular in continuous spaces (e.g. Peters & Schaal (2006)).
Our work can be seen as a MonteCarlo modelbased policy gradient technique in continuous stochastic systems. Similar work was done using Gaussian processes (Deisenroth & Rasmussen, 2011)
and with recurrent neural networks
(Schaefer et al., 2007) . The Gaussian process approach, while restricted to a Gaussian state distribution, allows propagating beliefs over the rollout procedure. More recently Gu et al. (2016) augment a modelfree learning procedure with data generated from modelbased rollouts.6 Conclusion and future work
We have extended the standard Bayesian neural network (BNN) model with the addition of a random input noise source
. This enables principled Bayesian inference over complex stochastic functions. We have shown that our BNNs with random inputs can be trained with high accuracy by minimizing
divergences, with , which often produces better results than variational Bayes. We have also presented an algorithm that uses random rollouts and stochastic optimization for learning a parameterized policy in a batch scenario. This algorithm particular suited for industry domains.Our BNNs with random inputs have allowed us to solve a challenging benchmark problem where modelbased approaches usually fail. They have also shown promising results on industry benchmarks including realworld data from a gas turbine. In particular, our experiments indicate that a BNN trained with as divergence measure in conjunction with the presented algorithm for policy optimization is a powerful blackbox tool for policy search.
As future work we will consider safety and exploration. For safety, we believe having uncertainty over the underlaying stochastic functions will allows us to optimize policies by focusing on worst case results instead of on average performance. For exploration, having uncertainty on the stochastic functions will be useful for efficient data collection.
Acknowledgements
José Miguel HernándezLobato acknowledges support from the Rafael del Pino Foundation. The authors would like to thank Ryan P. Adams, HansGeorg Zimmermann, Matthew J. Johnson, David Duvenaud and Justin Bayer for helpful discussions.
References
 Bertsekas (2002) D.P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific optimization and computation series. 2002. ISBN 9781886529083.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In ICML, pp. 1613–1622, 2015.
 Bui et al. (2016) Thang D Bui, Daniel HernándezLobato, Yingzhen Li, José Miguel HernándezLobato, and Richard E Turner. Deep Gaussian processes for regression using approximate expectation propagation. In ICML, pp. 1472–1481, 2016.
 Deisenroth & Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In ICML, pp. 465–472, 2011.
 Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2:1–142, 2013.
 Draeger et al. (1995) A. Draeger, S. Engell, and H. Ranke. Model predictive control using neural networks. IEEE Control Systems, 15:61–66, 1995.

Gal et al. (2016)
Yarin Gal, Rowan Mcallister, and Carl Rasmussen.
Improving PILCO with Bayesian neural networks dynamics models.
In
DataEfficient Machine Learning workshop, ICML
, 2016.  Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. arXiv preprint arXiv:1603.00748, 2016.
 Hans & Udluft (2009) Alexander Hans and Steffen Udluft. Efficient uncertainty propagation for reinforcement learning with limited data. In ICANN, pp. 70–79. Springer, 2009.
 Hein et al. (2016a) Daniel Hein, Alexander Hentschel, Thomas A Runkler, and Steffen Udluft. Reinforcement learning with particle swarm optimization policy (PSOP) in continuous state and action spaces. IJSIR, 7:23–42, 2016a.
 Hein et al. (2016b) Daniel Hein, Alexander Hentschel, Volkmar Sterzing, Michel Tokic, and Steffen Udluft. Introduction to the" industrial benchmark". arXiv preprint arXiv:1610.03793, 2016b.
 HernándezLobato et al. (2014) José Miguel HernándezLobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of blackbox functions. In NIPS, pp. 918–926, 2014.
 HernándezLobato et al. (2016) José Miguel HernándezLobato, Yingzhen Li, Mark Rowland, Daniel HernándezLobato, Thang Bui, and Richard E Turner. Blackbox divergence minimization. In ICML, pp. 1511–1520, 2016.
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational information maximizing exploration. In NIPS, pp. 1109–1117, 2016.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In NIPS, pp. 2575–2583, 2015.
 Ko et al. (2007) Jonathan Ko, Daniel J Klein, Dieter Fox, and Dirk Haehnel. Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In IEEE Robotics and Automation, pp. 742–747, 2007.
 Minka (2005) Tom Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005.
 Peters & Schaal (2006) Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In IROS, pp. 2219–2225, 2006.
 Peters & Schaal (2008) Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21:682–697, 2008.
 Rasmussen et al. (2003) Carl Edward Rasmussen, Malte Kuss, et al. Gaussian processes in reinforcement learning. In NIPS, pp. 751–758, 2003.
 Schaefer et al. (2007) Anton Maximilian Schaefer, Steffen Udluft, and HansGeorg Zimmermann. The recurrent control neural network. In ESANN, pp. 319–324, 2007.
 Snelson & Ghahramani (2005) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudoinputs. In NIPS, pp. 1257–1264, 2005.
 (23) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprint abs/1605.02688.
 Tresp (1994) V. Tresp. The wet game of chicken. Technical report, 1994.
 Wahlberg (1991) Bo Wahlberg. System identification using Laguerre models. IEEE Transactions on Automatic Control, 36:551–562, 1991.
 Wainwright & Jordan (2008) M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1:1–305, 2008.
 Wang & Dietterich (2003) Xin Wang and Thomas G Dietterich. Modelbased policy gradient reinforcement learning. In ICML, pp. 776–783, 2003.
 Watter et al. (2015) Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, pp. 2728–2736, 2015.
Appendix A Robustness of and when is not learned
We evaluate the accuracy of the predictive distributions generated by BNNs with stochastic inputs trained by minimizing (8) for different BNNs parameterized by in two simple regression problems. The first one is characterized by a bimodal predictive distribution. The second is characterized by a heteroskedastic predictive distribution. In the latter case the magnitude of the noise in the targets changes as a function of the input features.
In the first problem and is obtained as with probability and , otherwise, where and is independent of . The plot in the top of the 1st column in Figure 5 shows a training dataset obtained by sampling 2500 values of uniformly at random. The plot clearly shows that the distribution of for a particular is bimodal. In the second problem and is obtained as . The plot in the bottom of the 1st column in Figure 5 shows a training dataset obtained with 1000 values of uniformly at random. The plot clearly shows that the distribution of is heteroskedastic, with a noise variance that is a function of .
We evaluated the predictive performance obtained by minimizing (8) using and and also by running VB. However, we do not learn and keeping it instead fixed to the prior .
We fitted a neural network with 2 hidden layers and 50 hidden units per layer using Adam with its default parameter values, with a learning rate of 0.01 in the first problem and 0.002 in the second problem. We used minibatches of size 250 and 1000 training epochs. To approximate the expectations in
8, we draw samples from .The plots in the 3rd and 4th columns of Figure 5 show the predictions obtained with and , respectively. In these cases, the predictive distribution is able to capture the bimodality in the first problem and the heteroskedasticity pattern in the second problem in both cases The plots in the 2nd column of Figure 5 show the predictions obtained with VB, which converges to suboptimal solutions in which the predictive distribution has a single mode (in the first problem) or is homoskedastic (in the second problem). Tables 3 and 4 show the average test RMSE and loglikelihood obtained by each method on each problem.
These results show that Bayesian neural networks trained with or are more robust than VB and can still model complex predictive distributions, which may be multimodal and heteroskedastic, even when is not learned and is instead kept fixed to the prior . By contrast, VB fails to capture complex stochastic patterns in this setting.
Method  RMSE  Loglikelihood 

VB  5.12  3.05 
5.14  2.10  
5.15  2.11 
Method  RMSE  Loglikelihood 

VB  1.88  2.05 
1.89  1.78  
1.94  1.98 
Appendix B Methods
In the experiments we compare to the following methods:
Standard MLP.
The standard multilayer preceptron (MLP) is equivalent to our BNNs, but does not have uncertainty over the weight and does not include any stochastic inputs. We train this method using early stopping on a subset of the training data. When we perform rollouts using algorithm 1, the predictions of the MLP are made stochastic by adding Gaussian noise to its output. The noise variance is fixed by maximum likelihood on some validation data after model training.
Variational Bayes (VB).
The most prominent approach in training modern BNNs is to optimize the variational lower bound (Blundell et al., 2015; Houthooft et al., 2016; Gal et al., 2016). This is in practice equivalent to divergence minimization when (HernándezLobato et al., 2016). In our experiments we use divergence minimization with to implement this method.
Gaussian Processes (GPs).
Gaussian Processes have recently been used for policy search under the name of PILCO (Deisenroth & Rasmussen, 2011). For each dimension of the target variables, we fit a different sparse GP using the FITC approximation (Snelson & Ghahramani, 2005). In particular, each sparse GP is trained using inducing inputs by using the method stochastic expectation propagation (Bui et al., 2016). After this training process we approximate the sparse GP by using a feature expansion with random basis functions (see supplementary material of HernándezLobato et al. 2014). This allows us to draw samples from the GP posterior distribution over functions, enabling the use of Algorithm 1
for policy training. Note that PILCO will instead momentmatch at every rollout step as it works by propagating Gaussian distributions. However, in our experiments we obtained better performance by avoiding the moment matching step with the aforementioned approximation based on random basis functions.
Particle Swarm Optimization Policy(PSOP).
We use this method to estimate an upper bound for reward performance. PSOP is a model predictive control (MPC) method that uses the true dynamics when applicable (Hein et al., 2016a). For a given state , the best action is selected using the standard receding horizon approach on the real environment. Note that this is not a benchmark method to compare to, we use it instead as an indicator of what the best possible reward can be achieved for a fixed planning horizon .
Appendix C Model Parameters
For all tasks we will use a standard MLP with two hidden layer with hidden units each as policy representation. The activation functions for the hidden units are rectifiers: . If present, bounding of the actions is realized using the
activation function on the outputs of the policy. All models based on neural network will share the same hyperparameter. We use ADAM as learning algorithm in all tasks.
WetChicken
The neural network models are set to 2 hidden layers and 20 hidden units per layer. We use 2500 random state transitions for training. We found that assuming no observation noise by setting to a constant of helped the models converge to lower energy values.
For policy training we use a horizon of size and optimize the policy network for epochs, averaging over samples in each gradient update, with minibatches of size and learning rate set to .
Turbine
The world model and the BNNs have two hidden layers with hidden units each. For policy training and worldmodel evaluation we perform a rollout with horizon . For learning the policy we use minibaches of size and draw samples from .
Industrial Benchmark
For the neural network models we use two hidden layers with hidden units.We use a horizon of , training for epochs with batches of size and samples for each rollout.
Appendix D Computational Complexity
Model Training
All models were trained using theano and a single GPU. Training the standard neural network is fast, the training time for this method was between 5  20 minutes, depending on data set size and dimensionality of the benchmark. In theano, the computational graph of the BNNs is similar to that of an ensemble of standard neural networks. The training time for the BNNs varied between 30 minutes to 5 hours depending on data size and dimensionality of benchmark. The sparse Gaussian Process was optimized using an expectation propagation algorithm and after training, it was approximated with a Bayesian linear model with fixed basis functions whose weights are initialized randomly (see Appendix B). We choose the inducing points in the GPs and the number of training epochs for these models so that the resulting training time was comparable to that of the BNNs.
Policy Search
For policy training we used a single CPU. All methods are of similar complexity as they are all trained using Algorithm 1. Depending on the horizon, data set size and network topology, training took between 20 minutes (WetChicken, ), 34 hours (Turbine, ) and 1416 hours (industrial benchmark, ).
Comments
There are no comments yet.