1 Introduction
Reinforcement learning (RL) is a principled mathematical framework for experiencebased autonomous learning of control policies. Its trialanderror learning process is one of the most distinguishing features of RL [44]. Despite many recent advances in RL [25, 43, 48], a main limitation of current RL algorithms remains its data inefficiency, i.e., the required number of interactions with the environment is impractically high. For example, many RL approaches in problems with lowdimensional state spaces and fairly benign dynamics require thousands of trials to learn. This data inefficiency makes learning in real control/robotic systems without taskspecific priors impractical and prohibits RL approaches in more challenging scenarios.
A promising way to increase the data efficiency of RL without inserting taskspecific prior knowledge is to learn models of the underlying system dynamics. When a good model is available, it can be used as a faithful proxy for the real environment, i.e., good policies can be obtained from the model without additional interactions with the real system. However, modelling the underlying transition dynamics accurately is challenging and inevitably leads to model errors. To account for model errors, it has been proposed to use probabilistic models [42, 10]. By explicitly taking model uncertainty into account, the number of interactions with the real system can be substantially reduced. For example, in [10, 33, 1, 7], the authors use Gaussian processes (GPs) to model the dynamics of the underlying system. The PILCO algorithm [10] propagates uncertainty through time for longterm planning and learns parameters of a feedback policy by means of gradientbased policy search. It achieves an unprecedented data efficiency for learning control policies for from scratch.
While the PILCO algorithm is data efficient, it has few shortcomings: 1) Learning closedloop feedback policies needs the full planning horizon to stabilise the system, which results in a significant computational burden; 2) It requires us to specify a parametrised policy a priori, often with hundreds of parameters; 3) It cannot handle state constraints; 4) Control constraints are enforced by using a differentiable squashing function that is applied to the RBF policy. This allows PILCO to explicitly take control constraints into account during planning. However, this kind of constraint handling can produce unreliable predictions near constraint boundaries [41, 26, 22].
In this paper, we develop an RL algorithm that is a) data efficient, b) does not require to look at the full planning horizon, c) handles constraints naturally, d) does not require a parametrised policy, e) is theoretically justified. The key idea is to reformulate the optimal control problem with learned GP models as an equivalent deterministic problem, an idea similar to [27]. This reformulation allows us to exploit Pontryagin’s maximum principle to find optimal control signals while handling constraints in a principled way. We propose probabilistic model predictive control (MPC) with learned GP models, while propagating uncertainty through time. The MPC formulation allows to plan ahead for relatively short horizons, which limits the computational burden and allows for infinitehorizon control applications. Our approach can find optimal trajectories in constrained settings, offers an increased robustness to model errors and an unprecedented data efficiency compared to the state of the art.
Related Work
Modelbased RL: A recent survey of model based RL in robotics [35] highlights the importance of models for building adaptable robots. Instead of GP dynamics model with a zero prior mean (as used in this paper) an RBF network and linear mean functions are proposed [7, 3]. This accelerates learning and facilitates transferring a learned model from simulation to a real robot. Even implicit model learning can be beneficial: The UNREAL learner proposed in [17] learns a predictive model for the environment as an auxiliary task, which helps learning.
MPC with GP transition models: GPbased predictive control was used for boiler and building control [13, 28], but the model uncertainty was discarded. In [18]
, the predictive variances were used within a GPMPC scheme to actively reject periodic disturbances, although not in an RL setting. Similarly, in
[31, 32], the authors used a GP prior to model additive noise and model is improved episodically. In [5, 28], the authors considered MPC problems with GP models, where only the GP’s posterior mean was used while ignoring the variance for planning. MPC methods with deterministic models are useful only when model errors and system noise can be neglected in the problem [19, 12].Optimal Control: The application of optimal control theory for the models based on GP dynamics employs some structure in the transition model, i.e., there is an explicit assumption of control affinity [16, 33, 34, 5] and linearisation via locally quadratic approximations [5, 33]. The AICO model [47] uses approximate inference with (known) locally linear models. The probabilistic trajectories for modelfree RL in [40]
are obtained by reformulating the stochastic optimal control problem as KL divergence minimisation. We implicitly linearise the transition dynamic via moment matching approximation.
Contribution
The contributions of this paper are the following: 1) We propose a new ‘deterministic’ formulation for probabilistic MPC with learned GP models and uncertainty propagation for longterm planning. 2) This reformulation allows us to apply Pontryagin’s Maximum Principle (PMP) for the openloop planning stage of probabilistic MPC with GPs. Using the PMP we can handle control constraints in a principled fashion while still maintaining necessary conditions for optimality. 3) The proposed algorithm is not only theoretically justified by optimal control theory, but also achieves a stateoftheart data efficiency in RL while maintaining the probabilistic formulation. 4) Our method can handle state and control constraints while preserving its data efficiency and optimality properties.
2 Controller Learning via Probabilistic MPC
We consider a stochastic dynamical system with states and admissible controls (actions) , where the state follows Markovian dynamics
(1) 
with an (unknown) transition function and i.i.d. system noise , where . In this paper, we consider an RL setting where we seek control signals that minimise the expected longterm cost
(2) 
where is a terminal cost and the stage cost associated with applying control in state
. We assume that the initial state is Gaussian distributed, i.e.,
.For data efficiency, we follow a modelbased RL strategy, i.e., we learn a model of the unknown transition function , which we then use to find openloop^{1}^{1}1‘Openloop’ refers to the fact that the control signals are independent of the state, i.e., there is no state feedback incorporated. optimal controls that minimise (2). After every application of the control sequence, we update the learned model with the newly acquired experience and replan. Section 2.1 summarises the model learning step; Section 2.2 details how to obtain the desired openloop trajectory.
2.1 Probabilistic Transition Model
We learn a probabilistic model of the unknown underlying dynamics to be robust to model errors [42, 10]. In particular, we use a Gaussian process (GP) as a prior over plausible transition functions .
A GP is a probabilistic nonparametric model for regression. In a GP, any finite number of function values is jointly Gaussian distributed
[39]. A GP is fully specified by a mean function and a covariance function (kernel) .The inputs for the dynamics GP are given by tuples , and the corresponding targets are . We denote the collections of training inputs and targets by , respectively. Furthermore, we assume a Gaussian (RBF, squared exponential) covariance function
(3) 
where is the signal variance and is a diagonal matrix of lengthscales . The GP is trained via the standard procedure of evidence maximisation [21, 39].
We make the standard assumption that the GPs for each target dimension of the transition function are independent. For given hyperparameters, training inputs , training targets and a new test input , the GP yields the predictive distribution , where
(4)  
(5)  
(6)  
(7) 
for all predictive dimensions .
2.2 OpenLoop Control
To find the desired openloop control sequence , we follow a twostep procedure proposed in [10]. 1) Use the learned GP model to predict the longterm evolution of the state for a given control sequence . 2) Compute the corresponding expected longterm cost (2) and find an openloop control sequence that minimises the expected longterm cost. In the following, we will detail these steps.
2.2.1 Longterm Predictions
To obtain the state distributions for a given control sequence , we iteratively predict
(8) 
for by making a deterministic Gaussian approximation to using moment matching [12, 37, 10]. This approximation has been shown to work well in practice in RL contexts [10, 1, 7, 2, 3, 33, 34] and can be computed in closed form when using the Gaussian kernel (3).
A key property that we exploit is that moment matching allows us to formulate the uncertainty propagation in (8) as a ‘deterministic system function’
(9) 
where are the mean and the covariance of . For a deterministic control signal we further define the moments of the controlaugmented distribution as
(10) 
such that (9) can equivalently be written as the deterministic system equation
(11) 
2.2.2 Optimal OpenLoop Control Sequence
To find the optimal openloop sequence , we first compute the expected longterm cost in (2) using the Gaussian approximations obtained via (8) for a given openloop control sequence . Second, we find a control sequence that minimises the expected longterm cost (2). In the following, we detail these steps.
Computing the Expected LongTerm Cost
To compute the expected longterm cost in (2), we sum up the expected immediate costs
(12) 
for . We choose , such that this expectation and the partial derivatives , can be computed analytically.^{2}^{2}2Choices for
include the standard quadratic (polynomial) cost, but also costs expressed as Fourier series expansions or radial basis function networks with Gaussian basis function.
Similar to (9), this allows us to define deterministic mappings
(13)  
(14) 
that map the mean and covariance of onto the corresponding expected costs in (2).
Remark 1.
The openloop optimisation turns out to be sparse [4]. However, optimisation via the value function or dynamic programming is valid only for unconstrained controls. To address this practical shortcoming, we define Pontryagin’s Maximum Principle that allows us to formulate the constrained problem while maintaining the sparsity. We detail this sparse structure for the constrained GP dynamics problem in section 3.
2.3 Feedback Control with MPC
Thus far, we presented a way for efficiently determining an openloop controller. However, an openloop controller cannot stabilise the system [22]. Therefore, it is essential to obtain a feedback controller. MPC is a practical framework for this [22, 15]. While interacting with the system MPC determines an step openloop control trajectory , starting from the current state ^{3}^{3}3A state distribution would work equivalently in our framework.. Only the first control signal is applied to the system. When the system transitions to , we update the GP model with the newly available information, and MPC replans . This procedure turns an openloop controller into an implicit closedloop (feedback) controller by repeated replanning steps ahead from the current state. Typically, , and MPC even allows for .
In this section, we provided an algorithmic framework for probabilistic MPC with learned GP models for the underlying system dynamics, where we explicitly use the GP’s uncertainty for longterm predictions (8). In the following section, we will justify this using optimal control theory. Additionally, we will discuss how to account for constrained control signals in a principled way without the necessity to warp/squash control signals as in [10].
3 Theoretical Justification
Bellman’s optimality principle [1] yields a recursive formulation for calculating the total expected cost (2) and gives a sufficient optimality condition. PMP [36] provides the corresponding necessary optimality condition. PMP allows us to compute gradients of the expected longterm cost w.r.t. the variables that only depend on variables with neighbouring time index, i.e., depends only variables with index and . Furthermore, it allows us to explicitly deal with constraints on states and controls. In the following, we detail how to solve the optimal control problem (OCP) with PMP for learned GP dynamics and deterministic uncertainty propagation. We additionally provide a computationally efficient way to compute derivatives based on the maximum principle.
To facilitate our discussion we first define some notation. Practical control signals are often constrained. We formally define a class of admissible controls that are piecewise continuous functions defined on a compact space . This definition is fairly general, and commonly used zeroorderhold or firstorderhold signals satisfy this requirement. Applying admissible controls to the deterministic system dynamics defined in (11) yields a set of admissible controlled trajectories. We define the tuple as our control system. For a single admissible control trajectory , there will be a unique trajectory , and the pair is called an admissible controlled trajectory [41].
We now define the controlHamiltonian [6, 41, 45] for this control system as
(15) 
This formulation of the controlHamiltonian is the centre piece of the Pontryagin’s approach to the OCP. The vector
can be viewed as a Lagrange multiplier for dynamics constraints associated with the OCP [6, 41].To successfully apply PMP we need the system dynamics to have a unique solution for a given control sequence. Traditionally, this is interpreted as the system is ‘deterministic’. This interpretation has been considered a limitation of PMP [45]. In this paper, however, we exploit the fact that the momentmatching approximation (8) is a deterministic operator, similar to the projection used in EP [30, 23]. This yields the ‘deterministic’ system equations (9), (11) that map moments of the state distribution at time to moments of the state distribution at time .
3.1 Existence/Uniqueness of a Local Solution
To apply the PMP we need to extend some of the important characteristics of ODEs to our system. In particular, we need to show the existence and uniqueness of a (local) solution to our difference equation (11).
For existence of a solution we need to satisfy the difference equation pointwise over the entire horizon and for uniqueness we need the system to have only one singularity. For our discretetime system equation (via the momentmatching approximation) in (9) we have the following
Lemma 1.
The moment matching mapping is Lipschitz continuous for controls defined over a compact set .
The proof is based on bounding the gradient of and detailed in the supplementary material. Existence and uniqueness of the trajectories for the moment matching difference equation are given by
Lemma 2.
A solution of exists and is unique.
Proof Sketch
Difference equations always yield an answer for a given input. Therefore, a solution trivially exists. Uniqueness directly follows from the PicardLindelöf theorem, which we can apply due to Lemma 1. This theorem requires the discretetime system function to be deterministic (see Appendix B of [41]). Due to our reformulation of the system dynamics (9), this follows directly, such that the for a given control sequence are unique.
3.2 Pontryagin’s Maximum Principle for GP Dynamics
With Lemmas 1 2 and the definition of the controlHamiltonian 15 we can now state PMP for the control system as follows:
Theorem 1.
Let , be an admissible controlled trajectory defined over the horizon . If is optimal, then there exists an adjoint vector satisfying the following conditions:

Adjoint equation: The adjoint vector is a solution to the discrete difference equation
(16) 
Transversality condition: At the endpoint the adjoint vector satisfies
(17) 
Minimum Condition: For , we have
(18) for all .
Remark 2.
The minimum condition (18) can be used to find an optimal control. The Hamiltonian is minimised pointwise over the admissible control set : For every we find optimal controls . The minimisation problem possesses additional variables . These variables can be interpreted as Lagrange multipliers for the optimisation. They capture the impact of the control over the whole trajectory and, hence, these variables make the optimisation problem sparse [11]. For the GP dynamics we compute the multipliers in closed form, thereby, significantly reducing the computational burden to minimise the expected longterm cost in (2). We detail this calculation in section 3.3.
Remark 3.
In the optimal control problem, we aim to find an admissible control trajectory that minimizes the cost subject to possibly additional constraints. PMP gives firstorder optimality conditions over these admissible controlled trajectories and can be generalised to handle additional state and control constraints [45, 26, 41].
Remark 4.
Remark 5.
For linear dynamics the proposed method is a generalisation of iLQG [46]: The moment matching transition implicitly linearises the transition dynamics at each time step, whereas in iLQG an explicit local linear approximation is made. For a linear and a quadratic cost we can write the LQG case as shown in Theorem 1 in [47]. If we iterate with successive corrections to the linear approximations we obtain iLQG.
3.3 Efficient Gradient Computation
With the definition of the Hamiltonian in (15) we can efficiently calculate the gradient of the expected total cost . For a time horizon we can write the accumulated cost as the Bellman recursion [1]
(19)  
(20) 
for . Since the (openloop) control only impacts the future costs via the derivative of the total cost with is given by
(21) 
Comparing this expression with the definition of the Hamiltonian (15), we see that if we make the substitution we obtain
(22) 
This implies that the gradient of the expected longterm cost w.r.t. can be efficiently computed using the Hamiltonian [45]. Next we show that the substitution is valid for the entire horizon . For the terminal cost this is valid by the transversality condition (17). For other time steps we differentiate (19) w.r.t. , which yields
(23)  
(24) 
which is identical to the adjoint equation (16). Hence, in our setting, PMP implies that gradient descent on the Hamiltonian is equivalent to gradient descent on the total cost (2) [41, 11].
Algorithmically, in an RL setting, we find the optimal control sequence as follows:

For a given initial (random) control sequence we follow the steps described in section 2.2.1 to determine the corresponding trajectory . Additionally, we compute Lagrange multipliers during the forward propagation. Note that traditionally adjoint equations are propagated backward to find the multipliers [6, 41].

Given and a cost function we can determine the Hamiltonians . Then we find a new control sequence via any gradient descent method using (22).

Return to 1 or exit when converged.
We use Sequential Quadratic Programming (SQP) with BFGS for Hessian updates [29]. The Lagrangian of SQP is a partially separable function [14]. In the PMP, this separation is explicit via the Hamiltonians, i.e., the is a function of variables with index or . This leads to a blockdiagonal Hessian of SQP Lagrangian [14]. The structure can be exploited to approximate the Hessian via blockupdates within BFGS [14, 4]
4 Experimental Results
We evaluate the quality of our algorithm in two ways: First, we assess whether probabilistic MPC leads to faster learning compared with PILCO, the current state of the art in terms of data efficiency. Second, we assess the impact of state constraints while performing the same task.
We consider two RL benchmark problems: the underactuated cartpoleswingup and the fully actuated doublependulum swingup. In both tasks, PILCO is the most dataefficient RL algorithm to date [1].
Underactuated CartPole SwingUp
The cart pole system is an underactuated system with a freely swinging pendulum of mounted on a cart. The swingup and balancing task cannot be solved using a linear model [38]. The cartpole system state space consists of the position of the cart , cart velocity , the angle of the pendulum and the angular velocity . A horizontal force N can be applied to the cart. Starting in a position where the pendulum hangs downwards, the objective is to automatically learn a controller that swings the pendulum up and balances it in the inverted position in the middle of the track.
Constrained CartPole SwingUp
For the statespace constraint experiment we place a wall on the track near the target, see Fig. 1(a). The wall is at 70 cm, which, along with force limitations, requires the system to swing from the right side.
Fullyactuated DoublePendulum
The double pendulum system is a twolink robot arm (links lengths: ) with two actuators at each joint. The state space consists of 2 angles and 2 angular velocities [1]. The torques and are limited to Nm. Starting from a position where both links are in a downwards position, the objective is to learn a control strategy that swings the doublependulum up and balances it in the inverted position.
Constrained DoublePendulum
The doublependulum has a constraint on the angle of the inner pendulum, so that it only has a motion range, i.e., it cannot spin through, see Fig. 1(b). The constraint blocks all clockwise swingups. The system is underpowered, and it has to swing clockwise first for a counterclockwise swingup without violating the constraints.
Trials
The general setting is as follows: All RL algorithms start off with a single short random trajectory, which is used for learning the dynamics model. As in [10, 1] the GP is used to predict state differences . The learned GP dynamics model is then used to determine a controller based on iterated moment matching (8), which is then applied to the system, starting from . Model learning, controller learning and application of the controller to the system constitute a ‘trial’. After each trial, the hyperparameters of the model are updated with the newly acquired experience and learning continues.
Baselines
We compare our GPMPC approach with the following baselines: the PILCO algorithm [10, 1] and a zerovariance GPMPC algorithm (in the flavour of [28, 5]) for RL, where the GP’s predictive variances are discarded. Due to the lack of exploration, such a zerovariance approach within PILCO (a policy search method) does not learn anything useful as already demonstrated in [1], and we do not include this baseline.
We average over 10 independent experiments, where every algorithm is initialised with the same first (random) trajectory. The performance differences of the RL algorithms are therefore due to different approaches to controller learning and the induced exploration.
4.1 Data Efficiency
Performance of RL algorithms. Error bars represent the standard error.
LABEL: Cartpole; LABEL: Double pendulum. GPMPC (blue) consistently outperforms PILCO (red) and the zerovariance MPC approach (yellow) in terms of data efficiency. While the zerovariance MPC approach works well on the cartpole task, it fails in the doublependulum task. We attribute this to the inability to explore the state space sufficiently well.In both benchmark experiments (cartpole and double pendulum), we use the exact saturating cost from [1], which penalises the Euclidean distance of the tip of the (outer) pendulum from the target position, i.e., we are in a setting in which PILCO performs very well.
Fig. 2(a) shows that both our MPCbased controller (blue) and the zerovariance approach successfully^{4}^{4}4We define ‘success’ if the pendulum tip is closer than to the target position for ten consecutive time steps.
complete the task in fewer trials than the stateoftheart PILCO method (red). From the repeated trials we see that GPMPC learns faster and more reliably than PILCO. In particular, GPMPC and the zerovariance approach can solve the cartpole task with high probability (90%) after three trials (9 seconds), where the first trial was random. PILCO needs two additional trials. The reason why the zerovariance approach (no model uncertainties) works in an MPC context but not within a policy search setting is that we include every observed state transition immediately in the GP dynamics model, which makes MPC fairly robust to model errors. It even allows for modelbased RL with deterministic models in simple settings.
Fig. 2(b) highlights that our proposed GPMPC approach (yellow) requires on average only six trials () of experience to achieve a 90% success rate^{5}^{5}5The tip of outer pendulum is closer than to the target., including the first random trial. PILCO requires four additional trials, whereas the zerovariance MPC approach completely fails in this RL setting. The reason for this failure is that the deterministic predictions with a poor model in this complicated state space do not allow for sufficient exploration. We also observe that GPMPC is more robust to the variations amongst trials.
In both experiments, our proposed GPMPC requires only 60% of PILCO’s experience, such that we report an unprecedented learning speed for these benchmarks, even with settings for which PILCO performs very well.
We identify two key ingredients that are responsible for the success and learning speed of our approach: the ability to (1) immediately react to observed states by adjusting the longterm plan and (2) augment the training set of the GP model on the fly as soon as a new state transition is observed (hyperparameters are not updated at every time step). These properties turn out to be crucial in the very early stages of learning when very little information is available. If we ignored the onthefly updates of the GP dynamics model, our approach would still successfully learn, although the learning efficiency would be slightly decreased.
4.2 State Constraints
A scenario in which PILCO struggles is a setting with state space constraints. We modify the cartpole and the doublependulum tasks to such a setting. Both tasks are symmetric, and we impose state constraints in such a way that only one direction of the swingup is feasible. For the cartpole system, we place a wall near the target position of the cart, see Fig. 1(a); the double pendulum has a constraint on the angle of the inner pendulum, so that it only has a motion range, i.e., it cannot spin through, see Fig. 1(b). These constraints constitute linear constraints on the state.
We use a quadratic cost that penalises the Euclidean distance between the tip of the pendulum and the target. This, along with the ‘implicit’ linearisation, makes the optimal control problem an ‘implicit’ QP. If a rollout violates the state constraint we immediately abort that trial and move on to the next trial. We use the same experimental setup as data efficiency experiments.
The state constraints are implemented as expected violations, i.e., and chance constraints . Fig. 1(a) shows that our MPCbased controller with chance constraint (blue) successfully completes the task with a small acceptable number of violations, see Table 1. The expected violation approach, which only considers the predicted mean (yellow) fails to complete the task due to repeated constraint violations. PILCO uses its saturating cost (there is little hope for learning with quadratic cost [8]) and has partial success in completing the task, but it struggles, especially during initial trials due to repeated state violations.
Experiment  Cartpole  Double Pendulum 

PILCO  16/100  23/100 
GPMPCMean  21/100  26/100 
GPMPCVar  3/100  11/100 
One of the key points we observe from the Table 1 is that the incorporation of uncertainty into planning is again crucial for successful learning. If we use only predicted means to determine whether the constraint is violated, learning is not reliably ‘safe’. Incorporation of the predictive variance, however, results in significantly fewer constraint violations.
5 Conclusion and Discussion
We proposed an algorithm for dataefficient RL that is based on probabilistic MPC with learned transition models using Gaussian processes. By exploiting Pontryagin’s maximum principle our algorithm can naturally deal with state and control constraints. Key to this theoretical underpinning of a practical algorithm was the reformulation of the optimal control problem with uncertainty propagation via moment matching into an deterministic optimal control problem. MPC allows the learned model to be updated immediately, which leads to an increased robustness with respect to model inaccuracies. We provided empirical evidence that our framework is not only theoretically sound, but also extremely data efficient, while being able to learn in settings with hard state constraints.
One of the most critical components of our approach is the incorporation of model uncertainty into modelling and planning. In complex environments, model uncertainty drives targeted exploration. It additionally allows us to account for constraints in a riskaverse way, which is important in the early stages of learning.
References
 [1] R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.

[2]
B. Bischoff, D. NguyenTuong, T. Koller, H. Markert, and A. Knoll.
Learning Throttle Valve Control Using Policy Search.
In
Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases
, 2013.  [3] B. Bischoff, D. NguyenTuong, H. van Hoof, A. McHutchon, C. E. Rasmussen, A. Knoll, J. Peters, and M. P. Deisenroth. Policy Search for Learning Robot Control using Sparse Data. In Proceedings of the International Conference on Robotics and Automation, 2014.
 [4] H. G. Bock and K. J. Plitt. A Multiple Shooting Algorithm for Direct Solution of Optimal Control Problems. In Proceedings 9th IFAC World Congress Budapest. Pergamon Press, 1984.
 [5] J. Boedecker, J. T. Springenberg, J. Wulfing, and M. Riedmiller. Approximate Realtime Optimal Control based on Sparse Gaussian Process Models. In Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2014.
 [6] F. H. Clarke. Optimization and NonSmooth Analysis. Society for Industrial and Applied Mathematics, 1990.
 [7] M. Cutler and J. P. How. Efficient Reinforcement Learning for Robots using Informative Simulated Priors. In Proceedings of the International Conference on Robotics and Automation, 2015.
 [8] M. P. Deisenroth Efficient Reinforcement Learning using Gaussian Processes. KIT Scientific Publishing, 2010.
 [9] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian Processes for DataEfficient Learning in Robotics and Control. Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–23, 2015.
 [10] M. P. Deisenroth and C. E. Rasmussen. PILCO: A ModelBased and DataEfficient Approach to Policy Search. In Proceedings of the International Conference on Machine Learning, 2011.

[11]
M. Diehl.
Lecture Notes on Optimal Control and Estimation
. 2014.  [12] A. Girard, C. E. Rasmussen, J. QuinoneroCandela, and R. MurraySmith. Gaussian Process Priors with Uncertain InputsApplication to MultipleStep Ahead Time Series Forecasting. Advances in Neural Information Processing Systems, 2003.
 [13] A. Grancharova, J. Kocijan, and T. A. Johansen. Explicit Stochastic Predictive Control of Combustion Plants based on Gaussian Process Models. Automatica, 44(6):1621–1631, 2008.
 [14] A. Griewank and P. L. Toint. Partitioned Variable Metric Updates for Large Structured Optimization Problems. Numerische Mathematik, 39(1):119–137, 1982.
 [15] L. Grüne and J. Pannek. Stability and Suboptimality Using Stabilizing Constraints. In Nonlinear Model Predictive Control Theory and Algorithms, pages 87–112. Springer, 2011.
 [16] P. Hennig. Optimal Reinforcement Learning for Gaussian Systems. In Advances in Neural Information Processing Systems, 2011.
 [17] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In International Conference on Learning Representations, 2016.
 [18] E. D. Klenske, M. N. Zeilinger, B. Schölkopf, and P. Hennig. Gaussian ProcessBased Predictive Control for Periodic Error Correction. Transactions on Control Systems Technology, 24(1):110–121, 2016.
 [19] J. Kocijan. Modelling and Control of Dynamic Systems Using Gaussian Process Models. Advances in Industrial Control. Springer International Publishing, 2016.
 [20] G. Lee, S. S. Srinivasa, and M. T. Mason. GPILQG: Datadriven Robust Optimal Control for Uncertain Nonlinear Dynamical Systems. arXiv:1705.05344 2017.
 [21] D. J. C. MacKay. Introduction to Gaussian Processes. In Neural Networks and Machine Learning, volume 168, pages 133–165. Springer, Berlin, Germany, 1998.
 [22] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. Scokaert. Constrained Model Predictive Control: Stability and Optimality. Automatica, 36(6):789–814, 2000.

[23]
T. P. Minka.
A Family of Algorithms for Approximate Bayesian Inference
. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2001. 
[24]
T. P. Minka.
Expectation Propagation for Approximate Bayesian Inference.
In
Proceedings of the Conference on Uncertainty in Artificial Intelligence
, 2001.  [25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. HumanLevel Control through Deep Reinforcement Learning. Nature, 518(7540):529–533, 2015.
 [26] D. S. D. S. Naidu and R. C. Naidu, Subbaram/Dorf. Optimal Control Systems. CRC Press, 2003.
 [27] A. Y. Ng and M. I. Jordan. PEGASUS: A Policy Search Method for Large MDPs and POMDPs. Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2000.
 [28] T. X. Nghiem and C. N. Jones. Datadriven Demand Response Modeling and Control of Buildings with Gaussian Processes. In Proceedings of the American Control Conference, 2017.
 [29] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
 [30] M. Opper and O. Winther. Tractable Approximations for Probabilistic Models: The Adaptive TAP Mean Field Approach. Physical Review Letters, 86(17):5, 2001.
 [31] C. Ostafew, A. Schoellig, T. Barfoot, and J. Collier. Learningbased Nonlinear Model Predictive Control to Improve Visionbased Mobile Robot Path Tracking. Journal of Field Robotics, 33(1):133–152, 2015.
 [32] C. Ostafew, A. Schoellig, T. Barfoot. Robust Constrained Learningbased NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):15471563, 2016.
 [33] Y. Pan and E. Theodorou. Probabilistic Differential Dynamic Programming. Advances in Neural Information Processing Systems, 2014.
 [34] Y. Pan, E. Theodorou, and M. Kontitsis. Sample Efficient Path Integral Control under Uncertainty. Advances in Neural Information Processing Systems, 2015.
 [35] A. S. Polydoros and L. Nalpantidis. Survey of ModelBased Reinforcement Learning: Applications on Robotics. Journal of Intelligent and Robotic Systems, 86(2):153–173, 2017.
 [36] L. S. Pontryagin, E. F. Mishchenko, V. G. Boltyanskii, and R. V. Gamkrelidze. The Mathematical Theory of Optimal Processes. Wiley, 1962.
 [37] J. QuiñoneroCandela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression. Journal of Machine Learning Research, 6(2):1939–1960, 2005.
 [38] T. Raiko and M. Tornio. Variational Bayesian Learning of Nonlinear Hidden StateSpace Models for Model Predictive Control. Neurocomputing, 72(16–18):3702–3712, 2009.
 [39] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, USA, 2006.
 [40] K. Rawlik, M. Toussaint, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. In Proceedings of Robotics: Science and Systems, 2012.
 [41] H. Schättler and U. Ledzewicz. Geometric Optimal Control: Theory, Methods and Examples, volume 53. 2012.
 [42] J. G. Schneider. Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning. In Advances in Neural Information Processing Systems. 1997.
 [43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587), 2016.
 [44] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, USA, 1998.
 [45] E. Todorov. Efficient Computation of Optimal Actions. Proceedings of the National Academy of Sciences of the United States of America, 106(28):11478–83, 2009.
 [46] E. Todorov and Weiwei Li. A Generalized Iterative LQG Method for LocallyOptimal Feedback Control of Constrained Nonlinear Stochastic Systems. In Proceedings of the American Control Conference., 2005.
 [47] M. Toussaint. Robot Trajectory Optimization using Approximate Inference. In Proceedings of the International Conference on Machine Learning, 2009.
 [48] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine. Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search. arXiv:1610.00673, 2016.
Appendix
Lipschitz Continuity
Lemma 3.
The moment matching mapping is Lipschitz continuous for controls defined over a compact set .
Proof: Lipschitz continuity requires that the gradient is bounded. The gradient is
(25) 
The derivatives can be computed analytically [1].
We first show that the derivative is bounded. Defining , from [1], we obtain for all state dimensions
(26)  
(27)  
(28) 
where is the size of the training set of the dynamics GP and the th training input. The corresponding gradient w.r.t. is given by the last elements of
(29)  
(30) 
Let us examine the individual terms in the sum on the rhs in (30): For a given trained GP is constant. The definition of in (28) contains an exponentiated negative quadratic term, which is bounded between . Since is positive definite, the inverse determinant is defined and bounded. Finally, , which makes . The remaining term in (30) is a vectormatrix product. The matrix is regular and its inverse exists and is bounded (and constant as a function of . Since where is compact, we can also conclude that the vector difference in (30) is finite, which overall proves that is (locally) Lipschitz continuous and Lemma 3.
Sequential Quadratic Programming
We can use SQP for solving nonlinear optimization problems (NLP) of the form,
The Lagrangian associated with the NLP is
(31) 
where, and are Lagrange multipliers. Sequential Quadratic Programming (SQP) forms a quadratic (Taylor) approximation of the objective and linear approximation of constraints at each iteration
(32) 
5.1 Moment Matching Approximation [1]
Following the law of iterated expectations, for target dimensions we obtain the predictive mean
(33)  
(34) 
with . The entries of are computed using standard results from multiplying and integrating over Gaussians and are given by
(35)  
where we define
(36) 
is the difference between the training input and the mean of the test input distribution .
Computing the predictive covariance matrix requires us to distinguish between diagonal elements and offdiagonal elements: Using the law of total (co)variance, we obtain for target dimensions
(37)  
(38) 
respectively, where is known from (33). The offdiagonal terms do not contain the additional term because of the conditional independence assumption of the GP models. Different target dimensions do not covary for given .
We start the computation of the covariance matrix with the terms that are common to both the diagonal and the offdiagonal entries: With and the law of iterated expectations, we obtain
(39) 
because of the conditional independence of and given . Using the definition of the mean function, we obtain
(40)  
(41) 
Using standard results from Gaussian multiplications and integration, we obtain the entries of
(42) 
where we define
with taken from (36). Hence, the offdiagonal entries of are fully determined by (33)–(36), (38), (40), (41), and (42).
References
 [1] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian Processes for DataEfficient Learning in Robotics and Control. Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–23, 2015.