1 Introduction
Control may be regarded as decision making in a dynamic environment. Decisions have to be based on beliefs over the consequences of actions encoded by a model. Dealing with uncertain or changing dynamics is the realm of adaptive control. In its classical form, parametric approaches are considered (e.g. [20] ) and, typically, uncertainties are modelled by Brownian motion (yielding stochastic adaptive control [11, 6]) or via setbased considerations (an approach followed by robust adaptive control [15]). In contrast, we adopt an epistemological take on probabilistic control and bring to bear Bayesian nonparametric learning methods whose introspective qualities [7] can aide in addressing theexplorationexploitation tradeoffs relative to one’s subjective beliefs in a principled manner[1]. Based on these Bayesian learning methods, it is our ambition to develop adaptive controllers with probabilistic guarantees (interpreted in an epistemological sense) on control success.
In contrast to classical adaptive control where inference has to be restricted to finitedimensional parameter space, the nonparametric approach affords the learning algorithms with greater flexibility to identify and control systems with very few model assumptions. This is possible because these methods grant the flexibility to perform Bayesian inference over rich, infinitedimensional function spaces that could encode the dynamics. This property has led to a surge of interest in Bayesian nonparametrics; particularly benefiting their algorithmic advancement and application to a plethora of learning problems. Due to their favourable analytic properties,
normal or Gaussian processes (GPs) [2, 16] have been the main choice of method in recent years. Among other domains, GPs have been applied to learning discretetime dynamic systems in the context of modelpredictive control [9, 10, 12, 17], learning the error of inverse kinematics models [13, 14], dual control [1]as well as reinforcement learning and dynamic programming
[4, 5, 18, 8].On the flip side, the extent of flexibility can lead to the temptation to use the approach in a blackbox fashion, disregarding most structural knowledge of the underlying dynamics [9, 10, 18, 8, 12]. This can result in unnecessarily highdimensional learning problems, slow convergence rates and often necessitates large training corpora, typically to be collected offline. In the extreme, the latter requirement can cause slow prediction and conditioning times. Moreover, they have been used in combination with computationally intensive planning methods such as dynamic programming [4, 5, 18] rendering realtime applicability difficult.
In contrast to all this work, we will incorporate structural apriori knowledge of the dynamics afforded by Lagrangian mechanics (without sacrificing the flexibility afforded by the nonparametric nature). This requires, in some instances, partial departure from Gaussianity (e.g. if the sign of a function component of the dynamics is known) but improves the detail with which the system is identified and can reduce the dimensionality of the identification problem. Furthermore, our method will use the uncertainties of the models to decide upon training example incorporation and decision making.
Aside from learning, our method employs feedbacklinearisation [19] in an outerloop control law to reduce the complexity of the control problem. Thereby, in expectation, the problem is reduced to controlling a doubleintegrator via an innerloop control law. If we combine the outerloop controller with an innerloop controller that has desirable guarantees (e.g. stability) for the doubleintegrator, these properties can extend to the expected given nonlinear closedloop dynamics. The resulting approach enables rapid decision making and can be deployed online.
Our work is presented at the AMLSC Workshop at NIPS, 2013. During the review process, we were made aware of GPMRAC [3]
. The authors utilise a Gaussian process on joint statecontrol space to learn the error of an inversion controller in modelreference adaptive control. Under the assumption that the GP could be stated as an SDE of time, they prove stability. In contrast to this work, our method is capable of identifying the drift and control input vector fields constituting the underlying controlaffine system individually, yielding a more finegrained identification result. While this benefit requires the introduction of probing signals to the control during online learning, each of the coupled learning problems has state space dimensionality only. Moreover, our method and stability results are not limited to Gaussian processes. If the controlinput vector fields are identified with a lognormal process, our controller will automatically be cautious in scarcely explored regions.
2 Method
2.1 Model
Dynamics. Let be a (usually continuous) set of times, denote the configuration space, the state space and the control space. Via the principle of least action and the resulting EulerLagrange equation, Lagrangian mechanics leads to the conclusion that controllable mechanical systems are of second order and can be written in controlaffine form:
(1) 
Here, is a generalized coordinate of the configuration and is the control input. Functions are called drift and input functions, respectively. In the pendulum control domain we consider below, will encode joint angles and is a torque is proportional to.
Defining , , we can write the state as . The dynamics can be restated as the system of equations
(2)  
(3)  
(4) 
where and is the th row of matrix In this work, we assume the system is fully actuated. That is, we assume that always is fullrank: . That is, fullactuation enables us to instantaneously set the acceleration in all dimensions of . However, we do not have immediate control over jointangle velocities. Incorporating this kind of knowledge afforded by Lagrangian mechanics is beneficial both from a principled Bayesian vantage point and in order to decompose the dimensionality of the learning task.
Epistemic uncertainty and learning. Both dynamics functions and can be uncertain a priori. That is, a priori our uncertainty is modelled by the assumption that where are stochastic processes. The processes reflect our epistemic uncertainty about the true underlying (deterministic) dynamics functions and . If data becomes available over the course of the state evolution, we can update our beliefs over the dynamics in a Bayesian fashion. That is, at time we assume where is the data recorded up to time . The process of conditioning is often referred to as (Bayesian) learning.
Data collection. We assume our controller can be called at an ordered set of times . At each time , the controller is able to observe the state ^{1}^{1}1In fact, we can only observe and have to obtain noisy observations of as we will describe below. and to set the control input . The controller may choose to evoke learning at an ordered subset of times. To this end, at each time , the controller evokes a procedure explicated in Sec. 2.2 if it decides to incorporate an additional data point into data set
. The decision on whether to update the data will be based on the belief over the data point’s anticipated informativeness as approximated by its variance.
^{2}^{2}2Variance is known to approximate entropic measures of uncertainty (cf. [1]) and often easier to compute than entropy.For simplicity, we assume that learning can occur every seconds and the controller is called every seconds. A continuous control takes place in the limit of infinitesimal .
2.2 Learning procedure
To enable learning, we will require derivatives of the state (that is estimates of and ). If we do not have physical means to measure velocities and accelerations, obtaining numerical estimates becomes necessary based on observations of . To estimate derivatives, we chose a secondorder method. That is, our state derivative estimates are where is a period length with which we can observe states. In this work, we assume .
Assuming online learning, the data sets are found incrementally. Since it is hard to use the data to infer and simultaneously, we will have to actively decide which one we desire to learn about (and set the control accordingly – which we will then refer to as a probing control). To this end, we distinguish between the following learning components:

Learning : Assume we are at time and that we decide to learn about . This decision is made, whenever our uncertainty about , encoded by , is above a certain threshold . When learning is initiated, we keep the control constant for two more time steps , to obtain a good derivative estimate as described above. To remove additional uncertainty due to ignorance about , we set probing control yielding dynamics during time interval . On the basis of a derivative estimate , we can determine a noisy estimate of unknown function value at time as per
So, is added to the data after time .

Learning : At time , we choose to learn about function whenever our uncertainty about is sufficiently small (i.e. ) and our uncertainty about is sufficiently large (). When learning is initiated, we keep the control constant for two more time steps , to obtain a good derivative estimate as described above.
Let be the th unit vector. To learn about at state , we apply a control action where . Inspecting Eq. 4 we can then see that Since
will generally be a random variable, so is
having mean and variance . We obtain a noisy estimate of its derivative analogously to above. Modelling as a random variable with mean , becomes a random variable with mean(5) and variance
(6) Therefore, after time , we add training point to the data set. The additional variance (as per Eq. 6) is captured by setting observational noise levels for accordingly.
2.3 Control law
Unless the control actions are chosen to aid system identification (as described above), we will want to base our control actions on our probabilistic belief model over the dynamics. Given such an uncertain model, it remains to define an (outerloop) control policy with desirable properties. In this work, we propose to define a control law that, when not learning, uses the probabilistic model to guarantee desired behaviour in expectation. In this work, our attention will be restricted to expected stability.
Let , and . Acceleration is a random variable with mean
Hence, when applying inversion control law
(7) 
we get an expected closedloop dynamics of
(8)  
(9)  
(10) 
where is an innerloop control law.
Theorem 2.1.
Assume we are not performing probing actions anymore. That is, we are at time such that . Let be a control law that is linear in state and that ensures the doubleintegrator dynamics of the form
to have as a globally asymptotically stable equilibrium point. Finally, suppose that expectation and derivative commute. That is, . Then, our control law as per Eq. 7, with inner control law , ensures is a globally asymptotically stable equilibrium of the expected dynamics. In particular, .
Proof.
(Sketch) Let denote the differential operator with respect to time. Leveraging the linearity of the differential operator, we can exchange it with the expectation operator. Thereby, we conclude from Eq. 8 and Eq. 10 that
and
where the last step follows by linearity of the control law. Defining yields the quadratic regulator problem : . By assumption, we know that ensures that is a globally asymptotic equilibrium point of this dynamic system. Hence, in particular, . Resubstituting the definitions of for and subsequently, of , yields the desired statement. ∎
Remark 2.2.
The requirement that differential and expectation operator can be interchanged has to be checked on a case by case basis and depends on the interpretation of the random differential equation or the particular kinds of processes driving the equation. Examples of where the assumption is met are whitenoise limits or, when the integral curve
is an process with differentiable covariance function (here, is the meansquare derivative). An alternative would be to show expected stability for every Eulerapproximation and prove convergence in the mean of these Euler approximations to the continuoustime equation.Consequently, we have given conditions under which our control law guarantees feedbacklinearisation in expectation (and of the dynamics of the mean trajectory). That is, by choosing to impose desired behaviour for the double integrator problem (which is easy), we can reshape the dynamics such that the expected closedloop dynamics is stable.
For instance, a simple method of guaranteeing global asymptotic convergence of the state towards a goal state would be to set the innermost control law to the proportional feedback law
(11) 
where .
3 Experiments – Learning to control a torquecontrolled damped pendulum with a combination of normal and lognormal processes
We explored our method’s properties in simulations of a rigid pendulum with (a priori unknown) drift and constant input function . Here, are joint angle position and velocity, denotes a friction coefficient, is acceleration due to gravity is the length and the mass of the pendulum. The control input applies a torque to the joint that corresponds to jointangle acceleration. The pendulum could be controlled by application of a torque to its pivotal point. encode the pendulum pointing downward and denoted the position in which the pendulum is upward. Given an initial configuration we desired to steer the state to a terminal configuration .
For learning, we assumed that and had been drawn from a lognormal process.^{3}^{3}3For details on normal processes see [16]. The latter assumption encodes a priori knowledge that control input function can only assume positive values (but, to demonstrate the idea of cascading processes, we had discarded the information that was a constant). During learning, the latter process was based on a standard normal process conditioned on observations of . To compute the control as per Eq. 7, we need to convert the posterior mean over into the expected value over . The required relationship is known to be as follows:
(12) 
If required the posterior variance can be obtained as
Note, the posterior mean over increases with the variance of our normal process in logspace, and, the control law as per Eq. 7 is inversely proportional to the magnitude of this mean. Hence, the resulting controller is cautious, in the sense that control output magnitude is damped exponentially in regions of high uncertainty (variance).
To simulate a discrete th order sampleandhold controller in a continuous environment, we simulated the dynamics between two consecutive controller calls (occurring every seconds) employing standard ODEsolving packages (i.e. Matlab’s ode45 routine).
We illustrated the behaviour of our controllers in a sequence of four experiments. The parameter settings are provided in Tab. 1. Recorded control energies and errors (in comparison to continuous proportional controllers) are provided in Tab. 2.
Our Bayesian controller maintains an epistemic beliefs over the dynamics. These beliefs govern our control decisions (including those when to learn). Furthermore, to keep prediction times low, beliefs are only updated when the current variance indicated a sufficient of uncertainty. Therefore, one would expect to observe three properties of our controller:
(i) When the priors are chosen sensibly (could be indicated by the dynamic functions’ likelihood under the probabilistic models), we expect good control performance.
(ii) Prior training improves control performance and, reduces learning, but is not necessary to reach the goal. Both properties can be observed in Exp.1 and Exp. 3.
(iii) When the controller is ignorant of the inaccuracy of its beliefs over the dynamics (i.e. the actual dynamics are unlikely but the variances are low), control may fail since the false beliefs are not updated. An example of this is provided in Exp. 2.
(iv) We can overcome such problems practically, by employing the standard technique (see [16]) of choose the prior hyperparameters that maximise marginal likelihood. In Exp. 3, this approach was successfully applied to the control problem of Exp. 2.
(l,r,m)  

Exp. 1  (1,1,0.5)  .01  .5  (.001, .005)  (0,2)  ()  (1,1)  20 
Exp. 2  (1,0.5,4)  .01  1  (.001, .005)  (0,2)  ()  (2,2)  15 
Exp. 3  (1,0.5,4)  .01  1  (.001, .005)  (0,2)  ()  (2,2)  20 
P1  P100  SP1  SP2  P1  P100  SP1  SP2  SP1  SP2  

Exp. 1  134  644  139  57  137  10  59  25  (18, 20)  (23, 53) 
Exp. 2  552  11942  14759  17029  139  10  82  72  (2,1)  (2,1) 
Exp. 3  730  11942  3753  1619  184  10  83  17  (12,2)  (12,2) 
Experiment 1. We started with a zeromean normal process prior over endowed with a rational quadratic kernel with automated relevance detection (RQARD) [16]. The kernel hyperparameters were fixed. Observational noise variance was set to . The lognormal process over was implemented by placing a normal over with zero mean and RQARD kernel with fixed hyperparameters and observational noise level . Note, the latter was set higher to reflect the uncertainty due to . In the future, we will consider incorporating hetereoscedastic observational noise based on and the sampling rate. Also, one could incorporate knowledge about periodicity in the kernel.
Results are depicted in Fig. 1 and 2. We see that the system was accurately identified by the stochastic processes. When restarting the control task with stochastic processes pretrained from the first round, the task was solved with less learning, more swiftly and with less control energy.
Experiment 2. We investigated the impact of inappropriate magnitudes of confidence in a wrong model. We endowed the controller’s priors with zero mean functions and SEARD kernels [16]. Length scales of kernel were set to 20 and the output scale to . In addition to the low outputscale, we set observational noise variance to a low value of 0.0001 suggesting (illfounded) high confidence in the prior. The length scale of kernel was set to 50 with low output scales and observational noise variance of 0.5 and 0.001, respectively.
The results, depicted in Fig. 3. As to be expected, the controller fails to realise the inadequateness its beliefs. This results in a failure to update its beliefs and consequently, in a failure to converge to the target state.
Of course, this could be overcome with an actorcritic approach. Such solutions will be investigated in the context of future work.
Experiment 3.
Exp. 2 was repeated. This time, however, the kernel hyperparameters were found by maximizing the marginal likelihood of the data. The automated identification of hyperparameters is beneficial in practical scenarios where definition of a good prior for the underlying dynamics may be hard to conceive.
The optimiser succeeded in finding sensible parameters that allowed good control performance. As before, the method benefited from prior training yielding faster convergence and lower control effort. Both untrained and pretrained methods outperformed the controllers either in terms of control energy or convergence. Finally, the SP controllers with hyperparameter optimisation outperformed the SP controllers with fixed hyperparameters set in Exp. 2 (c.f. Tab. 2).
4 Conclusions
We have applied Bayesian nonparametric methods to learn online the drift and control input functions of a fullyactuated controlaffine secondorder dynamical system. Paired with the idea of feedbacklinearisation we devised a control law that switches between probing actions for learning and control signals that drive the expected trajectory towards a given setpoint. Our simulations have illustrated our controller’s behaviour in the context of a pendulum regulator problem and that it can successfully solve the identification and control problems. They have also served as an illustration of the inherent pitfalls of Bayesian control – that is, guarantees are stated relative to epistemological beliefs (encoded by a posterior) over the dynamical system in question. Therefore, the controller’s performance may be undermined by ignorance over the potential falsity of prior beliefs (cf. Exp. 3). However, as illustrated in Exp. 3, even the most simple model selection methods can alleviate the burden of having to conceive a good fixed prior.
In future work
, we will explore how to employ a predictorcorrector approach to uncover overconfidence of our models and to initiate learning. At present, our control law achieves desired performance of the expected trajectory. We will investigate how to extend the guarantees to achieve performance guarantees in expectation and within probability bounds. Other theoretical questions under investigation are analysis of the tradeoffs between the impact of probing actions (to learn), the desire to keep prediction time low, information gain and control refresh cycle length
. Finally, we will assess our methods’ performance in higherdimensional systems.References
 [1] Tansu Alpcan. Dual Control with Active Learning using Gaussian Process Regression. Arxiv preprint arXiv:1105.2211, pages 1–29, 2011.
 [2] H. Bauer. Wahrscheinlichkeitstheorie. deGruyter, 2001.
 [3] Girish Chowdhary, HA Kingravi, JP How, and PA Vela. Bayesian Nonparametric Adaptive Control using Gaussian Processes. Technical report, MIT, 2013.
 [4] MP Deisenroth, J. Peters, and C. E. Rasmussen. Approximate dynamic programming with gaussian processes. ACC, June 2008.
 [5] M.P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian process dynamic programming. Neurocomputing, 2009.
 [6] T.E. Duncan and B.PasikDuncan. Adaptive control of a scalar linear stochastic system with a fractional brownian motion. In FAC World Congress, 2008.
 [7] H. Grimmett, R. Paul, R. Triebel, and I. Posner. Knowing when we don’t know: Introspective classification for missioncritical decision making. In ICRA, 2013.
 [8] J. Ko, D. Klein, D. Fox, and D. Haehnel. Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp. In ICRA, 2007.
 [9] J. Kocijan and R. MurraySmith. Nonlinear Predictive Control with a Gaussian. Lecture Notes in Computer Science 3355, Springer, pages 185–200, 2005.
 [10] J. Kocijan, R. MurraySmith, C.E. Rasmussen, and B. Likar. Predictive control with Gaussian process models. In The IEEE Region 8 EUROCON 2003. Computer as a Tool., volume 1, pages 352–356. Ieee, 2003.
 [11] P. R. Kumar. A survey of some results in stochastic adaptive control. Siam J. Control and Optimization, 23, 1985.
 [12] Roderick Murraysmith, Carl Edward Rasmussen, and Agathe Girard. Gaussian Process Model Based Predictive Control. In IEEE Eurocon 2003: The International Conference on Computer as a Tool, 2003.
 [13] D. NguyenTuong and J. Peters. Using model knowledge for learning inverse dynamics. In IEEE Int. Conf. on Robotics and Automation (ICRA), 2010.
 [14] D NguyenTuong, J. Peters, M. Seeger, and B. Schölkopf. Learning inverse dynamics: a comparison. In Europ. Symp. on Artif. Neural Netw., 2008.
 [15] Ioannou P. and J. Sun. Robust Adaptive Control. Prentice Hall, 1995.

[16]
C.E. Rasmussen and C. K. I. Williams.
Gaussian Processes for Machine Learning
. MIT Press, 2006.  [17] Alex Rogers, Sasan Maleki, Siddhartha Ghosh, and N.R. Jennings. Adaptive Home Heating Control Through Gaussian Process Prediction and Mathematical Programming. In 2nd Int. Workshop on Agent Technology for Energy Systems (ATES 2011), 2011.
 [18] A. Rottmann and W. Burgard. Adaptive Autonomous Control using Online Value Iteration with Gaussian Processes. In ICRA, 2009.
 [19] M. W. Spong. Partial feedback linearization of underactuated mechanical systems. In Proc. IEEE Int. Conf. on Intel. Robots and Sys. (IROS), 1994.
 [20] K.Y. Volyanskyy, M.M. Haddad, and A.J. Calise. A new neuroadaptive control architecture for nonlinear uncertain dynamical systems: Beyond sigma and emodifications. In CDC, 2008.