Model-Based Reinforcement Learning for Physical Systems Without Velocity and Acceleration Measurements

by   Alberto Dalla Libera, et al.
Università di Padova

In this paper, we propose a derivative-free model learning framework for Reinforcement Learning (RL) algorithms based on Gaussian Process Regression (GPR). In many mechanical systems, only positions can be measured by the sensing instruments. Then, instead of representing the system state as suggested by the physics with a collection of positions, velocities, and accelerations, we define the state as the set of past position measurements. However, the equation of motions derived by physical first principles cannot be directly applied in this framework, being functions of velocities and accelerations. For this reason, we introduce a novel derivative-free physically-inspired kernel, which can be easily combined with nonparametric derivative-free Gaussian Process models. Tests performed on two real platforms show that the considered state definition combined with the proposed model improves estimation performance and data-efficiency w.r.t. traditional models based on GPR. Finally, we validate the proposed framework by solving two RL control problems for two real robotic systems.


page 1

page 4

page 6


RL-Controller: a reinforcement learning framework for active structural control

To maintain structural integrity and functionality during the designed l...

Model-Based Reinforcement Learning with SINDy

We draw on the latest advancements in the physics community to propose a...

Performance-Driven Controller Tuning via Derivative-Free Reinforcement Learning

Choosing an appropriate parameter set for the designed controller is cri...

Efficient Model-Free Reinforcement Learning Using Gaussian Process

Efficient Reinforcement Learning usually takes advantage of demonstratio...

Physically Embedded Planning Problems: New Challenges for Reinforcement Learning

Recent work in deep reinforcement learning (RL) has produced algorithms ...

A Reinforcement Learning-based Economic Model Predictive Control Framework for Autonomous Operation of Chemical Reactors

Economic model predictive control (EMPC) is a promising methodology for ...

I Introduction

Reinforcement Learning (RL) has seen explosive growth in recent years. RL algorithms have been able to reach and exceed human-level performance in several benchmark problems, such as playing the games of chess, go and shogi [25]. Despite these remarkable results, the application of RL to real physical systems (e.g., robotic systems) is still a challenge, because of the large amount of experience required and the safety risks associated with random exploration.

To overcome these limitations, Model-Based RL (MBRL) techniques have been developed [2, 27, 11]. Providing an explicit or learned model of the physical system allows drastic decreases in the experience time required to converge to good solutions, while also reducing the risk of damage to the hardware during exploration and policy improvement.

Describing the evolution of physical systems is generally very challenging, and still an active area of research. Deriving models from first principles of physics might be very difficult, and could also introduce biases due to parameter uncertainties and unmodelled nonlinear effects. On the other hand, learning a model solely from data could be expensive, and generally suffers from insufficient generalization. Models based on Gaussian Process Regression (GPR) [19] have received considerable attention for model learning tasks in MBRL [2]

. GPR allows to merge prior physical information with data-driven knowledge, i.e., information inferred from analyzing the similarity between data, leading to so-called semi-parametric models

[22, 21, 15].

Physical laws suggest that the state of a mechanical system can be described by positions, velocities, and accelerations of its generalized coordinates. However, velocity and acceleration sensors are often not available, in particular when considering low-cost experimental setups. In such cases, velocities and accelerations are usually estimated by means of causal numerical differentiation of positions, introducing a difference between the real and estimated signals. These signal distortions can be seen as an additional unknown input noise, which might compromise significantly the prediction accuracy of the learning algorithm. Indeed, standard GPR models do not consider noisy inputs. Several Heteroscedastic GPR models have been proposed in the literature, see for example

[28, 6, 13]. However, the solutions proposed might not be suitable for real-time application, and most of the time they are more useful for improving the estimation of uncertainty, than for improving the accuracy of prediction.

In this work, we propose a learning framework for model-based RL algorithms that does not need measurements of velocities and accelerations. Instead of representing the system state as a collection of positions, velocities, and accelerations, we propose to define the state as a finite past history of the position measurements. We call this representation derivative-free, to express the idea that the derivatives of position are not included in it.

The use of the past history of the state has been considered in the GP-NARX literature [13, 12, 3], as well as in Eigensystem realization algorithm (ERA) and Dynamic Mode Decomposition (DMD) [10, 23]. However, these techniques do not use a derivative-free approach when dealing with physical systems, e.g., they consider the history of position and velocity having double state dimension w.r.t. our approach (which might be a problem for MBRL) and do not incorporate prior physical model to design the covariance function. Derivative-free GPR models have also already been introduced in [20], where the authors proposed derivative-free nonparametric kernels.

The proposed approach has some connections with discrete dynamics models, see for instance [17, 14]. In these works, the authors derived a discrete-time model of the dynamics of a manipulator discretizing the Lagrangian equations. However, different from our approach, these techniques assume a complete knowledge of the dynamics parameters, typically identified in continuous time. Finally, such models might not be sufficiently flexible to capture unmodeled behaviors like delays, backlash, and elasticity.

Contribution. The main contribution of the present work is the formulation of derivative-free GPR models capable of encoding physical prior knowledge of mechanical systems that naturally depend on velocity and acceleration. We propose physically inspired derivative-free (PIDF) kernels, which provide better generalization properties than the nonparametric deriviative-free kernel, and enable the design of semi-parametric derivative-free (SPDF) models.

The commonly used derivative and acceleration signals approximated through numerical differentiation represent statistics of the past raw positional data that cannot be exact, in general. The proposed framework does not make these computational assumptions, thus preserving richer information content in the inputs that are fed into the model. Moreover, providing to the GPR model a sufficient reach past history we can capture eventual higher orders unmodeled behaviors, like delays, backlash, and elasticity.

The proposed learning framework is tested on two real systems, a ball-and-beam platform and a Furuta pendulum. The experiments show that the proposed derivative-free learning framework improves significantly the estimation performance obtained by standard derivative-based models. The SPDF models are used to solve RL-based trajectory optimization tasks. In both systems, we applied the control trajectory obtained by an iLQG [26] algorithm in an open-loop fashion. The obtained performance shows that the proposed framework learns accurately the dynamics of the two systems, and it is suitable for RL applications.

The paper is organized as follows. In Section II, we briefly introduce the standard learning framework adopted in model-based RL using GPR. Then, in Section III, we propose our derivative-free learning framework composed of the definition of the state and a novel derivative-free prior for GPR, based on the physical equations of motion. Finally, in the last two sections, we report the performed experiments.

Ii Model Based Reinforcement Learning Using Gaussian Process Regression

In this section, we describe the standard model learning framework adopted in MBRL using GPR, and the trajectory optimization algorithm applied. An environment for RL is formally defined by a Markov Decision Process (MDP). Consider a discrete-time system

subject to the Markov property, where and

are the state vector and the input vector at the time instant


When considering a mechanical system with generalized coordinates , the dynamics equations obtained through Rigid Body Dynamics, see [24], suggest that, in order to satisfy the Markov property, the state vector should consist of positions, velocities, and accelerations of the generalized coordinates, i.e., , or possibly of a subset of these variables, depending on the task.

Model-based RL algorithms derive the policy starting from , an estimate of the system evolution.

Ii-a Gaussian Process Regression

In some studies, GPR [19] has been used to learn , see for instance [2]. Typically, the variables composing are assumed to be conditionally independent given and , and each state dimension is modeled by a separate GPR. The components of , denoted by , with , are inferred and updated based on , a data set of input-output noisy observations. Let be the number of samples available, and define the set of GPR inputs as where with . As regards the outputs , two definitions have been proposed in the literature. In particular, can be defined as , the i-th component of the state at the next time instant, or as , leading to . In both cases, GPR models the observations as


where is Gaussian i.i.d. noise with zero mean and covariance , and . The matrix is called the kernel matrix, and is defined through the kernel function , i.e., the entry in position , is equal to . In GPR, the crucial aspect is the selection of the prior functions for , defined by , usually considered , and . Then, see [19]

, the maximum a posteriori estimator is:


In the following, we will refer to and as one of the components and the relative kernel function.

Physically inspired kernels. When the physical model of the system is available, the model information might be used to identify a feature space over which the evolution of the system is linear. More precisely, assume that the model can be written in the form , where is a known nonlinear function that maps the GPR inputs vector onto the physically inspired features space, and

is the vector of unknown parameters, modeled as a zero mean Gaussian random variable, i.e.,

, with . The expression of the physically inspired kernel (PI) is


namely, a linear kernel in the features . For later convenience, we define also the homogeneous polynomial kernel in , which is a more general case of (3):

Nonparametric kernel. When a physical model is not available, the kernel has to be chosen by the user according to their understanding of the process to be modeled [19]

. A common option is the Radial Basis Function kernel (RBF):


where is a positive constant called the scaling factor, and is a positive definite matrix that defines the norm over which the distance between and is computed, i.e., . Several options to parameterize have been proposed, e.g., a diagonal matrix or a full matrix defined by the Cholesky decomposition, namely, , see [19, Chp.5],[5, Sec. 4.1].

Semiparametric kernel. This approach combines the physically inspired and the non-parametric kernels. Here we define the kernel function as the sum of the covariances:


where can be, for example, the RBF kernel (4).

Ii-B Trajectory Optimization using iLQG

The iLQG algorithm is a popular technique for trajectory optimization [26]. Given discrete time dynamics such as (1) and a cost function, the algorithm computes locally linear models and quadratic cost functions for the system along a trajectory. These linear models are then used to compute optimal control inputs and local gain matrices by iteratively solving the associated LQG problem, see [26].

Iii Derivative-Free Framework for Reinforcement Learning Algorithms

A novel learning framework to model the evolution of a physical system is proposed, which addresses several limitations of the standard modelling approach described in Sec. II.
Numerical differentiation. The Rigid Body Dynamics of any physical system are functions of joint positions, velocities, and accelerations. However, a common issue is that often joint velocities and accelerations cannot be measured. Computing them by means of causal numerical differentiation starting from the (possibly noisy) measurements of the joint positions might introduce considerable delays and distortions of the estimated signals. This fact could severely hamper the final solution. This is a very well known and often discussed problem, see, e.g., [24, 8, 16].
Conditional Independence. The assumption of conditional independence among the with given in (1) might be a very imprecise approximation of the real system’s behavior, in particular when the outputs considered are position, velocity, or acceleration of the same variable, which are correlated by nature. This fact has been shown to be an issue in estimation performance in [21], where the authors proposed to learn the acceleration function and integrate it forward in time in order to estimate position and velocity. Moreover, under this assumption, a separate GP for each output needs to be estimated for modeling variables that are intrinsically correlated, leading to redundant modeling design and testing work, and a waste of computational resources and time. This last aspect might be particularly relevant when considering systems with a considerable number of DoF.
Delays and nonlinearities. Finally, physical systems are often affected by intrinsic delays and nonlinear effects that have an impact on the system over several time instants, contradicting the first-order Markov assumption; an instance of such behavior is reported in section V-B.

Iii-a Derivative-Free State definition

To overcome the aforementioned limitations, we define the system state111The exact state of a physical system is usually unknown, but in general accepted to be given by position, velocity and acceleration accordingly to the physics first principles. With a slight abuse of notation, we refer to our representation of the state in a derivative-free fashion as the state variable. in a derivative-free fashion, considering as state elements the history of the position measurements:


The simple yet exact idea behind this definition is that when velocities and accelerations measures are not available, if is chosen sufficiently large, then the history of the positions contains all the system information available at time , leaving to the model-learning algorithm the possibility of estimating the state transition function. Indeed, velocities and accelerations computed through causal numerical differentiation are the outputs of digital filters with finite impulse response (or with finite past instants knowledge for non-linear filters), which represent a statistic of the past raw position data. Notice that these statistics cannot be exact in general, leading to a loss of information that instead is kept in the proposed derivative-free framework.

The state transition function becomes deterministic and known (i.e., the identity function) for all the components of the state. Consequently, the problem of learning the evolution of the system is restricted to learning only the functions , reducing the number of models to learn and avoiding erroneous conditional independence assumptions. Finally, the MDP has a state information rich enough to be robust to intrinsic delays and to obey the first-order Markov property.

Iii-B State Transition Learning with PIDF Kernel

Derivative-free GPRs have already been introduced in [20], where the authors derived a data-driven derivative-free GPR. As pointed out in the introduction, the generalization performance of data-driven models might not be sufficient to guarantee robust learning performance, and exploiting eventual prior information coming from the physical model is crucial. To address this problem, we propose a novel Physically Inspired Derivative-Free (PIDF) kernel.

The PIDF exploits the property that the product and sum of kernels is still a kernel, see [19]. Define and assume that a physical model of the type , is known. Then, we propose a set of guidelines to derive a PIDF kernel starting from :

PIDF Kernel Guidelines

  1. [leftmargin=1.2em]

  2. Each and every position, velocity, or acceleration term in is replaced by a distinct polynomial kernel of degree , where is the degree of the original term; e.g., .

  3. The input of each of the kernels in 1) is a function of , the history of the position corresponding to the independent variable of the substituted term;
    e.g., .

  4. If a state variable appears into transformed by a function , the input to becomes the input defined at point 2) transformed by the same function , e.g., .

Applying these guidelines will generate a kernel function , which incorporates the information given by the physics, without knowing the velocity and acceleration.

The extension to semiparametric derivative-free (SPDF) kernels becomes trivial. Combining, as described in Section II-A, the proposed with a derivative-free NP kernel, (or as proposed in [20]), we obtain:


These guidelines formalize the solution to the non-trivial issue of modeling real systems using physical models without measuring velocity and acceleration. Although the guidelines might not be the only possible solution, they represent an algorithm with no ambiguity or arbiter choice to be made by the user to convert RBD into derivative free models.

In the next sections, we apply the proposed learning framework to the benchmark systems Ball-and-Beam (BB) and Furuta Pendulum (FP), describing in detail the kernel derivations. While for both setups we will show the task of controlling the system, highlighting the advantages of the proposed derivative-free framework, due to space limitations, we decided to present different properties of the proposed method in each of them. In the BB case, we will highlight the estimation performance of over computing with several filters and the difficulty of choosing the most suitable velocity. In the more complex FP system, we analyze robustness to delays, performance at step-ahead prediction, and make extensive comparisons among physically inspired, nonparametric, semiparametric derivative-free, and standard GPR.

Iv Ball-and-Beam Platform

Fig. 1: In-house built Ball-and-Beam experimental setup.

Fig. 1 shows our experimental setup for the BB system [9]

. An aluminum bar is attached to a tip-tilt table (platform) constrained to have 1 degree of freedom (DoF). The platform is actuated by an inexpensive, commercial off-the-shelf HiTec type HS-805BB RC model servo motor that provides open-loop positioning; the platform angle is measured by an accurate absolute encoder. There is no tachometer attached to the axis, so angular velocity is not directly measurable. A ball is rolling freely in the groove. We use an RGB camera which is attached to a fixed frame to measure the ball’s position. The ball is tracked in real-time using a simple, yet fast, blob tracking algorithm. All the communication with the camera and servo motors driving the system is done using ROS


Let and be the beam angle and the ball position, respectively, considered in a reference frame with origin at the beam center and oriented s.t. the beam end is positive. The forward dynamics of the ball are expressed by the following equation (see [7] for the details)


where , , and are the ball mass, inertia, radius, and friction coefficient, respectively. Starting from eq. (8), the forward function for is derived by integrating twice forward in time, and assuming a constant between two time instants:


where is the sampling time. In order to describe the BB system in the framework proposed in Section III, we define the derivative-free state as , with

Applying the guidelines defined in section III-B to Eq. (9), the PIDF kernel obtained is


Iv-a Prediction performance

The purpose of this section is to compare the prediction performance of the GP models (2), using as prior the PIDF kernel (10), , and using the standard PI kernel applying (8) to Eq. (3), . The question that the standard approach imposes is how to compute the velocities from the measurements in order to estimate , and there is not a unique answer to this question. We experimented with some common filters using different gains in order to find good velocity approximations:

  • [leftmargin=1.em]

  • Standard numerical differentiation followed by a low pass filter to reject noise, which uses the position history . We considered 3 different cutoff frequencies , , Hz with correspondent estimators denominated as , , , respectively;

  • Kalman filter, with different process covariances and equals to , , with correspondent estimators , , ;

  • The acausal Savitzky-Golay filter with window length .

Acausal filters have been introduced just to provide an upper bound on prediction performance; otherwise, they can not be applied in real-time applications. As regards the number of past time instants considered in , we set . Both the training and test datasets consists in the collection of 3 minutes of operation on the BB system, with control actions applied at 30Hz, while measurements from the encoder and camera were recorded. Both the datasets account for samples. The control actions were generated as a sum of 10 sine waves with randomly sampled frequency between Hz, shift phases in , and amplitude ranging .

Fig. 2: Comparison of the prediction errors obtained in the test set with physically inspired estimators, together with a detailed plot of the evolution of computed by means of numerical differentiation and a Kalman filter.

In Fig. 2, we visualize the distribution of the estimation errors module in the test set through boxplots, as well as reporting the numerical values of the . Acausal filtering guarantees the best performance, whereas, among the estimators with causal inputs, the proposed approach performs best. Indeed, the obtained with the derivative-free estimator is approximately smaller than the best obtained with the other causal estimators, i.e., and . As visible from the boxplots, the proposed solution exhibits a smaller variability. Results obtained with numerical differentiation and Kalman filtering show that the technique used to compute velocities can affect prediction performance significantly. In Fig. 2, we present also a detailed plot of the evolution obtained with different differentiation techniques. As expected, there is a trade-off between noise rejection and delay introduced that must be considered. For instance, increasing the cutoff frequency decreases the delay, but at the same time impairs the rejection of noise. An inspection of the , and prediction errors shows that too high or too low cutoff frequencies lead to the worst prediction performance. With our proposed approach, tuning is not required, since the filtering coefficients are learned automatically during the GPR training.

Iv-B Ball-and-beam control

The control task is the stabilization of the ball with zero velocity in a target position along the beam. The control trajectory is computed using the iLQG algorithm introduced in Section II-B. In order to model also the behaviors not captured by the physical equations of motion, we train a GP, called , with semiparametric kernel as in Eq.(7):


where the NP kernel is with the matrix parameterized through Cholesky decomposition. The training data are the same described in Section IV-A. The control trajectory obtained by iLQG using model is applied to the physical system, and performance is shown in Fig. 3.

Fig. 3: Top plot: comparison of the ball trajectory on the real system with the optimal trajectory computed by iLQG on . Bottom plot: comparison of the control signals computed by iLQG using and .

In the top plot, we can observe how the optimized trajectory for the model remains close to the ball trajectory of the real system for all the 100 steps (3.3[s]), which is the chosen length for the iLQG trajectory. This result illustrates the high accuracy of the model in estimating the future evolution of the real system. Note that the control trajectory is implemented in open loop, to highlight the model precision obtaining an average deviation between the target and the final ball position of 9[mm] and standard deviation of 5[mm] in 10 runs. By adding a small proportional feedback control, the error becomes almost null. In the bottom plot, the control trajectory obtained by iLQG using either

or is shown. Two major observations can be made: the trajectory obtained with approximates a bang-bang trajectory that in a linear system would be the optimal trajectory, and the trajectory obtained with is similar, but since the equation of motions cannot describe all the nonlinear effects present in a real system, the control action has a final bias that makes the ball drift away from the target position.

V Furuta Pendulum: Derivative Free Modeling and Control

The second physical system considered is the Furuta pendulum [4], a popular benchmark system in control theory. A schematic of the FP with its parameters and variables is shown in Fig. 4. We refer to “Arm-1” and “Arm-2” in Fig. 4 as the base arm and the pendulum arm, respectively, and we denote and the angles of the base arm and the pendulum.



Fig. 4: A schematic diagram of the FP with various system parameters and state variables. Arm- with has length , mass , inertia and center of mass for the two arms at .

In [1], the authors have presented a model of the FP. Based on that model, we obtained the expression of as a linear function w.r.t a vector of parameters ,


where and .

The FP considered in this work has several characteristics that are different from those typically studied in the research literature. Indeed, in our FP (see Fig. 5), the base arm is held by a gripper which is rotated by the wrist joint of a robotic arm (a MELFA RV-4FL). For this reason, the rotation applied to the wrist joint is denoted by , and it is different from the actual base arm angle (see Figure 4). The control cycle of the robot is fixed at ms, and communication to the robot and the pendulum encoder is handled by ROS.

These circumstances have several consequences. First, the robot can only be controlled in a position-control mode, and we need to design a trajectory of set points considering that the manufacturer limits the maximum angle displacement of any robot’s joint in a control period. This constraint, together with the high performance of the robot controller, results in a quasi-deterministic evolution of , that we identified to be . Therefore, the forward dynamics learning problem is restricted to model the pendulum arm dynamics. Additionally, the D-printed gripper causes a significant interplay with the FP base link, due to the presence of elasticity and backlash. These facts lead to vibrations of the base arm along with the rotational motion, and a significant delay in actuation of the pendulum arm, which results in .

Fig. 5: The Furuta Pendulum (3D printed in color green) held in the gripper at the swing-up position reached using the learned controller.

V-a Delay and nonlinear effects

In order to demonstrate the presence of delays in the system dynamics, we report a simple experiment in which a triangular wave in excites the system. The results are shown in Fig. 6 (for lack of space, the term depending on is not reported, as the effects of viscous friction are not significant). The evolution of is characterized by a main low-frequency component with two evident peaks in the beginning of the trajectory, and a higher-frequency dynamical component which corrupts more the main component as the time passes by. Several insights can be obtained from these results. First, the peaks of the low-frequency component can be caused only by the contribution, given that the and contributions do not exhibit these behaviours so prominently. Second, the difference between the peaks in the contribution and (highlighted in the figure by the vertical dashed lines) represent the delay from the control signal and the effect on the pendulum arm. Third, the high-frequency component in might represent the noise generated by the vibration of the gripper, the elasticity of the base arm, and all the nonlinear effects given by the flexibility of the gripper.

Fig. 6: Evolution of and its model-based basis functions. Derivatives are computed using the acausal Savitzky-Golay filter.

V-B FP derivative free GPR models

We used the derivative-free framework to learn a model for the evolution of the pendulum arm. The FP state vector is defined as , with

From (12), following the same procedure applied in the BB application to derive Eq. (9), we obtain . Applying the guidelines in Section III-B we obtain the corresponding PIDF kernel


In order to also model the complex behavior showed in Section V-A, we define a semiparametric kernel for the FP as:


where the NP kernel is defined as the product of two RBFs with their matrices independently parameterized through Cholesky decomposition . Adopting a full covariance matrix, the RBF can learn convenient transformations of the inputs, increasing the generalization ability of the predictor.

As experimentally verified in Section V-A, the evolution of is characterized by a significant delay w.r.t. the dynamics of . As a consequence, positions, velocities, and accelerations at time instant are not sufficient to describe the FP dynamics. However, defining the state as the collection of past measured positions, and setting properly , the GPR has a sufficiently informative input vector, and can select inputs at the proper time instants, thus inferring the system delay from data. Note that when considering also velocities and accelerations, a similar approach would require a state of dimension , instead of .

V-C Prediction performance

In this section, we test the accuracy of different predictors:

  • [leftmargin=1.em]

  • : NP estimator defined in (2) with a RBF kernel with diagonal covariance and input given by and its derivatives, i.e., all the positions velocities and accelerations from time to , ;

  • : NPDF estimator defined in (2) with a RBF kernel, ;

  • : the PIDF estimator defined in (2) with kernel defined in (13), ;

  • : the SPDF estimator defined in (2) with kernel defined in (14), .

The model is considered to provide the performance of a standard NP estimator based on and derivatives.

Fig. 7: Top: evolution of the as function of the experience seen by the model. Bottom: comparison of the obtained training the estimators with all the data in . In both plots, the -axis is in log-scale.

The estimators have been trained by minimizing the negative marginal log-likelihood (NMLL) over a training data set composed of samples, corresponding approximately to seconds of experience. The input signal is a sum of sinusoids with random angular velocity ranging between

. To deal with the consistent number of samples available, we rely on stochastic gradient descent to optimize the NMLL. Performance is measured on a test data set

composed of samples, obtained with an input signal of the same type as the one considered in , but a different distribution of the sinusoids with frequency ranging between , to show generalization ability. Estimators are compared both in terms of accuracy and data efficiency, and results are in Figure 7. In the bottom graph, we report the Root Mean Squared Error () in , and all the estimators considered are able to predict the evolution of the pendulum arm with an error smaller than one degree. However, derivative-free approaches outperform the non-derivative-free estimator. Note that achieves the best performance, and outperforms , despite that both models are based on an RBF kernel. The latter fact confirms that numerical computation of the derivatives might reduce estimation accuracy. In the top graph, we report the evolution of the RMSE as a function of the seconds of the training samples available. The derivative-free approaches are more accurate and data-efficient than . Notice that is more accurate only for the short period of the first seconds, and its RMSE decreases more slowly. The use of the PI kernel is particularly helpful as regards data efficiency, since after seconds of data is more accurate than and , and the ’s decreases faster than the one of .

V-D Rollout performance

In this section, we characterize the rollout accuracy of the derived models, namely the estimation performance at -step-ahead predictions.

Fig. 8: Evolution of with its confidence intervals.

For each model, we performed rollouts. During the -th rollout, we randomly pick an initial instant , then the input location in is selected as initial input, and a prediction is performed for a window of steps. For each simulation, is computed by subtracting the predicted trajectory from the one realized in . To characterize how uncertainty evolves over time, we define the error statistic , that is the of the prediction at the -th step ahead. The

confidence intervals are computed assuming i.i.d. and normally distributed errors. Under this assumptions, each

has a distribution. The performance in terms of the of , and is reported in Fig. 8. In the initial phase, is lower than , whereas for becomes greater than . This suggests that the NP model behaves well for short interval prediction, whereas the PI model is more suitable for long-term predictions. The SP approach combines the advantages of these two models. The evolution of confirms this, showing that outperforms and .

V-E Furuta Pendulum control

The model is used to design a controller to swing-up the FP using the iLQG algorithm described in Section II-B. The model is accurate to the point that the trajectories obtained by the iLQG algorithm were implemented in an open-loop fashion on the real system, and the results are shown in Fig. 9. The FP swings up with near-zero velocity at the goal position; however, as expected, an open-loop control sequence cannot stabilize it. Fig. 9 reports the agreement between the trajectories obtained under the iLQG control sequence, using both the and the real system. The comparison shows the long-horizon predictive accuracy of the learned model. Note that the models lose accuracy around the unstable equilibrium point, because of insufficient data, which are harder to collect in this area during training.

Fig. 9: Performance of the iLQG trajectory on the FP swing-up control.

Vi Conclusions

In this paper, we presented a derivative-free learning framework for model based RL, and we defined a novel physically-inspired derivative-free kernel. Experiments with two real robotic systems show that the proposed learning framework outperforms in prediction accuracy its corresponding derivative-based GPR model, and that semi-parametric derivative-free methods are accurate enough to solve model-based RL control problems in real-world applications. The proposed framework exhibits robustness to delays and a capacity to deal with partially observable systems that can be further investigated.


  • [1] B. S. Cazzolato and Z. Prime (2011) On the dynamics of the furuta pendulum. Journal of Control Science and Engineering. Cited by: §V.
  • [2] MP. Deisenroth and CE. Rasmussen PILCO: a model-based and data-efficient approach to policy search. In ICML 2011, Cited by: §I, §I, §II-A.
  • [3] A. Doerr, C. Daniel, D. Nguyen-Tuong, A. Marco, S. Schaal, T. Marc, and S. Trimpe (2017) Optimizing long-term predictions for model-based policy search. In Conference on Robot Learning, pp. 227–238. Cited by: §I.
  • [4] K. Furuta, M. Yamakita, S. Kobayashi, and M. Nishimura (1992) A new inverted pendulum apparatus for education. In Advances in Control Education 1991, pp. 133–138. Cited by: §V.
  • [5] F. Girosi, M. Jones, and T. Poggio (1995)

    Regularization theory and neural networks architectures

    Neural Computation. Cited by: §II-A.
  • [6] P. W. Goldberg, C. K. I. Williams, and C. M. Bishop (1998) Regression with input-dependent noise: a gaussian process treatment. In NIPS 10, pp. 493–499. Cited by: §I.
  • [7] J. Hauser, S. Sastry, and P. Kokotovic (1992-03) Nonlinear control via approximate input-output linearization: the ball and beam example. IEEE Transactions on Automatic Control 37, pp. 392–398. External Links: Document, ISSN 0018-9286 Cited by: §IV.
  • [8] J. Hollerbach, W. Khalil, and M. Gautier (2008) Model identification. In Springer Handbook of Robotics, pp. 321–344. Cited by: §III.
  • [9] D. K. Jha, D. Nikovski, W. Yerazunis, and A. Farahmand (2017-11) Learning to regulate rolling ball motion. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1–6. Cited by: §IV.
  • [10] J.N. Juang and R. Pappa (1985-11) An eigensystem realization algorithm for modal parameter identification and model reduction. Journal of Guidance Control and Dynamics 8, pp. . External Links: Document Cited by: §I.
  • [11] S. Levine and P. Abbeel (2014) Learning neural network policies with guided policy search under unknown dynamics. In NIPS 27, Cited by: §I.
  • [12] C. L. C. Mattos, A. Damianou, G. A. Barreto, and N. D. Lawrence (2016) Latent autoregressive gaussian processes models for robust system identification. IFAC-PapersOnLine. Cited by: §I.
  • [13] A. McHutchon and C. E. Rasmussen (2011) Gaussian process training with input noise. In Advances in Neural Information Processing Systems, Cited by: §I, §I.
  • [14] C. P. Neuman and V. D. Tourassis (1985) Discrete dynamic robot models. IEEE Transactions on Systems, Man, and Cybernetics. Cited by: §I.
  • [15] D. Nguyen-Tuong and J. Peters (2010) Using model knowledge for learning inverse dynamics. In ICRA, Cited by: §I.
  • [16] D. Nguyen-Tuong and J. Peters (2011) Model learning for robot control: a survey. Cognitive Processing 12 (4), pp. 319–340. Cited by: §III.
  • [17] S. Nicosia, P. Tomei, and A. Tornambè (1989-12-01) Discrete-time modeling and control of robotic manipulators. Journal of Intelligent and Robotic Systems 2 (4), pp. 411–423. Cited by: §I.
  • [18] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng (2009)

    ROS: an open-source robot operating system

    In ICRA workshop on open source software, Cited by: §IV.
  • [19] C.E. Rasmussen and C.K.I. Williams (2006)

    Gaussian processes for machine learning

    The MIT Press. Cited by: §I, §II-A, §II-A, §III-B.
  • [20] D. Romeres, M. Zorzi, R. Camoriano, S. Traversaro, and A. Chiuso (2019) Derivative-free online learning of inverse dynamics models. IEEE Transactions on Control Systems Technology. Cited by: §I, §III-B, §III-B.
  • [21] D. Romeres, D. K. Jha, A. DallaLibera, B. Yerazunis, and D. Nikovski (2019) Semiparametrical gaussian processes learning of forward dynamical models for navigating in a circular maze. In International Conference on Robotics and Automation (ICRA), Cited by: §I, §III.
  • [22] D. Romeres, M. Zorzi, R. Camoriano, and A. Chiuso (2016) Online semi-parametric learning for inverse dynamics modeling. In IEEE CDC, Cited by: §I.
  • [23] P. Schmid and J. Sesterhenn (2008) Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics. Cited by: §I.
  • [24] B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo (2010) Robotics: modeling, planning and control. Springer Science&Business Media. Cited by: §II, §III.
  • [25] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science. External Links: Document, ISSN 0036-8075 Cited by: §I.
  • [26] Y. Tassa, T. Erez, and E. Todorov (2012) Synthesis and stabilization of complex behaviors through online trajectory optimization. In IROS,, pp. 4906–4913. Cited by: §I, §II-B.
  • [27] E. Todorov and W. Li (2005) A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In ACC., pp. 300–306. Cited by: §I.
  • [28] C. Wang and R. M. Neal (2012) Gaussian process regression with heteroscedastic or non-gaussian residuals. CoRR abs/1212. Cited by: §I.