Learning dynamical systems has received considerable attention over the last decades and is widely recognized as an important and difficult problem (schon2011system)
. Indeed, in the case of physical systems, sampling data often requires practically involved and time-consuming experiments. Further, sampling at informative locations of the state space is challenging, since the system is constrained by the underlying dynamics. This is one key difference to many machine learning tasks, where data can be collected anywhere. Hence, it is essential to excite the system in such a way that the generated data enables sample-efficient learning. In the case of linear time-invariant (LTI) systems, there exists a rich body of theoretical results for this problem(ljung2001system). Nonetheless, it is still an active field of research, e.g., see (simchowitz2018learning) and references therein. Convergence results are usually tightly connected to the well-established theory of persistence of excitation (Green_persistent_excitation_lin_sys), which ensures that control inputs are significant enough to sufficiently excite the system. However, these control inputs are not necessarily optimal and targeted exploration can accelerate learning (umenberger2019robust). Learning nonlinear systems is even more complex, although there have been many advances and progress over the years (see e.g., (schon2011system; Schoukens_survey_nonlin_sysid)). We consider Gaussian process (GP) regression, which has been proven to be an efficient framework in many related applications, including model learning (Nguyen-Tuong_survey_model_learning_robot_control)Deisenroth_PILCO_presentation; Doerr_optimize_long-term_predictions_model-based_policy_search). These probabilistic models have many advantageous properties for learning dynamical systems (Morari_learning_control_GP; Deisenroth_identify_GP_state_space_models; Doerr_probabilistic_recurrent_state-space_models), such as taking uncertainty into account, coping with small datasets, and incorporating prior knowledge.
Active learning, i.e., sequentially choosing where to sample in order to build an informative dataset, has been investigated in many domains (see (Charu_active_learning_survey) for an overview). A critical difference that sets the active learning problem for dynamical systems apart is the fact that it is not possible to arbitrarily sample the state-action space. Indeed, the system is constrained by the dynamics, and has to be excited appropriately by control inputs. Existing approaches for actively learning static maps thrive by incorporating information-theoretical criteria that guide the sampling procedure. In particular, the combination with GPs yields powerful theoretical and practical results (Krause_sensor_placement_in_GP; Krause_nonmyopic_active_learning_GP). For dynamical systems, however, there is only little related work. Recent attempts have been made, proposing a greedy exploration scheme (Morari_learning_control_GP), focusing on exploration under safety constraints (Koller_learning_MPC_safe_exploration)
, or on active exploration for reinforcement learning using linear Bayesian inference rather than GPs(Belousov_receding_horizon_curiosity). Approaches relying on parametrization of the trajectory have also been presented, including for learning time series with GPs (Krause_informative_path_planning_underwater_vehicle; Nguyen-Tuong_safe_AL_time-series_GP). The proposed algorithms are related to this work, however, the analysis differs in several important points, which we will further discuss in Section 4.
We investigate the active learning problem for dynamical systems, which are modeled by a GP. In particular, we take the learned dynamics explicitly into account to guide the exploration.The following contributions are made:
Proposal of a method that searches for informative points to visit, then separately drives the system to reach them (separated search and control, short sep). While we can provide some theoretical guarantees on the suboptimality of the sequence of locations to visit, we find the method to have limitations in practice.
Novel method for deriving optimal input trajectories by maximizing an information criterion, while taking the dynamics into account as constraints. We propose two variants, receding horizon and plan and apply (short rec and p&a).
Benchmark on a set of numerical experiments, including robotic systems from reinforcement learning benchmarks, showing the efficiency of the second type of methods for actively exploring the state-action space.
2 Problem statement
We consider a system subject to the following discrete-time dynamics:
where is an unknown Lipschitz-continuous function, is the system state at time step , with the space of possible states, is the control input, with the space of bounded control inputs, and is i.i.d. Gaussian measurement noise. We assume the system has sufficient controllability and stability properties in order to exclude notoriously difficult learning problems, where systematic exploration would not be meaningful. For inputs , where , we observe the noisy measurements . We are thus in the standard GP regression setting with noise-free input and noisy target. While there are are extensions to training GP dynamics with noisy inputs (Rasmussen_GP_training_input_noise_NIGP) and other more realistic settings (Doerr_optimize_long-term_predictions_model-based_policy_search; Doerr_probabilistic_recurrent_state-space_models), we do not consider them here, as they are orthogonal to the problem of excitation and to ease the presentation. The true system dynamics are approximated by a GP denoted (see (Rasmussen_Williams_GP_for_ML) for an overview). It is fully characterized by its mean function and its covariance function . The prediction at an unobserved point
where is the covariance matrix of , , and . The differential entropy of at , which quantifies the uncertainty of the prediction (MacKay_information_theory_inference_learning), is defined as The kernel
usually depends on some hyperparameters, which are optimized during learning, often by maximizing the data log marginal likelihood.
We address the following question: how should one excite (1) to generate samples for learning in a sample-efficient way? We derive control inputs that optimize information criteria such as differential entropy, while taking the autoregressive structure of (1) explicitly into account. Ultimately, this reduces the prediction error given a fixed number of samples, by choosing informative control inputs that lead to exploratory system behavior.
3 From static to dynamic – a fundamentally different problem
Powerful active learning strategies have been developed for learning static maps (Krause_sensor_placement_in_GP). Here, it is possible to immediately query any point in the input space. Thus, the problem is amenable to a clean information theoretical treatment, which is lost for dynamical systems. The insights from the well-studied static problem are a natural starting point for this work, and we present an extension to the dynamic setting herein. At the same time, we shall underline the fundamentally different nature of the dynamical problem.
3.1 The static problem: sensor placement
A canonical example for actively learning static maps is the sensor placement problem (Krause_sensor_placement_in_GP). The objective is to find the best locations of sensors out of a finite subset of possible locations, in order to approximate a static map with a GP , using noisy measurements . A possible solution is to select where is the differential entropy of . Finding such an optimal set of placements is NP-hard (Krause_sensor_placement_in_GP). Therefore, there is a need for tractable approximations. In particular, the optimal sequence of placements can be approximated by the greedy rule
with the differential entropy of the GP at iteration . The set function is monotonic and exhibits a property of diminishing returns called submodularity: if , then . Thanks to the submodularity property and Proposition 4.3 in (Nemhauser_approximations_submodular_functions), it can be shown that the sequence of greedy placements selected by (3) is close to the true optimal sequence :
3.2 Extension to dynamical systems
The dynamical problem is fundamentally different: we cannot sample at an arbitrary state , we need to steer the system to through the unknown dynamics with a sequence of bounded actions . Therefore, there is also an information gain along the trajectory, which is not considered in the previously introduced static framework.
At first, we ignore this fact: we separate the search for informative states from obtaining the control inputs that drive the system to these states. This method is denoted separated search and control (sep). Starting from an initial point , at each iteration the next location to visit is determined by the greedy rule
After solving (5), we get a state-action pair . We steer the system to using a control trajectory , then apply . Here, is the control horizon, and is the time step since the beginning of the experiment, while indexes the iterations of the greedy procedure. Due to the controllability assumption, the existence of such control trajectories is ensured for sufficiently large and there exist methods to obtain them, e.g., iLQR (Mansard_iLQR_dynamic_programming; Tassa_iLQR_trajectory_optimization). However, there are severe issues in the concrete implementation:
In general, it is difficult to choose a priori such that each is attainable in steps. However, limiting the search space in (5) to the points attainable in steps yields a time varying set from which to choose . In this case, Proposition 4.3 in (Nemhauser_approximations_submodular_functions) is not applicable anymore. Hence, suboptimality guarantees of type (4) typically can only be derived for (5) if is chosen directly from .
Information gain along the trajectory is not included in the theoretical framework.
The considered search space is continuous, and not a finite set of possible locations as for the sensor placement problem. This has implications on the property of submodularity, which is originally defined for set functions.
Solving (5), a non-convex problem in a continuous state-action space, is nontrivial, and might not return the true optimum at each iteration.
One can obtain suboptimality guarantees of type (4) under restrictive assumptions, namely: is a finite set, is the true optimum at iteration , and is actually visited by the control procedure. However, these do not hold in practice. Next, we show that more efficient strategies can be designed by optimizing over the whole control trajectory.
4 Informative control generation
Separating search from control yields an insightful embedding into the sensor placement problem. However, the crucial properties of this problem also become apparent, revealing that many aspects of the current solutions are not transferrable to dynamical systems. The discussed insufficiencies inspired us to jointly optimize for control inputs and informative states with respect to the approximate dynamics as constraints. We propose the following approach: at time step , we pick the most informative control trajectory by solving
for a fixed time horizon and . This method is highly versatile: the cost function can easily be extended and further regularized, e.g., by penalizing a suitable norm of the control signals. Numerically, we use direct multiple shooting in CasADi (Casadi) with Ipopt (Ipopt), as in (Koller_learning_MPC_safe_exploration; Morari_learning_control_GP).
Receding horizon or plan and apply
For the above described method, we propose two options. In the default setting, we update the GP and optimize its hyperparameters at each time step. Then, we solve (6) in a receding time fashion (variant denoted receding horizon, short rec). However, this is very costly in terms of computations and may lead to a shortsighted behavior, since the exploration strategy has a chance to “change its mind” every time step. Thus, we propose a computationally cheaper alternative: we solve (6), roll out the whole control trajectory, batch update the GP with the measurements taken along the way, optimize its hyperparameters, and iterate (variant denoted plan and apply, short p&a). In this case, needs to be well-chosen: if it is too large, the GP will not be updated often enough, but if it is too small, the procedure will be too shortsighted.
Recent works propose related algorithms, but with a different focus. For example, in (Nguyen-Tuong_safe_AL_time-series_GP), exploration under safety constraints of dynamical systems modeled by GPs is investigated. However, the main focus lies on learning the safety measure, and exploration is achieved by parametrizing and optimizing a piecewise linear trajectory under safety constraints. An exploration scheme for learning GP dynamics is presented in (Morari_learning_control_GP), but it is greedy. The model learned during exploration is then used for reference tracking under uncertainty by solving an MPC problem. In (Koller_learning_MPC_safe_exploration), an MPC scheme for safe exploration of dynamical systems is proposed, where the constraint does not directly lie on the estimated dynamics, but on the propagation of safe ellipsoids through these dynamics. Exploration for reinforcement learning is considered in (Belousov_receding_horizon_curiosity), where a sequence of discrete actions is optimized. The proposed algorithm is related to (6), but using linear Bayesian inference as the learning framework. In this paper, we focus on exploration for learning dynamical systems with GPs, with a novel viewpoint extending solutions of the static problem. We also provide a comprehensive benchmark of control systems, which follows next.
5 Numerical benchmark
We compare the proposed methods in a numerical benchmark. For each approach, we evaluate the prediction error and quantify how much of the state space has been explored. The results are summarized in Table 1 and Figure LABEL:table_box_plots.
We run the following methods111Code available at https://git-amd.tuebingen.mpg.de/mbuissonfenet/active_learning_gp.git: PRBS and chirps ((Nelles_nonlin_sysid), Section 17.7) to compare to standard system identification signals, separated search and control (sep), and our optimal control method (6) with either receding horizon (rec) or plan and apply (p&a). We evaluate on five nonlinear dynamical systems with continuous state-action space and bounded controls, illustrated in Figure LABEL:systems_illustrated:
Pendulum, with , , , ;
Two-link planar robot (see (Siciliano_robotics_book)), with , , , ;
Double inverted pendulum on a cart (DIPC) from the MuJoCo environment (Todorov_MuJoCo_presentation) in Gym (Brockman_OpenAI_Gym_presentation), with , , , , and added damping;
Unicycle (see (udwadia2007analytical)), with , , , ;
Half-cheetah from the MuJoCo environment, with , , , .
Each method uses the same number of data points and same planning horizon, and starts from a stable equilibrium. We make sure each system has enough damping to be sufficiently stable and controllable in the exploration region, and choose the bounds on such that exploring the state space is neither too easy (even random signals can easily go everywhere) nor too hard (even active exploration methods cannot go far). We run 10 trials of rec since it is computationally heavy, and 100 trials of all other methods.
We quantify the accuracy of the learned model by monitoring the root mean square prediction error (RMSE) over a grid of uniformly randomly distributed states and inputs in a predefined region of interest, and the quality of exploration by computing the percentage of coverage of this region at the end of the experiment. The region of interest is chosen a priori in for each system, by picking bounds in each dimension in which most experiments stay. In this paper, we are not as much interested in the absolute results but rather in the comparison between the different methods, and in demonstrating that some explore the state space more than others, yielding a more accurate model.
The results shown in Table 1 and Figure LABEL:table_box_plots confirm that our optimal exploration methods (rec and p&a) yield the lowest prediction error and thus, the best models. Indeed, they can push each nonlinear system to unknown regions of , generating informative data points in the whole state space, which yields an overall more consistent model. The sep method performs reasonably well, but significantly worse than rec and p&a. The standard signals from system identification do not consider the current model and therefore, can perform arbitrarily badly. Nonetheless, PRBS explored surprisingly well for systems which are close to linear, e.g., rigid-body dynamics with torque control, where several states are linear in the input. Note, however, that this is not the case for other types of systems (e.g., DIPC), and that PRBS is often not a desirable system behavior.
|Coverage of state space (in percent of the region of interest)|
When learning models of dynamical systems, efficient exploration is key, since it determines how informative the gathered data is. In this paper, we propose and benchmark three main algorithms for actively learning dynamical systems with GPs. The separated search and control method is inspired by active learning for static GPs. However, its performance is suboptimal, and the theoretical guarantees of the static case are not directly applicable. More efficient exploration can be obtained with two variants of an approach based on computing optimal excitation signals. The receding horizon variant performs well, but yields a high computational burden. Hence, we also propose a batch update that trades off computation against performance. This framework is efficient but also versatile, as further modifications of the cost function are straightforward. We show on a numerical benchmark of diverse dynamical systems that the proposed methods are capable of exploring the state-action space efficiently, yielding more informative data, and hence, a more accurate model.
In future work, we intend to study the effects of different cost functions and look into the question of learnability, i.e., what makes a dynamical system easy or difficult to learn. It would also be interesting to generalize our framework to more realistic GP models for learning dynamical systems, such as those with noisy inputs and latent states. Validation on hardware experiments would also be relevant, however, the methods need to be computationally optimized first.
This work was supported in part by the Cyber Valley Initiative, the International Max Planck Research School for Intelligent Systems, and the Max Planck Society.