Robot learning has proven to be a challenge in real-world application. This is partially due to the ineffectiveness of passive data acquisition for learning and a necessity for action in order to generate informative data. What makes this problem even more challenging is that active data gathering is not a stable process. It involves exciting states in order to acquire new information. Safe exploration then becomes a challenge for modern day robotics. The problem becomes exacerbated when memory and task constraints (i.e., actively collecting data after deployment) are imposed on the robot. If the structures that compose the dynamics of the robot change over time, the robot will need to explore its own dynamics in a manner that is systematic and informative, avoiding damage to the underlying structures (and humans) in the environment. In this paper, we address these fundamental issues by developing an algorithm that is inspired by hybrid systems theory. This algorithm enables robots to actively pursue informative data by generating area coverage while guaranteeing Lyapunov attractiveness during exploration.
. This is generally seen in the field of reinforcement learning (RL) where attempts at a task, as well as learning from the outcome of actions, are used to both learn policies and predictive models[1, 3]. As a result, generalizing these methods to real-world application has been a topic of research [3, 1, 4] where data-inefficiency dominates much of the progress. A solution to the problem of data-inefficiency is to simulate robots in a realistic virtual environment and subsequently use the large amount of synthetic data to solve a learning problem before applying the results on a real robot . This leads to issues such as the “reality-gap” where finer modelling details such as motor delays lead to poor quality data for learning.
Existing work addresses the data-inefficiency problem by actively seeking out informative data using information maximization  or by pruning a data-set based on some information measure . These methods still suffer from the problem of local minima due to a lack of exploration or non-convex information objectives . Safety in the task is also a concern when actively seeking out informative measurements. Methods typically provide some bound on the worst outcome model using probabilistic approaches , but often only consider the safety with respect to the task and not with respect to the data collection process. We focus on problems where data collection involves exploring the state-space of robots where safe generation of informative data is important. In treating data acquisition as a dynamic area coverage problem—where the time spent during the trajectory of the robot is proportional to regions where there is an expectation of informative data—we are able to uncover more informative data that is not already expected. With this approach, we can provide attractiveness guarantees—that the robot will eventually return to a stable state—while providing control authority that allows the robot to actively seek out informative data in order to later solve a learning task. Thus, our contribution is an approach to dynamic area coverage for active data collection that starts from equilibrium policies for robots.
We structure the paper as follows: Section 2 provides a list of related work, Section 3 defines the problem statement for this work. Section 4 formulates the algorithm for active data acquisition from equilibrium. Section 5 provides simulated and experimental examples of our method. Last, Section 6 provides concluding remarks on our method and future directions.
2 Related Work
Existing work generally formulates problems of active data acquisition as information maximizing with respect to a known parameterized model [10, 11]. The problem with this approach is that robots need to address local optima [12, 11], resulting in insufficient data collection. Other approaches have sought to solve this problem by thinking of information maximization as an area coverage problem [12, 13]. Ergodic exploration, in particular, has remedied the issue of local extrema by using the ergodic metric to minimize the Sobelov distance  from the time-averaged statistics of the robot’s trajectory to the expected information in the explored region. This enables both exploration (quickly in low information regions) and exploitation (spending significant amount of time in highly informative regions) in order to avoid local extrema and collect informative measurements. The major downside is that this method assumes that the model of the robot is fully known. Moreover, there is little guarantee that the robot will not destabilize during the exploration process. This becomes an issue when the robot must explore part of its own state-space (i.e., velocity space) in order to generate informative data. To the authors’ best knowledge this has not been done to this date. Another issue is that these methods do not scale well with the dimensionality of the search space, making experimental applications with this approach challenging due to computational limitations.
Our approach overcomes these issues by using a sample-based KL-divergence measure  as a replacement for the ergodic metric. This form of measure has been used previously; however, it relied on motion primitives in order to compute control actions . We avoid this issue by using hybrid systems theory in order to compute a controller that sufficiently reduces the KL-divergence measure from an equilibrium stable policy. As a result, we can use approximate models of dynamical systems instead of complete dynamic reconstructions in order to actively collect data while ensuring safety in the exploration process through a notion of attractiveness.
The following section formulates the problem statement that our method solves.
3 Problem Statement
Modeling Assumptions and Stable Policies
Assume we have a robot whose approximate dynamics can be modeled using
where is the state of the robot,
is a control vector applied to the robot,is the free unactuated dynamics, is the actuated dynamics, and is the time rate of change of the robot at state subject to the control . Moreover, let us assume that there exists a Lyapunov function such that under a policy , , where for . For the rest of the paper, we will refer to as an equilibrium policy.
KL-divergence and Area Coverage
Given the assumptions of known approximate dynamics and the equilibrium policy, we can define active exploration for informative data acquisition as automating safe switching between and some control authority that generates actions that actively seek out informative data. This is accomplished by specifying the active data acquisition task using an area coverage objective where we minimize the KL-divergence between the time average statistics of the robot along a trajectory and a spatial distribution defining the current coverage requirement. We can then define an approximation to the spatial statistics of the robot as follows:
Given a search domain where , the -approximated time-averaged statistics of the robot, i.e., the time the robot spends in regions of the search domain , is defined by
where is a point in the search domain , is the component of the robot’s trajectory and actions that intersects the search domain , is a positive definite matrix parameter that specifies the width of the Gaussian, is a normalization constant such that and , is the sampling time, and is sum of the time horizon and amount of time the robot remembers into the past.
This is an approximation because the true time-averaged statistics, as described in 
, is a collection of delta functions parameterized by time. We approximate the delta function as a Gaussian distribution with covariance, converging as . Using this approximation, we are able to relax the ergodic area-coverage objective in  and use the following KL-divergence objective :
where is the expectation operator, , and , , is a distribution that describes where in the search domain an informative measurement is likely to be acquired. We can further approximate the KL-divergence via sampling where we approximate the expectation operator as
is the number of samples in the search domain drawn from a uniform distribution. With this formulation, we can approximate the ergodic coverage metric using (3).
In addition to the KL-divergence, we can add a task objective
where is the sampling time, is the time horizon, is the running cost, is the terminal cost, and is the state of the robot at time . This additional objective will typically encode some other task, in addition to the KL-divergence objective.
By summing the KL-divergence objective and a task objective (4), we can then pose active data acquisition as an optimal control problem subject to the initial approximate dynamic model of the robot. More formally, the objective is written as
where the goal is to generate a control that minimizes (5) subject to the approximate dynamics (1). Because we are including the equilibrium policy in the objective, we are able to synthesize controllers that take into account the equilibrium policy and the desire to actively seek out measurements.
For the rest of the paper, we assume the following:
We have an initial approximate model of the robot.
We also have an initial policy that maintains the robot at equilibrium, for which there is a Lyapunov function.
These two assumptions are reasonable in that often robots are designed around stable states and typically have locally stable policies.
The following section uses the fact that we have an initial policy in order to synthesize control vectors that reduce (5). Specifically, we want to generate a hybrid composition of control actions that enable active data collection and actions that stabilize the robotic system. That way, it is possible to quantify how much the robot is deviating from a stable equilibrium. Thus, we motivate using hybrid systems theory in order to consider how much the objective (5) changes from switching from the equilibrium policy to the control . By quantifying the change, we specify an unconstrained optimization which solves for a control that applies actions that retain Lyapunov attractiveness.
Our algorithm starts by considering the objective defined in (5) subject to the approximate dynamic constraints (1) and policy . We want to quantify how sensitive the objective is to switching from policy to the control vector for time for a infinitesimally small time duration . This sensitivity will be a function of and inform us of the most influential time to apply . Thus, we can use the sensitivity to write an objective whose minimizer is the schedule of control vectors that reduces the objective (5). The sensitivity of the objective (5) with respect to the duration time , of switching from the policy to the control at time is
where and , and is the adjoint, or co-state variable which is the solution of the following differential equation
subject to the terminal constraint . Taking the derivative of the objective (5) with respect to the duration time gives
The term is calculated by
where , and is the state transition matrix for the integral equation
where and .
We can similarly show that the term is given by
subject to the terminal condition .
The sensitivity is known as the mode insertion gradient . We can directly compute the mode insertion gradient for any control that we choose. However, our goal is to find one such control that reduces the objective (5) but still maintains its value near the equilibrium policy . To solve for this control, we formulate the following objective function
where is a positive definite matrix that penalizes the deviation from the policy . The control vector that minimizes is given by
Taking the derivative of (12) with respect to gives
Since is convex in , we set the expression in (4) to zero and solve for which gives us
which is a schedule of control values that reduce the objective for time . This controller reduces (5) for that is sufficiently small. The reduction in (5), , by applying can be approximated as . Ensuring that is an indicator that the robot is always actively pursuing data and reducing the objective (5). Let us assume that , where is the control Hamiltonian. Then where is the control space. Inserting (13) into (6) gives
Because of the manner in which we chose to solve for , and cancel out in (15). In addition, implies that and the policy is not an optimizer of (5). As a result, we can further analyze without the need to consider the policy . This gives us the following expression
which we can rewrite as
We automate the switching between and by choosing a and such that is most negative and . This is done through the combination of choosing with a 1-dimensional optimization and solving for using a line search until [16, 17]. By choosing we can place a bound on how much our algorithm excites the dynamical system through Lyapunov analysis (Theorem 4).
Assume there exists a Lyapunov function for (1) such that under the policy , is asymptotically stable. That is, where for . Then, given the schedule of control vectors (13) , , where is the Lyapunov function subject to the policy , and . Writing the integral form of the Lyapunov function switching between and at time for a duration of time starting at gives
where we explicitly write the dependency on in
. Using chain rule, we can write
Letting the largest value of be given by we can approximate (4) as
Subtracting both side by gives the upper bound on instability
for the active data collection process. By fixing the maximum value of , we can provide an upper bound to the change of the Lyapunov function during active data acquisition. Moreover, we can tune our control vector using the regularization value such that as , and . With this bound, we can guarantee Lyapunov attractiveness , where the system (1) is not Lyapunov stable, but rather there exists a time such that the system (1) is guaranteed to return to a region of attraction where the system can be guided towards a stable equilibrium state . This property will play an important role in examples in Section 5. A dynamical system (1) is Lyapunov attractive if at some time , the trajectory of the system where and such that is an equilibrium state. Given the schedule of control vectors (13) , the robotic system governed by the dynamics in (1) is Lyapunov attractive such that , where
is the solution to switching between stable and exploratory motions for duration starting at time . Assume there exists a Lyapunov function such that under the policy . Moreover, assume that subject to the control vector , the trajectory where where . Using Theorem 4, the integral form of the Lyapunov function (4), and the identity (4), we can write
where . Since is fixed and can be tuned by the matrix weight , we can choose a such that . Thus, and , implying Lyapunov attractiveness, where is the minimum of the Lyapunov function at the equilibrium state . Asymptotic attractiveness shows us that the robot will return to a region where will return to a minimum under policy , allowing the robot to actively explore and collect data safely. Moreover, we can choose the value of and in automating the active data acquisition such that attractiveness always holds, giving us an algorithm that is safe for active data collection.
All that is left is to define a spatial distribution that actively selects which measurements are more informative to the learning task.
Measure of Data Importance for Model Learning
Our goal is to provide a method that is general to any form of learning that requires a robot to actively seek out measurements through action. This may include area mapping or learning the dynamics of the robot. Thus, we use measures that allow the robot to quantify where in the search domain there exists useful data that needs to be collected. While there exists many measures that can select important data subject to a learning task, we use a measure of linear independence [19, 7, 20]. This measure is often used in sparse Gaussian processes [7, 20] where a data set is comprised of input measurements and output measurements such that each data point maximizes the measure of linear independence. We use this measure of independence, also known as a measure of importance, to create a distribution for which the robot will provide area coverage in the search domain for active data collection.
As illustrated in , this is done by evaluating a new measurement against the existing data points in given the structure of the model that is being learned. The importance measure for a new measurement pair is given by
which is the solution to ,
where are the basis functions (also known as feature vectors)
111 This feature vector can be anything from a Fourier set of basis functions or a neural network.
In addition, we can parameterize the functions
This feature vector can be anything from a Fourier set of basis functions or a neural network. In addition, we can parameterize the functionsand have the functions change over time., is the coefficient of linear dependence, the matrix is known as the kernel matrix with elements such that is the kernel function given by the inner product , , and .
The value provides a measure of how well the point can be represented given the existing data set and structure of the model being learned. Note that this measure will be computationally intractable for very large . Instead, other measures like the expected information density derived from the Fisher information matrix [12, 21] can be used if the learning task has a model that is parameterized by a set of parameters . Since , we define an importance distribution for which the robot will use generate area coverage. The importance distribution is
where , and , are functions of points . Note that will change as is augmented or pruned. If at any time for , we remove the point with the lowest value and add in the new data point.
We provide an outline of our method in Algorithm 1 for online data acquisition. The following section evaluates our method on various simulated environments.
5 Simulated Examples
In this section, we illustrate examples of Algorithm 1 for different examples that may be encountered in robotics. Figure 1 depicts three robotic systems on which we base our examples. In the first example, we use a cart double pendulum for use in area coverage for shape estimation. In the second and third example, we use a 22 dimensional quadrotor  and a 26 dimensional half-cheetah model from Roboschool  for learning a dynamics model of the robotic systems by exploring in the state-space. For implementation details, including parameters used, we refer the reader to the appendix.
Shape Estimation while Stabilizing Cart Double Pendulum
Our first example demonstrates the functionality of our algorithm for estimating a sinusoidal shape while simultaneously balancing a cart double pendulum in its upright position. The purpose of this example is to show that our method can synthesize actions that ensures the cart double pendulum is maintained upright while actively collecting data for estimating the shape. This example also serves the purpose of illustrating that our method can safely automate choosing when to stabilize and when to explore for data using approximate linear models of the robot dynamics and stabilizing policies derived from the approximate models.
The measurements of the height of the sinusoidal shape are collected through position of the cart (illustrated in Fig. 2
as the magenta crosshair underneath the cart). A Gaussian process with an radial basis function (RBF) kernel is then used to estimate the function and provide the distribution used for exploration. The underlying importance distribution (26) is updated as the data set is pruned to include new informative measurements.
As a result of Algorithm 1
, the robot will spend time where there is a high probability of acquiring informative data. This results is the shape reconstruction shown in Fig.2 using a limited fixed set of data ().
We analyze our algorithm by observing a candidate Lyapunov function (energy). Figure 3 depicts the value of the Lyapunov function over the time window of the cart double pendulum collecting data for estimating shape. The control vector over the application time increases the overall energy in the system (due to exploration). Since we include a regularization term that ensures does not deviate too far from the equilibrium policy , the cart double pendulum is able to stabilize itself, eventually returning to an equilibrium state and ensuring stability, illustrating the Lyapunov attractiveness property proven in Theorem 4.
Learning Dynamics of Quadrotor
Our next example illustrates active data acquisition in the state-space of a 22 degree of freedom quadrotor vehicle shown in Fig.0(a). The results are averaged across trials with random initial conditions sampled uniformly in the body angular and linear velocities where is a uniform distribution.
The goal for this quadrotor is to maintain hovering height while collecting data in order to learn the dynamics model . In this example, a linear approximation of the dynamics centered at the hovering height is used as the local dynamics approximation on which Algorithm 1 is based. We then generate a LQR controller with the approximate dynamics which we use as the equilibrium policy, The input data we collect is the state and control and the output data is which approximates the function . An incremental sparse Gaussian process  with a radial basis function kernel is used to generate a learned model of the dynamics using a data set of and to specify the importance measure (4).
Figure 4 (a) and Figure 4 (b) illustrates the modeling error and the minimum importance value within the data set using our method and the equilibrium policy with uniformly added added noise at of the saturation limit. Our method sequences and automates the process of choosing when it is best to explore and to stabilize by taking into account the approximate dynamics and the equilibrium policy. As a result, a robot is capable of acquiring informative data that improves the prediction of the nonlinear dynamic model of the quadrotor. In contrast, adding noise to the control input (often referred to as “motor babble” ) does not have temporal dependencies. That is, each new sample does not have information from the previous samples and cannot effectively explore the state-space.
As the robot continues to explore, the value of the mode insertion gradient (6) decreases as does the duration time as shown in Fig. 4 (c) and (d). This implies that the robot is sufficiently reducing the objective for area coverage and the equilibrium policy begins to take over to stabilize the robot. This is a result of taking into account the local stability of the robotic system while generating exploratory actions.
Learning to Gallop
In this last example, we consider applications of Algorithm 1 for systems with dynamic models and policies that are learned. We use the half-cheetah from the roboschool environment  for the task of learning a dynamics model in order to control the robot to gallop forward.
We first learn a simple standing upright policy using the augmented random search (ARS) method . In that process, we collect the state and action data to compute a linear approximation using least-squares for the local dynamics. Then Algorithm 1 is applied using an incremental sparse Gaussian process using an RBF kernel to generate a dynamics model from data as well as provide the importance measure using a set of data points. The input-output data structure maps input to the change in state . Our running cost is set to maintain the half-cheetah upright. After the Gaussian process model is learned, we use the generated model in the prediction of the forward dynamics as a replacement for the initial dynamics model.
As shown in Fig. 5, our method collects informative data while respecting the standing upright policy when compared to noisy inputs. We compared the two learned models using our controller with and the running cost set to maximize the forward velocity of the half-cheetah. We show those results in Fig. 5 over 5 runs of our algorithm at different initial states. Our method provides a learned model that has overall positive integrated velocity (forward movement). While our method is more complex than simply adding noise, it provides stability guarantees based on known policies in order to explore and collect data.
Algorithm 1 enables robots to actively seek out informative data based on the learning task while maintaining stability using equilibrium policies. Our method generates area coverage using a KL-divergence measure in order to enable robots to actively seek out informative data. Moreover, by using a hybrid systems theory approach to generating area coverage, we were able to incorporate equilibrium policies in order to provide stability guarantees even with the model of the robot dynamics only locally known. Last, we provide examples that illustrate the benefits of our approach for active data acquisition for learning tasks.
- Kormushev et al.  Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Robot motor skill coordination with em-based reinforcement learning. In International Conference on Intelligent Robots and Systems, pages 3232–3237, 2010.
- Reinhart  René Felix Reinhart. Autonomous exploration of motor skills by skill babbling. Autonomous Robots, 41(7):1521–1537, 2017.
- McKinnon and Schoellig  Christopher D McKinnon and Angela P Schoellig. Learning multimodal models for robot dynamics online with a mixture of Gaussian process experts. In International Conference on Robotics and Automation, pages 322–328, 2017.
- Tan et al.  Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. In Proceedings of Robotics: Science and Systems, 2018. doi: 10.15607/RSS.2018.XIV.010.
- Marco et al.  Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization. In International Conference on Robotics and Automation, pages 1557–1563, 2017.
- Schwager et al.  Mac Schwager, Philip Dames, Daniela Rus, and Vijay Kumar. A multi-robot control policy for information gathering in the presence of unknown hazards. In Robotics research, pages 455–472. 2017.
- Nguyen-Tuong and Peters  Duy Nguyen-Tuong and Jan Peters. Incremental online sparsification for model learning in real-time robot control. Neurocomputing, 74(11):1859–1867, 2011.
- Ucinski  Dariusz Ucinski. Optimal measurement methods for distributed parameter system identification. CRC Press, 2004.
- Berkenkamp et al.  Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.
Lin and Liu 
Tsen-Chang Lin and Yen-Chen Liu.
Direct learning coverage control based on expectation maximization in wireless sensor and robot network.In Conference on Control Technology and Applications, pages 1784–1790, 2017.
- Bourgault et al.  Frederic Bourgault, Alexei A Makarenko, Stefan B Williams, Ben Grocholsky, and Hugh F Durrant-Whyte. Information based adaptive robotic exploration. In International Conference on Intelligent Robots and Systems, volume 1, pages 540–545, 2002.
- Miller et al.  Lauren M Miller, Yonatan Silverman, Malcolm A MacIver, and Todd D Murphey. Ergodic exploration of distributed information. IEEE Transactions on Robotics, 32(1):36–52, 2016.
- Ayvali et al.  Elif Ayvali, Hadi Salman, and Howie Choset. Ergodic coverage in constrained environments using stochastic trajectory optimization. In International Conference on Intelligent Robots and Systems, pages 5204–5210, 2017.
- Arnold and Wellerding  Randolf Arnold and Andreas Wellerding. On the sobolev distance of convex bodies. aequationes mathematicae, 44(1):72–83, 1992.
- Axelsson et al.  Henrik Axelsson, Y Wardi, Magnus Egerstedt, and EI Verriest. Gradient descent approach to optimal mode scheduling in hybrid dynamical systems. Journal of Optimization Theory and Applications, 136(2):167–186, 2008.
- Mavrommati et al.  Anastasia Mavrommati, Emmanouil Tzorakoleftherakis, Ian Abraham, and Todd D Murphey. Real-time area coverage and target localization using receding-horizon ergodic exploration. IEEE Transactions on Robotics, 34(1):62–80, 2018.
- Abraham and D. Murphey  Ian Abraham and Todd D. Murphey. Decentralized ergodic control: Distribution-driven sensing and exploration for multiagent systems. IEEE Robotics and Automation Letters, 3(4):2987–2994, 2018.
- Polyakov and Fridman  Andrey Polyakov and Leonid Fridman. Stability notions and lyapunov functions for sliding mode control systems. Journal of the Franklin Institute, 351(4):1831–1865, 2014.
- Scholkopf et al.  Bernhard Scholkopf, Sebastian Mika, Chris JC Burges, Philipp Knirsch, K-R Muller, Gunnar Ratsch, and Alex J Smola. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999.
- Yan et al.  Xinyan Yan, Vadim Indelman, and Byron Boots. Incremental sparse gp regression for continuous-time trajectory estimation and mapping. Robotics and Autonomous Systems, 87:120–132, 2017.
- Emery and Nenarokomov  AF Emery and Aleksey V Nenarokomov. Optimal experiment design. Measurement Science and Technology, 9(6):864, 1998.
- Fan and Murphey  Taosha Fan and Todd Murphey. Online feedback control for input-saturated robotic systems on Lie groups. In Proceedings of Robotics: Science and Systems, June 2016. doi: 10.15607/RSS.2016.XII.027.
- Klimov and Shulman  Oleg Klimov and John Shulman. Roboschool. https://github.com/openai/roboschool, 2017.
- Nagabandi et al.  Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596, 2017.
- Mania et al.  Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.