Robotic systems deployed in a real-world scenario often operate in partially-known environments. Modeling the complex dynamic interactions with the environment requires high-fidelity technique that are often computationally expensive. For example, spherical harmonics are used to model the effects of non-uniform gravity field on a spacecraft. Machine-learning models can remedy this difficulty by approximating the dynamics from data[1, 2, 3, 4]. The learned models typically require off-line training with labeled data, often not available or hard to collect in many applications. Safe exploration is an efficient approach to collect ground truth data by safely interacting with the environment. Recent research on safe exploration  uses a deterministic approach in an episodic framework to collect labeled data by querying the domain of interest for informative data.
Planning for safe exploration is challenging when a probabilistic machine learning model is used to approximate unknown dynamics, because: 1) the uncertainties from the learned dynamic model lead to stochastic nonlinear dynamics, 2) the safety constraints are formulated as chance constraints that are generally non-convex, 3) performance objective may be a min–max
stochastic optimal control problem, e.g. maximum exploration with minimum control effort and 4) the propagation errors and safety violations need to be quantified, when the controller is computed with estimated dynamics.
In this paper, we systematically address the aforementioned challenges by proposing an end-to-end episodic framework that unifies learning, planning, and exploration for active and safe data collection for continuous-time dynamical systems, as shown in Fig. 1. The key contributions of the present paper are summarized as follows: a) We use a multivariate robust regression model  under a covariate shift constraint to compute the multi-dimensional uncertainty estimates of the unknown dynamics; b) We propose a novel iterative solution method to Information Stochastic Nonlinear Optimal Control (Info-SNOC) problem to plan a pool of sub-optimal safe and informative trajectories with the learned approximation of the dynamics. We build on the recent method  to solve the chance-constrained SNOC problem by projecting it to the generalized polynomial chaos (gPC) space; and c) We prove the safety of rollouts from our exploration method and reduction in uncertainty over epochs ensuring consistency of our learning method under mild assumptions. Rollout is defined as executing the computed safe trajectory and policy using a stable feedback controller. To ensure real-time safety, the feedback controller is augmented with a safety filter .
Safe exploration has been extensively studied in the reinforcement learning domain. For continuous dynamical system, the problem has been studied using the following three frameworks: learning-based model-predictive control (MPC), dual-control, and active dynamics learning.
Learning-based MPC [10, 11, 12, 5] has been studied extensively for controlling the estimated system. These techniques are also applied for planning an information trajectory to learn online. The effect of uncertainty in the learned model on the propagation of dynamics is estimated using methods such as bounding box and linearization. We use the method of generalized polynomial chaos (gPC)  expansion for propagation, which has asymptotic convergence to the original distribution, and provides guarantees on the constraint satisfaction. In MPC, safety is guaranteed by recursively checking feasibility and appending a safe policy to an exploring policy at the end of each rollout. In contrast, we formulate safety as joint chance constraints for given risk of constraint violation.
Estimating unknown parameters while simultaneous optimizing for performance has been studied as a dual control problem . Dual control is an optimal control problem formulation to compute a control policy that is optimized for performance and guaranteed parameter convergence. In some recent work [14, 15], the convergence of the estimate is achieved by using persistence of excitation condition in the optimal control problem. Our method uses Sequential Convex Programming (SCP) [16, 17, 18] to compute the persistent excitation trajectory. Recent work  uses nonlinear programming tools to solve optimal control problems with an upper-confidence bound  cost for exploration without safety constraints. We follow a similar approach but formulate the planning problem as an SNOC with distributionally robust linear and quadratic chance constraints for safety. The distributionally robust chance constraints are convexified via projection to the gPC space. The algorithm proposed in this paper can be used in the MPC framework with appropriate terminal conditions for feasibility and to solve dual control problems with high efficiency using the interior point methods.
The paper is organized as follows. We discuss the SNOC problem with results on deterministic approximations of chance constraints along with preliminaries on robust regression in Sec. II. The Info-SNOC algorithm along with exploration policy is presented in Sec. III. In Sec. IV, we derive the end-to-end safety guarantees. In Sec. V
, we apply the Info-SNOC algorithm to the nonlinear three degree-of-freedom spacecraft robot model. We conclude the paper in Sec. VI with brief discussion on the results of the analysis and the application of the Info-SNOC method.
Ii Preliminaries and Problem Definition
Ii-a Robust Regression For Learning
Learning dynamics is regarded as a regression problem under covariate shift. Covariate shift is a special case of distribution shift between training and testing data distributions, where the conditional output distribution given the input variable remains the same while the input distribution is different between training and testing. We refer to them as the source distribution and the target distribution . An exploration step in active data collection for learning dynamics is a covariate shift problem. We use and to represent input and output of the learning model. Robust regression is derived from a min–max
adversarial estimation framework, where the estimator tries to minimize a loss function and the adversary tries to maximize the loss under statistical constraints. The minimax framework derives a model that is robust to the worst-case possible data by generating a conditional distribution that is “compatible” with finite training data, while minimizing a loss function defined on a testing data distribution. Robust regression can handle multivariate outputs and the correlations efficiently by incorporating neural networks and predicting a multivariate Gaussian distribution directly, whereas traditional methods like Gaussian process regression suffer from high-dimensions and require heavy tuning of kernels.
The resulting Gaussian distributions provided by this learning framework are given below. For more technical details, we refer the readers to the prior work [6, 22]. The output Gaussian distribution takes the form , where
with a non-informative base distribution , is the model parameter learned from data, and is the feature function. The features can be learned using neural networks directly from data. The density ratio is estimated from data beforehand.
Ii-B Optimal and Safe Planning Problem
In this section, we present the finite-time chance-constrained stochastic optimal control problem formulation  used to design an informative trajectory. The optimization has control effort and terminal cost as performance objectives, and the safety is modelled as joint chance constraints. The full stochastic optimal control problem is as follows:
where any realization of the random variable, and are initial and terminal state distributions respectively, the control is deterministic, is the learned probabilistic model, and is the expectation operator. The modelling assumptions and the problem formulation will be elaborated in the following sections.
Ii-B1 Dynamical Model
The term of (3) is the estimated model of the unknown term of the original dynamics:
where the state is now considered deterministic, and the functions and are Lipschitz with respect to and .
The control set is convex and compact.
The maximum entropy distribution with the known mean and covariance matrix of the random variable is the Gaussian distribution . We follow this notation for mean and covariance matrix for remainder of the paper.
The learning algorithm computes the mean vectorand the covariance matrix estimates of that are functions of mean of the state and control . Due to Remark 1, the unknown bias term is modeled as a multivariate Gaussian distribution . The estimate in (3) can be expressed as
The existence and uniqueness of a solution to the SDE for a given initial distribution and control trajectory such that with measure , is guaranteed by the following assumptions: (a) Lipschitz Condition: There exists a constant such that any realization and
(b) Restriction on Growth: There exists a constant , is the Frobenius norm such that .
The approximate system (9) is controllable in the given feasible space.
Ii-B2 State and Safety Constraints
Safety is defined as a constraint on the state space , at time . The safe set is relaxed by formulating a joint chance constraint with risk of constraint violation as
The constant is called the risk measure of a chance constraint in this paper. We consider the polytopic constraint set with flat sides and a quadratic constraint set for any realization of the state. The joint chance constraint can be transformed to the following individual chance constraint:
such that . Here we use . The individual risk measure can be optimally allocated between the constraints as discussed in . A quadratic chance constraint is given as
) is computationally intractable due to the multi-modal probability density function ofand multi-dimensional integrals. We use a tractable conservative approximation by formulating distributionally robust chance constraint  for known mean
and varianceof the random variable .
The linear distributionally robust chance constraint is equivalent to the deterministic constraint:
See . ∎
The semi-definite constraint on the variance
is a conservative deterministic approximation of the quadratic chance-constraint .
See . ∎
The risk measures and are assumed to be given.
Ii-B3 Cost Functional
The integrand cost functional includes two objectives: 1) Exploration: to achieve maximum value of information for learning the unknown dynamics , and 2) Performance: to achieve fuel optimality. The cost functional is split into and corresponding to the performance cost and the information cost for exploration, respectively. The cost functional is of the following form, and is convex in .
We use the following Gaussian Upper Confidence Bound (UCB)  based information cost for exploration, for each element in and diagonal element in .
The information cost in (17) is a functional of the mean of the state and control at time . The coefficient
is chosen based on the confidence interval. Minimizing the cost, we maximize the information  available in the trajectory to learn the unknown model . The cost functional includes the mean estimate to implicitly trade off between exploration–exploitation during the trajectory design. The terminal cost functional is quadratic in the state , , where is a positive semi-definite function. For rest of the paper we consider the following individual chance-constrained problem,
that is assumed to have a feasible solution with .
Ii-C Generalized Polynomial Chaos (gPC)
The problem (18) is projected into a finite-dimensional space using the generalized polynomial chaos algorithm presented in . We briefly review the ideas and equations. We refer the readers to  for details on computing the functions and matrices that are mentioned below. In the gPC expansion, any bounded random variable is expressed as product of the basis functions matrix defined on the random variable and the deterministic time varying coefficients . The functions are chosen to be Gauss-Hermite polynomials for this paper. Let .
A Galerkin projection is used to transform the SDE (9) to the deterministic equation,
The exact form of the function is given in 
. The moments of the random variable, and can be expressed as polynomial functions of elements of . The polynomial functions defined in  are used to project the distributionally robust linear chance constraint in (14) to a second-order cone constraint in as follows:
given that is a diagonal matrix with diagonal element as , where is an inner product operation. Similarly, we project the initial and terminal state constraints and . In the next section, we present the Info-SNOC algorithm by using the projected optimal control problem.
Iii Info-SNOC Main Algorithm
In this section, we present the main algorithm of the paper by projecting (18) to the gPC space. We formulate a deterministic optimal control problem in the gPC space and solve it using Sequential Convex Programming (SCP) method [7, 17, 18]. The gPC projection of (18) is given by the following equation
In SCP, the projected dynamics (20) is linearized about a feasible nominal trajectory and discretized to formulate a linear equality constraint. Note that the constraints (21) and (22) are already convex in the states . The terminal constraint is projected to the quadratic constraint . The information cost functional from (17) is expressed as a function of by using the polynomial representation  of in terms of . Let and the cost is linearized around a feasible nominal trajectory to derive a linear convex cost functional :
We use the convex approximation as the information cost in the SCP formulation of the optimal control problem in (23). In the gPC space, we split the problem into two cases: a) that computes a performance trajectory, and b) that computes information trajectory to have stable iterations. The main algorithm is outlined below.
An initial estimate of the model (8) learned from data generated by a known safe control policy, and a nominal initial trajectory is used to initialize Algorithm 1. The stochastic model and the chance constraints are projected to gPC state space, which is in line 2 of Algorithm 1. The projected dynamics is linearized around the nominal trajectory and used as a constraint in the SCP. The projection step is only needed in the first epoch. The projected system can be directly used for . The current estimated model is used to solve (23) using SCP, in line 7 with = 0, for a performance trajectory. The output of this optimization is used as initialization to the Info-SNOC problem obtained by setting . The information trajectory is then sampled for a safe motion plan in line 9, that is used for rollout, in line 10, to collect more data for learning. The SCP step is performed in the gPC space . After each SCP step, the gPC space coordinates are projected back to the random variable space. The algorithm outputs a trajectory of random variable with finite variance at each epoch.
Convergence and Optimality
The information trajectory computed using SCP with the approximate linear information cost (24) is a sub-optimal solution of (23) with the optimal cost value . Therefore, the optimal cost of (23) given by is bounded above by , . For the Info-SNOC algorithm, we cannot guarantee the convergence of SCP iterations to a Karush-Kuhn-Tucker point using the method in [18, 16] due to the non-convexity of . Due to the non-convex cost function , the linear approximation of the cost can potentially lead to numerical instability in SCP iterations. Finding an initial performance trajectory, , and then optimizing for information, , is observed to be numerically stable compared to directly optimizing for the original cost functional . The two advantages of this approach are: 1) we compute a fuel efficient trajectory that is also informative, and 2) SCP iterations are stable as we are searching around a feasible trajectory.
The initial phases of learning might lead to a large covariance due to the insufficient data, resulting in an infeasible optimal control problem. To overcome this, we use two strategies: 1) Explore the initial safe set till we find a feasible solution to the problem, and 2) Use slack variables on the terminal condition to approximately reach the goal accounting for a large variance. The feasibility of SCP iterations is ensured by using the techniques discussed in Sec. IV.E of .
Iii-a Rollout Policy Implementation
The information trajectory computed using the Info-SNOC algorithm is sampled for a pool of motion plans . The trajectory pool is computed by randomly sampling the multivariate Gaussian distribution and transforming it using the gPC expansion . For any realization of , we get a deterministic trajectory that is safe with respect to the distributionally robust chance constraints. The trajectory is executed using the closed-loop control law for rollout, where is the current state. The properties of the control law and the safety during rollout are studied in the following section.
In this section, we present the main theoretical results analyzing the following two questions: 1) at any epoch how do learning errors translate to safety violation bounds during rollout, and 2) consistency of the multivariate robust regression as epoch .
The following bounds are satisfied with high probability for the same input to the original model , and the learned model . The also satisfies the bounded shown below
As shown in , the mean predictions made by the model learned using robust regression is bounded by , which depends on the choice of the function class for learning. The variance prediction is bounded by design. With Assumptions 3 and 4, the analysis is decomposed into the following three subsections.
Iv-a State Error Bounds During Rollout
We make the following assumptions on the nominal system to derive the state tracking error bound during rollout.
There exists a globally exponentially stable (i.e., finite-gain stable) tracking control law for the nominal dynamics . The control law satisfies the property for any sampled trajectory from the information trajectory . At any time the state satisfies the following inequality, when the closed-loop control is applied to the nominal dynamics,
where , , is a uniformly positive definite matrix with , and and
are the maximum and minimum eigenvalues.
The unknown model satisfies the bound .
The probability density ratio is bounded, where the functions , are probability density functions for at time and respectively.
Given that the estimated model (8) satisfies the Assumption 4, and the systems (7) and (9) satisfy Assumptions 5, 6, 7, if the control is applied to the system (7), then the following condition holds at time
See . ∎
Iv-B Safety Bounds
Consider the expectation of the ellipsoidal set . Using , the expectation of the set can be expressed as follows
Using the following equality:
Note that if the learning method converges, i.e., , , and , then . The quadratic constraint violation in (28) depends on the tracking error and the size of the ellipsoidal set described by .
From Lemma 1, the feasible solution satisfies the equivalent condition , where for the risk measure , mean and covariance . Consider the similar condition for the actual trajectory , , as shown below.
Note that since the system (7) is deterministic, we have and . Using , the right hand side of the above inequality reduces to the following:
Using the decomposition , Cauchy-Schwarz’s inequality, and Jensen’s inequality, we have . Using the sub-multiplicative property of -norm in the inequality above, we have . Assuming that in the above inequality, we have
The linear constraint is offset by leading to constraint violation of the original formulation (12). Note that, if , , , and then . In order to ensure real-time safety during trajectory tracking, we use a high gain control for disturbance attenuation with safety filter augmentation for constraint satisfaction.
Data is collected during the rollout of the dynamical system to learn a new model for next epoch. For epoch , predictor is a multivariate Gaussian distribution and is the empirical true data. We assume that set generated by the optimization problem in (18) for the first iterations is a discretization of . Assuming that there exists a global optimal predictor in the function class that can achieve the best error at each epoch :
the consistency of the learning algorithm is defined as
and proven in the following theorem.
The squared error of from the optimal predictor is . We first decompose and bound it using the inequality , , since as follows:
where , is defined in , and is unit vector in dimensions. This is the tail probabilities inequality of multivariate Gaussian distributions. The error is then bounded using the empirical prediction error and the best prediction error (33) as follows:
For converged robust regression learning in epoch , we have , . The bound on is made more concrete as
where is a hyper-parameter that is associated with model selection in robust regression in epoch . Using Cauchy-Schwartz inequality, we have . If , we have achieved. Therefore, to guarantee the learning consistency and needs to be upper bounded in planning. The Info-SNOC approach automatically outputs trajectories of finite variance. ∎