I Introduction
Reinforcement learning (RL) is a learning framework that addresses sequential decisionmaking problems, wherein an ‘agent’ or a decision maker learns a policy to optimize a longterm reward by interacting with the (unknown) environment. At each step, the RL agent obtains evaluative feedback (called reward or cost) about the performance of its action, allowing it to improve the performance of subsequent actions [sutton1998reinforcement, vrabie2013optimal]. Although RL has witnessed huge successes in recent times [silver2016mastering, silver2017mastering], there are several unsolved challenges, which restrict the use of these algorithms for industrial systems. In most practical applications, control policies must be designed to satisfy operational constraints, and a satisfactory policy should be learnt in a dataefficient fashion [vamtutorial].
Modelbased reinforcement learning (MBRL) methods [deisenroth2011pilco] learn a model from exploration data of the system, and then exploit the model to synthesize a trajectorycentric controller for the system [levine2013guided]. These techniques are, in general, harder to train, but could achieve good data efficiency [levine2016end]
. Learning reliable models is very challenging for nonlinear systems and thus, the subsequent trajectory optimization could fail when using inaccurate models. However, modern machine learning methods such as Gaussian processes (GP), stochastic neural networks (SNN), etc. can generate uncertainty estimates associated with predictions
[rasmussen2003gaussian, romeres2019semiparametrical]. These uncertainty estimates could be used to estimate the confidence set of system states at any step along a given controlled trajectory for the system. The idea presented in this paper considers the stabilization of the trajectory using a local feedback policy that acts as an attractor for the system in the known region of uncertainty along the trajectory [tedrake2010lqr].We present a method for simultaneous trajectory optimization and local policy optimization, where the policy optimization is performed in a neighborhood (local sets) of the system states along the trajectory. These local sets could be obtained by a stochastic function approximator (e.g., GP, SNN, etc.) that used to learn the forward model of the dynamical system. The local policy is obtained by considering the worstcase deviation of the system from the nominal trajectory at every step along the trajectory. Performing simultaneous trajectory and policy optimization could allow us to exploit the modeling uncertainty as it drives the optimization to regions of low uncertainty, where it might be easier to stabilize the trajectory. This allows us to constrain the trajectory optimization procedure to generate robust, highperformance controllers. The proposed method automatically incorporates state and input constraints on the dynamical system.
Contributions. The main contributions of the current paper are:

We present a novel formulation of simultaneous trajectory optimization and timeinvariant local policy synthesis for stabilization.

We present analysis of the proposed technique that allows us to analytically derive the gradient of the robustness constraint for the optimization problem.
It is noted that this paper only presents the controller synthesis part for MBRL – a more detailed analysis of the interplay between model uncertainties and controller synthesis is deferred to another publication.
Ii Related Work
MBRL has raised a lot of interest recently in robotics applications, because model learning algorithms are largely task independent and dataefficient [2019arXiv190702057W, levine2016end, deisenroth2011pilco]. However, MBRL techniques are generally considered to be hard to train and likely to result in poor performance of the resulting policies/controllers, because the inaccuracies in the learned model could guide the policy optimization process to lowconfidence regions of the state space. For nonlinear control, the use of trajectory optimization techniques such as differential dynamic programming [jacobson1968new] or its firstorder approximation, the iterative Linear Quadratic Regulator (iLQR) [tassa2012synthesis] is very popular, as it allows the use of gradientbased optimization, and thus could be used for highdimensional systems. As the iLQR algorithm solves the local LQR problem at every point along the trajectory, it also computes a sequence of feedback gain matrices to use along the trajectory. However, the LQR problem is not solved for ensuring robustness, and furthermore the controller ends up being timevarying, which makes its use somewhat inconvenient for robotic systems. Thus, we believe that the controllers we propose might have better stabilization properties, while also being timeinvariant.
Most modelbased methods use a function approximator to first learn an approximate model of the system dynamics, and then use stochastic control techniques to synthesize a policy. Some of the seminal work in this direction could be found in [levine2016end, deisenroth2011pilco]. The method proposed in [levine2016end]
has been shown to be very effective in learning trajectorybased local policies by sampling several initial conditions (states) and then fitting a neural network which can imitate the trajectories by supervised learning. This can be done by using ADMM
[boyd2011distributed] to jointly optimize trajectories and learn the neural network policies. This approach has achieved impressive performance on several robotic tasks [levine2016end]. The method has been shown to scale well for systems with higher dimensions. Several different variants of the proposed method were introduced later [chebotar2017path, montgomery2016guided, nagabandi2018neural]. However, no theoretical analysis could be provided for the performance of the learned policies.Another set of seminal work related to the proposed work is on the use of sumofsquare (SOS) programming methods for generating stabilizing controller for nonlinear systems [tedrake2010lqr]. In these techniques, a stabilizing controller, expressed as a polynomial function of states, for a nonlinear system is generated along a trajectory by solving an optimization problem to maximize its region of attraction [majumdar2013control].
Some other approaches to trajectorycentric policy optimization could be found in [theodorou2010generalized]. These techniques use path integral optimal control with parameterized policy representations such as dynamic movement primitives (DMPs) [ijspeert2013dynamical] to learn efficient local policies [williams2017information]. However, these techniques do not explicitly consider the local sets where the controller robustness guarantees could be provided, either. Consequently, they cannot exploit the structure in the model uncertainty. A workshop version of our paper could be found here [jha2019robust].
Iii Problem Formulation
In this section, we describe the problem studied in the rest of the paper. To perform trajectorycentric control, we propose a novel formulation for simultaneous design of openloop trajectory and a timeinvariant, locally stabilizing controller that is robust to bounded model uncertainties and/or system measurement noise. As we will present in this section, the proposed formulation is different from that considered in the literature in the sense it allows us to exploit sets of possible deviation of a system to design stabilizing controller.
Iiia Trajectory Optimization as Nonlinear Program
Consider the discretetime dynamical system
(1) 
where , are the differential states and controls, respectively. The function governs the evolution of the differential states. Note that the discretetime formulation (1) can be obtained from a continuous time system by using the explicit Euler integration scheme where is the timestep for integration.
For clarity of exposition we have limited our focus to discretetime dynamical systems of the form in (1) although the techniques we describe can be easily extended to implicit discretization schemes.
In typical applications the states and controls are restricted to lie in sets and , i.e. . We use to denote the index set . Further, there may exist nonlinear inequality constraints of the form
(2) 
with . The inequalities in (2) are termed as path constraints. The trajectory optimization problem is to manipulate the controls over a certain number of time steps so that the resulting trajectory minimizes a cost function . Formally, we aim to solve the trajectory optimization problem
(TrajOpt)  
s.t.  
where is the differential state at initial time . Before introducing the main problem of interest, we would like to introduce some notations.
In the following text, we use the following shorthand notation, . We denote the nominal trajectory as , . The actual trajectory followed by the system is denoted as . We denote a local policy as , where is the policy and denotes the parameters of the policy. The trajectory cost is also sometimes denoted as .
IiiB Trajectory Optimization with Local Stabilization
This subsection introduces the main problem of interest in this paper. A schematic of the problem studied in the paper is also shown in Figure LABEL:fig:problem_definition. In the rest of this section, we will describe how we can simplify the trajectory optimization and local stabilization problem and turn it into an optimization problem that can be solved by standard nonlinear optimization solvers.
Consider the case where the system dynamics, is only partially known, and the known component of is used to design the controller. Consider the deviation of the system at any step ’’ from the state trajectory and denote it as . We introduce a local (timeinvariant) policy that regulates the local trajectory deviation and thus, the final controller is denoted as . The closedloop dynamics for the system under this control is then given by the following:
(3) 
The main objective of the paper is to find the timeinvariant feedback policy that can stabilize the openloop trajectory locally within where defines the set of uncertainty for the deviation . The uncertainty region can be approximated by fitting an ellipsoid to the uncertainty estimate using a diagonal positive definite matrix such that . The general optimization problem that achieves that is proposed as:
(4)  
where denotes the known part of the model. Note that in the above equation, we have introduced additional optimization parameters corresponding to the policy when compared to TrajOpt in the previous section. However, to solve the above, one needs to resort to sampling in order to estimate the expected cost. Instead we introduce a constraint that solves for the worstcase cost for the above problem.
Robustness Certificate. The robust trajectory optimization problem is to minimize the trajectory cost while at the same time satisfying a robust constraint at every step along the trajectory. This is also explained in Figure LABEL:fig:problem_definition, where the purpose of the local stabilizing controller is to push the maxdeviation state at every step along the trajectory to tolerance balls around the trajectory. Mathematically, we express the problem as following:
(RobustTrajOpt)  
s.t.  
The additional constraint introduced in RobustTrajOpt allows us to ensure stabilization of the trajectory by estimating parameters of the stabilizing policy . It is easy to see that RobustTrajOpt solves the worstcase problem for the optimization considered in (4). However, RobustTrajOpt
introduces another hyperparameter to the optimization problem,
. In the rest of the paper, we refer to the following constraint as the robust constraint:(5) 
Solution of the robust constraint for generic nonlinear system is out of scope of this paper. Instead, we linearize the trajectory deviation dynamics as shown in the following Lemma.
Lemma 1.
The trajectory deviation dynamics approximated locally around the optimal trajectory are given by
(6)  
Proof.
Use Taylor’s series expansion to obtain the desired expression. ∎
To ensure feasibility of the RobustTrajOpt problem and avoid tuning the hyperparameter , we make another relaxation by removing the robust constraint from the set of constraints and move it to the objective function. Thus, the simplified robust trajectory optimization problem that we solve in this paper can be expressed as following (we skip the state constraints to save space).
(RelaxedRobustTrajOpt)  
s.t. 
where the term is defined as following after linearization.
(7)  
Note that the matrix allows to weigh states differently. In the next section, we present the solution approach to compute the gradient for the RelaxedRobustTrajOpt which is then used to solve the optimization problem. Note that rthis esults in simultaneous solution to openloop and the stabilizing policy .
Iv Solution Approach
This section introduces the main contribution of the paper, which is a local feedback design that regulates the deviation of an executed trajectory from the optimal trajectory generated by the optimization procedure.
To solve the optimization problem presented in the last section, we first need to obtain the gradient information of the robustness heuristic that we introduced. However, calculating the gradient of the robust constraint is not straightforward, because the
function is nondifferentiable. The gradient of the robustness constraint is computed by the application of Dankins Theorem [bertsekas1997nonlinear], which is stated next.Dankin’s Theorem: Let be a nonempty, closed set and let be a nonempty, open set. Assume that the function is continuous on and that exists and is continuous on . Define the function by
and
Let
be a given vector. Suppose that a neighborhood
of exists such that is nonempty for all and the set is bounded. The following two statements (a) and (b) are valid.
[(a)]

The function is directionally differentiable at and

If reduces to a singleton, say , then is Geaux differentiable at and
Proof See [FacchineiPangVol2], Theorem .
Dankin’s theorem allows us to find the gradient of the robustness constraint by first computing the argument of the maximum function and then evaluating the gradient of the maximum function at the point. Thus, in order to find the gradient of the robust constraint (5), it is necessary to interpret it as an optimization problem in , which is presented next. In Section IIIB, we presented a general formulation for the stabilization controller , where are the parameters that are obtained during optimization. However, solution of the general problem is beyond the scope of the current paper. Rest of this section considers a linear for analysis.
Lemma 2.
Assume the linear feedback . Then, the constraint (7) is quadratic in ,
(8)  
where is shorthand notation for
(9) 
Proof.
The next lemma is one of the main results in the paper. It connects the robust trajectory tracking formulation RelaxedRobustTrajOpt with the optimization problem that is well known in the literature.
Lemma 3.
The worstcase measure of deviation is
where
denotes the maximum eigenvalue of a matrix and
denotes the spectral norm of a matrix. Moreover, the worstcase deviationis the corresponding maximum eigenvector
(11)  
Proof.
Apply coordinate transformation in (8) and write
(12)  
Since is positive semidefinite, the maximum lies on the boundary of the set defined by the inequality. Therefore, the problem is equivalent to
(13)  
The formulation (13) is a special case with a known analytic solution. Specifically, the maximizing deviation that solves (13) is the maximum eigenvector of , and the value at the optimum is the corresponding eigenvalue. ∎
This provides us with the maximum deviation along the trajectory at any step ’k’, and now we can use Danskin’s theorem to compute the gradient which is presented next.
Theorem 1.
Introduce the following notation, . The gradient of the robust inequality constraint with respect to an arbitrary vector is
Where is maximum trajectory deviation introduced in Lemma 3.
Proof.
Start from the definition of gradient of robust constraint
Use Danskin’s Theorem and the result from Lemma 3 to write the gradient of robust constraint with respect to an arbitrary ,
which completes the proof. ∎
The gradient computed from Theorem 1 is used in solution of the RelaxedRobustTrajOpt– however, this is solved only for a linear controller. The next section shows some results in simulation and on a real physical system.
V Experimental Results
In this section, we present some results using the proposed algorithm for an underactuated inverted pendulum, as well as on a experimental setup for a ballandbeam system. We use a Python wrapper for the standard interior point solver IPOPT to solve the optimization problem discussed in previous sections. We perform experiments to evaluate the following questions:

Can an offtheshelf optimization solver find feasible solutions to the joint optimization problem described in the paper?

Can the feedback controller obtained by this optimization stabilize the openloop trajectory in the presence of bounded uncertainties?

How good is the performance of the controller on a physical system with unknown system parameters ?
In the following sections, we try to answer these questions using simulations as well as experiments on real systems.
Va Simulation Results for Underactuated Pendulum
The objective of this subsection is twofold: first, to provide insight into the solution of the optimization problem; and second, to demonstrate the effectiveness of that solution in the stabilization of the optimal trajectory.
For clarity of presentation, we use an underactuated pendulum system, where trajectories can be visualized in state space. The dynamics of the pendulum is modeled as . The continuoustime model is discretized as . The goal state is , and the initial state is and the control limit is . The cost is quadratic in both the states and input. The initial solution provided to the controller is trivial (all states and control are ). The number of discretization points along the trajectory is , and the discretization time step is . The cost weight on robust slack variables is selected to be . The uncertainty region is roughly estimated as along the trajectory. Detailed analysis on uncertainty estimation based on Gaussian processes is deferred to future work, due to space limits. The optimization procedure terminates in iterations with the static solution .
The controller generated by the optimization procedure is then tested in simulation, with noise added to each state of the pendulum model at each step of simulation as with and .
We tested the controller over several settings and found that the underactuated setting was the most challenging to stabilize. In Figure LABEL:fig:state_space, we show the statespace trajectory for the controlled (underactuated) system with additional noise as described above. As seen in the plot, the openloop controller becomes unstable with the additional noise, while the proposed controller can still stabilize the whole trajectory.
In Figure LABEL:fig:control_pendulum, we show the control inputs, the timeinvariant feedback gains obtained by the optimization problem. We also the timevarying LQR gains obtained along the trajectory to show provide some insight between the two solutions. As the proposed optimization problem is finding the feedback gain for the worstcase deviation from the trajectory, the solutions are different than the LQRcase. Next, in Figure LABEL:fig:error_pendulum, we plot the error statistics for the controlled system (in the underactuated setting) over different uncertainty balls using each sample for each ball. We observe that the steadystate error goes to zero and the closedloop system is stable along the entire trajectory. As we are using a linear approximation of the system dynamics, the uncertainty sets are still small, however the results are indicating that incorporating the full dynamics during stabilization could allow to generate much bigger basins of attraction for the stabilizing controller.
VB Results on BallandBeam System
Next, we implemented the proposed method on a ballandbeam system (shown in Figure 5) [8285376]. The ballandbeam system is a lowdimensional nonlinear system with the nonlinearity due to the dry friction and delay in the servo motors attached to the table (see Figure 5).
The ballandbeam system can be modeled with 4 state variables , where is the position of the ball, is the ball’s velocity, is the beam angle in radians, and is the angular velocity of the beam. The acceleration of the ball, , is given by
where is the mass of the ball,
is the moment of inertia of the ball,
is the radius of the ball, is the coefficient of viscous friction of the ball on the beam, is the coefficient of static (dry) friction of the ball on the beam, and is the acceleration due to gravity. The beam is actuated by a servo motor (position controlled) and an approximate model for its motion is estimated by fitting an autoregressive model.We use this model for the analysis where the ball’s rotational inertia is ignored and we approximately estimate the dry friction. The model is inaccurate, as can be seen from the performance of the openloop controller in Figure LABEL:fig:ballbeamresult. However, the proposed controller is still able to regulate the ball position at the desired goal showing the stabilizing behavior for the system (see the performance of the closedloop controller in Figure LABEL:fig:ballbeamresult
). The plot shows the mean and the standard deviation of the error for
runs of the controller. It can be observed that the mean regulation error goes to zero for the closedloop controller. We believe that the performance of the controller will improve as we improve the model accuracy. In future research, we would like to study the learning behavior for the proposed controller by learning the residual dynamics using GP [romeres2019semiparametrical].Vi Conclusion and Future Work
This paper presents a method for simultaneously computing an optimal trajectory along with a local, timeinvariant stabilizing controller for a dynamical system with known uncertainty bounds. The timeinvariant controller was computed by adding a robustness constraint to the trajectory optimization problem. We prove that under certain simplifying assumptions, we can compute the gradient of the robustness constraint so that a gradientbased optimization solver could be used to find a solution for the optimization problem. We tested the proposed approach that shows that it is possible to solve the proposed problem simultaneously. We showed that even a linear parameterization of the stabilizing controller with a linear approximation of the error dynamics allows us to successfully control nonlinear systems locally. We tested the proposed method in simulation as well as a physical system. Due to space limitations, we have to skip extra results regarding the behavior of the algorithm.
However, the current approach has two limitations– it makes linear approximation of dynamics for finding the worstcase deviation, and secondly, the linear parameterization of the stabilizing controller can be limiting for a lot of robotic systems. In future research, we will incorporate these two limitations using Lyapunov theory for nonlinear control, for better generalization and more powerful results of the proposed approach.