## References

## I Introduction

Recent progress in the reinforcement learning (RL) community [lillicrap2015continuous, mnih2016asynchronous, levine2016end, nagabandi2018neural, openai2018learning] has renewed a debate on the utility and role of models in controlling uncertain robotic systems. In this paper, we present a unifying viewpoint in which RL algorithms provide a mechanism for computing a reference tracking controller. This controller may then be used modularly in a variety of hierarchical control and planning schemes.

Specifically, this paper focuses on tracking desired output trajectories for a special class of nonlinear systems using a technique from geometric control theory known as *feedback linearization*. Feedback linearization renders the input-output behavior of a nonlinear system *linear* via application of an appropriately chosen control law. Desired output trajectories for the plant can then be generated using a linear reference model and tracked using well-established techniques from linear systems theory, such as LQR [kalman1960contributions] or linear MPC [borrelli2017predictive].

However, the primary drawback of feedback linearization is that it requires accurate knowledge of the plant’s dynamics. Many real-world robotic systems display dynamics with parameters that may be difficult to identify and nonlinearities which may be impractical to incorporate into a system dynamics model. While there have been extensive efforts to develop robust forms of feedback linearization using combinations of feedback and adaptation [sastry2011adaptive, craig1987adaptive, sastry1989adaptive, nam1988model, kanellakopoulos1991systematic, umlauft2017feedback, chowdhary2014bayesian, chowdhary2013bayesian], current methods in the literature make strong structural assumptions about the plant’s nonlinearities. This is highlighted in the case of multiple-input multiple-output nonlinear systems, for which the above references either assume that there is no coupling between the system inputs, or that a highly structured parametric representation of this coupling is available.

In sharp contrast to these methods, we propose a framework for constructing a linearizing controller for a plant with unknown dynamics using policy optimization algorithms from reinforcement learning. Our approach requires no *a priori* information about the structure of the coupling between the inputs and outputs of the plant. While our approach can naturally incorporate information from a nominal dynamics model into the learning process, it can also be applied when nothing but the structure of the linear reference model is known. Specifically, our approach begins by constructing a linearizing controller for the nominal dynamics model (if available). Then, it augments this nominal controller with an *arbitrarily structured* parametric component.
The parameters of this learned component are trained using a reinforcement signal which encourages actions which better match the desired input-output behaviour described by the liner reference model.
We demonstrate that for linearly parameterized controllers, the resulting optimization problem is convex, meaning that globally optimal solutions can be found reliably. Additionally, we present conditions which guarantee that an exact linearizing control law can be recovered.

We evaluate our framework in simulation, where it successfully learns to control a double pendulum (4 dimensional state) and a quadrotor (14 dimensional state) along arbitrary reference trajectories. We also demonstrate our method on a Baxter robot arm. In each case, a single learned linearizing control law can accomplish multiple tasks and track multiple reference signals. We report significant improvements in tracking performance within one hour of training time.

## Ii Related Work

Most approaches for constructing linearizing control laws for plants with *a priori* unknown dynamics are based on linear adaptive control theory [sastry2011adaptive].
The earliest approaches employing *indirect adaptive control* generally assume that a parameterized model of the plant’s true dynamics is available [sastry2011adaptive, craig1987adaptive, sastry1989adaptive, nam1988model, kanellakopoulos1991systematic].
The model parameters are then updated online deterministically using data collected from the plant, and the refined dynamics model yields an improved linearizing control law.
When accompanied by an appropriate (exponentially stabilizing) feedback law, such methods can be shown to track desired output signals asymptotically on the plant.
A large body of subsequent work [spooner1996stable, chen1995adaptive, bechlioulis2008robust]

has extended these results to more general classes of function approximators (i.e., neural networks) to approximate the systems dynamics and improve the linearizing control law. Recent efforts have also investigated the use of nonparametric methods for estimating the plant dynamics

[umlauft2017feedback, chowdhary2014bayesian, chowdhary2013bayesian].Frameworks employing *direct adaptive control* [sanner1992gaussian, wang1993stable] directly parameterize the linearizing controller for the system.
These methods also propose deterministic online update laws and feedback control architecture which ensure asymptotic tracking of desired reference signals. As discussed above, each of these methods makes strong assumptions about the coupling of the input-output dynamics of the system. A notable exception to the above literature is [hwang2003reinforcement], where a temporal differencing scheme is used to learn a linearizing controller for single-input single-output nonlinear systems. We build on this contribution by developing a framework for learning linearizing controllers for multiple-input multiple-output systems and by providing theoretical conditions under which an exact linearizing control law can be constructed.

## Iii Feedback Linearization

This section outlines how to compute input-output linearizing controllers for a known dynamics model. We refer the reader to [sastry2013nonlinear], [isidori2013nonlinear] for a more thorough introduction. In this paper, we consider square control affine systems of the form

(1) | ||||

where is the state, is the input and is the output. The mappings , and are each assumed to be smooth. We restrict our attention to a compact subset of the state space containing the origin.

### Iii-a Single-input single-output systems

We begin by introducing feedback linearization for single-input, single-output (SISO) systems (i.e., ). In order to construct this control law, we take time derivatives of until the input appears, and then invert the relationship to enforce linear input-output behavior. We begin by examining the first time derivative of the output:

(2) | ||||

(3) |

Here the terms and are known as *Lie derivatives* [sastry2013nonlinear], and capture the rate of change of

along the vector fields

and , respectively. In the case that for each , we can exactly control on . In particular, consider the control(4) |

which when applied to the system exactly ‘cancels out’ the nonlinear portion of the differential equation and enforces the linear relationship . However, it may be the case that (that is, the input does not directly affect the first derivative of the output), in which case the control law (4) will be undefined. In general, we can differentiate multiple times, until the input shows up in one of its higher order derivatives. Assuming that the input does not appear the first times we differentiate the output, the -th time derivative of will be of the form

(5) |

Here, and are higher order Lie derivatives. More information on how to compute these nonlinear functions can be found in [sastry2013nonlinear, Chapter 9]. If for each then the control law

(6) |

enforces the relationship
.
is referred to as the *relative degree* of the nonlinear system.

### Iii-B Multiple-input multiple-output systems

Next, we consider (square) multiple-input, multiple-output (MIMO) systems, i.e., . Due to space constraints, we leave a full development of this case to [sastry2013nonlinear, Chapter 9], but outline the main ideas here. As in the SISO case, we differentiate each of the output channels until at least one input appears. Let be the number of times we need to differentiate (the -th entry of ) for an input to appear. We then obtain an input-output relationship of the form :

(7) |

The square matrix is referred to as the *decoupling matrix* and is know as the *drift term*.
If is nonsingular on then we observe that the control law

(8) |

where yields the decoupled linear system

(9) |

where is the -th entry of . We refer to as the *vector relative degree* of the system. The decoupled dynamics (9) can be compactly represented with the LTI system

(10) |

which we will hereafter refer to as the *reference model*. Here, we have collected the states and constructed and to represent the dynamics of (10).

## Iv Directly Learning a Linearizing Controller

In this work, we will examine how to construct a linearizing controller for a physical plant of the form (1) with unknown dynamics

(11) | ||||

starting from the linearizing controller for the model system

(12) | ||||

which represents our "best guess" for the true dynamics of the plant. We make the following standard assumption:

###### Assumption 1

With this assumption in place, we know that there exists linearizing controllers of the form

(13) | ||||

(14) |

for the physical model and plant, respectively. We can construct using the techniques discussed above, but the terms in are unknown. However, we do know that

(15) | ||||

(16) |

for some continuous functions and . We construct parameterized estimates for these functions:

(17) |

Here, and are parameters to be trained by running experiments on the plant. We will assume that and are convex compact sets, and we will frequently abbreviate . We assume that and are continuous in and continuously differentiable in and , respectively.

Altogether, for a given our estimate for the controller which exactly linearizes the plant is given by

(18) |

In the case where no prior information about the dynamics of the plant is available (other than its vector relative degree), we simply remove from the above expression. Next we define a conceptual optimization problem which selects the parameters for the learned controller which, in a sense we will make precise shortly, best linearize the plant. We then describe a practical variant of this problem which is more amenable to real-world implementation.

### Iv-a Conceptual problem

From Section III we know that the input-output dynamics of the plant are of the form

(19) |

where the terms and are unknown to us, and we have written the highest order derivatives as to simplify notation. Under application of the dynamics given by:

(20) |

Letting equal the right-hand side of the above expression, we would ideally like to find such that for each and , for each and . That is, we would ideally like our feedback linearizing controller to accurately control the highest order derivatives of our output. However, since the dynamics of the plant are unknown to us, we do not know the terms in , and thus we cannot directly solve for . Instead, we define the pointwise loss

(21) |

which provides a measure of how well the learned controller linearizes the plant at the state when the virtual input

is applied. We then specify a probability distribution

over with support , which we use to model our preference for having an accurate linearizing controller at different points in the state space. We letbe the uniform distribution over the set

and then define the weighted loss(22) |

and then define our optimal choice of the parameters for the learned controller by solving the following optimization problem:

(23) |

Although we do not know the terms in , we can query this function by applying at various points in the statespace and recording the resulting value of . Thus zero-th order optimization methods can be used to solve (23). In the following section, we formulate an approximation to this problem which is more directly amenable to policy gradient reinforcement learning algorithms.

Our insistence that is supported on and that uniformly excites all directions in is analogous to the *persistence of excitation* conditions commonly found in the adaptive control literature [sastry2011adaptive], and is crucial for the following parameter convergence results.

###### Lemma 1

Suppose that there exists such that for each and . Then is a globally optimal solution to .

Note that if for each and then . Moreover, we clearly have for each . Thus, must be a global minimizer of the optimization (23).

However, (23) is generally a nonconvex optimization problem which means we cannot reliably find its globally optimal solution. Thus, we seek conditions on and which ensure that (23) is actually a convex optimization problem. In particular, we now consider the case where and take the form

(24) |

where and are nonlinear features, which are each assumed to be continuous functions. The proof of the following result can be found in the Appendix.

###### Lemma 2

Taken together, the above Lemmas immediately imply the following result, which provides conditions under which we can reliably recover the true linearizing controller for the plant by solving (23).

### Iv-B Reinforcement learning for practical implementation

To be able to solve (23) efficiently in practical settings we now cast it as a canonical reinforcement learning problem [SuttonBarto]. This allows us to leverage off-the-shelf implementations of on-policy reinforcement learning algorithms to efficiently learn feedback linearizing policies.

Indeed, if we take to be our policy which takes in both the current state and an auxiliary input and returns the control action , and take the reward for a given state to be , the above problem can be written as:

Where is the initial state distribution, is a distribution over auxiliary inputs, is the time horizon of the problem, and is additive zero-mean noise term to make the effect of the policy random.

A discretized version of this problem can be solved with on-policy reinforcement learning algorithms. Indeed, for a given fixed value of , we can sample rollouts of length , and use sequences of the state, output of the policy, and rewards to construct estimates of the gradient of with respect to . This can be done with any method including, but not restricted to REINFORCE [SuttonBarto] with baseline, Deep Deterministic Policy Gradients [DDPO], Proximal Policy Optimization [PPO], or Trust Region Policy Optimization [TRPO].

## V Examples

We now use our approach to learn feedback linearizing policies for three different systems to highlight its versatility. The first two examples are trained in silico while the third is in hardware. In all cases, the input to the parameterized policy replaces all angles with their sine and cosine, and does not include Cartesian positions.

### V-a Simulations

#### V-A1 Double pendulum with polynomial policies

We first test our approach on a fully actuated double pendulum with state , output , where and represent the angles of the two joints, with angular rates and respectively. The system has two inputs and that control the torque at both joints. Although the system is relatively low dimensional and fully actuated, it is highly nonlinear and can produce chaotic trajectories [shinbrot1992chaos].

We train linearizing controllers for the double pendulum in two cases where very poor prior information of the model is available. In the first case, we assume that our estimates for the mass and length of the pendulum arms are only of their true values. In the second case, we assume that no prior information on the system’s dynamics is available, so that our trained controller has and . In both settings we parameterize the learned portion of our policies by linear combinations of second order polynomials such that the reinforcement learning problem conforms to the assumptions of Theorem 1. We use the REINFORCE algorithm [SuttonBarto]

, and baseline state-value estimates with the average reward over all states. At each iteration (or epoch) we collect

rollouts of seconds each, and we train for 2000 epochs. Figure 2 presents the resulting trajectories for each learned controller. We do not plot the trajectories generated by the nominal model based controllers, since in both cases the initial controllers are unable to move the pendulum arms more than a few degrees from the downwards position. For each controller we observe improvement in tracking ability even though a low order polynomial policy is employed. In order to track the desired reference signals we apply a linear feedback gain on the reference model which is found by solving an infinite horizon LQR problem We used a state penalty matrix of and control penalty matrix of to generate the feedback gain.#### V-A2 14D quadrotor with neural network policies

Our second simulation environment uses the quadrotor model and feedback linearization controller proposed in [al2009quadrotor], which makes use of dynamic extension [sastry2013nonlinear]. In particular, the states for the model are where , and are the Cartesian coordinates of the quadrotor, and , and represent the roll, pitch and yaw of the quadrotor, respectively. The next six states represent the time derivatives of these state: . Finally, and are the extra states obtained from the dynamic extension procedure. The outputs for the model are the , , and coordinates.

In Figure 3

we show the performance of two learned feedback linearizing policies on two different high-performance reference tracking tasks. For the first learned policy, we initialized the training with an incorrect prior model where all the parameters of the model (mass and moments of inertia) were scaled by a factor of

. For the second learned policy the parameters of the incorrect prior model were scaled by. The policies were feed-forward neural networks with

activations with hidden layers of width . For each training epoch, 50 rollouts of length 25 were collected and the parameters were updated using PPO. We trained both policies for 2500 epochs. As shown in Figure 3, both prior models were unable to successfully track the desired references for both tasks, leading to highly unstable dynamics. The learned policies, on the other hand, were able to achieve high quality tracking of both references. For all trajectories a linear feedback gain was applied to the reference model, by solving an LQR problem where deviations in the position and yaw were penalized 20 times more than the norm of the control.Figure 3 also highlights how better prior models leads to better performance of the learned policy. Figure 4 highlights this trend through the learning curves of three policies initialized with prior models of decreasing quality. We observe that worse initial models result in worse policy performance, given the same network architecture and training time.

### V-B Robotic experiment: 7-DOF manipulator arm

We also evaluate our approach in hardware, on a 7-DOF Baxter robot arm. The dynamics of this 14-dimensional system are extremely coupled and nonlinear. Taking the 7 joint angles as output , however, the system is input-output linearizable with relative degree two. We use the system measurements (i.e., masses, link lengths, etc.) provided with Baxter’s pre-calibrated URDF [robotics2013baxter] and the OROCOS Kinematics and Dynamics Library (KDL) [bruyninckx2001open] to compute a nominal feedback linearizing control law.

This nominal controller suffers from several inaccuracies. First, Baxter’s actuators are series-elastic, meaning that each joint contains a torsion spring [williams2017baxter] which is unmodeled, and the URDF itself may not be perfectly accurate. Second, the OROCOS solver is numerical, which can lead to errors in computing the decoupling matrix and drift term. Finally, our control architecture is implemented in the Robot Operating System [quigley2009ros], which can lead to minor timing inconsistency.

We use the PPO algorithm to tune the parameters of a neural network with activations. The neural network maps from the (sine-cosine augmented) 21 states to 56 outputs ( inverse decoupling matrix, drift term). For each training epoch, 1250 rollouts of one timestep (0.05 s) each were collected. We trained for 100 epochs with learning rate , which took 104 minutes. Figure 5 summarizes typical results on tracking a square wave reference trajectory for each joint angle with period 5 s. As shown, the ideal linear system displays an exponential step response and rapidly converges to each new setpoint. The nominal feedback linearized model from OROCOS has significant steady-state error. Our learned approach significantly reduces, but does not eliminate, this error. We conjecture that this remaining error is a sign that the (relatively small) neural network may not be sufficiently expressive.

## Vi Conclusion

In this paper, we introduced a framework for learning a linearizing control law for a plant with *a priori* unknown dynamics and no assumptions on the coupling between the nonlinear components of the system. We provided theoretical guarantees for conditions under which it is possible to learn the exact linearizing controller. In more general settings, we cast the learning of a feedback linearizing controller as an on-policy reinforcement learning problem.

We validated our proposed approach on three problems. We first showed that it was possible to learn a feedback linearizing controller with *no prior model* to control a highly nonlinear fully actuated double pendulum. Second, we demonstrated that neural network-based feedback linearizing policies could efficiently track arbitrary trajectories of a high dimensional problem. We also empirically observed the advantage of incorporating prior knowledge into the control design. Finally we tested our approach in hardware on a Baxter robot, where we observed that after 104 minutes of training we saw a significant improvement in the tracking error over the baseline. Together, these empirical results confirm the effectiveness of our approach as a general method for designing high quality model reference controllers for high-dimensional systems with unknown dynamics.

### -a Proof of Lemma 2

First, we rearrange (19) into the form

(25) |

to separate out the portions that depend on . This can be further condensed by putting where we set

Letting , we can rewrite

From here we observe that where is a positive semi-definite matrix, and . Thus, recalling that is assumed to be a convex set, we see that (23) is a convex optimization problem which will be strictly convex if, and only if, is positive definite. Letting and , we see that is nothing but the Grammian of the set on with respect to an inner product which is weighted by the distributions and . Thus, will be positive definite if and only if is linearly independent on . For the sake of contradiction assume that is not linearly independent. Then there exists scalars and such that for each and

(26) |

Since we know that is invertible for each , this statement is equivalent to

(27) |

holding for each and . However, it is not difficult to see that this condition is ruled out in the case that and are linearly independent sets.