Technical Report: Adaptive Control for Linearizable Systems Using On-Policy Reinforcement Learning

This paper proposes a framework for adaptively learning a feedback linearization-based tracking controller for an unknown system using discrete-time model-free policy-gradient parameter update rules. The primary advantage of the scheme over standard model-reference adaptive control techniques is that it does not require the learned inverse model to be invertible at all instances of time. This enables the use of general function approximators to approximate the linearizing controller for the system without having to worry about singularities. However, the discrete-time and stochastic nature of these algorithms precludes the direct application of standard machinery from the adaptive control literature to provide deterministic stability proofs for the system. Nevertheless, we leverage these techniques alongside tools from the stochastic approximation literature to demonstrate that with high probability the tracking and parameter errors concentrate near zero when a certain persistence of excitation condition is satisfied. A simulated example of a double pendulum demonstrates the utility of the proposed theory. 1

Authors

• 5 publications
• 11 publications
• 18 publications
• 3 publications
• 31 publications
• 30 publications
• Model-free optimal control of discrete-time systems with additive and multiplicative noises

This paper investigates the optimal control problem for a class of discr...
08/20/2020 ∙ by Jing Lai, et al. ∙ 0

• Robust Model-Free Learning and Control without Prior Knowledge

We present a simple model-free control algorithm that is able to robustl...
10/01/2020 ∙ by Dimitar Ho, et al. ∙ 0

• An online evolving framework for advancing reinforcement-learning based automated vehicle control

In this paper, an online evolving framework is proposed to detect and re...
06/15/2020 ∙ by Teawon Han, et al. ∙ 0

• Composite Adaptive Control for Bilateral Teleoperation Systems without Persistency of Excitation

Composite adaptive control schemes, which use both the system tracking e...
04/18/2018 ∙ by Yuling Li, et al. ∙ 0

• Discrete-time data-driven control with Hölder-continuous real-time learning

This work provides a framework for data-driven control of discrete-time ...
03/27/2021 ∙ by aksanyal, et al. ∙ 0

• Learning Continuous Control Policies by Stochastic Value Gradients

We present a unified framework for learning continuous control policies ...
10/30/2015 ∙ by Nicolas Heess, et al. ∙ 0

• Stability Analysis of Optimal Adaptive Control using Value Iteration with Approximation Errors

Adaptive optimal control using value iteration initiated from a stabiliz...
10/23/2017 ∙ by Ali Heydari, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many real-world control systems display nonlinear behaviors which are difficult to model, necessitating the use of control architectures which can adapt to the unknown dynamics online while maintaining certificates of stability. There are many successful model-based strategies for adaptively constructing controllers for uncertain systems [sastry1989adaptive, sastry2013nonlinear, slotine1987adaptive]

, but these methods often require the presence of a simple, reasonably accurate parametric model of the system dynamics. Recently, however, there has been a resurgence of interest in the use of model-free reinforcement learning techniques to construct feedback controllers without the need for a reliable dynamics model

[schulman2015trust, schulman2017proximal, lillicrap2015continuous]. As these methods begin to be deployed in real world settings, a new theory is needed to understand the behavior of these algorithms as they are integrated into safety-critical control loops.

However, the majority of the theory for adaptive control is stated in continuous-time [sastry2013nonlinear], while reinforcement learning algorithms are typically implemented and studied in discrete-time settings [sutton2018reinforcement, borkar2009stochastic]. There have been several attempts to define and study policy-gradient algorithms in continuous-time [munos2006policy, doya2000reinforcement], yet many real-world systems have actuators which can only be updated at a fixed maximum sampling frequency. Thus, we find it more natural and practically applicable to unify these methods in the sampled-data setting.

Specifically, this paper addresses the model mismatch issue by combining continuous-time adaptive control techniques with discrete-time model-free reinforcement learning algorithms to learn a feedback linearization-based tracking controller for an unknown system, online. Unfortunately, it is well-known that sampling can destroy the affine relationship between system inputs and outputs which is usually assumed and then exploited in the stability proofs from the adaptive control literature [grizzle1988feedback]. To overcome this challenge, we first ignore the effects of sampling and design an idealized continuous-time behavior for the system’s tracking and parameter error dynamics which employs a least-squares gradient following update rule. In the sampled-data setting, we then use an Euler approximation of the continuous-time reward signal and implement a policy-gradient parameter update rule to produce a noisy approximation to the ideal continuous-time behavior. Our framework is closely related to that of [westenbroek2019feedback]; however, in this paper we address the problem of online adaptation of the learned parameters whereas [westenbroek2019feedback] considers a fully offline setting.

Beyond naturally bridging continuous-time and sampled-data settings, the primary advantage of our approach is that it does not suffer from the “loss of controllability” phenomena which is a core challenge in the model-reference adaptive control literature [sastry1989adaptive, kosmatopoulos1999switching]

. This issue arises when the parameterized estimate for the system’s decoupling matrix becomes singular, in which case either the learned linearizing control law or associated parameter update scheme may break down. To circumvent this issue, projection-based parameter update rules are used to keep the parameters in a region in which the estimate for the decoupling matrix is known to be invertible. In practice, the construction of these regions requires that a simple parameterization of the system’s nonlinearities is available

[craig1987adaptive]. In contrast, the model-free approach we introduce does not suffer from singularities and can naturally incorporate ‘universal’ function approximators such as radial bases functions or bases of polynomials.

However, due to the non-deterministic nature of our sampled-data control law and parameter update scheme, the deterministic guarantees usually found in the adaptive control literature do not apply here. Indeed, policy-gradient parameter updates are known to suffer from high variances

[zhao2011analysis]. Nevertheless, we demonstrate that when a standard persistence of excitation condition is satisfied the tracking and parameter errors of the system concentrate around the origin with high probability even when the most basic policy-gradient update rule is used. Our analysis technique is derived from the adaptive control literature and the theory of stochastic approximations [borkar2009stochastic, vershynin2018high]. Proofs of claims made can be found in the Appendix of the document. Finally, a simulation of a double pendulum demonstrates the utility of the approach.

I-a Related Work

A number of approaches have been proposed to avoid the “loss of controllability” problem discussed above. One approach is to perturb the estimated linearizing control law to avoid singularities [kosmatopoulos1999switching, kosmatopoulos2002robust, bechlioulis2008robust]. However, this method never learns the exact linearizing controller during operation and hence sacrifices some tracking performance. Other approaches avoid the need to invert the input-output dynamics by driving the system states to a sliding surface [slotine1987adaptive]. Unfortunately, these methods require high-gain feedback which may lead to undesirable effects such as actuator saturation. Several model-free approaches similar to the one we consider here have been proposed in the literature [hwang2003reinforcement, zomaya1994reinforcement], but these focus on actor-critic methods and, to the best of our knowledge, do not provide any proofs of convergence. Recently, non-parametric function approximators have been been used to learn a linearizing controller [umlauft2017feedback, umlauft2019feedback], but these methods still require structural assumptions to avoid singularities.

While our parameter-update scheme is most closely related to the policy gradient literature, e.g., [sutton2018reinforcement], we believe that recent work in meta-learning [finn2017model, santoro2016meta]

is also similar to our own work, at least in spirit. Meta-learning aims to learn priors on the solution to a given machine learning problem, and thereby speed up online fine tuning when presented with a slightly different instance of the problem

[vilalta2002perspective]. Meta-learning is used in practice to apply reinforcement learning algorithms in hardware settings [nagabandi2018learning, andrychowicz2020learning].

I-B Preliminaries

Next, we fix mathematical notation and review some definitions used extensively in the paper. Given a random variable

, if they exist the expectation of is denoted and its variances is denoted by . Our analysis heavily relies on the notion of a sub-Gaussian distribution. We say that a random variable is sub-Gaussian if there exists a constant such that for each we have

. Informally, a distribution is sub-Gaussian if it’s tail is dominated by the tail of some Gaussian distribution. We endow the space of sub-Gaussian distributions with the norm

defined by . As an example, if is a zero-mean Gaussian distribution with variance (with the -dimensional identity) then is sub-Gaussian with norm , where the constant does not depend on .

Ii Feedback Linearization

Throughout the paper we will focus on constructing output tracking controllers for systems of the form

 ˙x =f(x)+g(x)u (1) y =h(x)

where is the state, is the input and is the output. The mappings , and are each assumed to be smooth, and we assume without loss of generality that the origin is an equilibrium point of the undriven system, i.e., . Throughout the paper, we will also assume that state and the output can both be measured.

Ii-a Single-input single-output systems

We begin by introducing feedback linearization for single-input, single-output (SISO) systems (i.e., ). We begin by examining the first time derivative of the output:

 ˙y =Lfh(x)+Lgh(x) (2)

Here the terms and are known as Lie derivatives [sastry2013nonlinear]. In the case that for each , we can apply

 u(x,v)=1Lgh(x)(−Lfh(x)+v) , (3)

which exactly ‘cancels out’ the nonlinearities of the system and enforces the linear relationship with some arbitrary, auxiliary input. However if the input does not affect the first time derivative of the output—that is, if —then the control law (3) will be undefined. In general, we can differentiate multiple times, until the input shows up in one of the higher derivatives of the output. Assuming that the input does not appear the first times we differentiate the output, the -th time derivative of will be of the form

 y(γ)=Lγfh(x)+LgLγ−1fh(x)u (4)

Here, and are higher order Lie derivatives, and we direct the reader to [sastry2013nonlinear, Chapter 9] for further details. If for each then setting

 u(x,v)=1LgLγ−1fh(x)(−Lγfh(x)+v) (5)

enforces the trivial linear relationship . We refer to as the relative degree of the nonlinear system, which is simply the order of its input-output relationship.

Ii-B Multiple-input multiple-output systems

Next, we consider square multiple-input, multiple-output (MIMO) systems where . As in the SISO case, we differentiate each of the output channels until at least one input appears. Let be the number of times we need to differentiate (the -th entry of ) for at least one input to appear. Combining the resulting expressions for each of the outputs yields an input-output relationship of the form

 y(γ)=b(x)+A(x)u (6)

where we have adopted the shorthand . Here, the matrix is known as the decoupling matrix

and the vector

is known as the drift term. If is non-singular on for each then we observe that the control law

 u(x,v)=A−1(x)(−b(x)+v) (7)

where yields the decoupled linear system

 [y(γ1)1,y(γ2)2,…,y(γq)q]T=[v1,v2,…,vq]T, (8)

where is the -th entry of and is the -th time derivative of the -th output. We refer to as the vector relative degree of the system, with the total relative degree of all dimensions. The decoupled dynamics (8) can be compactly represented with the LTI system

 ˙ξr=Aξr+Bvr (9)

which we will hereafter refer to as the reference model. Here, and is constructed so that , where is the

-dimensional identity matrix. Note that (

9) collects . It can be shown [sastry2013nonlinear, Chapter 9] that there exists a change of coordinates such that in the new coordinates and after application of the linearizing control law the dynamics of the system are of the form

 ˙ξ =Aξ+Bv (10) ˙η =q(ξ,η)+p(ξ,η)v.

That is, the coordinates represent the portion of the system that has been linearized while the coordinates represent the remaining coordinates of the nonlinear system. The undriven dynamics

 ˙η=q(ξ,η) (11)

are referred to as the zero dynamics. Conditions which ensure that the coordinates remain bounded during operation will be discussed below.

Ii-C Inversion & exact tracking for min-phase MIMO systems

Let us assume that we are given a desired reference signal . Our goal is to construct a tracking controller for the nonlinear system using the linearizing controller (7), along with a linear controller designed for the reference model (9) which makes use of both feedback terms. We will assume that the first derivatives of are well defined, and assume that the signal can be bounded uniformly.

For compactness of notation, we will collect

 y(γ)d(⋅)=(y(γ1)1,d(⋅),y(γ2)2,d(⋅),…,y(γq)q,d(⋅))
 ξd(⋅)=(y1,d(⋅),…,y(γ1−1)1,d(⋅),…,yq,d(⋅),…,y(γq−1)q,d(⋅)).

Here, is used to capture the desired trajectory of the linear reference model, and will be used in a feedforward term in the tracking controller. To construct the feedback term, we define the error

 e(⋅)=ξ(⋅)−ξd(⋅) (12)

where is the actual trajectory of the linearized coordinates as in (10). Altogether, the tracking controller for the system is then given by

 u=A−1(x)(−b(x)+y(γ)d+Ke) (13)

where is a linear feedback matrix designed so that us Hurwitz. Under the application of this control law the closed loop error dynamics become

 ˙e=(A+BK)e (14)

and it becomes apparent that exponentially quickly. However, while the tracking error decays exponentially, the coordinates may be come unbounded during operation, in which case the linearizing control law will break down. One sufficient condition for to remain bounded is for the zero dynamics to be globally exponentially stable and for and to remain bounded [sastry1989adaptive, Chapter 9]. When the zero dynamics satiecfy this condition we say nonlinear system is exponentially minimum phase.

From here on, we will aim to learn a feedback linearization-based tracking controller for the unknown plant

 ˙xp =fp(xp)+gp(xp)up (15) yp =hp(xp)

in an adaptive fashion. We assume that we have access to a an approximate dynamics model for the plant

 ˙xm =fm(xm)+gm(xm)um (16) ym =hm(xm), (17)

which incorporates any prior information available about the plant. It is assumed that the state ( and ) for both systems belongs to , that the inputs and outputs for both systems belong to , and that each of the mappings in (15) and (16) are smooth. We make the following assumption about the model and plant:

Assumption 1

The plant and model have the same well-defined relative degree on all of .

Assumption 2

The model and plant are both exponentially minimum phase.

With these assumptions in place, we know that there are globally-defined linearizing controllers for the plant and model, which respectively take the following form:

 up(x,v)=βp(x)+αp(x)v um(x,v)=βm(x)+αm(x)v

While can be calculated using the model dynamics and the procedures outlined in the previous section, the terms comprising are unknown to us. However, we do know that they may be expressed as

 βp(x)=βm(x)+Δb(x) αp(x)=αm(x)+Δα(x)

where and are unknown but continuous functions. Thus we construct an estimate for of the form

 ^u(θ,x,v)=(βm(x)+βθ1(x))+(αm(x)+αθ2(x))v

where is a parameterized estimate for , and is a parameterized estimate for . The parameters and are to be learned during online operation of the plant, and the total set of parameters are collected by stacking on top of . Our theoretical results will assume that the estimates are of the form

 βθ1(x)=K1∑k=1θk1βk(x)    αθ2(x)=K2∑k=1θk2αk(x) (18)

where and

are linearly independent bases of functions, such as polynomials or radial basis functions.

Iii-a Idealized continuous-time behavior

We now introduce a continuous-time update rule for the parameters of the learned linearizing controller which assumes that we know the functional form of the nonlinearities of the system. In Section III-B, we demonstrate how to approximate this ideal behavior in the sampled data setting using a policy gradient update rule which requires no information about the structure of the plant’s nonlinearities.

We begin by assuming that there exists a set of “true” parameters for the plant so that for each and we have . In this case, we can write our parameter estimation error as so that .

With the gain matrix constructed as in Section II-C, an estimate for the feedback linearization-based tracking controller is of the form

 u=^u(θ,x,yγd+Ke). (19)

When this control law is applied to the system the closed-loop error dynamics take the form

 ˙e=(A+BK)e+BW(x,yγd,e)ϕ (20)

where is a complicated function of and which contains terms involving and . The exact form of this function can be found in the technical report. The term captures the effects that the parameter estimation error has on the closed loop error dynamics. As we have done here, we will frequently drop the arguments of to simplify notation. We will also write for when we wish to emphasize the dependence of the function on time.

Ideally, we would like to drive as so that we obtain the desired closed-loop error dynamics (14). Recalling from Section II-B that the reference model is designed such that , this suggests applying the least-squares cost signal

 R(t)=12∥BWϕ∥22=12∥Wϕ∥22 (21)

and following the negative gradient of the cost with the following update rule:

 ˙ϕ=−WTWϕ. (22)

Least-squares gradient-following algorithms of this sort are well studied in the adaptive control literature [sastry1989adaptive, Chapter 2]. Since we have , this suggests that the parameters should also be updated according to . Altogether, we can represent the tracking and parameter error dynamics with the linear time-varying system

 [˙e˙ϕ]=[A+BKBW(t)0−WT(t)W(t)]A(t)[eϕ]. (23)

Letting , the solution to this system is given by

 X(t)=Φ(t,0)X(0) (24)

where for each the state transition matrix is the solution to the matrix differential equation with intial condition , where is the identity matrix of appropriate dimension. From the adaptive control literature, it is well known that if is “persistently exciting” in the sense that there exists such that for each

 c1I>∫t0+δt0WT(t)W(t)dt>c2I (25)

for some , then the time varying system (23) is exponentially stable, if also remains bounded. Intuitively, this condition simply ensures that the regressor term is “rich enough” during the learning process to drive exponentially quickly. Observing (20) we also see that if exponentially quickly then exponentially as well. We formalize this point with the following Lemma:

Lemma 1

Let the persistence of excitation condition (25) hold and assume that there exists such that for each . Then there exists and such that for each

 ∥Φ(t1,t2)∥≤Me−ζ(t1−t2) (26)

with defined as above.

Proof of this result can be found in Appendix, but variations of this result can be found in standard adaptive control texts [sastry1989adaptive]. Unfortunately, we do not know the terms in (22) since we don’t know or so this update rule cannot be directly implemented. In the next section we introduce a model-free update rule for the parameters of the learned controller which approximates the continuous update (22) without requiring direct knowledge of or .

Hereafter, we will assume that the control supplied to the plant can only be updated every seconds. While this setting provides a more realistic model for many robotic systems, sampling has the unfortunate effect of destroying the affine relationship between the plant’s inputs and outputs [grizzle1988feedback] which was key to the continuous-time design techniques discussed above. Nevertheless, we now introduce a framework for approximately matching the ideal tracking and parameter error dynamics introduced in the previous section in the sampled-data setting using an Euler discretization of the continuous-time reward (21) and a policy-gradient based parameter update rule.

Before introducing our sampled-data control law and adaptation scheme, we first fix notation and discuss a few key assumptions our analysis will employ. To begin we let for each denote the sampling times for the system. Letting denote the trajectory of the plant, we let denote the state of the plant at the -th sample. Similarly, we let denote the trajectory of the outputs and their derivatives as in (10), and we set (not to be confused with the -th entry of ). Next we let denote the input applied to the plant on the interval . The parameters for our learned controller will be updated only at the sampling times, and we let denote the value of the parameters on . We again let , and denote the desired trajectory for the outputs and their appropriate derivatives, and let and , and . We make the following assumption about the desired output signals and their derivatives:

Assumption 3

The signal is continuous and uniformly bounded. Furthermore, for each the derivatives are also continuous and uniformly bounded.

Remark 1

Typical convergence proofs in the continuous-time adaptive control literature generally only require that be continuous and bounded, but these methods also assume that the input to the plant can be updated continuously. In the sampled data setting, we require the continuity of to ensure that it does not vary too much within a given sampling period.

After sampling the discrete-time tracking error dynamics obey a difference equation of the form

 ek+1=Hk(xk,ek,uk) (27)

where is obtained by integrating the dynamics of the nonlinear system and reference trajectory over . Generally, will no longer be affine in the input. However, the relationship is approximately affine for small values of . Indeed, with Assumptions 3 and 5 in place, if we apply the control law

 uk=u(θk,xk,yγd,k+Kek), (28)

then an Euler discretization of the continuous time error dynamics (20) yields

 ek+1=ek+Δt(A+BK)ek+ΔtBWkϕk+O(Δt2) (29)

where we have set . Thus, letting , for small the continuous-time cost is well approximated by

 (30)

where we note that and are both quantities which can be measured by numerically differentiating the outputs from the plant. Intuitively, the sampled-data cost provides a measure for how well the control matches the desired change in the tracking error (20) over the interval .

Next, we add probing noise to the control law (28) to ensure that the input is sufficiently exciting and to enable the use of policy-gradient methods for estimating the gradient of the discrete-time cost signal. In particular, we will draw the input according as , where

 πk(⋅|θk,xk,ek)=^u(θk,xk,yγd,k+K(ξd,k−ξk))+Wk (31)

and is additive zero-mean Gaussian noise. Methods for selecting the variance-scaling term will be discussed below, however for now it is sufficient to assume that is bounded.

With the addition of the random noise we now define

 Jk(θk)=Euk∼πk(θk,xk,ek)Rk(xk,ek,uk), (32)

noting that it is also common for policy gradient methods to use an expected “cost-to-go” as the objective. Regardless, using the policy-gradient theorem [sutton2000policy], the gradient of can be written as

 ∇θkJk(θk)=EπkRk(xk,ξk,uk)⋅∇θklogP{πk(uk|θk,xk,ek)} (33)

where the expectation accounts for randomness due to the input .

Moreover, a noisy, unbiased estimate of

is given by

 ^Jk=Rk(xk,ξk,uk)∇θklog(P{π(uk|θk,θk,xk,ek)}) (34)

where and is the actual input applied to the plant over the k-th time interval. Recall that can be directly calculated using , and (30), and can also be computed since the derivatives of (and thus of ) are known to us. Thus, can be computed using values that we have assumed we can measure. However, since the input is random, the gradient estimate is drawn according to

 ^Jk∼Δ^Jk(⋅|θk,xk,ek) (35)

where the random variable is constructed using the relationship (34). Using our estimate of the gradient for the discrete-time reward we propose the following noisy update rule for the parameters of our learned controller:

 θk+1=θk−Δt^Jk (36)

Putting it all together, the sampled-data stochastic version of our error dynamics becomes

 ek+1 =ek+Hk(xk,ek,uk) (37) ϕk+1 =ϕk−Δt^Jk

Assumption 4

There exists a constant such that almost surely.

Assumption 5

There exists a constant such and almost surely.

Assumption 4 ensures that the additive noise does not drive the state to be unbounded during a single sampling interval, while Assumption 5 ensures that the gradient estimate does not become undefined during the learning process. These important technical assumptions are common in the theory of stochastic approximations [borkar2009stochastic], and allow us to characterize the estimator for the gradient as follows:

Lemma 2

Let Assumptions 3-5 hold. Then is a sub-Gaussian distribution where

 E[Δ^Jk(⋅|θk,xk,ek)]=WTkWkϕk+O(Δt(1+σ+σ2)) (38)

and

 (39)

The Lemma demonstrates a trade-off between the bias and variance of the gradient estimate that has been observed in the reinforcement learning literature [silver2014deterministic, zhao2011analysis]. Specifically, the bias of the gradient estimate decreases as but this causes the gradient of the estimator to blow up, as indicated by the increasing sub-Gaussian norm. However, the bias of the gradient estimate has a term which is which does not depend on the amount of noise added to the system. This term comes from the fact that we have resorted to using a finite difference approximation (30) to approximate the gradient of the continuous-time reward in the sampled data setting. Due to this inherent bias, little is gained by decreasing past the point where . Next, we analyze the overall behavior of (37).

Iii-C Convergence analysis

The main idea behind our analysis is to model our sampled-data error dynamics (37) as a perturbation to the idealized continuous-time error dynamics (23), as is commonly done in the stochastic approximation literature [borkar2009stochastic]. Under the assumption that is persistently exciting, the nominal continuous time dynamics are exponentially stable and we observe that the total perturbation accumulated over each sampling interval decays exponentially as time goes on. Due to space constraints, we outline the main points of the analysis here but leave the details to the technichal report.

Our analysis makes use of the piecewise-linear curve

which is constructed by interpolating between

and along the interval . That is, we define

Combining the tracking and interpolated tracking error into the state we may write

 ddtX(t)=A(t)X(t)+δ(t) (40)

where for each the dynamics matrix constructed as in (23) and the disturbance captures the deviation from the idealized continuous dynamics caused at each instance of time due the sampling, additive noise, and the process of interpolating the parameter error. Again letting denote the solution to with initial condition , for each we have that

 X(t)=Φ(t,0)X(0)+∫t0Φ(t,τ)δ(τ)dτ (41)

Now, if we let for each we can instead write

 Xk=Φ(tk,0)X0+k−1∑i=1Φ(tk,ti+1)∫ti+1tiΦ(ti+1,τ)δ(τ)dτδk, (42)

where the term is the total disturbance accumulated over the interval . We separate the effects the distubance has on the tracking and error dynamics by letting denote the first elements of and letting denote the remaining entries. On the interval the disturbance can be written as a function of , and . Since is a random function of , for fixed , and , the two elements of are distributed according to

 δek∼Δek(⋅|θk,xk,ek)   and   δϕk∼Δϕk(⋅|θk.xk,ek). (43)

These random variables are constructed by integrating the distrubance over and an explicit representation of these variable can be found in the proof of the following Lemma, which can be found in the Appendix.

Lemma 3

Let Assumptions 3-5 hold. Then and are sub-Gaussian random variables where

 ∥E[Δek(⋅|θk,xk,ek)]∥2=O(Δt2(1+σ+σ2)) (44)
 ∥E[Δϕk(⋅|θk,xk,ek)]∥2=O(Δt2(1+σ+σ2)) (45)
 ∥Δek(⋅|θk,xk,ek)∥ψ2=O(Δtσ) (46)
 (47)

Next, for each we put , and then define the zero-mean random variables and . Our overall discrete-time process can then be written as

 Xk=Φ(tk,0)X0+k−1∑i=0Φ(tk,ti+1)(εi+Mi). (48)

where is constructed by stacking on top of and is constructed by stacking on top of . Now if we assume that is persistently exciting, then for each we have

 (49)

where and are as in Lemma 1 and we have put . Thus, under this assumption we may use the triangle inequality to bound

 (50)

Thus, when is persistently exciting we see that the effects of the disturbance accumulated at each time step decays exponentially as time goes on, along with the effects of the initial tracking and parameter error. A full proof for the following Theorem is given in the Appendix, but the main idea is to use properties of geometric series to bound over time and to use the concentration inequality from [vershynin2018high, Theorem 2.6.3] to bound the deviation of .

Theorem 1

Let Assumptions 3-5 hold. Further assume that is persistently exciting and let and be defined as in Lemma 1. Then there exists numerical constants and such that

 |E[Xk]|≤Mρk|X0|+MC1Δt(1+σ+σ2)ζ (51)

and for each with probability we have

 (52)

Despite the high variance of the simple policy gradient parameter update analyzed so far, the Theorem demonstrates that with high probability our tracking and parameter errors concentrate around the origin. As decreases, the bias introduced by the sampling and additive noise diminish, as does the radius of our high-probability bound. These bounds also become tighter as the exponential rate of decay for the idealized continuous time dynamics increases. The Theorem again displays the trade-off between the bias and variance of the learning scheme observed in Section 3. However, here we still observe in equation (51) that the bias introduced by the noise is relatively small, meaning does not have to be made prohibitively small so as to degrade the bound in (52).

Iii-D Variance Reduction via Baslines

It is common for policy gradients to be implemented with a baseline [williams1992simple]. In this case, the gradient estimator in (34) may become biased, though it often has lower variance [sutton2018reinforcement, greensmith2004variance]. The expression with a baseline is

 ^Jk=(Rk(xk,ek,uk)−Sk(xk,ek,uk))⋅∇θklog(P{π(uk|θk,θk,xk,ek)}), (53)

where is an estimate of . If does not depend on then the addition of the baseline does not add any bias to the gradient estimate [sutton2018reinforcement]. For example, in our numerical example below we use a simple sum-of-past-rewards baseline by setting , where is the -th reward recorded. We consider it a matter of future work to rigorously study the effects of this an other common baselines from the reinforcement learned literature within the theoretical framework we have developed.

Iv Numerical Example

Our numerical example examines the application of our method to the double pendulum depicted in Figure 1 (a), whose dynamics can be found in [shinbrot1992chaos]. With a slight abuse of notation, the system has generalized coordinates which represent the angles the two arms make with the vertical. Letting , the system can be represented with a state-space model of the form (1) where the angles of the two joints are chosen as outputs. It can be shown that the vector relative degree is , so the system can be completely linearized by state feedback.

The dynamics of the system depend on the parameters , , , where is the mass of the -th link and its length. For the purposes of our simulation, we set the true parameters for the plant to be . However, to set-up the learning problem, we assume that we have inaccurate measurements for each of these parameters, namely, . That is, each estimated parameter is scales to times its true value. Our nominal model-based linearizing controller is constructed by computing the linearizing controller for the dynamics model which corresponds to the inaccurate parameter estimates. The learned component of the controller is then constructed by using radial basis functions to populate the entries of and . In total, 250 radial basis functions were used.

For the online leaning problem we set the sampling interval to be seconds and set the level of probing noise at . The reward was regularized using an average sum-of-rewards baseline as described in III-D. The reference trajectory for each of the output channels were constructed by summing together sinusoidal functions whose frequencies are non-integer multiples of each other to ensure that the entire region of operation was explored. The feedback gain matrix

was designed so that each of the eigenvalues of

are equal to , where and are the appropriate matricies in the reference model for the system.

Figure 1 (b) shows the norm of the tracking error of the learning scheme over time while Figure 1 (c) shows the norm of the tracking error for the nominal model-based controller with no learning. Note that the learning-based approach is able to steadily reduce the tracking error over time while keeping the system stable.

V Conclusion

This paper developed an adaptive framework which employs model-free policy-gradient parameter update rules to construct a feedback-linearization based tracking controller for systems with unknown dynamics. We combined analysis techniques from the adaptive control literature and theory of stochastic approximations to provide high-confidence tracking guarantees for the closed loops system, and demonstrated the utility of the framework through a simulation experiment. Beyond the immediate utility of the proposed framework, we believe the analysis tools we developed provide a foundation for studying the use of reinforcement learning algorithms for online adaptation.

The following Appedicies contain items which were too long to present in the main body of the document. Appendix A containts two auxiliary Lemmas which are used extensively throughout the main proofs of Lemma 2 in Appendix B, Lemma 3 in Appendix C, and Theorem 1 in Appendix . Appendix E introduces the explicit form of the error equations in equation (20), and finally Appedix F provides proof for Lemma 1.

-a Auxiliary Lemmas

Lemma 4

3-5 hold. Then there exists a constant such that

 (54)

and

 (55)

The bound in (54) follows directly from Assumption 5 and the smoothness for the vector field for the plant. The bound in (55) follows from Assumption 3 and the continuity of the bases elements and .

Lemma 5

Let Assumptions 3-5 hold. Then there exits such that for each we have and and .

First, we have that on the interval . By our standing Assumptions and the continuity of and there exists a finite constant such that . This implies that , as desired. Noting that , the bound on follows by an analogous argument. To prove the bound for we recall that . However, the expression for is given in equation (75) below, and we see is bounded under our standing Assumptions. Thus, there exists a such that and thus we have . The desired result follows from the above observations.

-B Proof of Lemma 2

Next, we note that we may rewrite

 (56)

where for convienience of notation we have defined

 ^uθk=^u(θk,xk,yγd,k+Kek) (57)

and suppressed the dependence on and . Thus, we have that

 (58)

Now, since we can calculate by first calculating how varies with . Now, we have that

 (59)

and noting that we can rewrite the above expression as

 ek+1=ek+Δt(A+BK)ek+ΔtBWk