Learning from Incremental Directional Corrections

11/30/2020 ∙ by Wanxin Jin, et al. ∙ Purdue University Northwestern University 0

This paper proposes a technique which enables a robot to learn a control objective function incrementally from human user's corrections. The human's corrections can be as simple as directional corrections – corrections that indicate the direction of a control change without indicating its magnitude – applied at some time instances during the robot's motion. We only assume that each of the human's corrections, regardless of its magnitude, points in a direction that improves the robot's current motion relative to an implicit objective function. The proposed method uses the direction of a correction to update the estimate of the objective function based on a cutting plane technique. We establish the theoretical results to show that this process of incremental correction and update guarantees convergence of the learned objective function to the implicit one. The method is validated by both simulations and two human-robot games, where human players teach a 2-link robot arm and a 6-DoF quadrotor system for motion planning in environments with obstacles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

Code Repositories

Learning-from-Directional-Corrections

A new method for a robot to learn a control objective from human user's directional corrections.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For tasks where robots work in proximity to human users, a robot is required to not only guarantee the accomplishment of a task but also complete it in a way that a human user prefers. Different users may have difference preferences about how the robot should perform the task. Such customized requirements usually lead to considerable workload of robot programming, which requires human users to have expertise to design and repeatedly tune robot’s controllers until achieving satisfactory robot behaviors.

To circumvent the expertise requirement in traditional robot programming, learning from demonstrations (LfD) empowers a non-expert human user to program a robot by only providing demonstrations. In existing LfD techniques [31], a human user first provides a robot with behavioral demonstrations in a one-time manner, then the robot learns a control policy or a control objective function off-line from the demonstrations. Successful examples include autonomous driving [19], robot manipulation [9], and motion planning [17]. In some practical cases, the one-time and offline nature of LfD can introduce challenges. For example, when the demonstrations are insufficient to infer the objective function due to low data informativeness [15] or significant deviation from the optimal data [12]

, new demonstrations have to be re-collected and the robot has to be re-trained. Importantly, acquiring an optimal demonstration in a one-shot fashion for the systems of high degree-of-freedoms can be challenging

[12], because the human demonstrator has to move the robot in all degrees-of-freedom in a spatially and temporally consistent manner.

In this work, we address the above challenges by developing a new programming scheme that enables a non-expert human user to program a robot by incrementally improving the robot’s motion. For instance, consider the example of a robot that plans motion under a (random) control objective function. While it is executing the motion, a human user who supervises the robot will find the robot’s motion is not satisfactory; thus, the human user applies a correction to the robot during its motion execution. Then, the robot uses the correction to update its control objective function. This process of planning-correction-update repeats until the robot eventually achieves a control objective function such that its resulting trajectory agrees with the human user’s expectation. In this learning procedure, the human’s each correction does not necessarily move the robot to the optimal motion, but merely an incremental improvement of the robot’s current motion towards the human’s expectation, thus reducing the workload of a non-expert user when programming a robot compared to LfD. In addition to the incremental learning capability, the proposed learning from directional corrections technique in this work also has the following highlights.

1) The proposed method only requires human’s directional corrections. A directional correction is a correction that only contains directional information and does not necessarily need to be magnitude-specific. For instance, for teaching a mobile robot, the directional corrections are simply as ‘left’ or ‘right’ without dictating how far the robot should move.

2) The human’s directional corrections to the robot’s motion can be sparse. That means that the corrections can be applied only at sparse time instances within the time horizon of the robot’s motion. The learning is performed directly based on the sparse corrections, without attaining/retaining any intermediate corrected trajectory that may introduce inaccuracy.

3) Both theoretical results and experiments are established to show the convergence of the proposed learning algorithm. Specifically, we validate the method on two human-robot games and the results show that the proposed method enables a robot to efficiently learn a control objective function for the desired motion with few human’s directional corrections.

I-a Related Work

I-A1 Offline Learning from Demonstrations

To learn a control objective function from demonstrations, the available approaches include inverse optimal control [25, 29, 14]

and inverse reinforcement learning

[28, 35, 30], where given optimal demonstrations, an objective function that explains such demonstrations is inferred and used for motion control and planning. Despite the significant progress achieved in theory and applications [18, 16, 24, 19, 9], LfD approaches could be inconvenient in some practical situations. First, demonstrations in LfD are usually given in a one-time manner and the learning process is usually performed offline after the demonstrations are obtained. In the case when the given demonstration data is insufficient to learn the objective function from, such as low data informativeness as discussed in [15], or the demonstrations significantly deviates from the optimal ones, the data has to be re-collected and the whole learning process has to be re-run. Second, existing LfD techniques [29, 14, 28, 35, 30] normally assume optimality of the demonstration data, which is challenging to obtain for robots with high degree-of-freedoms. For example, when producing demonstrations for a humanoid robot, a human demonstrator has to account for the motion in all degrees in a spatially and temporally consistent manner. [12].

I-A2 Online Learning from Feedback or Physical Corrections

Compared to offline LfD, learning from corrections or feedback enables a human user to incrementally correct the robot’s current motion, making it more accessible for the non-expert users who cannot provide optimal demonstrations in a one-time manner [13]. The key assumption for learning from corrections or feedback is that the corrected robot’s motion is better than that before the correction. Under this assumption, [12]

proposes a co-active learning method, in which a robot receives human’s feedback to update its objective function. The human’s feedback includes the passive selection of a top-ranked robot trajectory or the active physical interference for providing a preferred robot trajectory. By defining a learning regret, which quantifies the

average misalignment of the score values between the human’s intended trajectory and robot’s trajectory under the human’s implicit objective function, the authors show the convergence of the regret. But since the regret is an average indicator over the entire learning process, one still cannot explicitly tell if the learned objective function is actually converging towards the human’s implicit one.

Very recently, the authors in [3, 34, 23]

approach learning from corrections from the perspective of a partially observable Markov decision process (POMDP), where human’s corrections are viewed as the observations about the unknown objective function parameters. By approximating the observation model and applying maximum a posteriori estimation, they obtain a learning update that is similar to the co-active learning

[12]. To handle the sparse corrections that a human user applies only at sparse time instances during the robot’s motion, these methods apply the trajectory deformation technique [7] to interpret each single-time-step correction through a human indented trajectory, i.e., a deformed robot trajectory. Although achieving promising results, choosing the hyper-parameters in the trajectory deformation is challenging, which can affect the learning performance [34]. In addition, these methods have not provided any convergence guarantee of the learning process.

Both the above co-active learning and POMDP-based learning require a dedicated setup or process to obtain the human indented/feedback trajectories. Specifically, in co-active learning, a robot is switched to the screening and zero-force gravity-compensation modes to obtain a human feedback trajectory, and in the POMDP-based method, the human intended trajectory is obtained by deforming the robot’s current trajectory based on a correction using trajectory deformation method. These intermediate steps may introduce inaccurate artificial aspects to the learning process, which could lead to failure of the approach. For example, when a user physically corrects a robot, the magnitude of a correction, i.e., how much the correction should be, can be difficult to determine. If not chosen properly, the given correction may be overshot, i.e., too much correction. Such a overshooting correction can make the obtained human feedback trajectory violate the assumption of improving the robot’s motion. In fact, as we will demonstrate in Sections II and V-C, the more closer the robot is approaching to the expected trajectory, the more difficult the choice of a proper correction magnitude will be, which can lead to learning inefficiency. Also, for POMDP-based methods, when one applies the trajectory deformation, the choice of hyper-parameters will determine the shape of the human intended trajectory and thus finally affect the learning performance, as discussed in [34].

I-B Contributions

This paper develops a new method to learn a robot objective function incrementally from human’s directional corrections. Compared to the existing methods above, the distinctions and contributions of the proposed method are stated as follows.

  • The proposed method learns a robot control objective function only using the direction information of human’s corrections. It only requires that a correction, regardless of magnitude, has a direction of incrementally improving robot’s current motion. As we will later show in Sections II and V-C, the feasible corrections that satisfy such a requirement always account for half of the entire input space, making it more flexible for a human user to choose corrections from.

  • Unlike existing learning techniques which usually require an intermediate setup/process to obtain a human indented trajectory, the proposed method learns a control objective function directly from directional corrections. The directional corrections can be sparse, i.e., the corrections only applied at some time instances within the time horizon of robot’s motion.

  • The proposed learning algorithm is developed based on the cutting plane technique, which has a straightforward intuitive geometric interpretation. We have established the theoretical results to show the convergence of the learned objective function to the human’s implicit one.

The proposed method is validated by two human-robot games based on a two-link robot arm and a 6-DoF quadrotor maneuvering system, where a human player, by applying directional corrections, teaches the robot for motion control in environments with obstacles. The experiment results demonstrate that the proposed method enables a non-expert human player to train a robot to learn an effective control objective function for desired motion with few directional corrections.

In the following, Section II describes the problem. Section III proposes the main algorithm outline. Section IV provides theoretical results of the algorithm and its detailed implementation. Numerical simulations and comparison are in Section V. Section VI presents the experiments on two human-robot games. Conclusions are drawn in Section VIII.

Ii Problem Formulation

Consider a robot with the following dynamics:

(1)

where is the robot state, is the control input, is differentiable, and is the time step. As commonly used by objective learning methods such as [28, 35, 30, 13, 12, 3, 34, 23, 15, 14], we suppose that the robot control cost function obeys the following parameterized form

(2)

where

is a vector of the

specified features (or basis functions) for the running cost; is a vector of weights, which are tunable; and is the final cost that penalizes the final state . For a given choice of , the robot chooses a sequence of inputs over the time horizon by optimizing (2) subject to (1), producing a trajectory

(3)

For the purpose of readiablity, we occasionally write the cost function (2) as .

For a specific task, suppose that a human’s expectation of the robot’s trajectory corresponds to an implicit cost function in the same form of (2) with . Here, we call the expected weight vector. In general cases, a human user may neither explicitly write down the value of nor demonstrate the corresponding optimal trajectory to the robot, but the human user can tell whether the robot’s current trajectory is satisfactory or not. A trajectory of the robot is satisfactory if it minimizes ; otherwise, it is not satisfactory. In order for the robot to achieve (and thus generates a satisfactory trajectory), the human user is only able to make corrections to the robot during its motion, based on which the robot updates its guess of towards .

The process for a robot to learn from human’s corrections in this paper is iterative. Each iteration basically includes three steps: planning, correction and update. Let , denote the iteration index and let denote the robot’s weight vector guess at iteration . At , the robot is initialized with an arbitrary weight vector guess . At iteration , the robot first performs trajectory planning, i.e. achieves by minimizing the cost function in (2) subject to its dynamics (1). During robot’s execution of , the human user gives a correction denoted by to the robot in its input space. Here, , called correction time, indicates at which time step within the horizon the correction is made. After receiving , the robot then performs update, i.e. change its guess to according to an update rule to be developed later.

Each human’s correction is assumed to satisfy the following condition:

(4)

Here

(5)

with being the -th entry and else; is the dot product; and is the gradient-descent of with respect to evaluated at robot’s current . Note that the condition in (4) does not require a specific value to the magnitude of but requires its direction roughly around the gradient-descent direction of . Such correction aims to guide the robot’s trajectory towards reducing its cost under unless the trajectory is satisfactory. Thus, we call satisfying (4) the incremental directional correction.

The problem of interest in this paper is to develop a rule to update the robot’s weight vector guess to such that converges to  as , with the human’s directional corrections under the assumption (4).

Remark.

We assume that human user’s corrections are in the robot’s input space, which means that can be directly added to the robot’s input . This can be satisfied in some cases such as autonomous driving, where a user directly manipulates the steering angle of a vehicle. For other cases where the corrections are not readily in the robot’s input space, this requirement could be fulfilled through certain human-robot interfaces, which translate the correction signals into the input space. Then, denotes the translated correction. The reason why we do not consider the corrections in the robot’s state space is that 1) the input corrections may be easier in implementation, and 2) the corrections in the state space can be infeasible for some under-actuated robot systems [33].

(a) Feasible region (green) for magnitude corrections
(b) Feasible region (red) for directional correction
Fig. 1: Magnitude corrections v.s. directional corrections. The contour lines and the optimal/satisfactory trajectory (black dot) of the human’s implicit cost function are plotted. (a): the green region (a sub-level set) shows all feasible magnitude corrections that satisfy . (b): the orange region (half of the input space) shows all feasible directional corrections that satisfy .
Remark.

The assumption in (4) on human’s correction is less restrictive than the one in [3, 34, 23, 12], which requires the cost of the corrected robot’s trajectory is lower than that of original , i.e., . As shown in Fig. 1, this requirement usually leads to constraints in corrections’ magnitudes. This is because to guarantee , has to be chosen from the -sublevel set of , as marked by the green region. Furthermore, this region will shrink as it gets close to the optimal trajectory (in black dot), thus making more difficult to choose when the robot’s trajectory is near satisfactory one. In contrast, the directional corrections satisfying (4) always account for half of the entire input space. A human can choose any correction as long as its direction lies in the half space with gradient-descent of . Thus, (4) is more likely to be satisfied especially for non-expert users.

Iii Algorithm Outline and Geometric Interpretation

In this section, we will present the outline of the proposed main algorithm for a robot to learn from human’s incremental directional corrections and then provide a geometric interpretation of the main algorithm. First, we present further analysis on the directional corrections.

Iii-a Equivalent Conditions for Directional Corrections

Before developing the learning procedure, we will show that the assumption in (4) is equivalent to a linear inequality posed on the unknown expected weight vector , as stated in the following lemma.

Lemma 1.

Suppose that the robot’s current weight vector guess is , and its motion trajectory is a result of minimizing the cost function in (2) subject to dynamics in (1). For , given a human’s incremental directional correction satisfying (4), one has the following inequality equation:

(6)

with

(7a)
(7b)

Here, is defined in (5); is the gradient of the final cost in (2) evaluated at ; and are the coefficient matrices defined as follows:

(8a)
(8b)

with

(9a)
(9b)
(9c)

In above, the dimensions of the matrices are , , , , . For a general differentiable function and a specific , denotes the Jacobian matrix of evaluated at .

A proof of Lemma 1 is presented in Appendix -A. In Lemma 1, and in (7) are known and depend on both human’s correction and robot’s motion trajectory . The above Lemma 1 states that each incremental directional correction can be equivalently converted to an inequality constraint on the unknown .

Remark.

and in Lemma 1 also appear in [15], in which they are shown to be efficiently computed iteratively based on , . Specifically, Define and initialize

(10a)
Perform the iteration with each next state-input until
(10b)
Finally for ,
(10c)

The above iterative property facilitates the computation of and by avoiding the inverse of the large matrix in (8), significantly reducing computational cost in solving for (8).

Iii-B Outline of the Main Algorithm

In order to achieve , at each iteration , we let denote a weight search space such that and for all . This can be thought of as the possible location of , and as a weight vector guess to . Rather than a rule to guide towards , we will develop a rule to update to such that a useful scalar measure of the size of will converge to 0.

Main Algorithm (Outline): In the proposed main algorithm, we initialize the weight search space to be

(11)

where and are non-negative constants denoting the lower bound and upper bound for the th entry in denoted as , respectively. Here, and can be chosen large enough to include . The learning proceeds with each iteration including the following steps:

  • [leftmargin=35pt,font=]

  • Choose a weight vector guess from the weight search space (We will discuss how to choose such in Section IV).

  • The robot restarts and plans its motion trajectory by solving an optimal control problem with the cost function and dynamics in (1). While the robot is executing , a human user applies a directional correction at time

    . Then, a hyperplane

    is obtained by (6)-(7).

  • Update the weight search space to :

    (12)

We provide a few remarks to the above outline of the main algorithm. For initialization in (11), we allow entries of to have different lower and upper bounds, which may come from the robot’s rough pre-knowlege about the range of each weight. Simply but not necessarily, one could initialize

(13)

where

(14)

In Step 1, one chooses . Soon we will show for all . Thus, one will expect to be closer to if the main algorithm could make smaller. In fact, the weight search space is non-increasing because by (12) in Step 3. A careful choice of to guarantee the strict reduction of a size measure of will be given in Section IV. In Step 2, the robot’s trajectory planning is performed by solving an optimal control problem with the cost function in (2) and the dynamics constraint in (1). This can be done by many trajectory optimization methods such as [22] or existing optimal control solvers such as [1]. With the robot’s trajectory and the human’s directional correction , the hyperplane can be obtained by (6)-(7). The detailed implementation of the main algorithm with the choice of and termination criterion will be presented in next section.

The proposed main algorithm also leads to the following lemma:

Lemma 2.

Under the proposed main algorithm, one has

(15)

and

(16)

A proof of Lemma 2 is given in Appendix -B. Lemma 2 has intuitive geometric explanations. Note that (15) suggests is always in the hyperplane . Moreover, (16) suggests that although the proposed algorithm directly updates the weight search space , the expected weight vector always lies in . Intuitively, the smaller the search space is, the closer is to .

Iii-C Geometric Interpretation to Updating

In this part, we will provide an interpretation of the proposed main algorithm through a geometric perspective. For simplicity of illustrations, we assume in this subsection.

(a) At -th iteration
(b) At -th iteration
Fig. 2: Illustration of updating .

At th iteration in Fig. 1(a), a weight vector guess (colored in red) is picked from the current weight search space (colored in light blue), i.e., . By Step 2 in the main algorithm, we obtain a hyperplane (in black dashed line), which cuts through the weight search space into two portions. By (15) in Lemma 2, we know that also lies on this hyper-plane because . By Step 3 in the main algorithm, we only keep one of the two cut portions, which is the interaction space between and the half space , and the kept portion will be used as the weight search space for the next iteration, that is, , as shown in the blue region in Fig. 1(a). The above procedure repeats also for iteration , as shown in the right panel of Fig. 1(b), and finally produces a smaller search space colored in the darkest blue in Fig. 1(b). From (12), one has . Moreover, by (16) in Lemma 2, we note that the expected weight vector is always inside whenever is.

(a) A large cut from
(b) A small cut from
Fig. 3: Illustration of how different directional corrections affect the reduction of the weight search space .

Besides the above geometric illustration, we also have the following observations:

  • The key idea of the proposed main algorithm is to cut and remove the weight search space as each directional correction is given. Thus, we always expect that can quickly diminish to a very small space as increases, because thereby we can say that the robot’s current guess is close to the expected weight vector . As shown in Fig. 2, the reduction rate of depends on two factors: the human’s directional correction , and how to choose .

  • From (7), we note that the human’s directional correction determines , which is the normal vector of hyperplane . When fixing the choice of the weight vector guess , we can think of the hyperplane rotates around with different choices of , which finally results in different removals of , as illustrated in Fig. 3.

  • How to choose from defines the specific position of the hyperplane , because the hyperplane is always passing through by Lemma 2. Thus, also affects how is cut and removed. This can be illustrated by comparing Fig. 1(a) with Fig. 2(a).

Based on the above discussions, the convergence of the proposed main algorithm is determined by the reduction of the weight search space . This depends on both the human’s directional corrections (hard to be predicted by the robot) and the robot’s choice of the weight vector guess . In the next section, we will present a way for robot to choose to guarantee the convergence of the proposed algorithm.

Iv Algorithm Implementation with Convergence Analysis

In this section, we will specify the choice of , provide the convergence analysis of the main algorithm, and finally present a detailed implementation of the algorithm with termination criterion.

Iv-a Choice of

Under the proposed main algorithm, at each iteration , the weight search space is updated according to (12), i.e.,

In order to evaluate the reduction of the weight search space, it is straightforward to use the volume of the (closure) weight search space , denoted as , and the zero volume implies the convergence of the search space [5]. By in (12), we know that is non-increasing. In the following we will further develop a way such that is strictly decreasing under the proposed algorithm; i.e., there exists a constant such that

(17)

In order to achieve (17), we note that different choices of will lead to different reduction of : as indicated in Fig. 2(a), a large volume reduction from to is achieved while the choice of in Fig. 2(b) leads to a very small volume reduction. This observation motivates us that to avoid a very small volume reduction, one intuitively chooses at the center of the weight search space . Specifically, we use the center of the maximum volume ellipsoid inscribed within the search space as defined below.

Definition 1 (Maximum Volume inscribed Ellipsoid [4]).

Given a compact convex set , the maximum volume ellipsoid (MVE) inscribed within , defined as , is represented by

(18)

Here, (i.e., a positive definite matrix); is called the center of ; and and solve the optimization:

(19)

where for and for .

Based on Definition 1, we let denote the MVE inscribed within with denoting the center of . For the choice of at iteration , we choose the weight vector guess

(20)

as illustrated in Fig. 4. Other choices for as a center of the search space are discussed in Appendix -C.

(a) Center of MVE in
(b) Center of MVE in
Fig. 4: Illustration of choosing weight vector guess as the center of MVE inscribed in weight search space .

We now present a computational method to achieve , i.e. the center of MVE inscribe within . Recall that in the proposed main algorithm, the initialization of in (11) is

with the th entry of . This can be equivalently rewritten as a set of linear inequalities:

(21)

where is the unit vector with the th entry equal to 1. Then, following the update in (12), is also a compact polytope, which can be written as

(22)

As a result, in (19), solving the center of the MVE inscribed within becomes a convex programming [4], as stated by the following lemma.

Lemma 3.

For a polytope in (22), the center of the MVE inscribed within can be solved by the following convex optimization:

(23)

The proof of the above lemma can be found in Chapter 8.4.2 in [4, pp.414]. The above convex optimization can be efficiently solved by existing solver e.g. [6]. In practical implementation of solving (23), since the number of linear inequalities grows as the iteration increases, which can increase computational cost, the mechanism for dropping some redundant inequalities in (22) can be adopted [5]. Dropping redundant inequalities does not change and its volume reduction (convergence). Please see how to identify the redundant inequalities in [5].

Iv-B Exponential Convergence and Termination Criterion

In this part, we will investigate convergence of the volume of following the proposed main algorithm and its termination criterion for practical implementation.

Note that the convergence of the proposed algorithm relies on the reduction of , which can be guaranteed by the following lemma:

Lemma 4.

Let be chosen as the center of the MVE inscribed within . Then, the update (12) leads to

(24)

Lemma 4 is a direct theorem from [32]. Lemma 4 indicates

Thus, exponentially fast, that is, its convergence speed is as fast as as .

In order to implement the main algorithm in practice, we will not only need the exponential convergence as established by Lemma 4, but also a termination criterion, which specifies the maximum number of iterations for a given requirement in terms of . Thus we have the following theorem.

Theorem 1.

Suppose is given by (11), and at iteration , is chosen as the center of MVE inscribed in . Given a termination condition

with a user-specified threshold, the main algorithm runs for iterations, namely, the algorithm terminates at most iterations, where

(25)

with given in (14).

Proof.

Initially, we have . From Lemma 4, after iterations, we have

(26)

which yields to

(27)

When ,

(28)

The above equation is simplified to

(29)

which means that the termination condition is satisfied. This completes the proof.  

On the above Theorem 1 we have the following comments.

Remark.

Since both and are always within for any by Lemma 2, the user-specified threshold in the termination condition can be understood as an indicator of a distance between the expected weight vector (usually unknown in practice) and the robot’s weight vector guess . The threshold is set based on the desired learning accuracy.

Iv-C Implementation of the Main Algorithm

By the termination criterion in Theorem 1 and the choice of in (20), one could implement the main algorithm in details as presented in Algorithm LABEL:algorithm1.

algocf[th]    

V Numerical Examples

In this section, we perform numerical simulations on an inverted pendulum and a two-link robot arm to validate the proposed algorithm and provide comparison with related work.

V-a Inverted Pendulum

The dynamics of a pendulum is

(30)

with being the angle between the pendulum and the direction of gravity, is the torque applied to the pivot, m, kg, and are the length, mass, and damping ratio of the pendulum, respectively. We discretize the continuous dynamics by the Euler method with a fixed time interval s. The state and control vectors of the pendulum system are defined as and , respectively, and the initial condition is . In the cost function (2), we set the weight-feature running cost as

(31a)
(31b)
and set the final cost term as , since our goal is to control the pendulum to reach the vertical position.

The time horizon is set as .

In numerical examples, we generate “human’s directional corrections” by simulation. Suppose that the expected weight vector is known explicitly: . Then, at iteration , the “human’s” directional corrections is generated using the sign of the gradient of the true cost function , that is,

(32)

Here, denotes the th entry of , and the correction time is randomly chosen (evenly distributed) within horizon . Obviously, the above “human’s directional corrections” satisfies the assumption in (4).

The initial weight search space is set as

(33)

In Algorithm LABEL:algorithm1, we set the termination parameter as , and the maximum learning iteration solved by (25) is . We apply Algorithm LABEL:algorithm1 to learn the expected weight vector . To illustrate results, we define the guess error (i.e., the distance square between the weight vector guess and the expected weight vector ), and plot the guess error versus the number of iterations in the top panel of Fig. 5. In the bottom panel of Fig. 5, we plot the directional correction applied at each iteration , where and bar denote positive and negative sign (direction) of the correction in (32), respectively, and the number inside the bar denotes the correction time that is randomly picked from .

Fig. 5: Learning a pendulum cost function from incremental directional corrections. The upper panel shows the guess error versus iteration , and the bottom panel shows the directional correction (i.e., positive or negative) applied at each iteration , and the value inside each bar is that is randomly picked within the time horizon .

Based on the results in Fig. 5, we can see that as the learning iteration increases, the weight vector guess converges to the expected weight vector . This shows the validity of the method, as guaranteed by Theorem 1.

V-B Two-link Robot Arm System

Here, we test the proposed method on a two-link robot arm. The dynamics of the robot arm system (moving horizontally) is , where is the inertia matrix, is the Coriolis matrix; is the vector of joint angles, and is the toque vector applied to each joint. The state and control variables for the robot arm control system are defined as and , respectively. The initial condition of the robot arm is set as