Stability-Guaranteed Reinforcement Learning for Contact-rich Manipulation

04/22/2020 ∙ by Shahbaz A. Khader, et al. ∙ KTH Royal Institute of Technology ABB 0

Reinforcement learning (RL) has had its fair share of success in contact-rich manipulation tasks but it still lags behind in benefiting from advances in robot control theory such as impedance control and stability guarantees. Recently, the concept of variable impedance control (VIC) was adopted into RL with encouraging results. However, the more important issue of stability remains unaddressed. To clarify the challenge in stable RL, we introduce the term all-the-time-stability that unambiguously means that every possible rollout will be stability certified. Our contribution is a model-free RL method that not only adopts VIC but also achieves all-the-time-stability. Building on a recently proposed stable VIC controller as the policy parameterization, we introduce a novel policy search algorithm that is inspired by Cross-Entropy Method and inherently guarantees stability. As a part of our extensive experimental studies, we report, to the best of our knowledge, the first successful application of RL with all-the-time-stability on the benchmark problem of peg-in-hole.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In contact-rich manipulation, modeling and control of contacts are necessary for successful execution of the task. Traditional planning and control methods that specialize in free motion and obstacle avoidance do not address contact-rich manipulation adequately. Planning and control of contact-rich tasks is exceedingly difficult, especially when precise knowledge of the geometry and location of the manipulator and its surroundings is not available. The control of manipulator-environment interaction under the presence of uncertainties is generally studied as interaction control [1, 2], but the terms compliant manipulation [3] and more recently contact-rich manipulation [4, 5, 6] have also been used. While seeking an appropriate control solution, an important property to be satisfied is stability, without which widespread adoption of robots cannot be realized.

Most interaction control methods assume the availability of a nominal reference trajectory, which again presupposes some degree of knowledge of the task geometry. Reinforcement learning (RL) has emerged as a promising paradigm that can alleviate this concern. Many recent works have applied deep RL methods for learning contact-rich manipulation policies that directly outputs torques or forces [6, 7, 8]. Although valid in principle, these methods do not benefit from the rich theory of interaction control and more importantly do not guarantee stability. In an encouraging trend, a number of recent works have studied incorporating variable impedance control (VIC), a well-known interaction control concept, into an RL policy and found that it indeed improved sample efficiency and task performance [4, 9, 10]. Unfortunately, none of the existing application of RL on contact-rich tasks, even those without a VIC structure, guarantee stability. To clarify the scope of the stability we are interested in, we define the term all-the-time-stability as guaranteed stability for all possible exploratory actions, from the beginning till end of the RL process.

Fig. 1: (a) RL optimizes both the policy () and Lyapunov function () (b) The VIC policy maps states (position, velocity) to action (force/torque). (c) Examples of contact states in peg-in-hole (d) RL reshapes untrained (illustrative) that cannot guide trajectories to the goal (red point) to one that can while maintaining all-the-time-stability.

In this paper, we propose a model-free RL algorithm with all-the-time-stability. The method is well suited but not limited to contact-rich manipulation tasks. Our approach leverages a previously proposed framework for modeling stable VICs for discrete interaction motions [11]. The main technical contribution is the introduction of a novel policy search method, inspired by Cross-Entropy Method (CEM) [12], in which the sampling distribution is so constructed that it can only generate stable parameter samples. A novel update law is derived for iteratively updating the sampling distribution. Our experimental results reveal that the proposed method not only achieves all-the-time-stability, but with principled initialization, it also attains high sample efficiency. We demonstrate, to the best of our knowledge, the first successful reinforcement learning with all-the-time-stability for the peg-in-hole task–a task that has long been considered as the benchmark for contact-rich manipulation [13, 14, 5].

Ii Related Works

This section is organized according to different topics as follows:

Ii-1 RL for Contact-rich Manipulation

Learning artificial neural networks (ANN) policies for contact-rich manipulation has a long history 

[13, 15], but these early works adopted an admittance control approach that is more suited to non-rigid interactions [2]. Recent well-known examples are [6, 7, 5], where optimal control [6], motion planning [7] and multimodal feature learning [5] were leveraged for making the problem tractable. Other approaches are model predictive control (MPC) with learned dynamics [16, 8] and learning reference force profile  [17, 18]. None of these methods guarantee control stability during learning or for the final policy.

Ii-2 Learning Variable Impedance Policies

Learning a VIC policy can be done either through learning from demonstration (LfD) [19, 20, 21], where human demonstration data is used to optimize policy parameters, or RL where a cost-driven autonomous trial-and-erorr process leads to a policy. A policy with VIC structure incorporates two levels of control loops: a trajectory generation loop and an impedance control loop. Guarantees on stability can be obtained only if both loops are considered in a unified way [11]. An example of LfD with stability guarantee is [21].

In this paper we adopt the paradigm of RL which has the benefit of alleviating the demand of human demonstrations. Time-dependent policies, without any stability guarantees, were learned in [9] and [22]. In [9] a Dynamical Movement Primitive (DMP) that encodes both reference and joint impedance trajectories is used, whereas in [22], a policy parameterized as a mixture of proportional-derivative systems is adopted. State-dependent policies were learned in  [10, 4]. In [10], both the reference and stiffness trajectories are predicted from a policy parameterized by Gaussian Mixutre Regression (GMR) model. Interestingly, stability was guaranteed for the trajectory generation loop by means of the method proposed in [23], but the overall stability was unclear because of the lack of unified analysis of both loops. The ANN based method [4] learns a state-dependent policy that outputs reference trajectory and impedance parameters but offers no stability guarantees.

Ii-3 Stable Variable Impedance Controllers

A recent work [24] performed stability analysis for already existing trajectory and impedance profiles, but does not propose any particular controller structure that may be utilized in RL. In another interesting work [25], the stability issue of VIC was tackled by a passivity-based approach, but it too assumed a reference trajectory to be given. Khanzari et al. [11] proposed a trajectory-independent modeling framework i-MOGIC for discrete motions, featuring VIC and stability guarantees. Our method adopts i-MOGIC as the policy parameterization and makes progress to form a model-free RL algorithm with all-the-time-stability.

Ii-4 Cross-Entropy Method for RL

The Cross-Entropy method [12], a general purpose sampling-based optimization method, has been previously used for policy search both in an unconstrained [26] and constrained [27] setting. Our approach can also be seen as a constrained policy search, but unlike [27]

, the constraint (symmetric positive definiteness) is automatically guaranteed by the choice of the Wishart sampling distribution. Furthermore, unlike most cases where a Gaussian sampling distribution allows an analytical maximum likelihood estimation (MLE) based update, our special sampling distribution that consists of both Gaussian and Wishart factors necessitates a novel approach. Our policy search method may be seen as a CEM-inspired Evolution Strategy rather than a faithful implementation of CEM.

Ii-5 Relation to Safe RL

Stability has been considered as a means for safe RL in [28, 29, 30]. The method in [28] depends on smoothness properties of the learned Guassian process (GP) dynamics model, something that is inappropriate for contact. [29] also used GP and is further limited to learning only the unactuated part of the dynamics, while the remaining is assumed to be a known control-affine model. Another example [30] assumes the availability of a stabilizing prior controller, which, along with the magnitude of the disturbance in the assumed nominal dynamics model, determines the region of guaranteed stability. In our case, such limitations do not exist since no model is learned; instead, our approach assumes a fully known model of the manipulator inverse dynamics against which stability is derived. This is not a limitation because such models are easily available. Unknown interaction dynamics does not undermine (global) stability as long as the interaction is passive (see III-D). Our perspective on stability is greater than safety: stability guarantee is required to ascertain the basic functionality of a control system.

Iii Background and Preliminaries

Iii-a Reinforcement Learning

A Markov decision process is defined by a set of states

, a set of actions , a reward function ), an initial state distribution , a time horizon

, and transition probabilities (or dynamics)

. In policy search RL, the goal is to obtain the optimal stochastic policy by maximizing the expected trajectory reward , where represents a sample from a distribution of trajectories that is induced by , and over the time period . This is usually done by a gradient-based () search for the optimum value of the parameterized policy :

(1)

In this paper, we consider a deterministic policy () and deterministic dynamics (). Nevertheless, the general form of the problem in (1) remains unchanged since a distribution can still be induced by and . arises in the context of parameter space exploration [31], which is less common than action space exploration (stochastic policy). Parameter space exploration allows exploration with deterministic policies.

Iii-B Stability in RL

Lyapunov stability analysis deals with the study of time evolution of non-linear control systems such robotic manipulators [32]. An equilibrium point of the system is stable if the state trajectories that start close enough are bounded around it and asymptotically stable if the trajectories converge to it. Such properties are necessary to ensure that the system does not fail catastrophically in a wide range of circumstances. Stability is guaranteed by mathematically proving the existence of a certain Lyapunov function –a scalar function of the state analogous to energy–that has the properties (among others) and for , where is the equilibrium point.

In an RL context, stability corresponds to a guarantee that any rollout is bounded in state space and tends to the goal position demanded by the task. Ideally, the policy should ensure the goal position as the unique equilibrium point, in which case we can refer to stability of the system and not just an equilibrium point. The notion of global stability, in which the initial state can be anywhere, is also desirable.

Iii-C Cross Entropy Methods

Cross Entropy Methods (CEM) [12] conducts sampling-based optimization to solve problems like (1). The optimization relies on a sampling distribution to generate the samples . The performance of each is evaluated according to and used to compute updates for the distribution parameter . Formally, it iteratively solves:

(2)

where denotes an indicator function that selects only the best samples or elites based on individual performance . The computation of from the samples is done by MLE. Very often is modeled as Gaussian, for which an analytical solution exists [12]. As the iteration index grows, is encouraged to converge to a narrow solution distribution with high performance, thereby approximately solving (1).

Iii-D i-MOGIC

Integrated MOtion Generator and Impedance Controller (i-MOGIC) was proposed as a modeling framework for discrete interaction motions [11]. The controller has the form of a weighted mixture of several spring-damper components:

(3)

The stiffness and damping matrices are denoted by and , respectively. The superscript is used to indicate the base spring-damper and for a mixture component. and denote position and velocity of the manipulator. Note that is defined relative to the goal position, implying that the global attractor point is the origin. The attractor points for the remaining components are denoted by . The mixing weight is a function of and parameterized by , where is a scalar quantity.

Equation (3) can be seen as a VIC policy [11] of the form:

(4)

where , is force/torque and , and are state-dependent position reference, stiffness matrix and damping matrix, respectively. The parameter set of i-MOGIC is given by:

(5)

Khansari et al. [11] showed that, with gravity compensated, (3) is globally asymptotically stable at the origin if:

(6)

Global asymptotic stability is proven for free motion using only the manipulator dynamics while considering any interaction forces as persistent disturbance. When the interaction is with a passive environment, asymptotic stability may be lost but stability is retained [11]. A simple example is when an obstacle is preventing the manipulator from moving towards the goal; here, the contact forces constitute the persistent disturbance. A passive environment corresponds to the common case where objects in the environment are not actuated. Note that some terms in the original i-MOGIC is omitted in (3) since they are set to zero.

Iv Approach

Fig. 2: RL with VIC policy structure. (Left): stability-unaware neural network policies outputting unconstrained impedance gains (Right): all-the-time-stability policy search with interpretable and constrained parameters.

As we have seen, i-MOGIC is a parameterized policy with a VIC structure and stability guarantees. To attain our goal of an RL algorithm with all-the-time-stability, we need a policy search algorithm that also features stable exploration and convergence guarantees. Our main contribution is a model-free RL algorithm that meets these requirements. We adopt i-MOGIC policy parameterization and propose a CEM-like policy search algorithm. i-MOGIC being a deterministic policy and its stability determined solely by its parameter values, the parameter space exploration strategy of CEM is ideal. Figure 2 provides further perspective by juxtaposing our solution with a deep RL approach, such as [4], with VIC structure and the more common action space exploration.

In our novel CEM-like algorithm, the constraints in (6

) are guaranteed by designing a sampling distribution that makes it impossible to sample an unstable parameter set. A feasible parameter set of i-MOGIC is a mixed set of real-valued vectors, positive scalars, and matrices with symmetry and positive (semi)definiteness. Considering positive numbers as a special case of symmetric positive definite (SPD) matrices and enforcing SPD for all the matrices in (

5), all parameters except can be modeled by a distribution of SPD to guarantee constraint satisfaction. We focus on this aspect first and then develop the complete solution subsequently. Note that enforcing SPD for all matrices in (5) implies that our approach is slightly more conservative than it is required.

Iv-a Optimizing Positive-definite Matrix Parameters

Fig. 3: The (apparent) linear relationship between and entropy of a Wishart distribution. is fixed and . Each plot represents a different distribution. Left: random dimensions ( between 2 and 10) and random W parameters. Right: and and random W.

We model the sampling distribution for an SPD matrix with the Wishart distribution [33], which is fully defined by two parameters and and denoted by . and are SPD and . The expected value is given by

. The variance of a random matrix is not easy to define but is controlled by

. MLE-based update of Wishart distribution is possible if we adopt a numerical optimization approach for (2). Often, it is desirable to avoid such an approach because it may entail specification of parameter intervals, gradient computation or lack of convergence guarantees. Instead, we derive a simple update rule that, in spite of not targeting to solve (2), has the general property of continuously refining the sampling distribution towards an approximate solution. Although ours is no longer a faithful CEM method, it still conforms to the Evolution Strategy paradigm.

We propose to update the parameter of the Wishart distribution by the empirical average of the elite samples . This is valid since scaling and addition preserves symmetry and positive definiteness. Our update rule for the parameter is based on two modeling assumptions: and , where is the entropy of the Wishart distribution; and and are the average rewards of all samples and elites, respectively. The first linear model is our design choice and is motivated by the intuition that there should be a reduction in entropy of the sampling distribution commensurate with a relative gain between and . The second model actually matches closely with reality as evidenced by Fig. 3. After combining the linear models and rearranging terms we obtain our update law:

(7a)
(7b)

where is the elite sample scaled by , is the iteration variable (see (2)), and and are the respective proportionality constants of the linear models. Since Fig. 3 (right) also indicates that is dependent only on , can be estimated once is known (for example, ). The constant

, on the other hand, is a hyperparameter and its tuning corresponds to the trade-off between exploration and exploitation. Higher values can achieve updates that favour exploitation instead of exploration because the variance of Wishart distribution is a decreasing function of

.

We now focus on the convergence behavior of the proposed method. Notice that and therefore . The boundary condition is satisfied only at convergence where all samples from the sampling distribution return identical rewards. In all other cases the distribution shrinks because of the decreasing nature of variance with increasing . This guarantees eventual convergence. Indeed, the converged solution may be a local minimum, as it would be the case for any policy search method. It is worth noting that although our modeling assumption, , is most appropriate, any decreasing function of is sufficient.

Iv-B Stability-Guaranteed Policy Search

1:Initialize ;
2: for
3:while not converged do
4:     Do rollouts with ,
5:     Extract samples from based on performance measure Sec. III-C
6:     Compute using MLE on
7:     Compute using (7) on ,
8:        for
9:     Increment
Algorithm 1 Policy search with stability guarantees

Our goal is to derive a model-free policy search method that guarantees the stability constraints of i-MOGIC. More specifically, we seek an algorithm that solves (1) using the general strategy in (2), if possible, but subject to the stability constraints in (6). The policy parameterization is (3), with the parameter set in (5). The only missing piece is the sampling distribution and the exact strategy for the iterative update of its parameter .

We define a sampling distribution of the form:

(8)

where

is the joint probability distribution of all the parameters in (

5). After introducing a notational simplification ,

is a multivariate Gaussian distribution with parameters

; and represents Wishart distributions with parameters where for . While the matrix cases have dimension , is a one dimensional Wishart distribution of positive numbers

(also equivalent to Gamma distribution). Since we have modeled the individual elements in

as independent random variables, each of them can be maintained and updated independently. The update of the Gaussian

is performed by the standard practice of MLE. For the remaining Wishart cases we employ our novel strategy in (7). The complete algorithm is shown in Alg. 1.

V Experimental Results

Several experiments are performed to study the proposed method using the simulator MuJoCo and the industrial robot YuMi. A sampling frequency of 100 Hz is used.

V-a Simulated 2D Block-insertion

In this experiment a series of simulated 2D block-insertion tasks are set up. Three tasks are defined by varying the initial position of the block and also the insertion clearance (Fig. 4). The policy controls the block by exerting an orthogonal 2D force (rotation not allowed) with the goal of inserting it into the slot. The task executions are expected to generate contacts between the block and the environment.

Fig. 4: 2D block-insertion tasks. From left to right: Task1 (insertion clearance: 0.5 mm, execution time: 1s), Task2 and Task3 (insertion clearance: 2 mm, execution time: 2s). Block size is 505050 mm and weighs 2Kg. Rough illustrative paths are indicated by red arrows.
Task1 Task2 Task3
1 8
TABLE I: The best values of number of components () for 2D block-insertion
Fig. 5: RL of 2D block-insertion tasks. Training progress for Task1(), Task2() and Task3(). Success rate is % success among all rollouts per iteration.
Fig. 6: VIC policy is learned while maintaining stability (Task2). Two rollouts are considered: from iteration 0 (unsuccessful) and from iteration 99 (successful). (a) Lyapunov energy () plots of the rollouts. Combined stiffness (b) and damping (c) matrices at regular intervals along . (d-e) Contour plots of for and , with their trajectories overlaid on them.

Other aspects of the experimental setup include hyperparameter tuning, initialization and reward model. Apart from the number of spring-damper components (), which we discuss in the following paragraph, only the learning rate is required to be tuned, for which we found to work well. We fixed and for all our experiments. This choice is motivated by the real-world sample complexity of RL for our final experiment, where an excess of rollouts per iteration may be undesirable (also see Sec. VI). We initialize the parameters of the sampling distribution as follows. The Wishart parameter

is set to the identity matrix (or 1 in the case of scalar) and the

parameter is set to the minimum value . The value of should be interpreted as a measure of confidence for the value of . For the Gaussian distribution, we set the mean to zero vector and the covariance to identity. Finally, we adopt a reward model that consists of the Euclidean distance to the goal and a quadratic cost term for the actions.

Does the value of play a significant role? This question is answered with a grid search using the set . We found that the best minimum number for does depend on the task. The results are summarized in Table I. Task3 was not successful even for ; we believe this task can only be solved with a more complex reward structure (reward shaping) that can circumvent local optima. The final RL progress for the selected values of are shown in Fig. 5. Task2 is learned earlier than Task1, despite being more complex, because of lower insertion clearance. The successful paths were similar to the indicated ones in Fig. 4.

The second aspect worth investigating is whether the method learns variable impedance at all? The combined stiffness (Fig. 6b) and damping (Fig. 6

c) matrices, along a successful rollout (position trajectory) in the final iteration for Task2, are plotted. We see that both stiffness and damping matrices have larger eigenvalues (higher impedance) at the beginning of the trajectory and gets smaller (lower impedance) at the vicinity and the interior of the slot. This is exactly what one would expect: higher impedance for free motion and lower impedance for contact motion. Note that the trajectory indicates free motion up until it makes contact followed by a smooth insertion. The first contact is visible as a small blip right before the insertion.

Finally, to test that stability is indeed maintained through out the training, we plot the Lyapunov (energy) function –of iMOGIC–for one rollout each from the first and last iterations (Fig. 6a). We see that both plots are monotonically decreasing, indicating the main stability guaranteeing property . The plot of the first iteration case did not converge to zero indicating that it did not succeed. However, stability is still preserved if not asymptotic stability. This is as example of the case where stability is preserved even when a (passive) environment is preventing convergence to the equilibrium point. In Fig. 6d-e, we see how RL has reshaped an initial Lyapunov energy function to one that is rich enough to succeed in the task–without compromising stability. Note that the energy function shares parameters with the policy.

V-B Peg-in-hole in Simulation

Fig. 7: The peg-in-hole task (a) The MuJoCo environment: cylindrical peg (24 mm wide, 50 mm long), square hole (25 mm wide, 30 mm deep), and execution time of 2 s (b-d) The real-world environment: cylindrical peg (27 mm wide, 50 mm long), cylindrical hole (27.5 mm wide, 40 mm deep), and 5 s of execution time. (c) Successful insertion position. An insertion depth of 20 mm is considered successful. (d) 3D printed cylindrical peg and hole.
Fig. 8: RL on simulated peg-in-hole (a) Reward progress with and without initialization (b) Success rate is % success among all rollouts per iteration. (c) Growth curve of the parameter (d)

Average of element-wise standard deviations of

and parameters (init case).
Fig. 9: Stability test for first (a) and last (b) RL iterations Trajectories (only 3D translation) of five randomly chosen initial positions.

In this experiment we scale up to a 7-DOF manipulator arm that is expected to insert a cylindrical peg into a square hole (Fig. 7a). The peg-in-hole task has been historically considered as the benchmark problem for contact-rich manipulation [13, 15, 14]. To ensure that the policy learns to exploit and also comply with environmental constraints, the initial position is chosen to be laterally away from the hole. The policy is implemented in the operational space including both translation and rotation (Euler angle representation). To comply with the Lyapunov analysis formalism where the global attractor point (goal position) is at the origin, the goal frame is adopted as the reference frame. As mentioned earlier, we fix the population parameters as and . Other settings include and . We adopt the reward model suggested in [6].

To mitigate the increased risk of local minima due to a larger parameter set, we propose the following principled approach. The base spring-damper distribution parameters are initialized as, but not constrained to, diagonal matrices such that a suitable critically damped pair would bring the manipulator to a close vicinity of the goal. For free motion such a value can be calculated analytically, but in the contact-rich case, an experimental approach is best. The parameters indicate the level of confidence for and we used an initial value of . For the remaining spring-damper components, the parameter was initialized to a fraction of the base parameter () and set to the corresponding critically damped value. For each of the local attractor points, its Gaussian distribution mean is initialized to the zero vector (goal point) and the covariance is initialized with a diagonal matrix such that the initial position of the manipulator is one unit of Mahalanobis distance away from the goal.

Does the RL process converge by progressively narrowing the sampling distribution to an approximate solution? Fig. 8a-b shows examples of RL with and without initialization. The case without initialization did not succeed at all. Further, we can see in Fig. 8c that while the overall reward converges for the case with initialization, the corresponding parameter grows to a high value (). We would expect the entropy to reduce accordingly but more interesting is how the variance of the Wishart samples reduce. In Fig. 8d we compare the average of element-wise standard deviations of the vectors and , which contains parameters and , respectively. Note that is modeled by Gaussian distribution that is updated by MLE and is modeled by Wishart distribution that is updated by our novel update rule. Remarkably, both cases appear to converge to zero at a similar rate. Note that 20-25 iterations is all it took for the convergence.

It is also interesting to see the stability property in action. In Fig. 9, we plot some trajectories (in operational space) for particular samples of policy parameters in the first (9a) and last (9b) iterations. Individual trajectories correspond to randomly chosen initial positions. Even without any learning (first iteration) the trajectories converge towards the goal position (9a). Similar behavior is evident for the fully learned final iteration also (9b). In fact, for the latter case, one of the rollouts even succeeded, indicating a strong potential for generalization across initial positions if trained accordingly.

V-C Peg-in-hole with Real Robot

In this experiment we bring the previous simulated experiment into the real world with some minor differences in the geometry and the initial position (Fig. 7b-d). Note that the insertion task is even more challenging with a clearance of only 0.5 mm. We use the same hyperparameter and parameter initialization values as before. A stopping condition for the RL process is introduced: 10 success out of 15 trials–provided that the reward has plateaued–is adopted. In theory, for a deterministic policy and dynamics, even one success at any iteration is enough as a solution, but such a solution is unlikely to be robust enough for the real world.

Fig. 10: RL progress for peg-in-hole with real robot. Success rate is % success among all rollouts per iteration.

The stopping condition was satisfied in iteration 20 with 11 successful insertions. The overall progress of RL can be seen in Fig. 10. Interestingly, due to good initialization, there were instances of insertions very early on; but, as it is evident from the success rate plot, it was also unreliable. The parameter reached a value of about 2000 at iteration 20. We recommend a final target value for in the interval as a guiding principle for tuning the learning rate . Throughout the RL process, the motion exhibited compliant behaviour that relied on environmental constraints for task execution. For example, the spike in success rate at iteration 5 corresponded to the behavior in which the peg slid (on the horizontal surface) all the way to the hole. Later iterations learned a faster strategy to target the opening of the hole directly which also resulted in compliant insertions. Stable behavior, such as represented in Fig. 9, was also evident throughout the RL process.

Vi Discussions

Our results reveal that it is indeed possible to gaurantee all-the-time-stability for RL in the context of contact-rich manipulation. We validated the proposed method by successfully learning the challenging benchmark problem of peg-in-hole. The stability guarantee that originates in the theoretical proof of iMOGIC [11] is preserved by the unique properties of our novel CEM-like policy search method. It is also noteworthy that our policy implementation is in the operational space including translation and rotation.

As a CEM-like optimization method, the proposed method is susceptible to local minima and high sample complexity. This is generally the case for all policy search RL methods. Our method tackles local minima issue by adopting a principled initialization method that is general enough for any task. Initialization is possible only because of the specialized nature of the policy that features interpretable structures such as spring-damper systems. Good initialization also significantly reduced sample complexity of the RL process despite the moderate size of the parameter set. In [4] approximately 600 iterations were required for tasks such as door opening and surface wiping, while ours required only 20-50 iterations for the arguably more complex peg-in-hole task. Note that our parameterization features full stiffness and damping matrices that are learned, while in related works such as [4, 10] only diagonal stiffness matrices are learned.

The main contribution in this paper is a novel CEM-like algorithm. The novelty is not only in modeling most of the sampling distribution (except for the local attractor points) with Wishart distributions, but also an update rule for the sampling distribution. Although the standard approach of MLE could have been attempted for the distribution update, our novel strategy allowed to circumvent some practical issues associated with a numerical MLE for Wishart distribution (see Sec. IV-A). Interestingly, one could associate an additional benefit to our update strategy. Real-world RL imposes constraints on values for and ; for example, we used and respectively. It can be argued that MLE with such few samples () may cause the sampling distribution to collapse rapidly and thus contribute to local minima. On the other hand, our method offers a fine control on the process through the learning rate . Therefore, we recommend to keep and fixed and vary to control the trade-off between exploitation and exploration.

A natural question is how much the specialized policy parameterization (i-MOGIC) restricts the complexity of the tasks that can be learned. Although we have demonstrated a peg-in-hole task, it is not clear if more complex motion profiles such as those that temporarily moves away from the goal can be solved. Further studies are required to elucidate this. ANN-based policy parameterization, with or without a VIC structure, can be expected to be more flexible towards the range of tasks that can be learned. Furthermore, these methods can also incorporate complex high dimensional state (or observation) spaces such as image pixels. However, it is extremely difficult for such methods to produce stable policies let alone achieve all-the-time-stability. Since a notion of all-the-time-stability is more or less mandatory for real-world deployment, methods such as ours that achieve it without being too restrictive should be seen as a promising direction.

Vii Conclusion

In this paper, we set the ambitious goal of attaining stability-guaranteed reinforcement learning for contact-rich manipulation tasks. To include stable exploration into our notion of stability, we introduced the term all-the-time-stability. We built upon a previously proposed stable variable impedance controller and developed a novel model-free policy search algorithm that is inspired by Cross-Entropy Method. In our novel approach, we formulated the sampling distribution such that it could only produce stable parameter samples and then derived an update law for its iterative update. The experimental results are significant since not only all-the-time-stability was achieved, but it was possible to perform real-world reinforcement learning on the benchmark task of peg-in-hole with very low sample complexity.

References

  • [1] S. Chiaverini, B. Siciliano, and L. Villani, “A survey of robot interaction control schemes with experimental comparison,” IEEE/ASME Transactions on mechatronics, vol. 4, no. 3, pp. 273–285, 1999.
  • [2] N. Hogan, “Impedance Control: An Approach to Manipulation: Part I—Theory,” Journal of Dynamic Systems, Measurement, and Control, vol. 107, no. 1, pp. 1–7, 03 1985.
  • [3] K. J. A. Kronander, “Control and learning of compliant manipulation skills,” 2015.
  • [4] R. Martín-Martín, M. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks,” in Proceedings of the International Conference of Intelligent Robots and Systems (IROS), 2019.
  • [5]

    M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,” in

    2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 8943–8950.
  • [6] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich manipulation skills with guided policy search,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on.   IEEE, 2015, pp. 156–163.
  • [7] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, and P. Abbeel, “Learning robotic assembly from cad,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1–9.
  • [8] A. Tamar, G. Thomas, T. Zhang, S. Levine, and P. Abbeel, “Learning from the hindsight plan—episodic mpc improvement,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on.   IEEE, 2017, pp. 336–343.
  • [9] J. Buchli, F. Stulp, E. Theodorou, and S. Schaal, “Learning variable impedance control,” The International Journal of Robotics Research, vol. 30, no. 7, pp. 820–833, 2011.
  • [10] J. Rey, K. Kronander, F. Farshidian, J. Buchli, and A. Billard, “Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies,” Autonomous Robots, vol. 42, no. 1, pp. 45–64, 2018.
  • [11] S. M. Khansari-Zadeh, K. Kronander, and A. Billard, “Modeling robot discrete movements with state-varying stiffness and damping: A framework for integrated motion generation and impedance control,” Proceedings of Robotics: Science and Systems X (RSS 2014), vol. 10, p. 2014, 2014.
  • [12] R. Y. Rubinstein and D. P. Kroese,

    The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-Carlo Simulation (Information Science and Statistics)

    .   Berlin, Heidelberg: Springer-Verlag, 2004.
  • [13] V. Gullapalli, R. A. Grupen, and A. G. Barto, “Learning reactive admittance control,” in Robotics and Automation, 1992. Proceedings., 1992 IEEE International Conference on.   IEEE, 1992, pp. 1475–1480.
  • [14] S.-k. Yun, “Compliant manipulation for peg-in-hole: Is passive compliance a key to learn contact motion?” in 2008 IEEE International Conference on Robotics and Automation.   IEEE, 2008, pp. 1647–1652.
  • [15] M. Nuttin and H. Van Brussel, “Learning the peg-into-hole assembly operation with a connectionist reinforcement technique,” Computers in Industry, vol. 33, no. 1, pp. 101–109, 1997.
  • [16] I. Lenz, R. A. Knepper, and A. Saxena, “Deepmpc: Learning deep latent features for model predictive control.” in Robotics: Science and Systems, 2015.
  • [17] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force control policies for compliant manipulation,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on.   IEEE, 2011, pp. 4639–4644.
  • [18] J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel, “Reinforcement learning on variable impedance controller for high-precision robotic assembly,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 3080–3087.
  • [19] S. Calinon, I. Sardellitti, and D. G. Caldwell, “Learning-based control strategy for safe human-robot interaction exploiting task and robot redundancies,” in Proc. IEEE/RSJ Intl Conf. on Intelligent Robots and Systems (IROS), 2010.
  • [20] K. Kronander and A. Billard, “Learning compliant manipulation through kinesthetic and tactile human-robot interaction,” IEEE transactions on haptics, vol. 7, no. 3, pp. 367–380, 2013.
  • [21] S. M. Khansari-Zadeh and O. Khatib, “Learning potential functions from human demonstrations with encapsulated dynamic and compliant behaviors,” Autonomous Robots, vol. 41, no. 1, pp. 45–69, Jan 2017.
  • [22] P. Kormushev, S. Calinon, and D. G. Caldwell, “Robot motor skill coordination with em-based reinforcement learning,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2010.
  • [23]

    S. M. Khansari-Zadeh and A. Billard, “Learning stable nonlinear dynamical systems with gaussian mixture models,”

    IEEE Transactions on Robotics, vol. 27, no. 5, pp. 943–957, Oct 2011.
  • [24] K. Kronander and A. Billard, “Stability considerations for variable impedance control,” IEEE Transactions on Robotics, vol. 32, no. 5, pp. 1298–1305, 2016.
  • [25] F. Ferraguti, C. Secchi, and C. Fantuzzi, “A tank-based approach to impedance control with variable stiffness,” in 2013 IEEE International Conference on Robotics and Automation.   IEEE, 2013, pp. 4948–4953.
  • [26] S. Mannor, R. Y. Rubinstein, and Y. Gat, “The cross entropy method for fast policy search,” in

    Proceedings of the 20th International Conference on Machine Learning (ICML-03)

    , 2003, pp. 512–519.
  • [27] M. Wen and U. Topcu, “Constrained cross-entropy method for safe reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 7450–7460.
  • [28] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in neural information processing systems, 2017, pp. 908–918.
  • [29] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” in

    The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI

    , 2019, pp. 3387–3395.
  • [30] R. Cheng, A. Verma, G. Orosz, S. Chaudhuri, Y. Yue, and J. Burdick, “Control regularization for reduced variance reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning, ICML, 2019, pp. 1141–1150.
  • [31] P. Matthias, H. Rein, D. Prafulla, S. Szymon, C. Richard Y., C. Xi, A. Tamim, A. Pieter, and A. Marcin, “Parameter space noise for exploration,” in International Conference on Learning Representations (ICLR), 2018.
  • [32] J.-J. E. Slotine, W. Li et al., Applied nonlinear control.   Prentice hall Englewood Cliffs, NJ, 1991, vol. 199, no. 1.
  • [33] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer, 2006.