Compare Contact Model-based Control and Contact Model-free Learning: A Survey of Robotic Peg-in-hole Assembly Strategies

by   Jing Xu, et al.
Tsinghua University

In this paper, we present an overview of robotic peg-in-hole assembly and analyze two main strategies: contact model-based and contact model-free strategies. More specifically, we first introduce the contact model control approaches, including contact state recognition and compliant control two steps. Additionally, we focus on a comprehensive analysis of the whole robotic assembly system. Second, without the contact state recognition process, we decompose the contact model-free learning algorithms into two main subfields: learning from demonstrations and learning from environments (mainly based on reinforcement learning). For each subfield, we survey the landmark studies and ongoing research to compare the different categories. We hope to strengthen the relation between these two research communities by revealing the underlying links. Ultimately, the remaining challenges and open questions in the field of robotic peg-in-hole assembly community is discussed. The promising directions and potential future work are also considered.



There are no comments yet.


page 1

page 3

page 10


Variable Compliance Control for Robotic Peg-in-Hole Assembly: A Deep Reinforcement Learning Approach

Industrial robot manipulators are playing a more significant role in mod...

Fast Skill Learning for Variable Compliance Robotic Assembly

The robotic assembly represents a group of benchmark problems for reinfo...

Robotic Assembly across Multiple Contact Stiffnesses with Robust Force Controllers

Active Force Control (AFC) is an important scheme for tackling high-prec...

Learning Sequences of Manipulation Primitives for Robotic Assembly

This paper explores the idea that skillful assembly is best represented ...

Contact Pose Identification for Peg-in-Hole Assembly under Uncertainties

Peg-in-hole assembly is a challenging contact-rich manipulation task. Th...

Industry 4.0 and Prospects of Circular Economy: A Survey of Robotic Assembly and Disassembly

Despite their contributions to the financial efficiency and environmenta...

Fundamental Challenges in Deep Learning for Stiff Contact Dynamics

Frictional contact has been extensively studied as the core underlying b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotic assembly as the essential components of industrial applications has been studied for a long time. In this work, we look at the most common problem of robotic assembly: peg-in-hole assembly, which is the basis of a wide range of component assemblies[kuangenjamming][su2017sensor-less]. Robotic peg-in-hole assembly has been extensively researched and applied in various fields from large-scale object assembly, such as aviation components[wan2017optimal][qiao2016largescale], engines[su2012new] and windshields assembly to small-scale components, such as mold casting manufacturing[visionformultiplepeghole], electronic components[su2012sensor] and even microproduct[chang2011visual] assembly.

I-a The development of robotic peg-in-hole assembly

Many academic and industrial researchers have focused on promoting the robotic peg-in-hole assembly on the basis of classical conditioning learning with conventional compliant control strategies[lefebvre2005active], observational learning with learning from demonstrations[zhu2018robot], and operant conditioning learning with learning from environments[sutton2018reinforcement]. In this work, we decompose the existing peg-in-hole assembly strategies into contact model-based and contact model-free two categories, as illustrated in Fig. 1. Furthermore, the contact model-free strategies can be further subdivided into learning from demonstrations and learning from environments.

Fig. 1: Systematic of robotic peg-in-hole assembly strategies.

I-A1 Conventional contact model-based strategies

The concept of conventional contact model-based strategies relies on the contact model analysis and decomposes the peg-in-hole assembly into two steps: contact-state recognition and compliant control. The compliant control strategies are preprogrammed by humans according to contact state recognition by analyzing the underlying friction and contact model[lefebvre2005active]. Researchers have made efforts to use contact model based strategies to solve a wide range of peg-in-hole assembly problems with special requirements and high complexity in autonomous industrial manufacturing. For instance, the best methods for performing high-precision assembly[su2017sensor-less][rlhighpercision][2018modelfreelearning] and large-scale component assembly[zhi2014largescale][qiao2016largescale][wan2017optimal] with the limited sensors have been widely investigated. Additionally, methods for deriving an assembly control strategy that can cope with complicated multiple peg-in-hole assembly flexibly has attracted considerable attention from researchers[kuangenjamming][zhiminfeedback][hou2018learning].

To date, most of the published research worked on the peg-in-hole assembly has focused only on optimizing the separate stages. On the one hand, contact-state recognition has been explored to improve the success rate of recognition through theoretical analysis[whitney1982jamming] and statistical techniques[jasim2017contact] without caring about assembly implementation. On the other hand, to enhance the efficiency and stability with little contact forces during assembly, some optimization approaches have been applied to improve the performance of compliant control strategies directly[nn2002analysis][tang2016autonomous][hou2018learning] without considering the assistance of contact state recognition results. However, less research has focused on analyzing the relations between the contact state recognition and control strategies and on integrating these two stages [lefebvre2005active].

I-A2 Learning from demonstrations (LFD)

As autonomous robotic peg-in-hole assembly techniques progress, the compliant assembly control strategies are expected to perform more complicated assembly with higher degrees of compliance in a unstructured and nonstationary environments. It is possible for the preprogramming contact model based compliant control strategies to take into account all the possible assembly situations in advance. From the perspective that human beings are capable of handing various complicated assembly flexibly and unpredictably, LFD methods[argall2009surveyforlearingdemonstration] are based on the idea that the assembly behaviors can be learned by interpreting humans demonstrations without preprogramming and have been developed to solve peg-in-hole assembly in the recent years[wan2017optimal][tang2015learning][tang2016teach]. To imitate the compliant behaviors of human beings for peg-in-hole assembly, in addition to the assembly motion path, the force control strategies are also taken into account[tang2015learning][tang2016teach]. A comprehensive survey of LFD methods was presented in [argall2009surveyforlearingdemonstration], which contributed a structure with demonstration gathering and policy deriving two phases. Zhu and Hu[zhu2018robot] surveyed the LFD techniques applied in general robotic assembly and introduced the whole demonstrations assembly system. Kyrarini et al.[kyrarini2018robot] examined and compared several modeling methods of human demonstrations for industrial assembly tasks.

The LFD method is an effective learning algorithm used to solve robotic assembly problems. However, few studies have analyzed the challenges of LFD methods for peg-in-hole assembly scenarios. Furthermore, less research has focused on improving the ability of adaptation to environmental changes, uncertainties and generalization in new assembly situations.

I-A3 Learning from environments (LFE)

In contrast to improving the generalization of LFD methods, current robots are expected to recognize the surrounding environments actively and to learn the assembly skills incrementally, similar as human beings. Reinforcement learning (RL) based methods hold great promise for achieving such performance, and these methods enable agents to learn behaviors through integration with the surrounding environments and ideally by generalizing to unseen scenarios or tasks[sutton2018reinforcement]. To solve the inherent difficulties in behavior modeling and the generalization, an adaptable and robust control system was developed not only through learning from expert demonstrations but also incremental learning. With the development of the artificial intelligence techniques, especially deep learning, typical RL based learning approaches, especially model-free learning algorithms, have been extensively applied to perform complex manipulations[jan2013reinforcement][levine2015learning], including robotic peg-in-hole assembly[rlhighpercision][rl2018peginhole].

It is widely accepted that it is possible to apply model-free RL algorithms in real-world robotic assembly tasks at the expense of data efficiency. To enhance the practicalities, many studies focusing on incorporating typical model free RL approaches with the prior knowledge or expert demonstrations have been published[zhiminfeedback][2018learningfromCAD][2018modelfreelearning]. Recently, there has been increasing interest in the development of model-based RL in the robotics community. For example, transition dynamics models have been utilized to derive the feedback rewards or optimal actions and have been investigated in [levine2015learning][polydoros2017survey][2018modelfreelearning]. However, for robotic peg-in-hole assembly, it is not clear how to fuse the existing knowledge into a model-free learning process naturally. Although, some papers have worked on the comparison and combination of model-based and model-free RL learning algorithms for robotic applications[polydoros2017survey], no survey has yet explored the relation between model-based RL learning algorithms with the conventional theoretical contact model and the implicit model learning from demonstrations.

I-B The motivation and purpose of this paper

Although numerous studies on robotic assembly have been published, there is still no paper surveying the existing research, including both the contact model-based and two kinds of contact model-free assembly strategies for peg-in-hole assembly. To the best of our knowledge, few studies have compared conventional contact model-based strategies and contact model-free algorithms. Therefore, one motivation of this paper is to survey the existing assembly strategies and group them as shown in Fig. 1 for the first time. Another motivation is to exploit the underlying relations between different assembly strategies and to explore the promising solutions by combining the strengths of contact model-based and contact model-free control strategies. Consequently, to make existing peg-in-hole research tractable, we attempt to give a fairly complete overview with the following goals

  1. This paper surveys the state-of-the-art research and ongoing developments of robotic peg-in-hole assembly and identifies promising approaches.

  2. This paper provides a novel and clear grouping method to analyze the existing research completely with comprehensive insights.

  3. This paper explores the underlying relationship between traditional contact model-based control strategies and contact model-free learning algorithms and proposes the promising solutions.

  4. This paper highlights the remaining challenges of the existing approaches and identifies open questions for future research.

The remainder of the paper is organized as follows: Section II introduces the whole robotic peg-in-hole assembly system. Section III analyzes the contact model-based control strategies in detail. Section IV surveys and compare two contact model-free learning algorithms. Section V concludes with a discussion of the open questions and potential future research directions.

Ii Robotic peg-in-hole assembly system

In this section, we briefly introduced the construction of a robotic peg-in-hole control system briefly, as shown in Fig. 2, which consists of three components: mating parts, the sensing system and manipulators. Generally, the holes are fixed and the manipulators grab the pegs to complete the parts mating according to the feedback from the sensing system.

Ii-a Mating parts

The mating parts, as shown in Fig. 2, are the assembly components and include the pegs and holes. According to the geometrical features shown in Fig. 2, a cylindrical peg-in-hole system is the basic assembly problem and has been extensively studied. Complex-shape peg assembly is also used in some special cases, including square pegs[park2013intuitive][kim2014holedectect], pegs with key slots and pegs with complex shapes[2014forceguide]. In addition, according to the number of peg-hole mating pairs, the research work can be decomposed into single peg-in-hole[wan2017optimal] and multiple peg-in-hole assembly[fei2003assembly][kuangenjamming] scenarios, as shown in Fig. 2. The complexity of assembly increases as the contact states of multiple peg-in-hole scenarios become more complicated.

Fig. 2: Introduction of a robotic peg-in-hole assembly system. (a)Setup of the robotic peg-in-hole assembly system. (b) Figures from Ren et al.[ren2018learning], Kim et al.[kim2014holedectect] and Song et al.[2014forceguide]. (c) Figures from Jasim et al.[jasim2014position], Xu et al.[zhiminfeedback] and Fei and Zhao [fei2003assembly]
Category Types Characteristics Methods
Vision Camera(2D, stereo), laser tracker Contact-less, low-resolution Boundary detection, visual servo strategy
Force F/T, torque Monitors contact force Blind search strategy, impedance control, force-based position control
Sensor-less Joint current, joint encoder Low cost, no installation ARIE-based inserting, human-like exploration searching
TABLE I: Conclusion for different sensing system.

The scale of assembly components corresponds to the application and ranges from macroassembly for large aviation parts to microassembly for electronic components in circuit board. The clearance between peg and hole also differs with the requirements of the assembly scenarios. In some high-precision scenarios[rlhighpercision][tang2015learning][kuangenjamming], the clearance may be below the solution and accuracy of the robot, which is typically in the range of 0.02 up to 0.2 mm. In addition to the rigid pegs with high stiffness values[zhang2017force], some flexible peg-in-hole components composed of plastics[kuangenjamming] and wood are also used. The clearance and hardness of the mating surfaces change the nature of the part mating tasks[2012contactsvm], including various degrees of complexity.

Ii-B Sensing system

In the case of robotic peg-in-hole assembly, the sensing system is used to acquire feedback from environment, similar to human sight and tactile sensing. Sensing systems based on two types of sensors or other feedback are surveyed, and the corresponding characteristics are as follows.

Ii-B1 Vision sensors

2D cameras are widely used for coarse localization by extracting the boundaries of holes from images[miura1998vision]. Maker points captured by 2D cameras were used to calculate the pose (position and orientation) of pegs in[wan2017optimal]. Image-based visual servo systems were designed to track the accurate hole position based on the extracted features[pauli2001vision] [wang2008microassembly]. For high-speed microscale peg-in-hole assembly, Chang et al.[chang2011visual] and Huang et al.[huang2013visualservoing]

proposed position-based visual servo systems with fast convergence guarantees based on the image calibration method and limited calibration. In contrast to images, stereo cameras (Kinect) have been applied to capture 3D point data to estimate the accurate 3D poses of mating parts

[abu2014solving][park2017compliance]. Additionally, laser trackers, such as the high-precision and contact-less tools have been employed to enhance the position accuracy of large-scale peg-in-hole assembly systems[zhi2014largescale][qiao2016largescale].

Ii-B2 Force sensors

Position controllers based on vision sensors might produce large contact forces due to the position errors. Therefore, force feedback can be utilized not only to monitor the assembly process, but also to accommodate the position uncertainty. The force feedback referred to as wrench signals (forces and moments) can be acquired from an external force-torque (F/T) sensor equipped on the end-effector of the robot

[zhiminfeedback] and from torque sensors integrated into the robot joints[lee2014active][ren2018learning], as shown in Fig. 2.

In general, force feedback is used for compliant control strategies in passive and active ways. As a passive example, auxiliary mechanical devices such as remote-center-compliance (RCC)[whitney1982jamming][sturges1996RCC] and a variant [xu2015robust] (composed of springs and dampers) attached to the end-effector were applied to accommodate the contact forces. Active compliant control strategies aim to control the assembly motions of robots actively based on force feedback [su2017sensor-less] and have been widely surveyed for more than 30 years. Wrench signals from sensors have been applied to recognize the contact-state and generate the low-level commands for position control of robots[lefebvre2005active] through impedance control[hogan1984impedance] and force-based position control[raibert1981hybrid].

Ii-B3 Sensorless systems

To eliminate the limitations of sensor frequency, sensor installation and measurement error, De Luca and Mattone[de2005sensorless] and Lee and Park[lee2014active] worked on an efficient sensor-less active compliant control systems without external F/T sensors, in which the wrench signals were approximated according to the current of joint motors. Additionally, the poses of pegs attached on the end-effector were interpreted by the encoder in the robots. Based on the pose information, robotic peg-in-hole assembly strategies driven by the environmental constraints, such as attractive regions in the environment (ARIE) without force-sensor feedback have been investigated recently[qiao2015concept][qiao2017iros].

Consequently, both vision sensors and force sensors have the strengths and shortcomings, as summarized in Table. I. Most vision sensors are appropriate for peg-in-hole assemblies with larger clearances and weak contact forces[huang2013visualservoing]. Nevertheless, force-based assembly control strategies have been explored in high-precision assembly systems with small clearances[rlhighpercision][2018modelfreelearning], large-scale components assembly systems with large contact forces[hou2018learning][kuangenjamming] and complex-shaped parts assembly systems with complex contact forces[song2016guidance][dietrich2010contact]. To combine the strengths of vision and force sensors, the hybrid sensing systems have also been developed in [xie2009hybridsensor][2004datafusion].

Ii-C Manipulators

The manipulators in peg-in-hole assembly can be industrial robots (such as those produced by the companies ABB and KUKA) with 6 degrees-of-freedoms (DOFs)

[kuangenjamming][zhang2017force], which are used to perform assembly requiring large forces and moments. Furthermore, in recent years, high-compliance robot manipulators (Baxter, PR2) with 7 or 8 DOFs have been developed and can perform manipulation problems more flexibly and safely[park2017compliance]. Generally, most robotic peg-in-hole assembly environments involve fixed holes, as shown in Fig. 2, and manipulators are used to grab the pegs to control the motions (translation and rotation) in Cartesian space or joint angles in joint space.

The robotic peg-in-hole assembly for inserting the pegs to the desired depth of holes generally consists of two main phases: searching and inserting. For the searching phase, the localization of holes should be identified, which is an essential step for the following inserting phase in real-world scenarios. Image-based boundary extraction techniques[chang2011visual], visual servo tracking approaches[wang2008microassembly] and blind search strategies based on the designed human-like searching path[chhatpar2001search] and the force feedback[kim2014holedectect] have been applied to locate holes and track the position. For the inserting phase, the assembly actions involve not only motion but also the applied external forces. In contrast to the searching phase, the inserting phase is more complicated; most researchers only focus on active control strategies for the inserting phase and neglect the searching phase[tang2016autonomous][zhang2017force][hou2018learning]. Therefore, in this work, all of the following analysis of robotic peg-in-hole assembly are for the inserting phase.

Iii Contact model-based control strategies

The assembly process is a constrained motion with geometrical and environmental constraints. The contact constraints between mating parts can be represented as topological contact states[291962][xiao1998contact][10.1007/978-3-642-83625-1_17]. Thus, the overall assembly process can be described as a sequence of transitions between the contact states. For instance, as shown in Fig. 3, a single peg-in-hole insertion process is formulated as the transitions between no contact, one-point contact and two-point contact. The contact model-based strategies for robotic assembly shown in Fig. 4 generally include two steps: contact sate recognition and compliant control.

Fig. 3: Contact state analysis and jamming diagram for single peg-in-hole assembly (Whitney [whitney1982jamming]).

The general idea of contact state recognition is to determine the contact constraints according to the observations, such as wrench signals and pose information. In this work, we analyze and decompose the existing research for contact state recognition into two categories: the analytical model[whitney1982jamming] and the statistical model[2012contactsvm]. The analytical model relies on the analysis of the geometrical and environmental constraints. The statistical model has been extensively developed in recent years and estimates the contact state through learning the pattern from the collected samples directly without the need for information about the tasks.

Fig. 4: Framework of contact model-based strategies.
Categories Success rate(%) Computational time() Advantages Disadvantages
GMM,  DSM-GMM 94.4 18.795 Fit distribution of samples Sensitive to the initial setting
SVM,  SVM-FIM 64.2 70.719 Excellent generalization Sensitive to missing sample
CFC,  GS-FCA 27.3/65.9 0.002/237.307 Fuse prior knowledge Only solve simple case
SGB 60.7 92.083 Without defining parameters Sensitive to unseen samples
HMM - - Eliminating time-varying uncertainties Sensitive to gain value
NNs - - Easy-implementation Less data-efficient
TABLE II: Comparison for contact states recognition methods with statistical model.

Iii-a Contact state recognition with analytical model

The contact state recognition with analytical model[xiao1998contact] generally includes two steps: contact state modeling and contact state determining.

Iii-A1 Contact state modeling

Contact states are modeled according to the mating features, and the most commonly used features are the force constraints between mating parts. Whitney [whitney1982jamming] proposed a quasi-static model to clarify the relation between force and geometrical constraints for a single peg-in-hole assembly. Additionally, two possible ill contact situations (wedging and jamming) are analyzed. Jamming, which often leads to insertion failure, represents the conditions in which the applied forces/moments of the peg are in the wrong proportions. As shown in Fig. 3, a jamming diagram is drawn to analyze the jamming conditions of the overall assembly process[whitney1982jamming]. Based on this idea, Sathirakul and Sturges[sathirakul1998jamming] and Fei and Zhao[fei2003assembly] enumerated contact states and presented a three-dimensional analysis of jamming conditions for multiple peg-in-hole assembly.

In contrast to the quasi-static model and rigid body assumption, some researchers have focused on dynamical or flexible models, which are closer to real assembly situations. Hsin-Te et al.[liao1998analysis] derived a general form of impact equations for an industrial manipulator performing peg-in-hole assembly using Lagrange’s impact model. Xia et al.[xia2006dynamicanalysis] established a three-dimensional jamming analysis based on a compliant elastic contact dynamic model and designed several no-jamming and no-wedging assembly strategies by analyzing the free contact conditions.

Iii-A2 Contact state determination

The contact states are determined by calculating the similarity between the observed actual and modeled contact states. As the contact state recognition moves forward, the contact states are modeled with uncertain parameters to accommodate the error in the contact model and to enhance the robustness to environmental uncertainties. Kalman filters (KF)[lefebvre2005online] and particle filters[gadeyne2005bayesian] have been utilized to estimate the geometrical parameters for better recognition of the contact state and state transitions. In addition, instead of determining the contact state by calculating the similarity, classification algorithms, such as support vector machine (SVM), have been applied to determine the contact state [2012contactsvm].

In conclusion, the analytical model is sensitive to uncertainties (such as that in the position of parts, the rigidity or elasticity of the assembly system and the friction model), and no perfect model can be adapted automatically to different environments. The aforementioned methods with analytical models relying on force constraint analysis will become more complicated in assembly systems with uncertain mating features and changing jamming conditions. Consequently, only a partial model can be achieved, and the generalization to new assembly scenarios is limited. Another drawback is that the variables of an analytical model can only be determined based on the observed contact states and past transitions.

Iii-B Contact state recognition with statistical model

In contrast to recognition based on an analytical model, contact state recognition with a statistical model without considering the possible uncertainties is formulated as a classification problem given the possible contact states. The contact state can be classified through the advanced statistical techniques such as

fuzzy classifiers (FC), neural networks (NNs), SVM, Gaussian mixtures model (GMM) and hidden Markov models (HMMs).

Conventional fuzzy classifiers (CFC) have been applied in contact state recognition by accommodating the uncertainties based on prior knowledge without the geometrical information on the pegs[xiao1998contact][1998conceptfuzzy][2000fuzzyanalysis]. In these scenarios, the output contact state is decided through the following fuzzy if-then rules


where denotes the th component of the th input observation signal, is the antecedent membership function of the th input component for the th contact state . To enhance the robustness of the fuzzy system, the gravitational search (GS) algorithm is employed to tune the fuzzy rules of each model[2013lmsfuzzy]. CFC is able to solve the simple classification problem through the designed fuzzy logic controller with little computing time.

NNs have been developed for a long time and were used to map the nonlinear relationship between the input force information and output contact states[nn2002analysis]. Compared to FC-based methods, the implementation of NNs is feasible without handcrafted extraction features and fuzzy rules. NNs have shown competitive classification performance in recent years with sufficient computing resources and samples. The main issue is that the trained classification model cannot be generalized to the scenarios with different dimensions of inputs due to the fixed network architecture. The performance compared to CFC was analyzed in [nn2002analysis], and both of them have advantages and disadvantages. Additionally, numerous studies focusing on integrating the flexibility of fuzzy set theory and the approximation ability of NNs have been performed[son2002optimal][son2001neuralfuzzy]. For training the classifiers with NNs, the input observed information, including measured wrench signals and pose data, usually requires preprocessing, such as normalization or uniform discretization.

SVM techniques through reducing the actual risk and confidence interval for correct classification, have been demonstrated to be suitable and applicable for real-world recognition with generalization to unknown environments

[2012contactsvm]. Previous work has proposed a practical contact state recognition framework, in which the input observations are processed through discrete wavelet transform (DWT) and the contact states are acquired through existing analytical models. A fuzzy inference mechanism (FIM) with an adaptive classifier boundary generated by SVM was used to classify the contact states of the peg-in-hole assembly sequence[2014fuzzysvm].

GMMs have been employed to model the input observations, and Bayesian classification has been incorporated to estimate a binary classification of the given GMMs[jasim2014contact][jasim2017contact]. The expectation maximization (EM) algorithm has been demonstrated to be efficient in optimizing the parameters of the given GMMs. Jasim et al. utilized the distribution similarity measure (DSM) to determine the optimal number of GMM components based on the previous work[jasim2017contact], and this process significantly enhances the modeling performance and computational cost for contact state modeling of the flexible objects.

HMMs show advantages in recognizing both the contact states and state transitions over the previous contact state classification approaches[hovland1998hmm][lau2003hmm][hannaford1991hmm]. In this way, the contact state classification problems solved by HMMs are capable of taking the temporal information into account. A multiple contact model method incorporated into an HMM model to estimate the contact sequence was proposed in [debus2004hmm]

, and this model only requires the partial observations, such as kinematic data without other object information. Basically, the aforementioned contact state recognition approaches are supervised learning problems, which require considerable labeled samples and extensive training first.

In contrast to parametric learning techniques, the random forest[cabras2016random] technique, without determining the parameters in advance, was explored for multiple classifications. In addition, the binary

stochastic gradient boosting

(SGB)[cabras2010contact] classifier, based on its strength of classifier diversity, can perform the contact state recognition. Jasim et al.[jasim2017contact] has compared the success rate and computational time of several most frequently used classification techniques through an assembly experiment with the flexible rubber manipulated objects. Based on the given results, we provide a comprehensive summary shown in Table. II, covering the comparison of the success rate, computational time and pros/cons of the introduced statistical techniques for contact state recognition.

In conclusion, for real-world robotic peg-in-hole assembly with a limited number of samples, NNs and FC-based methods cannot handle complicated contact model recognition well. Nevertheless, GMMs and SVM have shown the high efficiency and better generalization for contact state classification. Furthermore, HMMs can cooperate with other classification algorithms to take the effect of state transition into account.

Iii-C Compliant control

In contrast to the general assembly[lefebvre2005active][2012contactsvm], the contact model-based control strategies for peg-in-hole assembly depicted in Fig. 4 can be simplified into two steps: a high-level planning module and a low-level controller. The high-level planning module is used to derive the high-level commands for low-level controllers based on based on the geometrical and environmental constraints. The low-level controller with the set of high-level commands is used to execute the assembly actions according to the observed wrench signals and pose information and the current contact state.

Iii-C1 Low-level controller

At present, the robots are able to handle the point-to-point accuracy requirements easily, and the position controller has become quite mature[lopes2008force]. The low-level controllers of industrial robots generally consist of two categories: force-based impedance controllers and position-based force controllers. The force-based impedance controllers aim to execute the commands for joint torque[ren2018learning], and the measured wrench signals are used to generate the desired torque value for the inner force loop. The position-based force controllers typically generate the desired position and orientation commands according to the outer force loop; then, the commands are executed by the inner position controller. Both of these controllers have strengths and weaknesses; nevertheless, direct access to actuator torques data is not available for most industrial robots. The position-based force controllers are widely utilized for the industrial assembly control through the designed outer force controller[kuangenjamming][hou2018learning][tang2016autonomous].

To accommodate the environmental uncertainties of the assembly process, some researchers have focused on optimizing the parameters of the position-based force controller. A network-based adaptive fuzzy model guided by the contact state estimator has been proposed to acquire adaptive parameters for a force impedance controller[1998conceptfuzzy]. To optimize the outer proportional-integral-derivative (PID) force controller through few trials for real-world peg-in-hole assembly, Hou et al. [hou2018learning] proposed evolutionary algorithms (EA) in conjunction with support vector regression (SVR) to obtain the optimal PID parameters.

Iii-C2 High-level planning module

For peg-in-hole assembly, the high-level planning module generally generates the desired force and moment value for the low-level force controller according to geometrical constraints of mating parts[zhang2017force][kuangenjamming][hou2018learning]. In addition to the geometrical constraints, Qiao et al.[qiao2015concept] took environmental constraints into account based on the concept of ARIE, as shown in Fig. 4, and the position uncertainties were eliminated by coarse wrench signals. In [qiao2017iros] and [qiao2015arie], the constraint region in configuration space and physical space has been discussed, and a two-step insertion strategy and a human-inspired compliant strategy based on the ARIE concept have been verified in a broader range of peg-in-hole tasks with sensor-less systems. The environmental constraints based on ARIE can not only compensate for the limitations of the force sensors for high-precision assembly, but also provide the guarantees of safety and reliability in real-world assembly. The generality and robustness of the compliant control system have been improved with the assistance of environmental constraints.

Instead of optimizing the low-level controller directly, some researchers have made significant efforts to optimize the high-level planning module according to the contact model recognition results. Son [son2001neuralfuzzy] utilized fuzzy set theory to manage and address the uncertainties according to the prior assembly knowledge. Additionally, a neural network was constructed to approximate the nonlinear relationship between the jamming analysis and the insertion control strategy. Xia et al.[xia2006dynamicanalysis] proposed a no-jamming and no-wedging assembly strategy by choosing the appropriate set of applied forces and moments based on the corresponding control law for different contact states. In [shirinzadeh2011hybrid], a hybrid methodology was proposed by choosing the corresponding low-level controller according to distinguishing the different contact states. Tang et al.[tang2016autonomous] proposed an autonomous alignment method to correct the initial pose before the inserting phase based on the estimated contact state. As assembly strategies have become more advanced, Zhang et al.[kuangenjamming] established jamming diagrams based on contact state analysis for a complicated flexible dual peg-in-hole assembly. Then, a jamming theory was applied to establish the parameters of the low-level PD force controller.

However, the high-level planning module of peg-in-hole assembly is sometimes neglected, and most studies commonly focus on contact model recognition and low-level control strategies[zhang2017force]. Furthermore, it remains unclear how best to optimize the high-level planning module and the low-level controller according to the contact model recognition. A flexible and adaptable assembly strategy should match the real-time uncertainties via a smooth integration of all the separate modules.

Iv Contact model-free learning strategies

In contrast to the contact model-based control strategies dealing with the contact state recognition and compliant control separately, the contact model-free strategies combine these two steps together. As shown in Fig. 1, contact model-free strategies consist of two categories: LFD and LFE.

Iv-a Lfd

Compared to the industrial robots, humans can perform peg-in-hole assembly with any degree of pose uncertainty due to the flexibility of the wrists, the sensing system and intelligent decision-making ability. Instead of analyzing how do humans accomplish the assembly tasks, many researchers focus on simulating the human assembly demonstrations directly and then transforming the skills into robots programming, which is referred as LFD (also termed imitation learning and apprentice learning). For robotic peg-in-hole assembly, LFD methods consist of three principal phases: sensing, encoding and reproducing, which are depicted in Fig. 5[zhu2018robot][kyrarini2018robot].

Fig. 5: Learning from demonstrations strategies. (a) Framework of the LFD system. (b) Demonstration approaches.
LFD Approaches Advantages Disadvantages References
DMP Robust to spatial perturbation Delay and pause of motion [2010DMP][park2008movement][paxton2015incremental]
Solve multivariate data separately
Learn from single demo

Model joint probability density function

Handle different source and missing data
Offline learning and online fast regression
HMM Handle partial demonstrations Stability sensitive to gains [1996hmm][calinon2010learninghmm][calinon2010overview]
Handle temporal variability
Handle periodic and reaching movements together
Encode multivariate motion simultaneously
TABLE III: Comparison for learning from demonstrations approaches.

Iv-A1 Sensing phase

This stage aims to interpret the human motion trajectories, including the observed states and executed actions. At present, the human demonstration data can be collected through external sensors, such as kinesthetic demonstration, motion capture systems, and teleoperated demonstration, which are shown in Fig. 5. In this paper, the states of the assembly process, including pose information and wrench signals, can be recorded through kinesthetic demonstration and motion capture systems, as shown in 5. The pose of pegs can be calculated by robot joints encoder or determined by vision-based pose estimation approaches, such as the extracted 2D boundary features, maker point matching[wan2017optimal] or 3D point cloud data processing[jasim2017contact]. The wrench signals are usually detected through the external F/T sensors mounted on the end-effector or joint torque sensors. The corresponding executed actions include transnational-rotational offsets or velocities and applied forces acquired through torque sensors or external F/T sensors.

Additionally, to enhance the performance of interpretation, some data preprocessing techniques are applied for coping with the raw data, such as principal component analysis (PCA) to reduce dimensionality and dynamic time warping (DTW)[song2016guidance] methods to temporally align all sample points from different demonstrations. To integrate different types of sensing information, a data fusion architecture[2004datafusion] based on artificial neural networks (ANNs) is used to combine the pose information and wrench signals, and kalman filters (KF) is utilized to minimize the effects of noise.

Iv-A2 Encoding phase

The encoding phase involves mapping the relations among the observed states and the executed actions. The mapping approaches have been developed in three main methodologies: dynamic movement primitives (DMPs), Gaussian mixture regression (GMR) and HMMs.

DMPs, as a nonlinear dynamic system, are utilized to model the discrete movements of the assembly trajectories with sequence of specific goal positions. A second-order differential equation is employed to encode the desired movement primitives of assembly trajectories (positions, velocities and accelerations)[2010DMP]. The one component motion specified in joint or task space of the observed state is formulated as follows


where denotes the dimensions of the observed state; denotes the number of demonstration trajectories; denotes the length of one single demonstration; denotes the length of the trajectory; , and are the real-time positions, velocity and acceleration, respectively, of the trajectory at time step . Generally, the discrete movements and periodic movements can be represented as a first-order equation and a second-order differential equation, respectively, and can be rewritten in one manner as follows:


For discrete movements, the can be derived through



is a radial-basis function;

is a phase variable to guarantee tends to 0 as time increases. For the periodic movements, the can be derived through


where ; the phase variable moves with the constant speed ; is the amplitude of the oscillator; denotes the period of periodic movements or duration of the training movement; denotes the target position. Furthermore, denotes a nonlinear function representing the convergence property of the position towards the target value with the following two formulations, respectively. and are constant and set to ensure the convergence of the dynamic system represented by (3).

DMPs are defined by the parameters , and . can be set directly according to the samples; the duration of training movements can be chosen as ; and the parameter could be calculated from the solution of (4) in the recursive least square manner. For better regression performance, the multiple variables of DMPs are estimated in a separate process synchronized by the phase variable. For instance, locally weighted regression (LWR) with lower computational complexity is applied to synthesize the parameter and nonparametric Gaussian process regression (GPR) with high accuracy is applied to estimate and . DMPs-based approaches have been applied to reach a target or follow a periodic path by a set of mass-spring-damper mechanisms. In [abu2014solving] and [kramberger2017generalization], a complete methodology is proposed to learn from the human assembly demonstrations by combining DMPs to capture the trajectories of pegs with the force-torque profiles. Furthermore, the differential equation of DMPs has been improved to adapt to the uncertainty in the desired position and obstacle avoidance[park2008movement].

GMR is introduced to estimate the relation between the observed states and the control commands. GMR is a real-time regression solution that it can reproduce the trajectories modeled by a GMM or modified GMM, and the reproduced trajectories can be adapted to control robot assembly tasks. In [tang2015learning] and [tang2016teach], GMR is employed to predict the velocities in a manner similar to the human in response to wrench signals; then, the output velocities are executed through a low-level controller (impedance controller) to realize the peg-in-hole insertion phase. To construct a heavy-weight component assembly process, Wan et al.[wan2017optimal]

proposed a complete methodology through learning assembly skills from human demonstrations and compensating for the large deformation with GPR. The joint probability distribution

is calculated with a mixture of Gaussian components weighted by as follows


where is the observed state as discussed above; denotes the assembly actions; and denotes the dimensions of the assembly actions. Each component

features a Gaussian distribution with a mean of

and covariance


The conditional probability is derived by a weighted summation of each as follows:


where is the marginal probability of input variable . Therefore, the parameters of can be estimated iteratively based on the collected demonstration training data by calculating maximum likelihood estimation through the EM algorithm. Then, with the learned Gaussian parameters, the optimal predicted output could be calculated by maximizing as follows


where the weights could be calculated by


Therefore, the GMR can be learned offline and the learned regression function calculate the expected actions rapidly online, which makes it appropriate to perform the assembly in real-time.

HMMs have been extensively used to encode and generalize the observed assembly trajectories of humans due to the strengths of the spatial and temporal variability[1996hmm][calinon2010learninghmm][calinon2010overview]

. HMMs considered as a type of dynamic Bayesian network and are employed to model the real state transition in assembly processes (vision feedback and force-moments). An HMM model generally includes five components, hidden state

, observable state , initial state probability matrix , hidden state transition probability and observable state transition probability matrix . represents the probability of the state transition from to , and represents the probability of acquiring the observation at the state  (always for real-world peg-in-hole assembly). The joint probability is encoded by HMM model with a continuous state and each state is encoded by GMR with mean and convariance . Therefore, the HMM model can be defined by parameters , which can be learned by the EM algorithm[1996hmm]. Compared to the original GMR approach in (10), the weight representing the importance of different Gaussian is constant, which is extended to in [calinon2010learninghmm] by recursively calculating a maximum likelihood represented as the HMM model. The weight can be derived as


which takes the temporal influence of the dynamic assembly movements into account.

In conclusion, the comparison of different LFD encoding methods for robotic assembly has been investigated in [zhu2018robot], [kyrarini2018robot], [calinon2010overview] and [calinon2010learninghmm], and our conclusions regarding the pros and cons of these three LFD encoding methods for peg-in-hole assembly are as shown in Table. III. A significant strength of DWPs is their adaptability to the perturbations through a second-order system. GMMs can model the mapping function well with clustering and probability density estimation with high robustness to environmental noise. HMMs, encapsulating the precedence information with a state transition metric, can perform imitation learning with partial demonstrations[calinon2010learninghmm]. Additionally, in contrast to the DMPs, which require two different equations for periodic and discrete problems, HMMs exploit a unified formulation. To enhance the adaptation of learned assembly strategies, many researchers have investigated the variants of the above modeling methods or have combined them. Modified GMMs combined with optimal control algorithms were proposed in[kyrarini2018robot]. GMMs combined with HMMs have been explored and have shown competitive performance for robotic assembly[calinon2010learninghmm].

Iv-A3 Reproducing phase

After demonstrations are encoded and regression functions are optimized, the desired assembly actions are reproduced in the reproducing phase. The generalization of the learned assembly skills depends on the regression performance. Instead of generalizing the motions with statistical regression methods, such as LWR and locally weighted projection regression (LWPR), directly, GMR derives the regression function with the joint probability density of collected demonstration data. However, the existing LFD methods are at the trajectory level, which is difficult to apply to reproduce more complicated assembly tasks with larger uncertainties. Additionally, the generalization of new circumstances and the robustness against perturbation in addition to reproducing actions require further improvement.

Iv-B Lfe

The development of highly intelligent control systems with the ability to learn skills autonomously has advanced considerably. A promising direction based on RL has been extensively used to solve challenges related to complicated contact-rich assembly tasks[rlhighpercision][zhiminfeedback] [rl2018peginhole]. The core idea of RL-based strategies is that the robot learns and explores the assembly policy actively given a high-level specification of what to do through the reward interpret mechanism instead of guiding the specific actions explicitly. Furthermore, the robots can achieve incremental learning by interacting with the environment through the smooth combination of the contact model recognition and compliant control process. Recent advances in RL have achieved great success in solving robotic manipulations issues, especially in conjunction with deep neural networks~(DNNs) for parameterizing policies and value functions. As shown in Fig. 6, RL approaches are generally distinguished into typical model-free and model-based two main classes according to whether there is a learned model of the dynamic transitions between the robot and environment. Additionally, the integration of model-based and model-free techniques has also drawn considerable attention in recent years.

Fig. 6: (a) Architecture of reinforcement learning algorithms. (b) Examples of reinforcement learning applied in robotic peg-in-hole assembly.(Xu et al.[zhiminfeedback], Luo et al.[rl2018peginhole] and Fan et al.[2018modelfreelearning]
Category Advantages Disadvantages Methods
Model-free No need for prior knowledge of environment Performance depends on transition model DDPG
Easy implementation Divergence due to the bias of model DQN (Q-learning)
Model-based Fewer interactions with environments Less data efficiency GPS
Fast convergences to optimal policy Unstable and dangerous PILCO
TABLE IV: Comparison for reinforcement learning algorithms.

Iv-B1 Model-free RL

Model-free RL methods aim to learn the optimal policy by simultaneously exploring the state-action space and estimating a dynamic model from the transitions simultaneously[jan2013reinforcement]. The solution for solving an RL problem can be decomposed into two alternative method families: value-based methods and policy-based methods. A value-based method was proposed for the first time to learn the value function through the nonlinear function approximation (via NNs), and the discrete assembly action was chosen in a greedy manner[gullapalli1992learning]. A value-based learning algorithm with a long short-term memory (LSTM), a variant of a recurrent neural network (RNN) to estimate the value function in order to achieve peg-in-hole assembly with a precision exceeding the resolution of the robots[rlhighpercision]. Additionally, a learning framework for solving the real-world robotic assembly problem was proposed: the output of the RL control system was used as the settings of the low-level position-based force controller instead of controlling the robots directly.

The limitation of the value-based methods is that the output actions of the RL system can only be discrete and low-dimensional. Policy-based methods have been extensively explored in the case of high-compliance robotic applications. The RL method implemented with actor and critic two components was proposed to derive the assembly policy for the actor and critic and was used to evaluate the actions[nuttin1997learning]. As policy-based methods have advanced, deterministic policy gradient (DPG) theory was derived in [silver2014deterministic] to achieve differential policy learning with a high stability. Subsequently, deep deterministic policy gradient (DDPG) approaches[duan2016benchmarking] have been developed through combination with DNNs, and these approaches have been widely applied for high-compliance continuous action control applications[zhiminfeedback][ren2018learning].

Policy-based methods are appropriate to solve the real-world problems with the continuous and high-dimensional actions. The learning of the parameterized policy always converges slowly with a high degree of variance and instability. To date, some studies on improving the stability and efficiency of the DDPG framework for real-world robotic assembly have been published, which allows learning from different samples distribution in an off-policy scheme. A model-driven DDPG algorithm was proposed to learn the general assembly policy for multiple peg-in-hole problems

[zhiminfeedback]. As shown in Fig. 6, one contribution of the model-driven DDPG algorithm is that the learning of the actor network is driven by the basic actions from the simple but practical controller. Additionally, many research studies have focused on incorporating prior knowledge to enhance the efficiency. A DDPG from demonstration (DDPGfD) method was proposed in [vecerik2017leveraging] by inputting human demonstrations into the expert memory buffer, which are reused by a prioritized replay mechanism to enhance policy learning. In contrast to providing a baseline policy with for robots, prior knowledge about the geometric information of assembly parts was used to plan the motion trajectory in [2018learningfromCAD] to guide the policy learning. Basically, those authors focused on the assembly motion planning with geometrical information from a computer aided design (CAD) and utilized the RL algorithm to handle the dynamics of the environment.

Iv-B2 Model-based RL

In contrast to the typical model-free RL methods, model-based methods aim to learn a dynamic model with the stored transitions, and the policies are optimized by deriving the rewards and next state from the learned model[polydoros2017survey]. For complicated manipulations tasks, policy search methods for deriving the optimal policies through interacting with the learned dynamic model directly have shown faster convergence. Guide policy search (GPS) has been developed to learn a couple of manipulation behaviors, as shown in Fig. 6; this method combines a trajectory optimization component and a neural network policy learning component[levine2015learning]. Luo et al. proposed mirror descent GPS (MDGPS) to tackle a complicated assembly task with rigid pegs and deformable holes for use with noncompliant robots and external F/T sensors[rl2018peginhole]. The probabilistic Inference for Learning Control (PILCO)[deisenroth2011pilco] framework employs a Gaussian Process to model the transition dynamics and a linear function to represent the policy, and this framework is a state-of-the-art model-based RL algorithm in terms of the sample efficiency and time efficiency.

Consequently, model-based RL methods only need to explore a narrower space than the model-free methods, resulting in faster convergence with fewer interactions with the environment. However, the performance of model-based RL methods heavily depends on the accuracy of the learned transition dynamic model. Polydoros and Nalpantidis gave an up-to-date overview of model-based RL algorithms and the related robotic applications in [polydoros2017survey]. We summarize the pros and cons of the model-free and model-based RL algorithms for robotic peg-in-hole assembly as shown in Table. IV.

Iv-B3 Integration of model-based and model-free methods

Both model-free and model-based RL methods have advantages and disadvantages, as summarized in [polydoros2017survey]. Model-free RL methods can perform the complicated assembly problems prominently with a general and easy implementation way but are less efficient. DDPG-based model-free algorithms can provide more stable policies and attain the asymptotic performance in some assembly tasks that exceeds the performance of nonsmooth dynamics models. Model-based RL methods are able to enhance policy learning by utilizing rich transition information. Additionally, model-based optimal controllers constrain the exploration space to a safe region but often cannot consistently achieve good convergence performance due to a large model bias. In [2018modelfreelearning], Fan et al. analyzed a model-based RL method (GPS) and a model-free RL method (DDPG), and then proposed a more efficient framework by combining the model-based optimal control strategies with a model-free actor-critic based learning algorithms, as shown in Fig. 6.

Recently, the integration of the strengths of model-based and model-free RL methods has been a well-studied topic for decades. Most of the efforts have focused on smoothing the transition from model learning to policy learning and obtaining more useful information from sample transitions. In [pong2018temporal], the authors introduced a novel strategy called as the temporal difference model (TDM) by training a goal-conditional value function with a specific choice of reward and horizon prediction. This model made the robots consider not only reaching a goal state as optimally as possible but also as easily as possible. The TDM conditions was extended in a multistep model study[venkatraman2016multimodel] to not only predict a sequence state in the future but also to reach a possible goal state based on the general value functions idea in[sutton2011horde] by learning rich contextual value functions from one single experience dataset. Additionally, some researchers have focused on exploring how to make full use of the learned dynamic model in addition to the commonly used Dyna architecture[sutton1991dyna] and GPS-based methods[levine2015learning] in order to simulate the entire trajectory every iteration.

V Discussion and conclusion

We have surveyed the remarkable work on robotic peg-in-hole assembly processes and have provided a comparison of different strategies summarized in Table. V. Both contact model-based and contact model-free strategies can achieve distinguished performance in some special scenarios. In summary, contact model-based conventional controllers and LFD methods guarantee safety and efficiency and are suitable for special assembly scenarios after adjustment with preprogramming beforehand. LFE algorithms based on RL are promising for actively and flexibly performing a broad range of complicated assembly process. Similar to human beings decision-making systems without tedious programming and rules, RL-based algorithms can remove the specificity engineering of the feedback controller, and they can naturally solve assembly problems with large environmental uncertainties and generalize to new situations.

Category Contact model-based Contact model-free
TABLE V: Conclusion for Robotic peg-in-hole assembly strategies.

It is clear that it is not possible for robots to perform the peg-in-hole assembly as flexibly as human beings based solely on any single strategy. Although RL-based contact model-free algorithms have attracted more attention than contact model-based algorithms and LFD methods, RL is not the main component for deriving an assembly strategy with sufficient robustness and flexibility to perform all the robotic peg-in-hole assembly problems. Furthermore, typical model-free RL-based methods are still not the suitable way for robotic problems. Consequently, we highlight a couple of open questions in the field of robotic peg-in-hole assembly and propose some potential directions for future research.

V-a Open questions in the field of robotic peg-in-hole assembly?

V-A1 How can the active compliant control strategies cooperating with passive compliant mechanisms be improved?

With the development of sensing hardware and robotic perception techniques, active compliant control strategies have been extensively explored for robotic peg-in-hole assembly. In addition, high-compliance robots have also been employed develop complicated assembly systems with a simple compliant strategy. Both the improvement of active compliant control strategies and passive compliant mechanisms can promote assembly research. For a peg-in-hole assembly, the large position or force uncertainties can be accommodated by an active compliant control strategy, while smaller uncertainties can be eliminated through improving the compliance of mechanism instead of optimizing the parameters of the active controller. Therefore, the incorporation of active compliant control strategies and passive devices still requires more attention to decide when to optimize the compliant control strategy or modify the passive mechanism.

V-A2 How can effective and incremental demonstration learning be realized?

LFD methods provide a solution to perform the robotic peg-in-hole assembly without handcrafted preprogramming according to contact model recognition. Although it is challenging to collect demonstration experiences, it is an essential task for improving not only data efficiency but also the adaptation and generalization of the learned assembly policy. For instance, in an attempt to solve this challenge, DMPs were used as fundamental blocks with RL to learn advanced skills[rldmp2011reinforcement]

. RL is commonly used to obtain the adaptive parameters for robust results. Additionally, to improve the efficiency, better feature extraction methods are required to select better demonstrations and omit undesirable information.

V-A3 How can model-based and model-free RL algorithms by combined?

It is clear that the integration of model-based and model-free RL algorithms is a promising solution to promote the RL based strategies in robotics peg-in-hole assembly, but this issue introduces two key points: how can a perfect dynamic model be learned? and how can robots be made to balance learning from the transition model and learning directly from the environments?

A good transition model representing the dynamics of the environment allows the robots to have a true understanding of the environment, which ensures that the optimal policy can be chosen accurately based on the model. In specific real-world robotic problems, the environment has been explored as a physics-based model and as a statistical model from experience data, including deterministic models and stochastic models. In the learning process, the statistical model can be considered as a supervised learning problem. Deep learning has achieved a major advances in function approximation, but a low sample efficiency still limits the performance in real-world scenarios. Therefore, one point is how to consider the environmental uncertainties or the existing physics model in transition model learning. Additionally, the transition model can be extended by taking prior domain knowledge, such as expert experience, into account.

As shown in Fig. 6, the robots decide when to interact with the transition model, and the degree of confidence in the transition model greatly affects the quality of the learned assembly skills. Therefore, scalable methods for effectively planning based on the given transition model are still required, in addition to the Dyna architecture[sutton1991dyna] and GPS-based methods [levine2015learning].

V-B Potential future work

To combine the strengths of contact model-based and contact model-free learning algorithms, we propose the following directions to explore the possible solutions in the field of robotic peg-in-hole assembly.

V-B1 Incorporate the knowledge representation method into contact model recognition and transition model learning.

Contact model recognition and transition model learning still require better feature extraction methods for the assembly environment. A promising solution is a knowledge representation method based on general value functions, which was proposed to represent the understanding of environments through learning some simple auxiliary tasks given some prior knowledge.

V-B2 Incorporate prior knowledge into learning process

Prior knowledge can be the existing control law as in [zhiminfeedback] or can be interpreted through learning from expert demonstrations. Additionally, predictions about the environments can be learned as GVFs, which can also be considered as prior knowledge for incorporating into the learning process. For instance, prior knowledge can be used to improve the balance of the model-based and model-free RL strategies.

V-B3 Incorporate physical model into reward shaping for RL-based algorithms

The solution of robotic peg-in-hole assembly problems through RL-based algorithms holds great promise. However, most of real-world problems are sometimes difficult to interpret with reward signals, and unpredictable exploration and dangerous actions need to be reduced. How well the designed reward mechanism shapes the assembly problems affects the the quality and efficiency of learning. For instance, Xu et al.[zhiminfeedback] investigated a fuzzy reward system to take more prior knowledge into account. The Inverse RL method was utilized to derive the rewards from the observed expert behaviors, thereby exploiting the knowledge of human beings[abbeel2004apprenticeship]. Therefore, the physical-model including the geometric information on parts and a mature friction model can be considered as the implicit constraint on the design of reward mechanism. Additionally, instead of the constant reward signals, the reward-based mechanism can be updated by evaluating a high-level objective function according to the designer’s final goal, which means that the robots receive different evaluation feedback at different stages.


The authors would like to thank…