Deep Reinforcement Learning for High Precision Assembly Tasks

08/14/2017 ∙ by Tadanobu Inoue, et al. ∙ ibm YASKAWA ELECTRIC CORPORATION 0

High precision assembly of mechanical parts requires accuracy exceeding the robot precision. Conventional part mating methods used in the current manufacturing requires tedious tuning of numerous parameters before deployment. We show how the robot can successfully perform a tight clearance peg-in-hole task through training a recurrent neural network with reinforcement learning. In addition to saving the manual effort, the proposed technique also shows robustness against position and angle errors for the peg-in-hole task. The neural network learns to take the optimal action by observing the robot sensors to estimate the system state. The advantages of our proposed method is validated experimentally on a 7-axis articulated robot arm.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Industrial robots are increasingly being installed in various industries to handle advanced manufacturing and high precision assembly tasks. The classical programming method is to teach the robot to perform industrial assembly tasks by defining key positions and motions using a control box called “teach pendant”. This on-line programming method is usually tedious and time consuming. Even after programming, it takes a long time to tune the parameters for deploying the robot to a new factory line due to environment variations.

Another common method is off-line programming or simulation. This method can reduce downtime of actual robots, but it may take longer time overall than on-line programming including the time for developing the simulation and testing on the robot. It is quite hard to represent the real world including environment variations with 100% accuracy in the simulation model. Therefore, this off-line method is not sufficient for some industrial applications such as precision machining and flexible material handling where the required precision is higher than the robot accuracy.

In this paper, we propose a skill acquisition approach where the low accuracy of conventional programming methods is compensated by a learning method without parameter tuning. Using this approach, the robot learns a high precision fitting task using sensor feedback without explicit teaching.

For such systems, reinforcement learning (RL) algorithms can be utilized to enable a robot to learn new skills through trial and error using a process that mimics the way humans learn [1]. The abstract level concept is shown in Fig. 1. Recent studies have shown the importance of RL for robotic grasping task using cameras and encoders [2][3], but none of these methods can be applied directly to high precision industrial applications.

Fig. 1: Robot learns new skills using deep reinforcement learning

To show the effectiveness of this approach, we focus on learning tight clearance cylindrical peg-in-hole task. It is a benchmark problem for the force-controlled robotic assembly. The precision required to perform this task exceeds the robot accuracy. In addition to tight clearance the hole can be tilted in either direction, this further adds to the problem difficulty. Instead of using super-precise force-torque sensors or cameras, we rely on the common force and position sensors that are ubiquitous in the industrial robots. To learn the peg-in-hole task, we use a recurrent neural network, namely, Long Short Term Memory (LSTM) trained using reinforcement learning.

The rest of the paper is organized as follows. Section II explains the problem. Details of our proposed method is described in Section III. Quantitative analysis of the method on a real robot is presented in Section IV. Finally, we conclude the paper in Section V with some directions for the future work.

Ii Problem Formulation

A high-precision cylindrical peg-in-hole is chosen as our target task for the force-controlled robotic assembly. This task can be broadly divided into two main phases [4]:

  • Search: the robot places the peg center within the clearance region of the hole center

  • Insertion: the robot adjusts the orientation of the peg with respect to the hole orientation and pushes the peg to the desired position

In this paper, we study and learn these two phases separately.

Ii-a Search Phase

Although industrial robots have reached a good level of accuracy, it is difficult to set peg and hole to few tens of of precision by using a position controller. Visual servoing is also impractical due to the limited resolution of cameras or internal parts that are occluded during assembly, for example, in case of meshing gears and splines in transmission. In this paper, we use a common 6-axis force-torque sensor to learn the hole location with respect to the peg position.

Newman et al. [5]

calculate the moments from sensors and interprets the current position of the peg by mapping the moments onto positions. Sharma

et al. [4] utilize depth profile in addition to roll and pitch data to interpret the current position of the peg. Although, these approaches are demonstrated to work in simulation, it is difficult to generalize them for the real world scenario. In the real case, it is very difficult to obtain a precise model of the physical interaction between two objects and calculate the moments caused by the contact forces and friction [6].

Ii-B Insertion Phase

The insertion phase has been extensively researched. Gullapalli et al. [7] use associative reinforcement learning methods for learning the robot control. Majors and Richards [8] use a neural network based approach. Kim et al. [9] propose the insertion algorithm which can recover from tilted mode without resetting the task to the initial state. Tang et al. [10] propose an autonomous alignment method by force and moment measurement before insertion phase based on a three-point contact model.

Compared to these previous works, we insert a peg into a hole with a very small clearance of . This high precision insertion is extremely difficult even for humans. This is due to the fact that humans cannot be so precise and the peg usually gets stuck in the very initial stage of insertion. It is also very difficult for the robot to perform an insertion with clearance tighter than its position accuracy. Therefore, robots need to learn in order to perform this precise insertion task using the force-torque sensor information.

Iii Reinforcement Learning with Long Short Term Memory

In this section, we explain the RL algorithm to learn the peg-in-hole task (Fig. 2). The RL agent observes the current state of the system defined as:


where and are the average force and moment obtained from the force-torque sensor; the subscript denotes the axis.

Fig. 2: Reinforcement learning with LTSM

The peg position is calculated by applying forward kinematics to joint angles measured by the robot encoders. During learning, we assume that the hole is not set to the precise position and it has position errors. By doing this we add robustness against position errors that may occur during the inference. To satisfy this assumption, we calculate the rounded values and of the position data and using the grid shown in Fig. 3. Instead of the origin (0, 0), the center of the hole can be located at , , where is the margin for the position error. Therefore, when the value is , , it will be rounded to . Similarly when the value is , it will be rounded to , and so on. This gives auxiliary information to the network to accelerate the learning convergence.

Fig. 3: Position data rounded to grid size

The machine learning agent generates an action

to the robot control defined as:


where, is the desired force and

is the desired peg rotation given as input to the hybrid position/force controller of the manipulator. Each component of the vector

is an elementary movement of the peg described in Fig. 4. An action is defined as a combination of one of more elementary movements.

Fig. 4: Elementary movement: (a) Force movements (b) Rotation movements

RL algorithm starts with a random exploration of the solution space to generate random actions . By increasing exploitation and reducing exploration over time the RL algorithm strives to maximize the cumulative reward:


where, is the discount factor, is the current reward assigned to each action and is the step number. In the proposed technique, we only compute one reward at the end of each episode. If the trial succeeds, the following positive reward is provided to the network:


where is the maximum number of steps in one episode, .

As we can see from Eq. (4), the target of the learning is to successfully perform the task in minimum number of steps. If we cannot finish the task in , the distance between the starting point and the final position of the peg is used to compute the penalty. The penalty is different for search phase and insertion phase. For search phase, the penalty or negative reward is defined as:


where is the distance between the target and the peg location at the end of episode, is the initial position of the peg, and is the safe boundary. For insertion phase, the penalty is defined by:


where, is insertion goal depth and is the downward displacement from the initial peg position in the vertical direction.

The reward is designed to stay within the range of . The maximum reward is less than because we cannot finish the task in zero steps. The episode is interrupted with reward , if the distance of the peg position and goal position is bigger than in the search phase. In the insertion phase, the reward becomes minimum value when the peg is stuck at the entry point of the hole.

To maximize the cumulative reward of Eq. (3), we use a variant reinforcement learning called Q-learning algorithm. At every state the RL agent learns to select the best possible action. This is represented by a policy :


In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. In Q-learning, we can approximate the table update by the Bellman equation:


where, and are the next state and action respectively.

As the state space is too big, we train a deep recurrent neural network to approximate the Q-table. The neural network parameters are updated by the following equation:


where, is the learning rate, denotes the gradient function, and

is the loss function:


Using the Q-learning update equation, the parameters update equation can be written as:


As shown in [11], we store the data for all previous episodes of the agent experiences to a memory pool with maximum size in a FIFO manner (Algorithm 1). Random sampling from this data provide replay events to provide diverse and decorrelated data for training.

In case of machine learning for real robot, it is difficult to collect the data and perform the learning offline. The robot is in the loop and the reinforcement learning keep improving the performance of the robot over time. In order to efficiently perform the data collection and learning, the proposed algorithm uses two threads, an action thread and a learning thread. Algorithm 1 shows the pseudo code of the action thread. The episode ends when we successfully finish the phase, exceeds maximum number of allowed steps , or a safety violation occurs (i.e. going outside the safe zone ). It stores the observation to a replay memory and it outputs the action based on the neural network decision. Algorithm 2 shows the learning thread and it updates the neural network by learning using the replay memory.

  Initialize replay memory pool to size
  for episode = 1 to M do
     Copy latest network weights from learning thread
     Initialize the start state to sequence
     while NOT EpisodeEnd do

        With probability

select a random action , otherwise select
        Execute action by robot and observe reward and next state
        Store in
     end while
  end for
  Send a termination signal to the learning thread
Algorithm 1 Action thread
  Initialize the learning network with random weights
     if current episode is greater than  then
        Sample random minibatch of data from . The minibatch size is
        Set target =
        Set prediction =
        Update the learning network weight using equation Eq. 11.
     end if
  until Receive a termination signal from the action thread
Algorithm 2 Learning thread

Unlike [11], we use multiple long short-term memory (LSTM) layers to approximate the Q-function. LSTM can achieve good performance for complex tasks where part of the environment’s state is hidden from the agent [12]. In our task, the peg is in physical contact with the environment and the states are not clearly identified. Furthermore, when we issue an action command shown in Eq. (2), the robot controller interprets the command and executes the action at the next cycle. Therefore, the environment affected by the actual robot action can be observed after 2 cycles from the issuing action. Experiments show that LSTM can compensate for this delay by considering the history of the sensed data.

Iv Experiments

The proposed skill acquisition technique is evaluated by using a 7-axis articulated robot arm. A 6-axis force-torque sensor and a gripper are attached to the end effector of the robot (Fig. 5(a)). The rated load of the force-torque sensor is for the force and for the moment. The resolution of the force is . The gripper is designed to grasp cylindrical pegs of diameter between 34 and 36 . In this paper, we suppose that the peg is already grasped and in contact with the hole plate. As shown in Fig. 5(b), a 1D goniometer stage is attached to the base plate to adjust the angle of this plate with respect to the ground.

Fig. 5: (a) Robot (b) Description of peg-in-hole components

We prepare hole and pegs with different sizes (Table I). The clearance between peg and the hole is defined in the table, while the robot arm accuracy is only .

Type Diameter Height Material Clearance
Peg S1 Steel
Peg S2 Steel
Hole S Steel
TABLE I: Peg and hole dimensions

Fig. 6 shows the architecture of the experimental platform. The robot arm is controlled by action commands issued from an external computer (Apple MacBook Pro®, Retina, 15-inch, Mid 2015 model with Intel Core® i7 ). The computer communicates with the robot controller via User Datagram Protocol (UDP). The sensors are sampled every and the external computer polls the robot controller every to get 20 data points at one time. These 20 data points are averaged to reduce the sensor noise. The learned model is also deployed on a Raspberry Pi® 3 for the execution. The machine learning module in Fig. 6 trains a LSTM network using RL to perform an optimal action for a given system state.

Fig. 6: Architecture of the experimental platform

We treat search and insertion as two distinct skills and we train two neural networks to learn each skill. Both networks use two LSTM layers of size and (Fig. 2). At the first step, the search phase is learned and then the insertion phase is learned with search skill already in place.

The maximum size of the replay memory shown in Algorithm 1 is set to 20,000 steps and it is overwritten in a first-in-first-out (FIFO) manner. The maximum number of episodes is set to 230 and the maximum number of steps is set to 100 for the search phase and 300 for the insertion phase. The learning thread shown in Algorithm 2 starts learning after episodes. Batch size is to select random experiences from .

The initial exploration rate for the network is set to 1.0 (i.e. the actions are selected randomly at the start of learning). The exploration is reduced by 0.005 after each episode until it reaches 0.1. This allows a gradual transition from exploration to exploitation of the trained network.

Iv-a Search Phase

Preliminary experiments and analysis on actual robot moment were performed to compute the optimal vertical force . We first calibrate the 6 axis force/torque sensor. In particular, we adjust the peg orientation () to ensure that both and are 0 for a vertical downward force (Fig. 7(a)). After calibration, we analyze the moment for three different downward forces at three different peg locations (Fig. 7(b)).

Fig. 7: Preliminary experiments for the moments analysis. (a) Align peg to get zero moment values (b) Stamp peg nearby the hole.

Fig. 8 shows the moment values for nine different configurations of peg position and force. Figs. 8(a) and 8(d) show that we cannot get a detectable moment by pushing down with a force of . In contrast, it is clear that a downward force of both and can be used for estimating the hole direction based on the moment values. As expected, in the case of in Figs. 8(b) and 8(e), is bigger when the peg is closer to the hole. It is better to use a weaker force to reduce wear and tear of the apparatus, especially for relatively fragile material (e.g. aluminum, plastic). As a result, we use downward force for all subsequent experiments in search phase.

Fig. 8: Moment values in preliminary experiments, in red and in blue. (a)(b)(c) and (a) , (b) , (c) ; (d)(e)(f) and (d) , (e) , (f) ; (g)(h)(i) and (g) , (h) , (i)

Due to the accuracy of robot sensors there is an inherent error of in the initial position of the peg. In addition, the hole can be set by humans manually in a factory and there can be large position errors in the initial position of the hole. In order to make the system robust to position errors, we add additional error in the position in one of 16 directions randomly selected. Instead of directly starting from large initial offset, the learning is done in stages for the search phase. We start with a very small initial offset of the peg from the hole and learn the network parameters. Using this as prior knowledge we increase the initial offset to . Instead of starting from exploration rate of 1.0 we set initial exploration rate to 0.5 for the subsequent learning stage.

The state input to the search network is a 7-dimensional vector of Eq. (1). The size of the grid in Fig. 3 is set to for and for . The neural network selects one of the following four actions defined using Eq. (2):

with , and . Since the peg stays in contact with the hole plate by a constant force , it can enter into the hole during the motion. Compared to step wise movements, the continuous movements by the force control can avoid the static friction.

The peg position is used to detect when the search is successful. If becomes smaller than compared to the starting point, we say that the peg is inside the hole. We set for the maximum safe distance (Eq. (5)).

Fig. 9: Performance of the proposed method during learning search phase with clearance, tilted angle, initial offset (a) Reward (b) Step. Means and 90% confidence bounds in a moving window of 20 episodes

Fig. 9 shows the learning progress in case of clearance, tilt angle, and initial offset. Fig. 9(a) shows the learning convergence and Fig. 9(b) illustrates that the number of steps to successfully accomplish the search phase is reduced significantly.

Iv-B Insertion Phase

Successful searching is a pre-requisite for the insertion phase. After training the searching network, we train a separate but similar network for insertion. Based on the 7-dimensional vector of Eq. (1), we define the following state input vector of this network:


where, , sense the peg orientation, while indicates if the peg is stuck or not.

To accomplish the insertion phase, the system chooses from the following 5 actions of Eq. (2):

The vertical peg position is used for the goal detection. If the difference between starting position and the final position of the peg becomes larger than , we can judge that the insertion is completed successfully. We use for the stroke threshold (Eq. (6)). The reward for a successful episode is similar to the one used in search phase (Eq. (4)).

Iv-C Results

In order to show the robustness of the proposed technique, we perform experiments with pegs of different clearances. We also perform tests with tilted hole plate using a 1D goniometer stage under the plate. The results are shown in the attached video (see

We execute the peg-in-hole task 100 times after learning to show the time performances of the learning method:

  • Case A: initial offset, clearance and tilted angle

  • Case B: initial offset, clearance and tilted angle

Fig. 10 shows histograms of the execution time in two cases about search, insertion, and total time. Fig. 10(a) shows the distribution of the execution time spread over wider area and is shifted further right than Fig. 10(d). When the tilt angle is larger, the execution time for the insertion becomes longer as the peg needs to be aligned with the hole.

Fig. 10: Histograms of execution time: The case (A) that clearance, tilted angle, initial offset (a) search time (b) insertion time (c) total time. The case (B) that clearance, tilted angle, initial offset (d) search time (e) insertion time (f) total time.
Approach (1) (2) (2) (2)
Clearance [] 10 10 10 20
Angle error [] 1.0 0 0 1.6
Initial position error [] 1.0 1.0 3.0 1.0
Search time (s) 0.97 2.26 0.95
Insertion time (s) 1.40 1.33 2.31
Total time (s) 5.0 3.47 4.68 4.36
TABLE II: Average execution time for peg-in-hole task;
(1) Conventional approach using fixed search patterns [13]
(2) Our proposed approach

Table II summarizes the average execution time in 100 trials for the four cases. We achieve 100% success rate in all cases. For comparison, our results are compared with the specifications on the product catalog of the conventional approach using force sensing control and fixed search patterns [13]. The maximum initial position and angle errors allowed by the conventional approach is and respectively. The results show that robust fitting skills against position and angle errors can be acquired by the proposed learning technique.

V Conclusions and Future Work

There are industrial fitting operations that require very high precision. Classical robot programming techniques takes a long setup time to tune parameters due to the environment variations. In this paper, we propose an easy to deploy teach-less approach for precise peg-in-hole tasks and validate its effectiveness by using a 7-axis articulated robot arm. Results show robustness against position and angle errors for a fitting task.

In this paper, the high precision fitting task is learned for each configuration by using online learning. In future work, we will gather trial information from multiple robots in various configurations and upload them to a Cloud server. More general model will be learned on the Cloud by using this data pool in batches. We would like to generalize the model so that it can handle different materials, robot manipulators, insertion angles, and also different shapes. Then, skill as a service will be delivered to robots in new factory lines with shortened setup time.

The proposed approach uses a discrete number of actions to perform the peg-in-hole task. As an obvious next step, we will analyze the difference between this approach and continuous space learning techniques such as A3C [14] and DDPG [15].


We are very grateful to Masaru Adachi in Tsukuba Research Laboratory, Yaskawa electric corporation, Japan for his helpful support to this work.


  • [1] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” International Journal of Robotic Research, vol.32, no.11, pp.1238–1274, 2013.
  • [2]

    S. Levine, P. Pastor, A. Krizhevsky, D. Quillen, ”Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection”, International Symposium on Experimental Robotics (ISER), 2016.

  • [3]

    L. Pinto, A. Gupta, ”Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours”, IEEE International Conference on Robotics and Automation (ICRA), 2016.

  • [4] K. Sharma, V. Shirwalkar, and P. K. Pal, ”Intelligent and Environment-Independent Peg-In-Hole Search Strategies”, International Conference on Control, Automation, Robotics and Embedded Systems (CARE), 2013.
  • [5] W. S. Newman, Y. Zhao, and Y. H. Pao, ”Interpretation of Force and Moment Signals for Compliant Peg-in-Hole Assembly”, IEEE International Conference on Robotics and Automation, 2001.
  • [6] C. Bouchard, M. Nesme, M. Tournier, B. Wang, F. Faure, and P. G. Kry, ”6D Frictional Contact for Rigid Bodies”, Proceedings of Graphics Interface, 2015.
  • [7] V. Gullapalli, R. A. Grupen, and A. G. Barto, ”Learning Reactive Admittance Control”, IEEE International Conference on Robotics and Automation, 1992.
  • [8] M. D. Majors, and R. J. Richards, ”A Neural Network Based Flexible Assembly Controller”, Fourth International Conference on Artificial Neural Networks, 1995.
  • [9] I. W. Kim, D. J. Lim, and K. I. Kim, ”Active Peg-in-hole of Chamferless Parts using Force/Moment Sensor”, IEEE/RSJ International Conference on Intelligent Robots and Systems, 1999.
  • [10] T. Tang, H. C. Lin, Y. Zhao, W. Chen, and M. Tomizuka, ”Autonomous Alignment of Peg and Hole by Force/Torque Measurement for Robotic Assembly”, IEEE International Conference on Automation Science and Engineering (CASE), 2016.
  • [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, ”Playing Atari with Deep Reinforcement Learning”, NIPS Deep Learning Workshop, 2013.
  • [12] B. Bakker, ”Reinforcement Learning with Long Short-Term Memory”, 14th International Conference Neural Information Processing Systems (NIPS), 2001.
  • [13] Yaskawa Europe GmbH, Motofit,”
  • [14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, ”Asynchronous Methods for Deep Reinforcement Learning”, International Conference on Machine Learning, 2016.
  • [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, ”Continuous control with deep reinforcement learning”, arXiv:1509.02971, 2015.