Connector insertion and many other tasks commonly found in modern manufacturing settings involve complex contact dynamics and friction. Since it is difficult to capture related physical effects with first-order modeling, traditional control methods often result in brittle and inaccurate controllers, which have to be manually tuned. Reinforcement learning (RL) methods have been demonstrated to be capable of learning controllers in such environments from autonomous interaction with the environment, but running RL algorithms in the real world poses sample efficiency and safety challenges. Moreover, in practical real-world settings we cannot assume access to perfect state information or dense reward signals. In this paper, we consider a variety of difficult industrial insertion tasks with visual inputs and different natural reward specifications, namely sparse rewards and goal images. We show that methods that combine RL with prior information, such as classical controllers or demonstrations, can solve these tasks from a reasonable amount of real-world interaction.READ FULL TEXT VIEW PDF
Many industrial tasks on the edge of automation require a degree of adaptability that is difficult to achieve with conventional robotic automation techniques. While standard control methods, such as PID controllers, are heavily employed to automate many tasks in the context of positioning, tasks that require significant adaptability or tight visual perception-control loops are often beyond the capabilities of such methods, and therefore are typically performed manually. Standard control methods can struggle in presence of complex dynamical phenomena that are hard to model analytically, such as complex contacts. Reinforcement learning (RL) offers a different solution, relying on trial and error learning instead of accurate modeling to construct an effective controller. RL with expressive function approximation, i.e. deep RL, has further shown to automatically handle high dimensional inputs such as images .
However, deep RL has thus far not seen wide adoption in the automation community due to several practical obstacles. Sample efficiency is one obstacle: tasks must be completed without excessive interaction time or wear and tear on the robot. Progress in recent years on developing better RL algorithms has led to significantly better sample efficiency, even in dynamically complicated tasks [2, 3], but remains a challenge for deploying RL in real-world robotics contexts. Another major, often underappreciated, obstacle is goal specification: while prior work in RL assumes a reward signal to optimize, it is often carefully shaped to allow the system to learn [4, 5, 6]. Obtaining such dense reward signals can be a significant challenge, as one must additionally build a perception system that allows computing dense rewards on state representations. Shaping a reward function so that an agent can learn from it is also a manual process that requires considerable manual effort. An ideal RL system would learn from rewards that are natural and easy to specify. How can we enable robots to autonomously perform complex tasks without significant engineering effort to design perception and reward systems?
We first consider an end-to-end approach that learns a policy from images, where the images serve as both the state representation and the goal specification. Using goal images is not fully general, but can successfully represent tasks when the task is to reach a final desired state . Specifying goals via goal images is convenient, and makes it possible to specify goals with minimal manual effort. Using images as the state representation also allows a robot to learn behaviors that utilize direct visual feedback, which provides some robustness to sensor and actuator noise.
Secondly, we consider learning from simple and sparse reward signals. Sparse rewards can often be obtained conveniently, for instance from human-provided labels or simple instrumentation. In many electronic assembly tasks, which we consider here, we can directly detect whether the electronics are functional, and use that signal as a reward. Learning from sparse rewards poses a challenge, as exploration with sparse reward signals is difficult, but by using sufficient prior information about the task, one can overcome this challenge. To handle this challenge, we extend the residual RL approach [8, 9], which learns a parametric policy on top of a fixed, hand-specified controller, to the setting of vision-based manipulation.
In our experiments, we show that we can successfully complete real-world tight tolerance assembly tasks, such as inserting USB connectors, using RL from images with reward signals that are convenient for users to specify. We show that we can learn from only a sparse reward based on the electrical connection for a USB adapter plug, and we demonstrate learning insertion skills with rewards based only on goal images. These reward signals require no extra engineering and are easy to specify for many tasks. Beyond showing the feasibility of RL for solving these tasks, we evaluate multiple RL algorithms across three tasks and study their robustness to imprecise positioning and noise.
Learning has been applied previously in a variety of robotics contexts. Different forms of learning have enabled autonomous driving , biped locomotion , block stacking , grasping , and navigation [14, 15]. Among these methods, many involve reinforcement learning, where an agent learns to perform a task by maximizing a reward signal. Reinforcement learning algorithms have been developed and applied to teach robots to perform tasks such as balancing a robot , playing ping-pong  and baseball 
. The use of large function approximators, such as neural networks, in RL has further broadened the generality of RL. Such techniques, called “deep” RL, have further allowed robots to be trained directly in the real world to perform fine-grained manipulation tasks from vision , open doors , play hockey , stack Lego blocks , use dexterous hands , and grasp objects . In this work we further explore solving real-world robotics tasks using RL.
Many RL algorithms introduce prior information about the specific task to be solved. One common method is reward shaping , but reward shaping can become arbitrarily difficult as the complexity of the task increases. Other methods incorporate a trajectory planner [25, 26, 27] but for complex assembly tasks, trajectory planners require a host of information about objects and geometries which can be difficult to provide.
Another body of work on incorporating prior information studies using demonstrations either to initialize a policy [18, 28], infer reward functions using inverse reinforcement learning [29, 30, 31, 32, 33] or to improve the policy throughout the learning procedure [34, 35, 36, 37]
. These methods require multiple demonstrations, which can be difficult to collect, especially for assembly tasks, although learning a reward function by classifying goal states[38, 39] may partially alleviate this issue. More recently, manually specifying a policy and learning the residual task has been proposed [8, 9]. In this work we evaluate both residual RL and combining RL with learning from demonstrations.
Previous work has also tackled high precision assembly tasks, especially insertion-type tasks. One line of work focuses on obtaining high dimensional observations, including geometry, forces, joint positions and velocities [40, 41, 42, 43], but this information is not easily procured, increasing complexity of the experiments and the supervision required. Other work relies on external trajectory planning or very high precision control [42, 41], but this can be brittle to error in other components of the system, such as perception. We show how our method not only solves insertion tasks with much less information about the environment, but also does so under noisy conditions.
In this work, we empirically evaluate learning methods on a set of electric connector assembly tasks, pictured in Fig. 2. Connector plug insertions are difficult for two reasons. First, the robot must be very precise in lining up the plug with its socket. As we show in our experiments, errors as small as mm can lead to consistent failure. Second, there is significant friction when the connector plug touches the socket, and the robot must learn to apply sufficient force in order to insert the plug. Image sequences of successful insertions are shown in Fig. 1
, where it is also possible to see details of the gripper setup that we used to ensure a failure free, fully automated training process. In our experiments, we use a 7 degrees of freedom Sawyer robot with end-effector control, meaning that the action signalcan be interpreted as the relative end-effector movement in Cartesian coordinates. The robot’s underlying internal control pipeline is illustrated in Fig. 3.
To comprehensively evaluate connector assembly tasks, we repeat our experiments on a variety of connectors. Each connector offers a different challenge in terms of required precision and force to overcome friction. We chose to benchmark the controllers performance on the insertion of a USB connector, a U-Sub connector, and a waterproof Model-E connector manufactured by MISUMI. All the explored use cases were part of the IROS 2017 Robotic Grasping and Manipulation Competition , included as part of a task board developed by NIST to benchmark the performance of assembly robots.
In the following we describe the used adapters, USB, D-Sub, and Model-E. The observed difficulty of the insertion increases in that order.
The USB connector is a ubiquitous, widely-used connector and offers a challenging insertion task. Because the adapter becomes smoother and therefore easier to insert over time due to wear and tear, we periodically replace the adapter. Of the three tested adapters, the USB adapter is the easiest.
Inserting this adapter requires aligning several pins correctly, and is therefore more sensitive than inserting the USB adapter. It also requires more downward force due to a tighter fit.
This adapter is the most difficult of the three tested connectors as it contains several edges and grooves to align and requires significant downward force to successfully insert the part.
We consider three settings in our experiments in order to evaluate how plausible it is to solve these tasks with more convenient state representations and reward functions and to evaluate the performance of different algorithms changes as the setting is modified.
In this experiment, we evaluate whether the RL algorithms can learn to perform the connector assembly tasks from vision without having access to state information. The state provided to the learned policy is a grayscale image, such as shown in Fig. 4
. For goal specification, we use a goal image, avoiding the need for state information to compute rewards. The reward is the pixelwise L1 distance to the given goal image. Being able to learn from such a setup is compelling as it does not require any extra state estimation and many tasks can be specified easily by a goal image.
In this experiment, the reward is obtained by directly measuring whether the connection is alive and transmitting:
This is the exact true reward for the task of connecting a cable, and can be naturally obtained in many manufacturing systems. As state, the robot is given the Cartesian coordinates of the end-effector and the vertical force that is acting on the end-effector. We could only automatically detect the USB connection, so we only include the USB adapter for the sparse experiments.
In this experiment, the robot receives a manually shaped reward based on the distance to the target location . We use the reward function
. The hyperparameters are set to, , and . When an insertion is indicated through a distance measurement, the sign of the force term flips, so that when the connector is inserted. This rewards the agent for pressing down after an insertion and showed to improve the learning process. The force measurements are calibrated before each rollout to account for measurement bias and to decouple the measurements from the robot pose.
To solve the connector insertion tasks, we consider and evaluate a variety of reinforcement learning algorithms.
In a Markov decision process (MDP), an agent at every time step is at state, takes actions , receives a reward , and the state evolves according to environment transition dynamics . The goal of reinforcement learning is to choose actions to maximize the expected returns where is the horizon and is a discount factor. The policy is often chosen to be an expressive parametric function approximator, such as a neural network, as we use in this work.
One class of RL methods additionally estimates the expected discounted return after taking action from state , the Q-value . Q-values can be recursively defined with the Bellman equation:
and learned from off-policy transitions . Because we are interested in sample-efficient real-world learning, we use such RL algorithms that can take advantage of off-policy data.
For control with continuous actions, computing the required maximum in the Bellman equation is difficult. Continuous control algorithms such as deep deterministic policy gradients (DDPG)  additionally learn a policy to approximately choose the maximizing action. In this paper we specifically consider two related reinforcement learning algorithms that lend themselves well to real-world learning as they are sample efficient, stable, and require little hyperparameter tuning.
SAC is an off-policy value-based reinforcement learning method based on the maximum entropy reinforcement learning framework with a stochastic policy .
Instead of randomly exploring from scratch, we can inject prior information into an RL algorithm in order to speed up the training process, as well as to minimize unsafe exploration behavior. In residual RL, actions are chosen by additively combining a fixed policy with a parametric policy :
The parameters can be learned using any RL algorithm. In this work, we evaluate both SAC and TD3, explained in the previous section. The residual RL implementation that we use in our experiments is summarized in Algorithm 1.
A simple P-controller serves as the hand-designed controller of our experiments. The P-controller operates in Cartesian space and calculates the current control action by
where denotes the commanded goal location. As control gains we use . This P-controller quickly centers the end-effector above the goal position and reaches the goal after about 10 time steps when starting from the reset positions, which is located about 5cm above the goal.
Another method to incorporate prior information is to use demonstrations from an expert policy to guide exploration during RL. We first collected demonstrations with a joystick controller. Then, we add a behavior cloning loss while performing RL that pushes the policy towards the demonstrator actions, as previously considered in . Instead of DDPG, the underlying algorithm RL algorithm used is TD3.
We evaluate the industrial applicability of the residual RL approach on a variety of connector insertion tasks that are performed on a real robot, using easy-to-obtain reward signals. In this section, we consider two types of natural rewards which are intuitive to humans: an image directly specifying a goal and a binary sparse reward indicating success. For both cases, we report success rates on tasks they solve. We aim to answer the following questions: (1) Can such trained policies provide comparable performance to policies that are trained with densely-shaped rewards? (2) Are these trained policies robust to light variations and noise?
For the vision-based learning experiments, we use only raw image observations and distance between the current image and goal image as the reward signal. Sample images that the robot received are shown in Fig. 4. We evaluate this type of reward on all three connectors. In our experiments, we use grayscale images.
In the sparse reward experiment, we use the binary signal of the connector being electrically connected as the reward signal. This experiment is most applicable to electronic manufacturing settings where the electrical connection between connectors can be directly measured. We only evaluate the sparse reward setting on the USB connector, as it was straightforward to obtain the electrical connection signal.
After evaluating the tasks in the above settings, we further evaluate with full state information with a dense and carefully shaped reward signal, given in Eq. 2, that incorporates distance to the goal and force information. Evaluating in this setting gives us an “oracle” that can be compared to the previous experiments in order to understand how much of a challenge sparse or image rewards pose for various algorithms.
For safe and reliable future usage, it is required that the insertion controller is robust against small measurement or calibration errors that can occur when disassembling and reassembling a mechanical system. In this experiment, small goal perturbations are introduced in order to uncover the required setup precision of our algorithms.
One advantage of using reinforcement learning is the exploratory behavior that allows the controller to adapt from new experiences unlike a deterministic control law. The two RL algorithms we consider in this paper, SAC and TD3, explore differently. SAC maintains a stochastic policy, and the algorithm also adapts the stochasticity through training. TD3 has a deterministic policy, but uses another noise process (in our case Gaussian) to inject exploratory behavior during training time. We compare the two algorithms, as well as when they are used in conjunction with residual RL, in order to evaluate the effect of the different exploration schemes.
|RL + LfD||Images||52%||52%|
|RL + LfD||Images||20%||20%|
In order to evaluate our experiments with dense and vision-based rewards, we analyze the achieved final distance to the goal throughout the training process. Policies trained with sparse rewards are compared based on their success rate because their training objective does not include the minimization of the distance to the goal. We report the success rate of all final policies and compare their robustness towards measurement noise in the goal location.
The results of the vision-based experiment are shown in Fig. 5. Our experiments show that a successful and consistent vision-based insertion policy can be learned from relatively few samples using residual RL. This result suggests that goal-specification through images is a practical way to solve these types of industrial tasks.
Although image-based rewards are often very sparse and hard to learn from, in this case the distance between images corresponds to a relatively dense reward signal which is sufficient to distinguish the different stages of the insertion process.
Interestingly, during training with pure RL, the policy would sometimes learn to “hack” the reward signal by moving down in the image in front of or behind the socket. In contrast, the stabilizing human-engineered controller in residual RL provides sufficient horizontal control to prevent this and it also transforms the 3-dimensional task into a quasi 1-dimensional problem for the reinforcement learning algorithm, which explains the very good results obtained with residual RL in conjunction with vision-based rewards.
In this experiment, we compare these methods on the USB insertion task with sparse rewards. The results are reported in Fig. 7. All methods are able to achieve very high success rates in the sparse setting, but the methods that use prior information learn about twice as fast. This result shows that we can learn precise industrial insertion tasks in sparse-reward settings, which can often be obtained much easier than a dense reward. In fact, prior work has found that the final policy for sparse rewards can outperform the final policy for dense rewards as it does not suffer from a misspecified objective .
|RL + LfD||Sparse||100%||32%|
The results of the experiment with perfect state information and dense rewards is shown in Fig. 9. In this case, the same conclusions of residual RL outperforming RL alone hold. Due to the shaped reward, RL makes more initial progress than it does with the other reward signals, but it never overcomes the friction required to insert the plug in fully and complete the task. It appears that the hand-designed reward function does not incentivize the full insertion enough, as we were able to obtain better results on the USB insertion with sparse rewards.
In previous set of experiments, the goal locations were known exactly. In this case, the hand-engineered controller performs well. However, once noise is induced to the goal location, the deterministic P-controller does not solve the task anymore. After training on perfect observations, a goal perturbation is created artificially and the controllers are tested under this condition. In the presence of a perturbation on the USB insertion task, the hand-engineered controller succeeds in only trials, while the best policies with residual RL can still solve the task in of the trials. A similar difference of the success rates was observed on the D-Sub connector, where all polices showed a larger decrease in performance when noise was added. Remarkable performance was achieved by residual RL when trained on the Model-E connector insertion. There, the high precision requirements prevented the human-engineered controller from succeeding more than of the trials, even though the exact goal location was given. Residual RL consistently solved this task and even managed to handle goal perturbations in of the trials. The results of our robustness evaluations are included in Figures 6 and 8. The agent demonstrably learns consistent, small corrective feedback behaviors in order to move in the right direction towards the descending path, a behavior that is very difficult to specify manually. The results of this experiment showcases the strength of residual RL. Since the human controller already specifies the general trajectory of the optimal policy, environment samples are only required to learn this corrective feedback behavior.
All experiments were also performed using TD3 instead of SAC. The final success rates of these experiments are included in Fig. 6. When combined with residual RL, SAC and TD3 perform comparably. However, TD3 is often less robust. These results are likely explained by the nature of the exploration for the two algorithms. TD3 has a deterministic policy and fixed noise during training, so once it observes some high-reward states, it quickly learns to repeat that trajectory. SAC adapts the noise to the correct scale, helping SAC stay robust to small perturbations. Furthermore, we found that the outputted action of TD3 approaches the extreme values at the edge of the allowed action space. This suggests that TD3 finds a local minimum, from which it is difficult to improve further.
In this paper we studied deep reinforcement learning in a practical setting and demonstrated that this approach can solve complex industrial assembly tasks with tight tolerances, e. g. connector plug insertions. Compared to previous work , which uses dense reward signals, we showed that we can learn insertion policies only from sparse binary rewards or even purely from goal images and also limited to only image observations. We conducted a series of experiments for various connector type assemblies under challenging conditions such as noisy goals and complex connector geometries. We found that residual RL consistently performs well in all of these settings. Our study demonstrates the feasibility the application of RL to industrial automation tasks, where reward shaping or perception may be difficult, but sparse rewards or image goals can often be provided.
There remains significant challenges for applying these techniques in more complex environments. One practical direction for future work is focusing on multi-stage assembly tasks through vision. This would pose a challenge to the goal-based policies as the background would be visually more complex. Moreover, multi-step tasks involve adapting to previous mistakes or inaccuracies, which could be difficult, but in theory should be able to be handled by RL. Extending the presented approach to multi-stage assembly tasks will pave the road to a higher robot autonomy in flexible manufacturing.
This work was supported by the Siemens Corporation, the Office of Naval Research under a Young Investigator Program Award, and Berkeley DeepDrive.
NIPS Workshop on Deep Learning, 2013, pp. 1–9.
International Conference on Machine Learning, 2018.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
L. Pinto and A. Gupta, “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours,”IEEE International Conference on Robotics and Automation (ICRA), 2016.
J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” inAdvances in Neural Information Processing Systems (NIPS), 2016.
The IEEE International Conference on Computer Vision (ICCV), Oct 2017.