Robot manipulation and control tasks are naturally addressed in different action spaces. For a walking robot, it is important to directly control contact interactions to avoid slippage [Ludo_Contact_2013]. In contrast, for a tennis swing, it is important to track and control position, velocity, and at times the acceleration of the end-effector [DMP_Initial_2002]
. For a surface-to-surface alignment task, minimizing the moment around a contact is important for robustness[khansari2016adaptive]. Tackling all these tasks requires solving two subproblems: (1) the generation of reference signals (desired contacts, trajectories, moments, etc.), and (2) the tracking of these signals. Control systems for robots can be structured with two feedback control loops that address the aforementioned subproblems: an outer loop controller that generates a time-varying reference trajectory, and an inner loop controller that tracks this trajectory. Let us refer to the outer loop as a functional map from observations to reference signals , and the inner loop as a map from reference signals to actuation commands . The combined control law becomes: , where is some observation, is an abstract action providing a reference signal in some space, and is the control command sent to the robot’s actuators to track this reference. While control theory provides a vast repertoire of strategies to map from reference signals to actuation commands ( implementations), a key problem in robotics is to generate a viable reference in a suitable space given raw sensory observations, i.e. modelling the function and deciding on the interface to . Once the interface has been defined (e.g. forces, positions, contact points, etc.), the generation of reference signals by can be addressed in multiple ways, ranging from hand-tuned state machines [sen2016automating, khansari2016adaptive, murali2015learning], trajectory optimization [18-toussaint-RSS, 2017_rss_system]schaal2003computational, Ijspeert:2013:DMP, Calinon08IROS, kroemer2015towards, krishnan2017transition], or reinforcement learning (RL) [lee2019making, harrison2017adapt]. Recent research in RL has focused on “observations-to-torques” [levine2016end], which is akin to merging and into a single learned model. Other methods use higher-level action spaces, such as joint space commands (e.g. position or velocity) [gu2017deep, haarnoja2018soft, vevcerik2017leveraging, zhu2018reinforcement] or task space commands (e.g. end-effector poses, force or fixed impedance) [lee2019making, Mrinal:2011]. These works typically focus on the effects of choosing an observation space on the learning process and rarely justify the choice of action spaces, . However, the choice of action space defines the quantity around which the inner control loop is closed, and by extension the space wherein tracking error is minimized. This critically impacts robustness and task performance as well as learning efficiency and exploration.
Moreover, previously proposed action spaces do not necessarily create suitable references for some contact-rich tasks. Consider the task of wiping a board, wherein a robot must control forces in some directions (to keep pressing against the board) and motion in others. The physical constraints of the task dictate in which axes the robot should be stiff and in which it should be compliant. This can often be time-varying. Manual specification of the task constraints is not a scalable solution for the variety of contact-rich tasks the robot may need to perform, and for some tasks, manual specification may be non-trivial. This paper studies how the selection of an interface between and , (the space ) affects RL as a method to learn the mapping from observations to reference signals, , and presents the first empirical study comparing the most common choices for the in contact-rich manipulation. We argue that an action space that captures motion and impedance in end-effector space can enable efficient learning of such tasks. We evaluate joint position, joint velocity, joint torque, joint variable impedance, as well as fixed and variable impedance in end-effector space. The choice of action space should be guided not only by the robot model but also by prior knowledge of the task. Hence, we compare action spaces across tasks with varying degrees of task-space constraints, i.e., Path Following with no contact (path following), manipulation of constrained mechanisms (door opening), and continuous unconstrained contact (surface wiping). Moreover, we introduce variable impedance control in end-effector space (VICES) and advocate this action space for Deep RL algorithms applied to contact-rich manipulation. We show that policies defined in VICES improve sample efficiency for exploration in RL, energy efficiency, and reduce peak forces. Thanks to the classical dynamically consistent operational space formalism [khatib1987unified], we observe that policies learned in end-effector space are also more robust to transfer across robots with significant differences in dynamics whether in simulation or the real world.
Ii Related Work
Robot Motion Control: Compliant control of a robot manipulator enables adaption to uncertainty in the environment (e.g. exact shape of a surface, or kinematic constraints of mechanisms) during contact-rich tasks. However, certain tasks require direct control of the contact interactions (e.g. don’t apply too much force when wiping a window). Previous abstractions have divided dimensions of the task into those that are controlled kinematically (through position and velocity) and dynamically (through force and torques) [4308708, khatib1987unified, kroger2004compliant]. However in practice, the hard decoupling requires a level of knowledge about the task that is not always available . Impedance control [part1985impedance] allows safe robot contact manipulation with an unknown environment by explicitly controlling the amount of force the robot exerts when it deviates from a given kinematic goal. Therefore, it alleviates the need for perfect knowledge or hard separation between the dynamic and kinematic task dimensions. However, different phases of a manipulation task may require a dynamic balancing between kinematic and dynamic control. Existing methods address it by scheduling variable impedance gains to maintain stability or safety for a given kinematic trajectory [Mitrovic2011, Li2018, ruckert2013learned]. However, these methods assume that a reference trajectory is given.Instead, we propose to directly predict both end-effector displacement (reference) and variable impedance gains based on observations. Action Spaces in Learning from Demonstrations: LfD derives a task policy based on demonstrations provided by some other agent(s) [Argall:2009:SRL]. If the demonstrations do not perfectly overlap, a possible approach is to derive a policy that imitates the mean motion of the demonstration set, and varies the stiffness according to the coherence of the trials [6636303, 5648931] or according to the force sensed during kinesthetic replay [Abu-Dakka2018]. Similar use of variable impedance as an action representation for LfD has been demonstrated to be successful for adaptive grasping , manipulation of deformable objects [lee2015learning], and co-adaptation to human workers [rozo2016learning]. However, the specification of impedance in the demonstration only reflects variability in demonstrations trajectories but not the underlying task constraints imposed by the environment nor the force profiles required for the task. This approach is also restricted to tasks where expert demonstrations are feasible, and hence is limited in application to kinematic tasks with phases that require different level of precision. Reinforcement Learning: In the field of model-based reinforcement learning, Kim et al. [kim2010impedance] proposed a method to learn the parameters of a variable impedance position controller in end-effector space based on the equilibrium point formalism. They demonstrated the convergence, robustness and energy efficiency of their method on simulated manipulation task with a two DoF planar arm. However, their method requires an initial trajectory which is not always available. Similar to our method, Buchli et al. [buchli2011learning] apply policy improvements with path integrals (PI) [theodorou2010generalized]
to refine initial trajectories and learn variable scheduling for the joint impedance parameters. They demonstrate that energy consumption can be optimized while achieving a task using variable impedance. However, they use joint space as their action space, which limits the transferability of the learned behaviors to different robots and the optimality of the trajectories in the space of the task. Also, their method requires an initial estimate of the solution to start the iteration. Rey et al.[Rey2018] propose an approach to simultaneously learn kinematic trajectories from demonstrations and variable impedance in task space from exploration. They use Gaussian Mixture Regression as representation for the policy and demonstrate their method in simulation and in one real-world planar task with one single stiffness parameter.  proposed a method to refine given trajectories with additional force profiles using PI[theodorou2010generalized] and readings from a force-torque sensor. We aim to achieve dynamic behavior without direct force loop control by using impedance to learn both trajectories and variable stiffness profiles. It is worth noting that all previous approaches have boostrapped learning with initial demonstrations while we explore learning from scratch to better understand how fast policies converge. Viereck et al. [Viereck2018]
studied how to incorporate control structure to learn hopping policies for one-legged robot with RL. They use an optimal controller for fixed task conditions and learn to imitate its policy with neural networks to generalize to new task conditions. Interestingly, they compare two network architectures outputting signals in different action spaces: directly desired torques or full feedback parameters and desired configuration, which is transformed into torques with an analytic function. Their experiments show that this second action space is best suited for the hopping task with intermittent contact and adds interpretability to the network output. We study a set of analytic functions (controllers) that map policy actions to low level robot commands for robot manipulation in three tasks with different contact properties. With related motivation to ours, Peng et al.[Peng:2017:LLS:3099564.3099567] studied the importance of different action representations in RL for the task of locomotion. Similar to us, they aimed to shed light on the best action space to be used, but in their case they focused on imitation learning in bipedal motion of simulation agents. We would like to provide similar insights in the more complex contact-rich robot manipulation domain and include preliminary studies of transfer to real world.
Iii Reinforcement Learning
The goal in reinforcement learning is to find a policy , that selects actions based on current observations so as to maximize the expected reward obtained from interactions with the environment [sutton2018reinforcement]. We assume that the underlying problem can be modelled as a discrete-time continuous Markov decision problem (, , , , , ), where is a continuous state space, is a continuous action space,
is a Markovian transition model defining the probability of transitioning between states for a given action, is a reward function , is a discount factor (for infinite horizon problems) and is the initial state distribution. When is probabilistic it represents the probability of the action, , given the state, : , and is the density distribution over in state . Alternatively, we can assume partial observability and learn a policy conditioned on observations instead of latent state. Herein, the agent following the policy obtains an observation at time and performs an action , receiving from the environment an immediate reward and a new observation . Assuming the policy is parameterized by , a policy gradient algorithm optimizes to maximize that the expected future return:
These algorithms are based on the policy gradient theorem that states: , where and is the action-value function associated to the current policy . There are multiple algorithmic solutions based on the policy gradient theorem that allow us to represent the policy with a deep neural network, e.g. Trust Region Policy Optimization (TRPO) [schulman2015trust], Deep Deterministic Policy Gradients (DDPG) [lillicrap2015continuous], or Advantage Actor-Critic (A2C) [mnih2016asynchronous]. In our evaluation of different action spaces for policies, we will use Proximal Policy Optimization (PPO) [schulman2017proximal]. The evaluation of the sensitivity of different algorithms to the action space is deferred to future work.
Iv Action Spaces in RL for Robot Manipulation
Relating the formalisms we introduced in Sec. I and III, corresponds to , the function that maps observations to reference actions in some space , assuming that another function will map these actions to low level control commands, . We note that the RL algorithms in Sec. III are agnostic to choice of action space . In practice, the most common in RL for robot manipulation are a) joint torques [levine2016end], b) joint velocities [gu2017deep, vevcerik2017leveraging, zhu2018reinforcement], c) joint positions [haarnoja2018soft], and d) end-effector position [lee2019making, thananjeyan2017multilateral] possibly with orientation. The most common lowest level control commands (and the one we assume for our underlying physical agent) are joint torques, . Joint torques are safer than positions and velocities for contact-rich tasks in unstructured environments, because the forces the robot will apply on the environment are limited by the specified desired torques. Manipulation tasks can seldom be solved solely by only controlling motion since there are tasks that contain contact and force constraints (e.g. the adaptation required to manipulate an articulated object or the minimum force to press while we wipe a surface) [4308708, khatib1987unified, 508440, kroger2004compliant]. To succeed in these tasks, the robot needs to dynamically modulate the exerted force on the environment through the torques on each joint. To map between action space and actuation space , we can define analytic parameterized functions (i.e. controllers), , that transform the output of the policy from the action space to the space of control commands depending on the current state of the robot, . The parameters of these functions, , can be made part of the policy action space so that the agent has full controllability on the manipulation behavior [buchli2011learning]. In the following, we will introduce the different choices of analytic controllers , we use to map policy actions from commonly used policy action spaces into joint torques.
Iv-a Joint Torques
When the policy directly outputs desired joint torques, i.e. , the function that transforms to robot commands is simply ():
Iv-B Joint Velocities
When the policy outputs reference joint velocities , the function to map to joint torques is ():
where we close the loop around , the current joint velocity (state), and
is a vector of proportional gain (parameter).
Iv-C Joint Positions
For policies that output reference joint positions, , it is most straightforward to use a proportional-derivative (PD) controller that generates torques that increase with the joint position error and decrease with the current joint velocity. We also remove the dynamic effects of the mechanism by scaling the torques with the inertia matrix, [khatib1987unified]. The function to transform reference joint positions to joint torques is thus ():
where is the difference between current and desired joint configurations, which can be used as an alternative policy action space. and are vectors of proportional and derivative gains (parameters ).
Iv-D End-Effector Pose
In the cases where the policy outputs the desired 6-D pose of the robot in end-effector space, , we can use an impedance-based PD controller to first derive an end-effector space acceleration to move towards the goal. To do that, can be decomposed into desired position, , and desired orientation, . In the impedance-based PD controller, the end-effector acceleration increases with the difference between desired end-effector pose and current pose, and , and decreases with the current end-effector velocity, and . We then compute the robot actuations (joint torques) to achieve the desired end-effector space accelerations leveraging the kinematic and dynamic models of the robot with the dynamically-consistent operational space formulation [Khatib1995a]. First, we compute the wrenches at the end-effector that correspond to the desired accelerations, . Then, we map the wrenches in end-effector space to joint torque commands with the end-effector Jacobian at the current joint configuration : . Thus, the function that maps end-effector space position and orientation to low level robot commands is ():
where and are the parts corresponding to position and orientation in , the inertial matrix in the end-effector frame that decouples the end-effector motions, and are the position and orientation parts of the end-effector Jacobian, and corresponds to the subtraction in . The difference between current and desired position () and between current and desired orientation () can be used as alternative policy action space, . , , , and are vectors of proportional and derivative gains for position and orientation (parameters ), respectively.
Iv-E Variable Impedance End-Effector Space (VICES)
Thus far, we have defined transformations between policy actions and robot commands are parameterized with . In these cases, parameters are manually specified. We observe that it is beneficial to augment the action space with these parameters to give the agent full control of the behavior. As discussed in Sec. II, this idea has been previously explored in joint space for and [buchli2011learning]. In this paper, we propose to also turn the parameters of the end-effector space function (, , , and ) into policy outputs. We term this action space as Variable Impedance End-Effector Space (VICES). It enables the policy to learn both to predict the end-effector pose as a trajectory reference as well as to dynamically adapt the impedance gains along each of the six axes (rotation and translation) according to the phase of the task.
We conduct experiments in three application domains: a) free space Path Following [buchli2011learning], b) manipulation of articulated mechanisms [kim2010impedance] and c) surface wiping [ott2008cartesian, Leidner2016RoboticAR]. These tasks are not only relevant applications in robotics, but also span different levels of task constraints from free motion to highly constrained contact-rich manipulation, which allows us to evaluate and compare the characteristics of the different action spaces for policy learning. For these three tasks and for each of the evaluated action spaces we aim to answer the following questions: is the action space suitable for model-free RL? Is the learned policy physically efficient? Does the policy learned with a simulated robot transfer to a different simulated robot? Does a policy learned in simulation transfer to a real robot? To answer these questions we will use the following metrics and tests:
[ topsep=0pt, noitemsep, partopsep=0.5ex, parsep=0.5ex, leftmargin=*, itemindent=2.5ex ]
Sample efficiency and task completion: samples required for the policy to succeed in the task and/or converge
Physical efficiency: energy consumed by the robot when using the trained policy. We assume a proportional relationship between joint torques and electric power
Physical effort: wrenches applied to the environment by the trained policy during contact-rich manipulation tasks
Transferability between robots: does a robot achieve the task using a policy trained on a different robot?
Sim-to-real transfer of contact-rich policies: does a real robot achieve the task using a policy trained in simulation?
Training curves for a) Path Following (free space), b) door opening (kinematic constraints), and c) surface wiping (contact rich) tasks; The plots depict mean and standard deviation of five learning processes with different random seeds; Tasks without contact or with kinematic constraints (Path Following and door opening) do not require variable impedance as action space to achieve high reward; In the contact-rich task (surface wiping) the policy using variable impedance in end-effector space achieves higher reward because it learns to adapt correctly the amount of force applied to the tasks constraints
Our control framework is outlined in Fig 2. In all experiments our policies output actions () at , while we send joint torque commands () to the robot at . To generate torque commands at a higher frequency, the controllers use the constant desired goal from the policy while updating the current state of the robot, . In order to ensure smooth robot commands and generated motions, in all of our controllers we interpolate linearly between policy commands at consecutive time steps.
V-a Free-Space Motion - Path Following
Setup. In this experiment we aim to measure the properties of different action spaces for tasks that do not involve any contact with the environment. Agent’s goal is to follow a trajectory in free-space passing through four via-points. The via-points are placed on a virtual plane in front of the agent at a constant distance along the x axis. The order and location of the via-points are fixed. This setup is a more complex version of the one via-point trajectory of Buchli et al. [buchli2011learning]. This task can be solved kinematically without impedance control. However, we found that controlling the compliance of the robot could still offer benefits in this setup. Reward Model. This task is trained in two phases : a first phase of task completion and a second phase of energy optimization. In the first phase, the agent is rewarded only to complete the task: to pass through the four via-points. In the second phase, the trained models from previous phase are further trained with the additional objective of optimizing their motion to reduce energy consumption. In the first phase of the experiment, the agent is rewarded when it hits a via-point (it gets closer than ). To help guide exploration, we also provide a small dense reward inversely proportional to the distance to the next via-point in the trajectory. Since the episodes continue after the task is completed, a task-completion bonus proportional to the remaining time steps was introduced to discourage the robot from unnecessarily extending the duration of the task. We train policies with this reward using the different action spaces to evaluate if they can learn to follow the free-space trajectory. In the second phase of the experiment , we explore if the action spaces can optimize for the additional objective of minimizing energy consumption without decreasing the quality of the first objective (passing through the via-points). We include an energy consumption penalty to the previously defined reward function. To evaluate energy consumption we assume that the torques from the motors are proportional to electric current and the voltage is constant, and thus the amount of electric power scales proportional to the torque and the energy is its time integral. Observations. We use as observations the pose and velocity of the end-effector in the robot reference frame, as well as the location of the via-points (and whether each one has been checked). Evaluation. We first evaluate each of the different action spaces in simulation, using a simulated Panda robot agent with five different random seeds. In the first phase of the experiment, we measure sample efficiency (reward as function of the iterations) and level of completion of the task (number of via-points crossed). In the second phase of the experiment, we also measure the total energy consumption and task success. In both phases we evaluate how the original trained policies transfer between robots.
Sample Efficiency and Task Completion. Fig. 3 (a) shows the training curves for policies in each of the action spaces. All policies except the ones that output reference joint torques and end-effector poses resolved with a fixed low impedance controller were able to achieve the goal of the task: checking all 4 via-points (see Fig. 4(a)). For the policies that achieve the task, the differences in reward value after convergence is simply a consequence of the termination bonus: some action spaces (e.g. desired end-effector poses resolved with high fixed impedance) allow for faster motion and thus faster completion of the task We gain insights on how an RL policy exploits VICES for this task by observing the stiffness and damping over the course of an episode. Fig. 4 depicts the commands (the desired stiffness and the product of desired stiffness and delta position) from the policy trained with VICES for one episode after the first stage of training (before applying the energy penalty). The policy exploits the impedance (stiffness) to reach each via-point in the different portions of the trajectory. As it checks each via-point (indicated by the vertical bars in the figure), the impedance changes in the appropriate dimension to move quickly to the next via-point with enough stiffness to avoid overshooting. Physical Efficiency
. We evaluate the physical efficiency of policies in different action spaces by comparing the total energy consumption of the agents at the end of the first phase and of the second phase of our experiment, where we add the energy penalty. We found that the policies using variable impedance in end-effector space as action space were the only end-effector space policies that consistently improved energy efficiency while maintaining task performance. Both the medium and high fixed impedance models became unstable, since the action space does not have sufficient degrees of freedom to optimize the motion to reduce energy consumption while still achieving the trajectory task. Note that since the low impedance model never achieved the task, it was not evaluated with energy penalties. In joint space, the policies outputting actions in variable impedance space were also able to reduce the energy consumption significantly more than the controllers with fixed impedance, as expected[buchli2011learning]. They also reduced more energy than policies outputting variable impedance in end-effector space. This reflects that the policies outputting joint space commands resulting from the first phase of our experiment solved the task much faster (with higher energy consumption) than their end-effector counterparts and therefore had much more room for improvement when optimizing for energy efficiency. Therefore, the difference in absolute energy optimization between policies in joint and in end-effector space is an artifact of the difference in magnitude between end-effector space delta position limits and joint space delta angle limits (i.e., the joint space agents were originally allowed to move more at each time step). Transferability. We also evaluate how policies using different action spaces transfer in simulation from one robot to another through zero-shot transfer from the Panda robot to the Sawyer robot. The results are depicted in Fig. 4(a). As expected, we observed that after convergence only the policies using fixed and variable impedance in end-effector space could transfer directly between robots. The joint-space policies were not able to transfer due to the very different kinematics and dynamics of the two robot platforms. By using end-effector space control, we factor out the effects of the embodiment from the policy learning problem.
V-B Manipulation of Constrained Mechanisms - Door Opening
Setup. In this task, the robot has to learn how to manipulate a one DoF constrained mechanism, a door, to a specific configuration. The agent is equipped with a two-finger gripper it can use to hold the door handle. The door handle is a bar attached vertically on the door leaf. The gripper is closed, leaving a space between the fingers to cage the door handle while still allowing for rotation between handle and gripper. We ensure that the agent learns to interact in a controlled and safe manner. Hence instead of maximally opening the door, we set the goal to manipulate the door into a desired joint configuration (). We measure success as distance of final to desired configuration at end of each episode. Reward Model. We reward the agent when the door joint gets closer to the desired configuration. We provide additional constant reward if the configuration of the door is very close to the desired value (less than ). We penalize forces and torques exerted on the environment that go beyond the physical payload of the robot (). We also penalize the agent for colliding with the environment with links other than the gripper and for going beyond its joint limits. For safety, the episode terminates when joint limits are violated. Observations. We use as observation the pose and velocity of the robot’s end-effector in the robot reference frame, as well as the door’s angle and angular velocity.
Evaluation: We evaluate the different action spaces in simulation. We train an agent with a Panda robot embodiment for each action space with five different random seeds. Sample Efficiency and Task Completion. We first evaluate the different action spaces on their sample efficiency of learning the door-manipulation. The training results are depicted in Fig. 3, middle. The task success results for the door task can be found in Fig. 4(b). We observe that policies that output end-effector space actions (with medium, variable, and high impedance) outperform policies in all other action spaces, in terms of achieving close to 100% task success rate and higher rewards. In end-effector space, the policy resolved with an impedance controller with fixed medium stiffness and damping is able to learn the task at a faster rate than the variable impedance controller, as it is initialized with a suitable impedance to operate the door with the defined friction. However, policies outputting actions in the both aforementioned spaces reach similar rewards and task success rates at the end of training, as the policies that can vary impedance end up learning a suitable impedance for the task. While the policies resolved with an impedance controller with fixed high impedance parameters also achieve on average 100% task success rate, their rewards are lower because they exert higher forces in the environment that is penalized. The policies resolved with an impedance controller with fixed low impedance parameters are not able to learn the door opening task because they cannot exert high enough forces to overcome the friction of the door and move it. The policies outputting joint velocity actions can reach up to 75% task success, but the rewards are much lower than policies in VICES, as they often reach joint limits while opening the door. Policies outputting other joint-space actions (torques, positions) are unable to learn to exert enough force to open the door without reaching joint limits. Transferability. We also evaluate the ability of policies in different action spaces to transfer from the Panda robot to the Sawyer robot. The results are shown in Fig. 4(b), in lighter colors. Transferring policies for the door opening task is more complex than for the free-space Path Following task because the different robots’ kinematics lead to very different task-space limitations, as well as very different joint limit constraints. Similar to results in the other tasks, policies trained in joint space are unable to transfer, since the kinematics of the robots differ substantially. The end-effector space policies are able to transfer much more successfully, as the end-effector space policies are able to abstract away the dynamics and kinematics of each specific robot model. There is still a performance drop, as the policies in end-effector space do not learn to account for the robots’ different kinematic constraints (i.e. joint limits).
V-C Contact-Rich Manipulation - Surface Wiping
Setup. In this experiment the goal is to wipe a table whose surface location is unknown. The agents are equipped with a wiping tool, resembling a scrubber or a whiteboard eraser (see Fig. 1). In the simulator, the tool is modeled as a soft material that creates contact forces that increase proportionally to the penetration into the tool’s surface. The material to wipe is modeled as a set of small elements of a color different from the table. The elements are placed randomly on the table surface to form a continuous “stain” and are marked initially as unwiped. They become wiped if the wiping tool passes through them, which also causes them to disappear visually. Note that since the elements are modeled as very thin ( height) cylinders resting on the table’s surface, the agent needs to press the tool against the surface so as to be able to wipe elements. The friction of the table is randomized, as well as the initial location of the agent above the table. Reward Model. The main reward comes from wiping off elements. We also provide additional reward for wiping off all the elements. Additionally, to help during the initial phases of exploration, we give the agent a small reward for maintaining contact with the table. Finally, since we aim to generate safe solutions that can directly be tested on the real robot, we slightly penalize the agent for applying forces over the payload of the real robot (), and harshly penalize the agent for reaching joint limits or colliding with the table with parts other than the wiping tool. If such collisions occur, the episode ends and the tasks restarts. Observations. There is no straightforward way to represent the state of a wiping task. Instead, we directly use visual observations: RGB images of the wiping scene generated in our simulator, and obtained from a camera on our real robot platform for the simulation-to-real transfer experiments. As in previous experiments, we also provide the pose and velocity of the end-effector. Evaluation: We first evaluate the different action spaces on simulation. We train an agent with a Panda robot embodiment for each action space with five different random seeds. Sample Efficiency and Task Completion. Fig. 3, right, shows the convergence of the agents with different action spaces. We observe that the agents with variable impedance in end-effector space converge faster and achieve higher reward. The higher reward is obtained thanks to a lower penalty for applying excessive force on the table since the agents can learn to appropriately adapt their stiffness. The mean force applied by the policies with variable impedance in end-effector space is , less than the robot’s payload. Agents using other action spaces apply higher mean force or not enough to wipe the table. In terms of task completion, the results are depicted in Fig. 4(c), dark colors. Policies outputting actions in VICES achieve the highest ratio of wiped units.
Transferability. We also evaluate if the policies learned with the Panda robot embodiment transfer directly to the Sawyer robot in simulation. Fig. 4(c), depict the results of the policy transfer between robots. Policies trained in variable impedance in end-effector space transfer better than policies in any other space since the policy is independent of the robot embodiment. However, there is a significant drop in performance due to the different forces generated by the different embodiments. Simulation-to-real transfer. In a final experiment we evaluate if the policies trained in simulation can be used on the real robot without any retraining. The goal in the real world is to wipe a whiteboard painted with a marker. Since our focus is on the evaluation of the action space and not on learning a representation of the image, we convert the real images into fake simulated images by superimposing the results of a color segmentation for the colored parts of the table on an image from the simulator where the robot configuration is set to track the real robot. As a safety precaution, we stop the robot if the payload is exceeded. We note that the robot does not use any direct force sensing during the experiments. We initialize the robot to the same location and run ten trials each with a different part of the whiteboard painted. One example of the run can be seen in Fig. 6 and more runs in the video attachment. We assume a successful trial when the robot wipes more than 3/4 of the painted line. The robot wipes successfully the board in 8 of the 10 trials. In one of the failed trials the robot moved abruptly and triggered the safety mechanism. In another trial the robot did not wipe the mark entirely. These results indicate that the policies trained with VICES can transfer seamlessly to real world by exploiting the knowledge of the dynamics model of the robot.
Reinforcement Learning (RL) as a family of algorithms has ushered in impressive results in generalization, yet principled evaluation on how to choose action spaces to learn control policies is missing. We presented a thorough evaluation of the effect of the choice of action space on learning policies in RL for tasks without contact, with kinematic constraints and contact-rich manipulation tasks. We also presented variable impedance in end-effector space (VICES) as an efficient choice of action space for RL and showed empirically that, even when contact conditions are dynamically variable during the task, this model outperforms other action space choices on sample efficiency, energy consumption, and safety. We also showed that, thanks to the subtraction of the dynamic effects of the embodiment, using variable impedance in end-effector space we can transfer policies learned in simulation to other simulated robots and to a real robot without fine tuning.
This work has been partially supported by JD.com American Technologies Corporation (“JD”) under the SAIL-JD AI Research Initiative. This article solely reflects the opinions and conclusions of its authors and not JD or any entity associated with JD.com.