Sim-to-Real Transfer Learning using Robustified Controllers in Robotic Tasks involving Complex Dynamics

09/13/2018 ∙ by Jeroen van Baar, et al. ∙ MERL 0

Learning robot tasks or controllers using deep reinforcement learning has been proven effective in simulations. Learning in simulation has several advantages. For example, one can fully control the simulated environment, including halting motions while performing computations. Another advantage when robots are involved, is that the amount of time a robot is occupied learning a task---rather than being productive---can be reduced by transferring the learned task to the real robot. Transfer learning requires some amount of fine-tuning on the real robot. For tasks which involve complex (non-linear) dynamics, the fine-tuning itself may take a substantial amount of time. In order to reduce the amount of fine-tuning we propose to learn robustified controllers in simulation. Robustified controllers are learned by exploiting the ability to change simulation parameters (both appearance and dynamics) for successive training episodes. An additional benefit for this approach is that it alleviates the precise determination of physics parameters for the simulator, which is a non-trivial task. We demonstrate our proposed approach on a real setup in which a robot aims to solve a maze puzzle, which involves complex dynamics due to static friction and potentially large accelerations. We show that the amount of fine-tuning in transfer learning for a robustified controller is substantially reduced compared to a non-robustified controller.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Teaching robots to perform challenging tasks has been an active topic of research. In particular, it has recently been demonstrated that reinforcement learning (RL) coupled with deep neural networks is able to learn policies (controllers) which can successfully perform tasks such as pick and fetch.

Robots may be slow, dangerous, can damage themselves and they are expensive. When a robot is learning a task, it needs to be taken out of production. Learning policies using model-free deep RL typically requires many samples to explore the sequential decision making space. Model-free RL applied to tasks that involve complex dynamics, require even more samples to learn adequate policies compared to tasks involving (largely) linear dynamics. Directly learning on robots may thus be very costly.

In order to reduce the time required for learning on a real robot, training can be performed in simulation environments. The learned policy is then transferred to the real world domain. Modern graphics cards and sophisticated physics engines enable the simulation of complex tasks. Learning with simulators has several advantages. The rendering and physics engines are capable of computing simulations faster than real-time. This helps to reduce overall training times. Recent deep reinforcement learning algorithms allow agents to learn in parallel [1], which reduces training times. Furthermore, both appearance and physics can be controlled in simulation. For example the lighting condition, or the friction of an object can be changed, or the entire simulation can be halted to allow for computation of updates.

Appearance, complex dynamics, and robot motor movements in the real world can only be simulated up to some approximation. Simulation to real world transfer thus requires fine-tuning on real data. Furthermore, real setups involving various components, experience delays which are hard to determine exactly. For example, the delay introduced by the acquisition system, where some time has passed before the acquired image is available for processing by the algorithm.

By randomization of the appearance, physics and system parameters during reinforcement learning on simulation data, robustified

policies can be learned. This is analogous to training a deep convolutional neural network to classify objects regardless of the background in the input images. We found that robustified policies can greatly reduce the amount of time for fine-tuning in transfer learning. Reducing the fine-tuning time in transfer learning becomes especially important for tasks involving complex dynamics.

We demonstrate our proposed approach on a challenging task of a robot learning to solve a marble maze game. The maze game is shown in Figure 1. The marbles are subject to static and rolling friction, acceleration, and collisions (with other marbles and with the maze geometry). A simulator simulates the physics of the marbles in the maze game, and renders the results to images. We learn to solve the game from scratch using deep reinforcement learning. A modified version of the deep reinforcement learning is used to learn directly on real robot hardware. We learn both a robustified and non-robustified policy in simulation and compare the times required for fine-tuning after transferring the policy to the real world.

In the remainder of this paper we will refer to learning on simulated data / environments as offline learning, and learning on real data / environments will be referred to as online learning. Transfer learning (TL) with fine-tuning on real data therefore constitutes both offline as well as online learning.

Ii Related Work

Figure 1: Marble maze game. (Left) Top view of the marble maze after a plexiglass top has been removed (leaving holes in the outermost edge). A paper rim is used to cover the holes. The black dots in each gate between rings are used for alignment. The view also shows the world aligned and axes. (Middle) The marble maze mounted on the robot arm. (Right) A rendering of the simulated marble maze under some chosen lighting conditions (without added noise).

Our work is inspired by the recent advances in deep reinforcement learning, learning complicated tasks and achieving (beyond) human level performance on a variety of tasks [2, 1, 3, 4].

TL has been an active area of research in the context of deep learning. For example, tasks such as object detection and classification can avoid costly training time by using pre-trained networks and fine-tuning 

[5, 6], where typically only the weights in the last couple of layers are updated. TL from simulated to real has also been applied to learn robot tasks [7, 8, 9, 10]. To reduce the time required for fine-tuning in TL, the authors in [11] propose to make simulated data look more like the real world.

In [12]

the authors acknowledge that training robot tasks on simulated data alone does not readily transfer to the real world. They propose a form of fine-tuning where the inverse dynamics for the real robot are recovered. It requires a simulator and training which produces reasonable estimates of the real world situation. The drawback of this method is that it requires long online training times, whereas our goal is to minimize the duration of the online training time.

By randomization of the appearance, the learning can become robust against appearance changes and readily transfer to the real world domain [13, 14]. The method proposed in [15] exploits an ensemble of simulated source domains and adversarial training to obtain robust policies. This policy search approach relies on trajectories and roll-outs which solve the task. The approach proposed in [16] uses model-based RL to learn a controller entirely in simulation, allowing for zero-shot TL. Since we are considering tasks involving (much) more complex dynamics, we instead follow a similar approach as [17], and perform randomization of appearance, physics and system parameters with model-free RL.

Model-agnostic meta-learning (MAML) [18], aims to learn a meta-policy that can be quickly adapted to new (but similar) tasks. In the case of complex dynamics it is not clear how easily MAML could be applied. Appearance and dynamics randomization can be considered as forms of meta-learning. Other approaches aim to learn new tasks, or refine previously learned tasks, without ”forgetting”, e.g., [19]. Our emphasis instead is on reducing the amount of time required for fine-tuning in TL.

Our simulator provides observations of the state in simulation, similar to the real world. In [20] the critic receives full states, whereas the actor receives observations of states. Coupled with appearance randomization, zero-shot transfer can be achieved. The full state requires that the physics parameters to produce complex dynamics match those of the real world. However, precisely determining the physics parameters is non-trivial.

Formulating reward functions is not straightforward. The authors in  [21] propose to discover robust rewards to enable the learning of complicated tasks. Adding additional goals (sub-goals), basically a form of curriculum learning [22], can improve the learning as well [23]. The latter approach may be applied to break up the goal of a marble maze into stages. However, in this paper we show that a simple reward function which governs the overall goal of the game is sufficient.

The authors in [24] propose a game-like environment for generating synthetic data for benchmark problems related to reinforcement learning. We developed our simulator along the same lines as [24].

In [25] the authors propose to model both the dynamics and control in order to solve the marble maze game. This is a complementary approach to the TL approach proposed in this paper, and we believe that each approach has its own strengths and weaknesses.

Iii Preliminaries

We briefly review some concepts from (deep) reinforcement learning (RL) using model-free asynchronous actor-critic, and define some terminology that we will use in the remainder of this paper. In the next section we will discuss our approach.

Iii-a Reinforcement Learning

In RL an agent interacts with an environment, represented by a set of states , taking actions from an action set , and receiving rewards

. The environment is governed by (unknown) state transition probabilities

. The agent aims to learn a (stochastic) policy , which predicts (a distribution over) actions based on state . The goal for the agent is to learn a policy which maximizes the expected return , where the return denotes the discounted sum of future rewards, with discount factor .

To determine for a given policy how good it is to be in a certain state, or how good it is to take a certain action in a certain state, RL depends on two value functions: a state-value function and an action-value function

. For Markov decision processes, the value functions can be written as a recursion of expected rewards, e.g.,

, where denotes the current state, and denotes the next state. The recursive formulations are Bellman equations. Solving the Bellman optimality equations would give rise to the optimal policy . For details we refer the reader to [26]

We consider the case where agents interact with the environment in episodes of finite length. The end of an episode is reached if the agent arrives at the timestep of maximum episode length, or the goal (terminal state) is achieved. In either case, the agent restarts from a new initial state.

Iii-B Deep RL using Advantage Actor-Critic

In [1] the authors propose the asynchronous advantage actor-critic algorithm. The algorithm defines two networks: a policy network with network parameters , and a value network with network parameters

. This policy-based model-free method determines a reduced variance estimate of

as  [27]. The return is an estimate of and the baseline is a learned estimate of the value function . The policy is referred to as the actor, and value function estimate as the critic.

The authors in [1] describe an algorithm where multiple agents learn in parallel, and each agent maintains local copies of the policy and value networks. Agents are trained on episodes of maximum length . Within each episode, trajectories are acquired as sequences , of maximum length . Rather than the actual state, the inputs are observations (images) of the state, and a forward pass of each image through the agent’s local policy network results in a distribution over the actions. Every steps, the parameters of the global policy and value networks are updated and the agent synchronizes its local copy with the parameters of the global networks. The current episode ends after steps, or when the terminal state is reached, and then a new episode starts. This episodal learning is repeated until the task is solved consistently. See [1] for further details.

Iv Deep Reinforcement Learning for a Task with Complex Dynamics

Iv-a Setting up the Task

The task we aim to learn is to solve a marble maze game, see Figure 1. Solving the game means that the marble(s) are maneuvered from the outermost ring, through a sequence of gates, into the center. Due to static and dynamic friction, acceleration, damping, and the discontinuous geometry of the maze, the dynamics are (highly) complex and difficult to model. To solve the marble maze game using model-free RL we can define a reward function as:


This sparse reward function is general and does not encode any information about the actual geometry of the game. The action space is discretized into five actions. The first four actions constitute rotation increments, clockwise and counterclockwise around the , and axes up to a fixed maximum angle. Figure 1–Left shows the orientation of the , and axes with respect to the maze. The increment is sufficient to overcome the static friction, while simultaneously avoiding accelerations that are too large. We define a fifth action as no-op, i.e., maintain the current orientation of the maze. We empirically determined the fixed maximum angle to be in either direction.

Iv-B Deep Reinforcement Learning on Simulated Robot Environments

In order to learn a robustified policy in simulation, we adopt the idea of randomization from [17, 13, 14]

. We implemented two learning schemes. In the first scheme, each agent was assigned different parameters which were kept fixed for the duration of learning. In the second scheme, the physics and appearance parameters are randomly sampled from a pre-determined range, according to a uniform distribution, for each episode and each agent. We found that the second scheme produced robustified policies which adapted more quickly during fine-tuning on the real robot after transfer.

We use the asynchronous advantage actor-critic (A3C) algorithm to learn a policy for the marble maze game. To successfully apply reinforcement learning with sparse rewards, a framework of auxiliary tasks may be incorporated [28]. One could consider path following as an auxiliary (dense reward) task. However, we aim to keep our approach as general as possible, and not rely on the geometry of the maze. We instead incorporate pixel change and reward prediction, as proposed by [28]. Pixel change promotes taking actions which result in maximal change between images of consecutive states. In the context of the maze game, we aim to avoid selecting consecutive actions that would result in little to no marble motions. In addition, reward prediction aims to over-represent rewarding events to offset the sparse reward signal provided by the reward function. To stabilize learning and avoid settling into sub-optimal policies we employ the generalized advantage estimation as proposed by [29] together with entropy regularization with respect to the policy parameters [1].

Iv-B1 Robustified Policies

At the start of each episode, for each agent, the parameter values for static friction, dynamic friction, damping and marble(s) mass are uniformly sampled from a range of values. We emulated a camera delay by rendering frames into a buffer. The camera delay was varied per episode and agent. During each episode the parameters are held constant. Each observation received from the simulator is corrupted by AGWN. We experimented with additional appearance changes, such as different light colors and intensities. We found that those changes had little effect on improving the time required for fine-tuning for our current setup.

Iv-C Deep Reinforcement Learning on Real Robot Environments

A3C is an on-policy method, since the current policy is used in roll-outs (using an -greedy exploration strategy) to obtain the current trajectory of length . For each update, A3C

accumulates the losses for the policy and value networks over the trajectory and performs backpropagation of the losses to update the policy and value network parameters. The simulation is halted until the network parameters have been updated, and then roll-outs for the next trajectory continue using the updated policy


For a real robot setup we need to be able to compute an update, while simultaneously collecting the next trajectory, since we cannot halt the motion of the marble(s) during an update. We therefore adopt an off-policy approach for the real robot setups (see Algorithm 1).

  —initialized or robustly learned in simulation
  obtain using
     while  do {concurrently}
         compute update from
         obtain using
     end while
  until done
Algorithm 1 Algorithm for off-policy A3C

We acquire the next trajectory while concurrently computing the updates for the policy and value networks based on the previously acquired trajectory . We first verified in simulation that our off-policy adaptation of A3C would indeed be able to successfully learn a policy to solve the marble maze. If one had access to multiple robots, the robots could act as parallel agents similar to the case of simulation. However, due to practical limitations, we only have access to a single robot and are thus limited to training with a single agent in the real world case.

V Implementation

We have implemented a simulation of the marble maze using MuJoCo [30] to simulate the dynamics, and Ogre 3D [31] for the appearance. We carefully measured the maze and marble dimensions to accurately reconstruct its 3D geometry. In order to match the simulated dynamics to the real world dynamics, we have tuned the MuJoCO parameters, with static friction, dynamic friction, and damping parameters in particular. For tuning, the maze was inclined to a known orientation, and the marble was released from various pre-determined locations within the maze. Using the markers (see Figure 1) we aligned the images of the simulated maze to the real maze by computing a homography warp. We then empirically tuned the parameters to match the marble oscillations between the simulated and real maze. Learning the parameters instead would be preferable, but this is left as future work. The simulator is executed as a separate process, and communication between controller and simulator is performed via sockets. The simulator receives an action to perform, and returns an image of the updated marble positions and maze orientation, along with a reward (according to Eq. 1) and terminal flag.

The policy network consists of two convolutional layers, followed by a fully-connected layer. The input to the network is an 84

84 image. A one-hot action vector and the reward are appended to the 256-dim. output of the fully-connected layer and serves as input to an LSTM layer. This part of the network is shared between the policy (actor) and value (critic) network. For the policy network a fully-connected layer with softmax activation computes a distribution over the actions. For the value network, a fully connected layer outputs a single value. We empirically chose

and .

The ()-tuples are stored in a FIFO experience buffer (of length 3000). We keep track of which tuples have zero and non-zero rewards for importance sampling. For reward prediction we (importance) sample three consecutive frames from the experience buffer. The two convolutional layers and fully connected layer are shared from the policy and value networks. Two more fully connected layers determine a distribution over negative, zero or positive rewards.

For pixel change, we compute the average pixel-change for a 2020 grid, for the central 8080 portion of consecutive images. The pixel-change network re-uses the layers up to and including the LSTM layer for the policy and value network. A fully connected layer together with a deconvolution layers predict 2020 pixel change images. At most frames are sampled from the experience buffer, and we compute the L2 loss between the pixel change predicted by the network, and the recorded pixel change over the sampled sequence. Both losses are added to the A3C loss.

The physics parameters are uniformly sampled from a range around the empirically estimated parameter values. Due to the lack of intuitive interpretation of some of the physics parameters, the range was determined by visually inspecting the resulting dynamics to ensure that the dynamics had sufficient variety, but did not lead to instability in the simulation.

For the real setup, the ROS framework is used to integrate the learning with camera acquisition and robot control. The camera is an Intel RealSense R200 and the robot arm is a Mitsubishi Electric Melfa RV-6SL (see Figure 1–Middle). The execution time of a rotation command for the robot arm is about 190ms. Forward passes through the networks and additional computation time add up to about 20 or 30ms. Although we can overlap computation and robot command execution to some degree, observations are acquired at a framerate of 4.3Hz, i.e. 233ms intervals, to ensure robot commands are completed entirely before the new state is obtained. We observed that during concurrent network parameter updates the computation time for a forward pass through the policy network increases drastically. If we expect that the robot action cannot be completed before the new state is observed by the camera, we set the action to no-op (Sec. IV-A). We implemented a simple marble detector to determine when a marble has passed through a gate, in order to provide a reward signal. For learning in simulation we use the same 4.3Hz framerate. Each incremental rotation action is performed over the course of the allotted time interval of 233ms, such that the next state provided by the simulator reflects the situation after a complete incremental rotation.

Vi Results

Online (real) Offline (simulator) TL (online part)
Robust 3.5M 4.0M 55K
Non-Robust 3.5M 4.5M 220K
Table I: Comparison of online, offline and online fine-tuning steps for TL for a single marble. A robustified policy can reduce the training steps by a factor of almost 60 compared to online training, and a factor of more than 3 compared to non-robustified TL fine-tuning.

Table I compares the number of steps for training a policy to successfully play a one marble maze game. Training directly on the real robot takes about 3.5M steps. For TL, we compare the number of fine-tuning steps necessary for a robustified policy versus a non-robustified policy (fixed parameters). Training a robustified policy in simulation takes about 4.0M steps, whereas a non-robustified policy takes approximately 4.5M to achieve 100% success rate. TL of a robustified policy requires about 55K steps to ”converge”. This is a reduction of nearly 60 compared to online training. A non-robustified policy requires at least 3 the number of fine-tuning steps in order to achieve the same level of success in solving the maze game.

Figure 2: Results for the fine-tuning of policies solving a maze game with one marble for a simulation pre-trained robustified policy (Left), and for a simulation pre-trained non-robustified policy (Right). Note that the horizontal axis ranges between Right and Left are slightly different. In the Top row we plot the number of steps per episode —with maximum episode length —and in the Bottom row we plot the accumulated rewards per episode. The fine-tuning of the robustified policy leads to earlier success of consistently solving the maze game. We consider convergence at ~55K for the robustified policy. Even after more than ~220K fine-tuning episodes, the non-robustified policy occasionally fails to solve the maze game. In addition, the number of steps on average per episode to solve the maze game is significantly less for the case of the robustified policy.

Figure 2 further shows the benefit for TL of a robustified policy. The left side of Figure 2 shows results for the robustified policy, with results for the non-robustified policy on the right. The bottom row shows the accumulated rewards for an episode. An accumulated reward of 4.0 means that the marble has been maneuvered from the outside ring into the center, since there are four gates to pass through. The graph for the robustified policy shows that the learning essentially converges, i.e., achieve 100% success, whereas for the non-robustified policy transfer, the success rate is around 90%. The top row of Figure 2 shows the length of each episode. It is evident that the robustified policy has successfully learned how to handle the complex dynamics to solve the maze game.

We repeated the same experiment for a two marble maze game, with the goal to get both marbles into the center of the maze. We only compared TL with the robustified policy. The results are shown in Table II. Learning a two marble game in simulation with rewards achieved 100% success. However, training on the real setup with these rewards proved very challenging. We believe this is due to the geometry of the maze—the center has only one gate, surrounded by four gates in the adjacent ring—coupled with the static friction. We designed a reward function which gives more importance for passing through gates into rings closer to the goal. This promotes a marble to stay in the center area, while the controller maneuvers the remaining marble. The rewards were modified to instead (which was also used for training the two marble game offline). When learning online, even after 1M steps, the success rate is still at 0% (a single marble reached the center about a dozen of times). With fine-tuning a transferred robustified policy, after 225K steps around a 75% success rate is achieved.

Online Offline TL
Robust 1M (0%) 3.0M (100%) 225K (75%)
Table II: Comparing TL for a two marble maze game. Both the number of steps and success rate are reported.

We investigate if the transfer of a single marble policy learned offline, would require longer fine-tuning for a two marble game online. After 100K steps of fine-tuning, the policy was able to start solving the game. A success rate of about 50% was achieved after 400K steps. Thus, fine-tuning a robustified policy trained on a two marble maze game in simulation achieves a higher success rate compared to the fine-tuning of a single marble robustified policy.

We refer the reader to the supplemental material for videos of example roll-outs for single and two marble maze games.

Vii Discussion and Future Work

Deep reinforcement learning is capable of learning complicated robot tasks, and in some cases achieving (beyond) human-level performance. Deep RL requires many training samples, especially in the case of model-free approaches. For learning robot tasks, learning in simulation is desirable since robots are slow, can be dangerous and are expensive. Powerful GPUs and CPUs have enabled simulation of complex dynamics coupled with high quality rendering at high speeds. Transfer learning, i.e., the training in simulation and subsequent transfer to the real world, is typically followed by fine-tuning. Fine-tuning is necessary to adapt to any differences between the simulated and the real world. Previous work has focused on transfer learning tasks involving linear dynamics, such as controlling a robot to pick an object and place it at some desired location. However, we explore the case when the dynamics are complex. Non-linearities arise due to static and dynamic friction, acceleration and collisions of objects interacting with each other and the environment. We compare learning online, i.e., directly in the real world, with learning in simulation where the physics, appearance and system parameters are varied during training. For reinforcement learning we refer to this as learning robustified policies. We show that the time required for fine-tuning with robustified policies, is greatly reduced.

Although we have shown that model-free deep reinforcement learning can be successfully used to learn tasks involving complex dynamics, there are drawbacks of using a model-free approach. In the example discussed in our paper, the dynamics are (mostly) captured by the LSTM layer in the network. In the case of more than one marble the amount of fine-tuning time significantly increases. In general, as the complexity of the state space increases, the amount of training time increases as well. When people perform tasks such as the maze game, they typically have a decent prediction of where the marble(s) will go given the amount of rotation applied. In [32, 33] the graphics and physics engine are embedded within the learning to recover physics parameters and perform predictions of the dynamics. In [34] the physics and dynamics predictions are modeled with networks. These approaches are interesting research directions for tasks involving complex dynamics.

We currently use high-dimensional images as input to the learning framework. Low-dimensional input, i.e. marble position and velocity, may be used instead. In addition, rather than producing a distribution over a discrete set of actions, the problem can be formulated as a regression instead and directly produce values for the and axes rotations [35, 1].

People quickly figure out that the task can be broken down into moving a single marble at the time into the center, while avoiding marbles already in the center location from spilling back out. Discovering such sub-tasks automatically would be another interesting research direction. Along those lines, teaching a robot to perform tasks by human demonstration, or imitation learning, could teach robots complicated tasks without the need for elaborate reward functions, e.g., 



We want to thank Rachana Sreedhar for the implementation of the simulator and Wei-An Lin for the Pytorch implementation of deep reinforcement learning.