I Introduction
In this paper, we present a trajectory optimization algorithm for robots with unknown dynamics operating under state and control constraints using Reinforcement Learning (RL). Trajectory optimization is a procedure that produces state and control sequences for a dynamical system under relevant constraints for the system. Most of the stateoftheart motion planning algorithms generate a plan in the configuration space of the robot, which is then followed by the robot using a trajectory tracking controller [1]. With unknown robot dynamics, the use of traditional PID controllers for trajectory tracking is commonplace. Most of the planning algorithms ignore the dynamics of the robot and hence return a trajectory which is foten unsuitable for a lot of target applications [1]. For example, many industrial robots can achieve very high accelerations and torques that can potentially damage the robot as well as the components being manipulated. As an example, in Figure 1 we show a manipulator arm trying to assemble parts of a computer. For such applications, the robot is trying to assemble delicate parts connected with wires with fixed length that can easily be broken if high torques are applied. Furthermore, many users deploying such robots have access only to a highfidelity simulator for the system and its kinematic equations, but not to the true robot dynamics. As a result, a lot of existing results from modelbased trajectory optimization cannot be used. Consequently, for many applications, a lot of time is spent in manually designing trajectory tracking controllers. However, highfidelity simulators provide a good resource that can be used for training an RL agent in the simulated environment, to be used on the real system later.
A modelfree RL agent [2] can learn feasible trajectories under dynamic constraints; however, it requires a lot of time for convergence to good solutions. In light of this, we propose to train an RL agent to generate dynamically feasible trajectories, where a samplingbased motion planning algorithm is used to guide the RL agent for faster learning and convergence. We use a RRTbased [1] trajectory to guide an RL agent to learn dynamicallyfeasible trajectories in the presence of state and input constraints. The RL agent is trained in a curriculum learning setting for faster convergence, and we show that we can achieve good generalization when conditioning the policy on the goal. The proposed algorithm is tested in several simulated environments, as well as with a real robot, by transferring the learned policy from the simulator to the robot. In Figure 1, we show a manipulator arm assembling parts in an environment cluttered with obstacles (a desktop computer) using the proposed algorithm.
Contributions. Our paper has the following contributions.

We propose an RLbased algorithm to efficiently generate controlsmooth trajectories for unknown constrained dynamical systems using a goaldirected reference trajectory.

The proposed algorithm can generalize to local perturbations in goal position in the presence of obstacles.

We demonstrate the proposed algorithm in both simulation and experimental environments, where the proposed method outperforms a baseline method where a trajectory generated by RRT is tracked by a finetuned PID controller.
Ii Related Work
Reinforcement learning has recently made huge advances, based on the success of deep learning techniques. Recent RL algorithms have achieved very impressive performance in learning in computer games
[3, 4], robotics [5], etc. Broadly, RL algorithms can be classified as modelbased or modelfree
[2, 6]. Modelbased algorithms can achieve good sample complexity and generalization, but are generally known to be harder to train for nonlinear dynamical systems. Modelfree algorithms, on the other hand, can achieve good asymptotic performance, but suffer from high sample complexity. A lot of recent research has focused on leveraging ideas from control and optimization theory for faster learning [7, 8, 9]. In a lot of robotics applications, it is generally advantageous to initialize RL agents with demonstrations which can provide them with an initial reference solution [10]. Learning policies from reference trajectories has been studied in [11, 12, 13]. Motivated by this idea, our work mainly focuses on using reference trajectories that can be provided by offtheshelf planning algorithms to speed up learning for our RL agent. The planning algorithm serves as a demonstrator for the learning algorithm. The closest work similar to ours is [13]. In [13], a modelbased RL agent is learned using a trajectorycentric RL (Guided Policy Search) approach to learn a trajectorytracking controller for a trajectory provided by RRT [1]. However, using a modelbased RL in constrained state and control settings could be difficult, because it is not clear how the underlying trajectory optimization algorithm [14] can account for arbitrary state constraints for manipulatorlike systems.Our combination of RL and reference trajectory tracking can be seen as a form of reward shaping [15]. Reward shaping speeds up learning by creating a more informative reward signal. However, designing shaping rewards requires significant nontrivial reward engineering, and may also alter the optimal solution. To alleviate this problem, automatic reward shaping has been researched [16, 17].
Iii Background
We consider the standard RL setting that consists of an agent interacting with a stochastic environment. An environment consists of a set of states , a set of actions , a distribution of initial states , a reward function
, transition probabilities
, and a discount factor .An episode starts with an initial observation sampled from . At each time step , the agent observes an observation and chooses an action according to a policy , which is a mapping from observations to actions: . Then, the agent obtains a reward , and the next state is sampled from . The goal of the agent is to maximize the expected discounted sum of rewards . The quality of the agent’s action when receiving an observation can be measured by a function .
DDPG [18] is a modelfree Qlearningbased reinforcement learning algorithm for continuous action spaces. It is an extension of the earlier DQN agent [4]
, using generating distributions over continuous action spaces. In DDPG, we maintain two neural networks: a deterministic policy (called the actor)
and a Q function approximator (called the critic) , parameterized by a set of parameters and . An actor network deterministically maps observations to actions and tries to maximize. DDPG employs a critic neural network to estimate Q by minimizing the Bellman loss:
(1) 
where the step target is calculated using target networks and as
(2) 
Each transition of the agent is stored in a replay buffer, from which minibatches are sampled to train the networks. This stabilizes training by removing temporal correlations, and therefore reduces the changes in the distributions the networks are trying to learn. Additionally, a prioritized replay buffer [19] assigns a priority to each transition, computed as the last temporal difference (TD) error and a small hyperparameter . For more details, see [19].
However, earlier research has shown that DDPG is prone to overestimating Qvalues, and results in suboptimal policies. TD3 [20] implements three improvements to address the overestimation resulting from approximation errors. First, it maintains two independent critic networks, and always employs the minimum Qvalue as the optimization target. Second, it proposes to delay the propagation of weight updates. Finally, it explicitly increases the smoothness of the Qfunction prediction by adding a clipped normal noise to the action to the target Qvalue. Using these three improvements, we can replace the step target of the critic defined in (2) with
(3) 
Iv Proposed Approach
In this section, we present details of the algorithm and some techniques which allow us to train the algorithm efficiently. We train a TD3 agent (see Section III) using a reference trajectory provided by RRT. Furthermore, due to the constrained nature of the problem (presence of obstacles), we use curriculum learning to simplify learning for the TD3 agent. These are explained in detail next, and presented as a psuedo code for clarity in Algorithms 1 and 2.
Iva Reinforcement Learning with Reference Trajectory
We consider the standard RL problem described in Section III with a reference trajectory . We include the information about the reference trajectory into a reward function as . Therefore, the reward function can be written as
(4) 
is the reward that originates from a pure RL setting, and is calculated using the reference trajectory. The idea is to accelerate the learning process by the additionally defined term in the reward function. This term penalizes search too far from the reference trajectory, and thus limits the search space for the agent.
In prior work, an expert trajectory is generally used to define the function . In contrast, we use a standard RRT algorithm for generating a reference path, because the computational cost for generating a path is much smaller than that of doing RL. Due to the nature of randomsampling based algorithms, RRT produces a jerky path, and it results in jerky trajectories, because the critic directly optimizes the reward function. In order to mitigate this problem, we investigated two improvements.
First, we reduce the number of vertices that describe the trajectory by randomly shortcutting between them, as described in [21]. To do so, we randomly pick two vertices, and divide the trajectory that connects the two points with a fixed distance. Then, for each vertex, we check if it has contact with obstacles or not, and if none of the vertices collides with obstacles, we shortcut the path, i.e. omit the vertices between the selected two vertices.
Second, we replace the reference trajectory in every episode with a path found by the RL agent during training that satisfies that: 1) the current path reaches the goal without colliding with obstacles; 2) the number of total steps to achieve the goal is the lowest, and 3) the cumulative reward is the highest.
IvB Resets to Reference Path
To overcome the problem of exploration, we reset some training episodes to a reference path with a probability of . Restarts from them makes the agent explore more efficiently, because the reference trajectory is guaranteed to reach the goal. Prior work [10] employs expert trajectories and resets to a state in them; however, we do not have such expert trajectories. Instead, we utilize a reference path that ensures that the goal is reached. To reset to a reference path, we uniformly sample joint angles from the set of reference trajectories, and assign the start angle to the sampled value.
IvC Curriculum Learning
Generally, RL is harder to train if the reward function is sparse, and if an episode is longer. To simplify learning for the agent, we gradually increase the complexity of the problem in a curriculum learning setting. We use curriculum learning in two different settings. First, we train the RL agent to learn a controller close to the reference trajectory without any sparse penalty for collision (i.e., ), with the intuition that state constraints make the problem harder. This in turn provides more dense rewards to reach the goal. Once the agent learns successfully how to learn in the absence of obstacles, we introduce penalty for collision, which we gradually increase, to adapt to the obstacles. Second, to make learning easier, we gradually decrease the goal region for the agent. For our problem, we define an acceptable goal area as , and we declare success when . (Note that is sixdimensional).
The gradual change in penalty for collision forces the critic to fit to noisy targets. This could lead the actor to converge to a nonoptimal local minimum, since the actor learns with gradients computed using the critic network. To avoid this, we store past good experiences in a replay buffer, which is different from the prioritized replay buffer that was described in III, and encourages the actor to choose the same action as the past good experiences in any given state. The additionally prepared buffer stores the past best episodes as in [22], in the sense that 1) the RL agent reaches the goal without colliding with any obstacle, and 2) the RL agent gets higher episode rewards.
To imitate from such good experiences, we use the behavioural cloning loss which was proposed in [10] and is defined as:
(5) 
After sufficient training, the agent might surpass the performance of the past best experiences and thus would then become detrimental to the agent’s performance. The Qfilter mitigates this problem by only applying if the critic judges that the action proposed by the actor is worse than the action of the demonstrator, in out setting, past good experiences. Based on adding a behavioral cloning loss, the actor loss results in:
(6) 
where
is a hyperparameter for actor to balance learning from critic or past good experiences.
IvD Goal Parameterization of Policy
To achieve generalization to perturbations in the target state for the agent, we parameterize the policy of the agent on the goal. The idea is that a goalparameterized policy represented by a network with enough capacity should be able to generalize to perturbations in the goal location. This is a very desirable property to have in the final policy, because robots are often expected to adapt to some local perturbations in the target state. We assume that the target state is sampled from the set . We train a single network to maximize the expected discounted reward over multiple goal states. The learning problem is to optimize the following expected discounted reward:
across all goals . The reward
is now conditioned on the goal to reflect the fact that rewards depend on the particular goal (or task). This is achieved by increasing the capacity of the network by adding additional input units to the network. In the simplest setting, we achieve this by simply padding extra inputs to the network that contain the goal information.
V System Overview
In this section, we provide relevant details of the simulator and the real system we used in this paper for our experiments.
Va Hardware
We use a MELFA RVFR robot, which is an industrial robot that has 6 degrees of freedom
[23]. The generated trajectories must ensure that joint angles and angular velocities that consist the trajectories are within a known specified range. The robot used in the experiments in the paper is operated in a position control mode where a position command is sent to the robot every seconds, which comes from the minimum operational time of the industrial robot we used in a real setting. As a result, the control input is the velocity for each joints. We, however, would like to minimize the acceleration (i.e., the derivative of the control signal or the control jumps) during operation. This is a desirable feature for a lot of industrial manipulators where direct torque control is not accessible.VB Simulator
We utilize a simulator to generate trajectories and then deploy them in a real setting. The simulator is a highfidelity simulator for the MELFA RVFR called RT ToolBox3 [24]. The baseline controller we use to compare the RL agent in this work is a PID trajectorytracking controller that can be designed in the simulator given a reference trajectory. For our experiment, this function is our initial baseline, which is described in detail in VI. The simulator has a builtin function for collision checking between the manipulator and obstacles present in the environment, and we use the same function for collision detection during planning. However, the proposed algorithm is agnostic to the collision checking method and simulation environment.
Vi Experiments
In this section, we will describe several different environments in which we test our proposed algorithm. In particular, we test it in two environments in simulation–a Bookshelf environment (see Figure 1(a)) and an OpenComputer environment (see Figure 1(b)). In these environments, the robot is trying to manipulate objects that can be damaged if excessive torque or acceleration is applied. Furthermore, we will show experimental results with a real robot for the OpenComputer environment (see Figure 1(c)). Videos of the learned behavior of the robot could be seen in the supplementary material.
In our experiments, we try to investigate the following questions:

Does the combination of a reference trajectory and RL improve the performance of each one of them in isolation?

Does the proposed algorithm generate feasible trajectories in the presence of state and control constraints better than some of the traditional control techniques of trajectory tracking with a reference trajectory?

Does curriculum learning helps the agent learn faster?
In the following text, we answer the abovementioned questions, and demonstrate that we can generate smooth trajectories and the agent can generalize to unseen goal conditions upon conditioning the policy on goal position.
Via Environment
ViA1 States
The states of the system consist of current angles and angular velocities . Therefore, the state set is represented in . The initial angles and angular velocities are deterministically reset to , and .
ViA2 Actions
The action of the agent
is the vector of angular velocities
for the next step. Since we consider a six dimensional configuration space environment, the action set is described as . We define a time step described as . Therefore, the angles of the next step can be calculated as(7) 
ViA3 Rewards
As described in IVA, we add , which is calculated from the reference trajectory, to the conventional reward term . First, referring to [25], we define the conventional reward term as
(8) 
where is an Euclidean distance to the goal, i.e., . , , and are indicators of whether the agent reaches the goal, and whether collision between the agent and the obstacles occurs, and whether the agent violates the constraint of joint angles respectively. The fifth term encourages the agent to generate a smoother trajectory, which is essential when operating the real system. The final term is negative value, so it encourages the agent to reach goal with smaller steps.
Then, we design an additional term by using a reference path as
(9) 
where is the distance to the reference path and is the progress along the path. The first term penalizes search too far from reference path, and second term encourages to go towards goal target angles along with reference path.
In order to calculate and , we divide the reference path and agent’s path at regular intervals, as shown in Fig. 3. By dividing the path, we obtain the subsampled vertices for the reference path, and for the agent’s path, where and are the numbers of vertices in each divided path. We can then define the distance to the given path as , where is the distance to the path calculated as . We can also observe the progress along the path as , where is the vertex index of the nearest neighbor to , i.e., .
random seeds. The bold line shows the average episode rewards, and the shaded region is one standard deviation from the average. The plot shows the faster and stable learning that we achieve using a reference trajectory which is the updated for self imitation in each episode.
ViA4 Termination Condition
An episode terminates with following two conditions: the joint angles of the agent are sufficiently close to the goal state as described in IVC, or the number of steps of an episode is over a specified threshold.
ViB BookShelf Environment
The BookShelf environment consists of a tworow, threestage bookcase, simulating a pick and place task. Each of the cube in the bookshelf is mm deep, mm high and mm wide. The manipulator starts from an initial pose denoted by , and has to reach different points specified as , those are center positions for each cube of the bookshelf, defined as [0, 8, 131, 0, 41, 180], [52, 59, 106, 141, 78, 170], [52, 28, 111, 134, 60, 152], [52, 13, 95, 111, 42, 117], [128, 59, 106, 141, 78, 190], [128, 28, 111, 134, 60, 208], [128, 13, 95, 111, 42, 243]. We define task to as reaching from to those angles defined above.
ViC OpenComputer Environment
The OpenComputer environment is for simulating a computer assembly, picking up a connector, and inserting it into a socket mounted on a mother board as illustrated in Fig. 1(b). The picking part and insertion part is out of our focus, so the simulation starts from just above the connector place with an angle of = [47, 8, 113, 0, 75, 138], and the goal is near the socket, denoted by = [90, 1, 138, 180, 46, 88]. The real setup shown in Fig. 1(c) is the same as the above environment, except that the robot is grasping a connector with a harness. A video of the implementation of the algorithm on the real manipulator is provided in the supplementary materials.
Vii Experimental Results
This section presents results from experiments designed to answer the questions described in VI. The baseline that we compared our proposed method with is a combination of a reference path and a PIDbased trajectory tracking controller implemented in our simulator, as described in VB. Note that the reference path is generated using RRT and is smoothed out by shortcutting, as described in IVA.
Viia Accelerating RL by Using Reference Paths
First, we evaluate the effectiveness of using reference paths to train an RL agent. We compare three learning methods. Firstly, we train an RL agent without a reference path by setting in Eq.(9). Secondly, we train with reference paths, and finally, we train with a reference path while it is being updated in every episode if it satisfies the conditions described in IVA
. The evaluation metric is the cumulative episodic reward that an RL agent gets during an episode. For fair comparison between methods with and without reference paths, we omit reward terms that come from the reference path by setting
in Eq.(9).Figure 4 shows the resulting episodic returns. It suggests that the use of reference paths improves convergence performance with respect to the training without reference path. Also, updating the reference path improves the performance more, because the initial reference path is jerky, and that may result in converging to a nonoptimal trajectory. Thus, we see that the use of a reference trajectory for training of the RL agent helps in speeding up policy learning.
ViiB Generating Smoother and Shorter Trajectories using RL
Next, we compare the quality of the trajectories obtained by the proposed algorithm against the baseline method. We use two metrics to quantify the quality of the trajectories obtained: the time needed to reach the goal, and the magnitude of acceleration. Recall that part of the initial motivation to training the agent this way was to minimize control jumps, and thus generate trajectories with limited acceleration. Table I shows the time required by the proposed algorithm to reach the goal using the proposed algorithm and compared against the baseline. This clearly demonstrates that the proposed method generates high quality trajectories. Figure 5 shows the angular acceleration during a rollout of the proposed method, compared against the baseline method. It shows that the proposed method generates trajectories with much lower acceleration profiles than those generated by the baseline method in all joints, while also minimizing the time taken to reach the goal.
Task  Open  BookShelf  
Computer  1  2  3  4  5  6  
Baseline  0.82  0.56  0.65  0.62  0.80  0.75  1.28 
Ours  0.22  0.34  0.26  0.25  0.48  0.55  0.50 
ViiC Curriculum Learning
Next, we investigate how our curriculum learning helps training our RL agent. We compare our full model with the one without curriculum learning and self imitation for task 1 in the BookShelf environment. Note that in TD3, the actor learns to maximize the function (critic) parameterized by a neural network. Therefore, if the estimation of function is insufficient, it gives undesirable gradients to the actor and that would result in lower episode rewards.
Fig. 6 shows a comparison of convergence rates of the agent using different methods. Without curriculum learning, the agent achieves slower convergence, because the training of the critic is harder due to a huge collision penalty, and it is harder to get positive reward which the agent receives only upon reaching the goal. Also, training without self imitation results in unstable training, because the critic needs to fit the noisy reward because of changing collision penalty in Eq. (8). Our full model is both stable and converges faster, because curriculum learning makes it easier for the critic to fit the function, and self imitation mitigates the noisy reward problem by imitating past good experiences.
ViiD Generalization to Goal Perturbation
Next, we evaluate the generalization of our method with respect to goal perturbation. As shown in IVD, we add the goal state to both the actor and critic networks, in the expectation that the method can generalize over goal states. The target task is task 2 of BookShelf environment, moving from start angle to defined in VIB. As for targets, we fix a plane in a cube of the BookShelf and change goal state within a [mm] rectangle. For training, we randomly sample different positions in the rectangle. After training, we test the generalization by randomly sampling positions in the same rectangle used during training, recording whether the robot reached the goal or not. To exploit the past good experiences, we prepared top buffer for each goal, and tried to make the training more stable and improve sample efficiency.
Table II shows the result of the experiments. It demonstrates that the RL agent successfully generalizes to goal perturbations over a reasonable area even in the presence of state constraints.
Train  Test  Overall  Train  Test  Overall  
Number of  10/10  49/50  59/60  49/50  45/50  94/100 
successes  
Success rate  1  0.98  0.98  0.98  0.90  0.94 
Viii Conclusion
The research reported in this paper is based on the idea to combine RL with trajectory optimization for unknown systems in the presence of constraints on state, control and controljumps. This kind of problems is common in robotics, where a manipulator has to be used for tasks in an environment cluttered with obstacles, in a positioncontrol mode. We proposed a method based on RL, for the case when the dynamics are unknown, that generates optimal trajectories in the presence of obstacles and other constraints. For faster learning, we use an offtheshelf samplingbased algorithm to first generate a reference trajectory which is then used by the RL agent to converge to an optimal solution faster. The proposed method was demonstrated on several simulated environments using a highfidelity simulator for an industrialgrade manipulator. We compared the learned policy against a baseline controller designed to track the trajectory obtained by smoothing the initial reference trajectory. The proposed algorithm was also tested for generalization to multiple new target states.
In future research, we would like to investigate the proposed algorithm by parameterizing it with the reference trajectory. We expect that as long as we do not change the environment, the agent would learn to produce a better solution respecting all the constraints.
References
 [1] S. M. LaValle, Planning algorithms. Cambridge university press, 2006.
 [2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
 [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[5]
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep
visuomotor policies,”
The Journal of Machine Learning Research
, vol. 17, no. 1, pp. 1334–1373, 2016.  [6] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
 [7] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust region policy optimization.” in Icml, vol. 37, 2015, pp. 1889–1897.
 [8] S. Levine and V. Koltun, “Guided policy search,” in Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ser. ICML’13. JMLR.org, 2013, pp. III–1–III–9. [Online]. Available: http://dl.acm.org/citation.cfm?id=3042817.3042937
 [9] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in NIPS, 2014, pp. 1071–1079.
 [10] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6292–6299.
 [11] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and generalization of motor skills by learning from demonstration,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on. IEEE, 2009, pp. 763–768.
 [12] K. Mülling, J. Kober, O. Kroemer, and J. Peters, “Learning to select and generalize striking movements in robot table tennis,” vol. 32, no. 3. Sage Publications Sage UK: London, England, 2013, pp. 263–279.
 [13] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, and P. Abbeel, “Learning robotic assembly from CAD,” CoRR, vol. abs/1803.07635, 2018. [Online]. Available: http://arxiv.org/abs/1803.07635
 [14] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 4906–4913.
 [15] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in ICML, vol. 99, 1999, pp. 278–287.
 [16] G. Konidaris and A. Barto, “Autonomous shaping: Knowledge transfer in reinforcement learning,” in ICML. ACM, 2006, pp. 489–496.
 [17] B. Marthi, “Automatic shaping and decomposition of reward functions,” in ICML. ACM, 2007, pp. 601–608.
 [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2015.
 [19] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR, vol. abs/1511.05952, 2015. [Online]. Available: http://arxiv.org/abs/1511.05952
 [20] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actorcritic methods,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, 2018, pp. 1582–1591. [Online]. Available: http://proceedings.mlr.press/v80/fujimoto18a.html
 [21] K. Hauser and V. NgThowHing, “Fast smoothing of manipulator trajectories using optimal boundedacceleration shortcuts,” in 2010 IEEE International Conference on Robotics and Automation, May 2010, pp. 2493–2498.
 [22] J. Oh, Y. Guo, S. Singh, and H. Lee, “Selfimitation learning,” CoRR, vol. abs/1806.05635, 2018. [Online]. Available: http://arxiv.org/abs/1806.05635
 [23] “Melfa rvfr,” http://www.mitsubishielectric.com/fa/products/rbt/robot/, accessed: 20190115.
 [24] “Rt toolbox3,” https://eu3a.mitsubishielectric.com/fa/en/products/rbt/robot/rt_toolbox3, accessed: 20190115.
 [25] M. Večerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” 2017.
 [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
Appendix
Viiia Curriculum Learning Setting
As written in IVC, we used curriculum setting to train our RL agents. Let be the number of goals reached in an experiment. When , we train RL agents without checking collision with obstacles, and linearly decrease from [rad] to [rad]. Then, we linearly increase the value of the collision penalty from to in .
For self imitation, we set for the top replay buffer, and do self imitation only when the buffer is filled with episodes. For goal generalization experiments in VIID, we set for all different goal settings, and start self imitation when more than 20% of the top replay buffer is stored.
ViiiB Training Details
Both the actor and critic networks have two hidden layers with 128 and 64 units for each layer. The hidden layers use the ReLU activation function, and the output layer of the actor uses the tanh activation function, so that an action lies in the range of
. We define the maximum step for an episode to be , and the agent randomly resets to a reference path with a probability of , as described in IVB. We train our TD3 agent for at most one million steps. Both the actor and the critic perform updating every time an episode finishes, collecting samples, with a minibatch of size sampled from a prioritized replay buffer. The prioritized replay buffer consists of transitions with fixed and . For the ADAM optimization algorithm [26], we use learning rates offor both the actor and the critic, and the default values from the TensorFlow framework for the other hyperparameters. The target networks are also updated every cycle using a decay coefficient of
. We use a discount factor of .