Modern robots operating in real environments should be able to cope with dynamic workspaces. They should autonomously and flexibly adapt to new tasks, new motions, environment changes, and disturbances. These requirements generated novel challenges in the area of Robot Control. It is no longer sufficient to implement control algorithms that are robust to noise; they should also become independent from the assigned task, the planned motion, and the accuracy of the dynamic model of the system to be controlled. They should easily adapt to different devices and working conditions by overcoming the need of complex parameters identification and/or system re-modeling.
Reinforcement Learning (RL) has been commonly adopted for this purpose. However, the high number of degrees-of-freedom of modern robots leads to large dimensional state spaces, which are difficult to be learned: example demonstrations must often be provided to initialize the policy and mitigate safety concerns during training. Moreover, when performing dimensionality reduction, not all the dimensions can be fully modelled: an appropriate representation for the policy or value function must be provided in order to achieve training times that are practical for physical hardwares.
The conjunction of parallel computing and Embedded Deep Neural Networks (DNN) extended RL to continuous control applications. Parallel computing provides concurrency, particularly performing simultaneously multiple actions at the same time. DNN overcomes the need for infinite memory for storing experiences: it approximates non-linear multidimensional functions by parametrizing agents (i.e., robots) experiences through the network’s finite weights. The notion of Deep Reinforcement Learning (DRL) results.
This paper proposes a detailed and extensive comparison of the Trust Region Policy Optimization (TRPO)  and Deep Q-Network with Normalized Advantage Functions (DQN-NAF)  with respect to other state of the art algorithms, namely Deep Deterministic Policy Gradient (DDPG)  and Vanilla Policy Gradient (VPG) . Both simulated and real-world experiments are provided. They let to finely describe the hyper-parameters selection and tuning procedures as well as demonstrate the robustness and adaptability of TRPO and DQN-NAF while performing manipulation tasks such as reaching a random position target and pick & placing an object. Such algorithms are able to learn new manipulation policies from scratch, without user demonstrations and without the need of a task-specific domain knowledge. Moreover, their model-freedom guarantees good performances even in case of changes in the dynamic and geometric models of the robot (e.g., link lengths, masses, and inertia).
The rest of the paper is organized as follows. Section II describes existing DRL algorithms. In Section III the essential notation is introduced together with the foundations of RTPS and DQN-NAF. Section IV describes the simulated experiments: a detailed description of the implemented simulated robot model is depicted, together with the tasks we ask it to perform. Simulation lets us deduce and prove the correctness of our system design as well as show the steps to follow for a powerful hyper-parameters estimation. Section V transfers our policies to a real setup. Finally, Section VI contains conclusions and future works.
Ii State of the Art
During the years, successful applications of NNs for robotics systems have been implemented. Among others, fuzzy neural networks and explanation-based neural networks have allowed robots to learn basic navigation tasks. Multi-Layer Perceptrons (MLPs) were adopted to learn various tasks of the RoboCup soccer challenge, e.g., defenses, interception, kicking, dribbling and penalty shots. With respect to Robot Control, neural oscillators with sensor feedback have been used to learn rhythmic movements where open and closed-loop information were combined, such as gaits for a two legged robot. Focusing on model-free DRL, and  make a robotic arm learn to open a door from scratch by using DQN-NAF. , instead, uses Hindsight experience Replay (HER) to train several tasks by assigning sparse and binary rewards. Such tasks include the Pick&Place of an object and the pushing of a cube. Recently, the scientific community is achieving notable results by combining model-free and model-based DRL with Guided Policy Search (GPS) [9, 21, 3]. This combination guarantees good performances on various real-world manipulation tasks requiring localization, visual tracking and complex contact dynamics tasks. 
, instead, manages to train from single view image streams a neural network able to predict the probability of successful grasps, learning thus a hand-eye coordination for grasping. Interesting related works on visual DRL for robotics are also[8, 11, 14, 22]. Data efficient DRL for DPG-based dexterous manipulation has been further explored in , which mainly focuses on stacking Lego blocks.
Iii-a Preliminaries and notation
Robotics Reinforcement Learning is a control problem in which a robot acts in a stochastic environment by sequentially choosing actions (e.g. torques to be sent to controllers) over a sequence of time steps. The aim is that of maximizing a cumulative reward. Such problem is commonly modeled as a Markov Decision Process (MDP) that provides: a state space, an action space , an initial state distribution with density , a stationary transition dynamics distribution with conditional density satisfying the Markov property for any trajectory and a reward function . The policy (i.e., the robot controller) mapping is used to select actions in the MDP. The policy can be stochastic or deterministic . In DRL the policy is commonly parametrized as a DNN and denoted by , where
is the general parameter storing all the network’s weights and biases. A typical example is a gaussian Multi-Layer Perceptron (MLP) net, which samples the action to be taken from a gaussian distribution of actions over states:
The return , with , is the total discounted reward from time-step onwards, where is known as a discount factor that favors proximal rewards instead of distant ones. In RL value functions are defined as the expected total discounted reward: state-value and action-value
. DRL methods usually approximate such value functions with neural networks (critics) and fit them empirically on return samples with stochastic gradient descent on a quadratic Temporal Difference (TD) loss. The agent’s goal is to obtain a policy which maximises the return from the initial state, denoted by the performance objective. To do so, classical RL methods pick the best action that maximize such value functions (acting greedily) while sometime acting randomly to explore . This fact is taken into account in DRL with stochastic policies or with deterministic policies with added noise. Since not every robotic setup may have the possibility to inject noise into the controller for space exploration, we explored both stochastic and deterministic model-free DRL algorithms. In this paper, we implemented a Trust Region Policy Optimization (TRPO)  as a stochastic policy and a Deep Q-Network with Normalized Advantage Functions (DQN-NAF)  as a deterministic one.
Policy gradient (PG) methods are a class of RL algorithms that enjoy many good convergence properties and model-free formulations. The main reason that led to PG methods is that the greedy update of classical RL often leads to big changes in the policy, while in a stable learning it is desirable that both policy and value function evolve smoothly. Thus it is preferable to take little steps in the parameter space ensuring that the new policy will collect more reward than the previous one. The direction of the update should be provided by some policy gradient, which must be estimated as precise as possible to secure stable learning. The general stochastic gradient ascent update rule for PG methods is
where is the learning rate. A proper network optimizer with adaptive learning rate such as Adam  is strongly advised for such updates. Vanilla Policy Gradient (VPG), a variant of REINFORCE algorithm , estimates the policy gradient from policy rollouts with the log-likelyhood ratio trick formula:
where is a single sample estimate of
, thus typically with high variance. Many methods have been proposed to reduce the PG variance, including another neural net for estimatingor (actor-critic methods), or the use of importance sampling to reuse trajectories from older policies. In this paper we use TRPO and DQN-NAF.
is replaced with lower variance advantages , estimated with Generalized Advantage Estimation algorithm () (similar to actor-critic algorithm).
it uses the Natural Policy Gradient (NPG), making the PG invariant to the parametrization used by premultiplying it with the inverse of the policy Fisher Info Matrix, namely the
metric tensorfor policy space.
This kind of update takes into account also the distance in KL-divergence terms between subsequent policies. Bounding such divergence helps in stabilizing the learning. Finally, since for neural network policies with tens of thousands of parameters NPG incurs prohibitive computation cost by forming and inverting the empirical FIM. Therefore is it usually retrieved approximately using a Conjugate Gradient (CG) algorithm with a fixed number of iterations.
A line search algorithm is performed to check if there has been an improvement in the surrogate loss and the old policy does not differ too much from the updated one in distribution.
This algorithm proved very successful in contacts rich environment and high-dimensional robots for locomotion tasks, but its efficiency in common robotic tasks such as 3D end-effector positioning and Pick&Place must be yet validated.
DQN-NAF was proposed by  and aims to extend Q-learning to continuous spaces without relying on PG estimates. Therefore, in order to solve the hard problem of -maximization in continuous action space,  proposed the introduction of Normalized Advantage Functions (NAF). This new method, which adopts a deterministic neural network policy , enforces the advantage function to be shaped as a second order quadratic convex function, such as
where is a trainable state-dependent positive definite matrix. Since acts just as a constant in the action domain and that , the final -function has the same quadratic properties of the advantage function (5) and it can be easily maximized by choosing always . This allows to construct just one net with that will output and to retrieve the -values. Clearly the overall -network parameter is the union of and , since they differ only in the output layer connection. The DQN-NAF pseudocode is presented in Algorithm 2. The structure is very similar to DDPG due to use of targets nets for computing the TD loss but uses only one more complex -network that incorporates the policy. Another slight difference is that the critic may be fitted times each timestep, acting as a critic-per-actor update ratio. This increases computational burden but stabilizes even more learning since the state-value network approximates better of the true , improving policy updates reliability. This algorithm was applied with success directly onto a 7 DOF robotic arm in , even managing to learn how to open a door from scratch. In particular it was implemented an asynchronous version of DQN-NAF surfing the ideas of , where multiple agents were collecting samples to be sent to a shared replay buffer. In this way learning is almost linearly accelerated with the number of learners, since the replay buffer provides more decorraleted samples for the critic update. Obviously the reward function plays an important role in both DDPG and DQN-NAF and we will focus on different designs to explore the performances on these two state-of-the-art DRL algorithms.
Iv Simulated Experiments
We first compared the most promising state of the art algorithms by means of simulated tasks modeled using the MuJoCo physics simulator 
. Simulation lets fast and safe comparisons of design choices such as, for DRL, the hyperparameters’ setting. We modeled a UR5111https://www.universal-robots.com/products/ur5-robot/ manipulator robot from Universal Robots with a Robotiq S Model Adaptive 3-fingers gripper222https://robotiq.com/products/3-finger-adaptive-robot-gripper attached on its end effector, for a total of 10 degrees of freedom. The same robot was used in our real-world experiments. We want to emphasize the fact that only one robotic arm was modeled for the simulated experiments in order to keep consistency with the real-world setup. However, analyzed algorithms would remain robust even in case of changes of the dynamic and geometric models of the robot (e.g., link lenghts, masses, and inertia).
Iv-a Robot Modeling
The manipulator and gripper MuJoCo models (MJCF files) are generated from the robots’ Unified Robot Description Formats (URDFs)333http://wiki.ros.org/urdf. Once attached the MJCF files to each other, we computed the following global joint state and torque
where and ( is the -th finger and is the
-th phalange) are the UR5 and gripper joint positions vectors, respectively (measures expressed in radiants).and are the UR5 and gripper action vectors, i.e., the torques applied to each joint by its motor.
|Joint||Joint Limits |||
In order to better match the real robot, the MuJoCo model includes the actual gear reduction ratios and motors nominal torques of Tab I and II. Actuators were modeled as torque-controlled motors. As advised by MuJoCo documentation, joint damping coefficients were added and chosen by trial and error, resulting in an improved simulated joint stiffness.
Focusing on the gripper, its fingers under-actuated system was modeled as a constraint of joint phalanges angles. This joint coupling was implemented by defining fixed tendons lengths between phalanges through a set of multiplicative joint coefficients . These parameters were found by trial and error until a satisfying grasp was obtained: for the tendon between and , between and , . This is not how the real system works, but it is the best demonstrated way to ensure a correct simulated gripper closure. Finally, inertia matrices were correctly generated through the MuJoCo inertiafromgeom option, which enables automatic computation of inertia matrices and geoms frames directly from model’s meshes.
An important parameter is the MuJoCo simulation timestep , i.e., the timestep at which the MuJoCo Pro physics engine computes successive evolution states of the model, given an initial joints configuration. Usually, magnitude of milliseconds is chosen. In our case, ensures a good trade-off between simulation’s stability and accuracy. Standard gravity ( ) was already enabled by the simulator by default.
In order to match the real UR5 controller, which operates the robotic arm at , we set a frameskip . This value defines how many MuJoCo state evolutions the OpenAI’s Gym environment must skip, with an effective sampling time of
This method guarantees a stable and accurate simulation while sampling our modeled system at the correct rate.
Iv-B1 Random Target Reaching
The robot end effector must reach a random target position (the center of the red sphere of Figure 1) within a fixed cube of side 40 in the robot workspace. In global coordinates (world reference frame positioned and centered on the floor under the robotics arm bench):
where is a vector whose entries are sampled uniformly within the specified bounds .
The choice of restricting the goal position within a cube aims to limit the training space of DRL algorithms, otherwise extended 850 from the base joint of the robot. In order to promote space exploration and avoid deterministic behavior, uniformly sampled noise is added to the initial joint positions and velocities of the UR5:
The state of the environment follows:
where is the robotic arm joint vector; is its time derivative; and are the position of the end effector and of the target, respectively.
The episode horizon for this task has been set to , which means that the agent is allowed to achieve its goal within . Thus the engine computes a total of s of simulated time; after that the episode is terminated and a new one starts.
The reward function follows:
The regularization term aims to promote the learning of stable and bounded actions, slightly penalizing () the usage of excessive torques. This reward function is always negative (penalizing rewards) thus the maximum collectible return is . Here we can define a particular environment state as terminal by checking if the task has been correctly performed truncating the episode. However in this case the agent must experience the whole trajectory until the episode horizon threshold if a previous good terminal state is not encountered. Any other type of termination will lead to higher return, tricking agent to infer the actual sequence of action as good. Due to this fact, such a reward slows the initial learning process since it is highly likely that the robots may find itself in a state far from optimum but still it must experience the whole bad episode. A discrete timestep
that means the assigned task is achieved.
The arm must learn how to grasp a cylinder from a table ( , ) and place it about above the object (see Figure 2):
At every episode, both cylinder and goal positions are fixed, while the initial position of the robot’s joints is uniformly sampled.
The state of the environment follows:
is selected as timesteps, that means a total allowable time to perform the task equal to . A similar task was already performed in  with DQN-NAF, but with a stick floating in the air attached to a string and a simplified gripper with fingers without phalanges. Our task instead is more realistic and the robot must learn to firmly grasp without any slip the cylinder.
Inspired by , we created a geometrical-based reward function that promotes the minimization of three distances:
the distance from the end effector to the object:
the distance from the fingers to the center of mass of the cylinder:
In particular is the cartesian position of the second phalanx of finger in the world reference frame. The radius of the cylinder acts simply as offset to avoid nonsense penalties since it is impossible to reach with the fingers the object’s center of mass.
the distance from the cylinder to the goal:
The final reward function is:
where is manually selected in order to balance distance weightings. The function is normalized in order to avoid huge rewards when reaching the goal. In this way, when no torque is applied to motors and the goal is reached, the highest reward possible is 1. This is an encouraging reward function: such reward shaping is one of our main novelties and it foresee that its values are instead mainly positives, allowing us to define a bad terminal state and speeding up simulation of many trajectories. This type of reward function is widely diffused in locomotion tasks, since it is easy to assign a reward proportional to the distance traveled or forward velocity. On the other hand for robotic manipulations this is not always trivial and such a reward function can be hard to compose efficiently.
The gripper must stay close to the cylinder and the cylinder to the goal, that means the episode is terminated on the following state check:
Iv-C Hyper-parameters settings
By trial and error, we found that episodes guarantees a good training for the reaching task while is a good trade-off for the Pick&Place: the training is stopped when is reached. rllabplusplus algorithms perform the policy update every samples. This means that every algorithm iteration/policy update is done every episodes, were is the maximum number of episode timesteps (max path length). Moreover, we used a discount factor in order to make the agent slightly prefer near future rewards rather than distant ones. Specifically for every algorithm:
DQN-NAF updates the policy based on the critic estimation. The seamless integration of the policy in the second order approximated critic allows to select, at each timestemp, the action that globally maximize the function. We tested three different minibatch sizes: . In order to explore the fact that the same but scaled reward function may cripple the learning, only in the policy update we scaled the rewards by a factor . In other words, the reward used to update the policy is
In principle a lower reward should reduce the base stepsize of the policy gradient. Intuitively this whole method is heavily task dependent but proved  to stabilize (though slow down) the learning. The soft target update coefficient for target networks used was left to the default value .
We used the Conjugate Gradient (CG) Algorithm with iterations in order to estimate the NPG direction and to fit the baseline network. We used the rllab default trust region size for both policy () and baseline () updates. Tests demonstrated that the size of the baseline network does not significantly affect the learning progress; thus, it was fixed to . This might reflect the fact that the baseline is deep enough to effectively predict the states value it is fed with; a larger network would slow the training and introduce overfitting. The MLP baseline network is updated through the CG algorithm. For the advantage estimation procedure we used a GAE coefficient as suggested by . According to , the batch size highly affects the stability and the learning performance curve. Thus, we tested 3 different batch sizes, corresponding approximately to a environment runs per algorithm iteration.
Iv-D Evaluation and results
The average return is used to evaluate policies performances. After each update of the policy neural network, the new controller is tested on new task episodes and an estimate of the agent performance is estimated, i.e., the average undiscounted return
along with its standard deviation:
represents the shaded region around the mean return. We used the undiscounted return as evaluation metric because it lets an easier understanding of the mean sequence of rewards if compared with its-discounted version.
Finally, Final Average Return describes the average return of the last 10 policy runs. Episodes Required indicates the minimum number of episodes required to reach a performance similar to a final policy characterized by Final Average Return.
These settings are used to compare DQN-NAF and TRPO for the proposed tasks with respect to two widely used state of the art DRL algorithms: Vanilla Policy Gradient (VPG) and Deep Deterministic Policy Gradient (DDPG) . Our aim is that of proving the robustness and adaptability of proposed approaches with respect to the proposed tasks. For a exhaustive comparison, we tested 4 different types of nets: , , and (see Table IV-D). Policy networks are trained with , while networks and
|Policy||Policy Hidden Sizes|
Iv-D1 Random Target Reaching
|Algorithm||Episodes Req.||Final Avg Return||Max Return|
VPG struggles to learn a near-optimal policy (see Figure 3). The best VPG policy () gets stuck after just episodes on an average return of about . TRPO is not able to solve the task but, thanks to its theoretical monotonic guarantees, it should be able to reach a close to zero return with a slightly higher number of episodes. DDPG can synthesize a policy that achieves the best possible return in about episodes. However, it is the algorithm with the most unstable return trend and it must be carefully tuned in order to get good results.
Being designed to perform robotic tasks, DQN-NAF stably solves the environment in less then episodes. Moreover, almost every policy architecture succeeds to collect almost zero return with a very similar number of episodes. This behavior uncorrelates the need for a huge net to perform the same task: it seems that it is the method the network is trained with that really makes the difference. However we cannot skip to test different nets on the next environments since this fact is surely related to the reward function used and the particular task. As a general rule, we found out that a net larger then usually delivers better performance across these 4 algorithms.
As displayed in Figure 4, the pick and place environment proved highly stochastic due to the contacts between the gripper and the cylinder; little impacts during the grasp learning often lead the cylinder to fall and roll, preventing further grasp trials. This fact is reflected by the high average return variance and unstable learning in VPG, DDPG and DQN-NAF, for almost all network configurations. Their learning curves prove an overall return increase but the grasp still fails frequently due to the slipness of the cylinder or high approaching speed. The monotonic improvement theory and precautions of TRPO delivers after 5000 episodes an average return of , performing a solid grasp while generating a stable trajectory for the cylinder placement on the blue goal. The most interesting fact about the TRPO grasp is the tilting of the cylinder towards the fingers. This allows the robotic arm to lift the cylinder with less effort while minimizing the risk of object slip/loss. On the other hand, the overall movements for the cylinder’s transport can be sometimes more shaky than those observed in the reaching task with DQN-NAF. TRPO’s policy was also chosen to perform the task on the real setup because it had the most room for improvement and further training may polish the network’s behavior or deliver better grasping results.
|Algorithm||Episodes Req.||Final Avg Return||Max Return|
V Real-World Experiments
Real-world experiments aim to prove that the policies learned in simulation are powerful also in real environments.
In order to use the learned policies in a real environment, it is necessary to put in communication the real setup with the simulated one. The simulated environment can interface the external software by exchanging JSON data through a TCP Socket connection. As the real robotics setup is based on ROS, we used ROSBrige444http://wiki.ros.org/rosbridge_suite which provides a JSON API to ROS functionality for non-ROS programs.
Focusing on visual data, in order to easily obtain objects poses, fiducial markers are placed on them. In particular, we used the AprilTags  library. A Microsoft Kinect One, placed in front of the robot, is used to view the scene.
V-1 Random Target Reaching
The policy described in Section IV was tested: a ball is sustained near the gripper as in Figure 5. A marker is placed on it in order to obtain its pose. The robot is able to place its end effector at the ball position with a 100% success rate. Moreover, the robot is able to follow the ball when in motion (see the supplementary video).
The robot has to pick up a cylinder placed on a table and bring it on a random point placed over the first one. As for the previous experiments, the cylinder pose is recognized using a fiducial marker (see Figure 6). 100% success is guarantees as demonstrated by the supplementary video.
Vi Conclusions and Future Work
Deep Reinforcement Learning algorithms provides nowadays very general methods with little tuning requirements, enabling tabula-rasa learning of complex robotic tasks with deep neural networks. Such algorithms showed great potential in synthesizing neural nets capable of performing the learned task while being robust to physical parameters and environment changes. In simulation, we compared DQN-NAF and TRPO to VPG and DDPG for classical tasks such as end-effector dexterity and Pick&Place on a 10 DOF collaborative robotic arm. Simulated results proved that good performances can be obtained with reasonable amount of episodes, and training times can be easily improved with more CPUs on computational clusters. DQN-NAF performed really well on the reaching task, achieving a suboptimal policy. TRPO demonstrated to be the most versatile algorithm thanks to its reward scaling and parametrization invariances. VPG learns typically slower whereas DDPG is the most unstable and difficult to tune since it is highly reward scale sensitive. We discovered that the policy network architecture (width/depth) was not a decisive learning parameter and it is algorithm dependent. However, a hidden layer size of at least is advised for similar continuous control tasks. Finally we showed that it is possible to transfer the learned policies to real hardware with almost no changes.
-  (2017) Constrained policy optimization. arXiv preprint arXiv:1705.10528. Cited by: §III-B.
-  (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5055–5065. Cited by: §II.
-  (2017) Combining model-based and model-free updates for trajectory-centric reinforcement learning. arXiv preprint arXiv:1703.03078. Cited by: §II.
-  (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3389–3396. Cited by: §I, §II, §III-A, §III-B, §IV-B2, §IV-B2.
Continuous deep q-learning with model-based acceleration.
International Conference on Machine Learning, pp. 2829–2838. Cited by: §II, §III-B.
-  (2017) Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133. Cited by: §IV-C, §IV-C.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
Learning visual servoing with deep features and fitted q-iteration. arXiv preprint arXiv:1703.11000. Cited by: §II.
-  (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §II.
-  (2016) Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics, pp. 173–184. Cited by: §II.
Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312. Cited by: §II.
-  (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §III-B.
-  (2011) AprilTag: a robust and flexible visual fiducial system. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3400–3407. Cited by: §V.
-  (2017) C-learn: learning geometric constraints from demonstrations for multi-step manipulation in shared autonomy. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 4058–4065. Cited by: §II.
-  (2017) Data-efficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073. Cited by: §II.
-  (2015) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897. Cited by: §I, §III-A, §III-B.
-  (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: item 1), §IV-C.
-  (2014) Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 387–395. Cited by: §I.
-  (2012) MuJoCo: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §IV.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §I, §III-B.
-  (2017) Collective robot reinforcement learning with distributed asynchronous guided policy search. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 79–86. Cited by: §II.
-  (2015) Towards vision-based deep reinforcement learning for robotic motion control. arXiv preprint arXiv:1511.03791. Cited by: §II.