Current RL algorithms aimed at robot manipulation are either policy gradient methods, such as PPO , or actor-critic methods, such as DDPG , TD3 , or SAC . These methods have a stable learning process because they directly optimize policy parameters based on expected return, but still suffer from sample inefficient function approximation when compared to value-based optimization approaches. While value-based approaches were previously either limited by their function approximation capability or could only work in discrete action spaces, recent works 
have developed novel neural network architectures that enable general function approximation of continuous state-action value functions only using the Bellman error.
In this paper we evaluate RBF-DQN , a action value-based method inspired by Q-networks, to complete multiple RLBench  tasks in continuous state and action spaces. RLBench provides a challenging testbed for evaluating reinforcement learning algorithms in multi-task robot settings because it only provides sparse-rewards for completing long-horizon tasks. We conduct experiments in 5 different tasks (Fetch Reach, Button Push, Toilet Down, Open Drawer, and Pick and Place), and investigate how value-based approaches (like RBF-DQN) compare to state-of-the-art actor-critic methods (like T3D, PPO, and SAC) and how their performance is impacted by typical data augmentation and replay buffer sampling techniques. To the best of our knowledge, this is the first comparison of value-based approaches against actor-critic methods for continuous robot control in sparse reward multi-task settings.
Ii Background and related work
Reinforcement learning is the study of maximizing an agent’s long term discounted reward through interactions with an environment 
. It is commonly modeled as a Markov Decision Process (MDP), defined by the tuple . In robotic manipulation domains, denotes the continuous state space, represents the continuous action space, is the transition model, and is the reward model. The discount factor, , determines the importance of immediate rewards compared with future rewards. The action-value function , with and , is defined as the maximum expected return achievable by following a particular policy , after seeing some state and then taking some action . The optimal action-value function corresponds with the optimal policy. The optimal action-value function follows an important identity which is known as the Bellman equation :
When the reward function and transition function are known, then the optimal can be easily found by using standard dynamic programming algorithms such as value iteration. However if the model dynamics are not known, RL algorithms will need to find by interacting with the environment without learning an explicit model. One notable example of these model-free algorithms is Q-learning , which approximates using an estimator that depends on parameters . The parameters can be updated iteratively through gradient descent using the estimates (often stabilized using a target network, such as in DQN):
Introduced by Asadi et al. , RBF-DQN is a value-based method that can efficiently approximate
using a set of radial basis functions, and simultaneously approximates the action maximizing-value with bounded error. Specifically, RBF-DQN approximates by optimizing centroid locations and centroid values as functions of state and parameters and with the following equation: :
During training, both the centroid locations and state-dependent centroid values are learned, these are then used during forward propagation to form the Q function output . In multi-dimensional action spaces, the temperature parameter can be tuned to ensure an upper bound error :
The key to why RBF-DQN is so powerful as a value-based method in continuous action spaces is due to its action-maximization property as well as the fact that it is a universal function approximator . In Q learning, the update rule (2) relies on finding . This is prohibitively expensive in continuous action spaces, due to a nearly infinite search space, and employing tricks like discretizing the action space may produce sub-optimal solutions. The action maximization property of RBF-DQN however, guarantees that all critical points of can be well-approximated by a centroid location . This makes action-maximization as simple as searching over all centroids where represents a centroid location.
Ii-C HER and PER
Most robotic manipulation tasks are under the sparse reward setting, which makes training an RL agent extremely challenging due ineffective exploration, leading to high sample inefficiency. Hindsight Experience Replay (HER)  and Prioritized Experience Replay (PER)  are two methods which can be used to improve the sample efficiency of previously experienced states. As agents train, transition tuples are collected and stored in a replay buffer , where , and is the goal. These transitions often come from trajectories generated by the agent’s policy during each episode, and they are stored in the replay buffer as a dataset of samples to train with.
In HER, after each training episode, both the original goal and potentially multiple hindsight goals are selected from the current trajectory according to a goal selection strategy, and these are stored in the replay buffer.
In PER , transitions are sampled from the replay buffer weighted by their TD or Bellman error, rather than being sampled uniformly. This conceptually means that the agent prioritizes transitions in the replay buffer which it finds surprising or unexpected.
The hyperparameterdetermines the degree to which prioritization is used.
Iii Technical Approach
Our aim is to demonstrate RBF-DQN’s efficacy on robot manipulation tasks. We applied RBF-DQN to various simulated robotic manipulation tasks under sparse rewards to investigate RBF-DQN’s performance on these tasks. We also investigated how combining HER and PER with RBF-DQN impacted performance.
RLBench is a robot learning simulator with many realistic & challenging tasks involving a Franka Panda Arm, such as Fetch Reach, Open Door, and Close Toilet . One aim of RLBench is to provide a standardized suite of tasks for benchmarking performance of RL strategies.
Iii-B Goal Selection and Detection
For our robotic manipulation tasks, we utilized two hindsight goal selection strategies for use with HER. A simple hindsight goal selection strategy, known as final, passes the last state of a trajectory into a function which maps states to goals. We also considered a strategy called future, which considers states later on in the trajectory relative to a given timestep as goals.
In our ablation studies with RBF-DQN involving HER, we use the final and future strategy where we not only use the final state as a hindsight goal, but also future states later on in the trajectory relative to a given timestep .
The specifics behind depend largely on the manipulation task being performed. For Fetch Reach, takes the state as input and returns the position of the end effector, but for a task like Open Drawer, returns the state of the prismatic joint representing how open or closed the drawer is.
Finally, we determine whether a goal was achieved by checking if the norm of the achieved and desired goal is less than some arbitrary (which for our experiments has been set to ): .
, PPO converged at around 2000 iterations, with 10 update epochs in each iteration and considered as 20000 episodes equally. TD3 converged at around 13000 episodes, and SAC converged at around 12000 episodes. ForButton Push, PPO converged around at 1300 iterations or 13000 episodes, TD3 converged at around 7000 episodes and SAC converged at 14000 episodes. For Toilet Down, PPO converged after 1300 iterations or 13000 episodes, TD3 converged after more than 11000 episodes and SAC needs at least 13000 episodes to converge. For Open Drawer
, PPO achieved a maximum success rate of 0.30 after 3000 episodes, TD3 achieved a success rate of 0.30 after 5000 episodes and SAC can hardly achieve such a success rate along the process. We run all algorithms for 3 seeds and shade the 95% confidence interval for each run.
using a Franka Panda Arm with 8 DoF (7 joints + 1 gripper tip), each with a continuous range of motion. Agents receive a reward of 1 when they completes the task and a reward of 0 for all other time steps. All variations are trained in the joint velocity action space, where actions are represented as an 8 dimensional vector, where each element corresponds to a joint velocity or gripper tip open position. For each task, we used the low dimensional state space provided by RLBench, consisting of information about the robot arm joint velocities, and all objects in the scene. The state space was pruned to reduce its dimensionality and remove irrelevant information. Each agent was trained for 3,000 episodes, where each episode corresponds to a maximum of 200 steps.
Descriptions of the tasks, initialization sequences, and state spaces are described below.
Reach and Button Push: The robot arm is required to move to a target position in the environment (and push a button). The state space is 17 dimensional, representing the joint positions of the arm, position of the end effector tip, and position of the target to reach. Goals for HER on the Fetch Reach and Button Push task are derived from the ending end-effector position.
Toilet Seat Down: The robot arm is required to put the lid of a toilet seat down. The state space is 101 dimensional, where the state encompasses information about the gripper joint positions and velocities as well as information about the toilet, like its position, orientation and joint state (how open or closed the lid is). Goals for HER are based on the toilet lid joint.
Pick and Place: The robot arm is required to pick up a block and move it to a spot in 3D space. This task proved extremely difficult in the sparse reward setting, so we simplified the task by first motion planning to the block, forming and maintaining a grasp throughout the trajectory (locking the 8th element of the action vector to keep the gripper closed). The reduced state space is 51 dimensional, representing the robot joint velocities, block position, and target position in space. Goals for HER are formed from the position of the end effector.
Open Drawer Task: The robot arm is required to pull open a drawer. Due to the difficulty of this task in the sparse reward setting, we initialize the gripper at the beginning of each episode to make durative contact with the bottom handle of the drawer, and form a grip. Throughout the trajectory, the robot has full control over its 8 dimensional action space. The reduced state space consisted of 45 dimensions: the joint velocities and gripper state of the robot arm, the waypoint of the bottom drawer, and the prismatic joint of the bottom drawer (loosely representing how open or closed the drawer is). Goals for HER were formed with the drawer’s prismatic joint, which increases from 0 as the drawer is opened.
V Discussion and Analysis
From the results, we see RBF-DQN under an -greedy policy compares favorably to other state-of-the-art baselines under the same conditions. In the five sparse reward RLBench robotic manipulation tasks evaluated, RBF-DQN required 1/3 as many episodes to succeed at each task, which is a significant breakthrough in sample efficiency for robotic manipulation.
We note that while RBF-DQN was successful, not all of the sampling strategies using (HER, PER, HER+PER) were equally effective on each task: PER may result in unstable learning, and HER may not always be feasible to incorporate. Fetch Reach and Pick and Place had success under HER and HER+PER, but when using PER only, the training had a tendency to become unstable; for Button Push, neither PER, HER or HER+PER outperformed vanilla RBF-DQN; for Open Drawer, HER did not increase performance, while PER increased learning speed but was unstable. For Toilet Down, as a result of their being no intermediate stable goal states (the lid is either up, or falling down due to gravity with slight perturbations), HER is not useful, and PER leads to unstable learning compared to vanilla RBF-DQN.
Differences in the environment may play a role in the success of the sampling strategies in terms of what areas of state space the agent explored. In particular, the stability of trajectory states sampled from the experience buffer (and those which are chosen as hindsight goals) may have an impact on success. Fetch Reach and Button Push have the property that all states in the state space are stable: in the absence of robotic control, states (of the end effector or the button) do not transition to a fixed point.
For Toilet Down, the goal state of the toilet lid is an attractor for lid joint angle due to gravity, so certain perturbations of the lid when it is open can cause the lid to fall to the goal state. Setting hindsight goals for lid angles which naturally fall towards the goal state, with no robot contact on the lid, could be a successful hindsight goal selection strategy. However, since the robot does not need an intelligent policy at the subgoal states due to the attraction dynamics of this task family, hindsight goal selection is not as beneficial as in cases where planning is challenging from the subgoal states.
Additionally, the mapping from state space to goal space is critical: certain tasks can only be completed if the robot successfully maintains durative contact with the object throughout the trajectory. Therefore, certain tasks require hindsight goals to be created out of states that maintain durative contact. Pick and Place and Open Drawer states are stable, but only as long as the gripper maintains contact with the object or handle. Therefore, in Pick and Place, we opted to always ensure the gripper remained closed. In contrast, for Open Drawer, we performed only grasp initiation, but subsequently allowed the robot arm full control over its DoF, implying that it could potentially release its grip on the drawer. We observe that due to these differences, HER on Pick and Place was more effective than HER on Open Drawer, since in Open Drawer, there is a very low probability that the gripper remains closed throughout the trajectory.
In certain experiments such as Button Push, Fetch Reach, and Open Drawer, HER + PER resulted in performance of the agent collapsing near the end of training. It is possible that the bias introduced by Priority Experience Replay is significant enough to destabalize convergence at the end of training, and that the weighted importance sampling ratios in PER would benefit from an annealing schedule that reduced the weights over time. This is especially problematic for value-based approaches like RBF-DQN which tend to be less stable during training than policy-gradient methods like TD3, PPO and SAC since they optimize for low Bellman error rather than directly improving the expected returns of the policy. Future work will investigate approaches for mitigating the destabilization issues introduced by biased replay buffer sampling techniques.
Even without common sample-efficiency improvements, we have demonstrated that RBF-DQN is relatively more sample efficient than current state of the art baselines, and is able to perform better or comparably on multiple robot manipulation tasks. We attribute the success of RBF-DQN on sparse reward, continuous state & action manipulation tasks to the action maximization and function approximation properties of RBFs, which guarantee the location of the max centroids to approximately correspond with the max Q value at a given state, within an error. This property is extremely powerful, allowing action maximization to be achieved over simple centroid search (of which there are finitely many), suggesting why RBF-DQN performs efficiently.
Our results provide strong motivations for incorporating RBF-DQN as a sample-efficient value-based method in the domain of robotic manipulation. It seems promising that RBF functions can be leveraged to improve sample complexity for robotic manipulation tasks with both on-line and off-line RL methods. It would be interesting to see how RBF-DQN performs in higher dimensional state representations, and how other sampling methods or goal generation methods could be utilized to improve sample efficiency.
We have experimentally seen that RBF-DQN is comparable or better at common robotic manipulation tasks to PPO, TD3, and SAC. Especially when paired with HER and PER, RBF-DQN is a powerful value based model for off-policy continuous action space robotic manipulation.
In the future, we hope to experiment with using RBF-DQN on vision based state input (depth images, and point clouds), incorporating sample efficiency algorithms like CURL  and RAD , as well as adapting HER to work with image based state input. Furthermore, we are working to improve the stability and over-estimation tendencies of RBF-DQN by exploring the potential of incorporating dueling  and double  DQN techniques into RBF-DQN.
-  (2017-07) Hindsight Experience Replay. arXiv e-prints, pp. arXiv:1707.01495. External Links: Cited by: §II-C.
-  (2020-02) Deep Radial-Basis Value Functions for Continuous Control. arXiv e-prints, pp. arXiv:2002.01883. External Links: Cited by: §I, §I, §II-B, §II-B, §II-B.
-  (1952) On the theory of dynamic programming.. Proceedings of the National Academy of Sciences of the United States of America 38 8, pp. 716–9. Cited by: §II.
-  (2018) Addressing function approximation error in actor-critic methods. External Links: Cited by: §I.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. External Links: Cited by: §I.
-  (2019-09) RLBench: The Robot Learning Benchmark & Learning Environment. arXiv e-prints, pp. arXiv:1909.12271. External Links: Cited by: §I, §III-A, §IV.
-  (2020-04) Reinforcement Learning with Augmented Data. arXiv e-prints, pp. arXiv:2004.14990. External Links: Cited by: §VI.
-  (2019) Continuous control with deep reinforcement learning. External Links: Cited by: §I.
-  (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Cited by: §II.
-  (1994) Markov decision processes—discrete stochastic dynamic programming. John Wiley & Sons, Inc., New York, NY. Cited by: §II.
-  (2019) Stable baselines3. GitHub. Note: https://github.com/DLR-RM/stable-baselines3 Cited by: §IV.
-  (2015-11) Prioritized Experience Replay. arXiv e-prints, pp. arXiv:1511.05952. External Links: Cited by: §II-C, §II-C, §II-C.
-  (2017) Proximal policy optimization algorithms. External Links: Cited by: §I.
-  (2020-04) CURL: Contrastive Unsupervised Representations for Reinforcement Learning. arXiv e-prints, pp. arXiv:2004.04136. External Links: Cited by: §VI.
-  (1998) Reinforcement learning: An introduction. The MIT Press. Cited by: §II.
-  (2015-09) Deep Reinforcement Learning with Double Q-learning. arXiv e-prints, pp. arXiv:1509.06461. External Links: Cited by: §VI.
-  (2015-11) Dueling Network Architectures for Deep Reinforcement Learning. arXiv e-prints, pp. arXiv:1511.06581. External Links: Cited by: §VI.
-  (1992-05) Technical note: q-learning. Machine Learning 8, pp. 279–292. External Links: Cited by: §II-A.