Reinforcement Learning (RL) has been used to accomplish diverse robotic tasks: manipulation [4, 3, 27, 69], locomotion [34, 88], navigation [33, 14, 90, 37], flight [40, 70], interaction [13, 15], motion planning [21, 72] and more. Due to high sample complexity and safety requirements, it is common to train the RL agent in simulation [4, 34, 10]. To reduce training time and encourage exploration, the agent is usually trained with distributed rollouts [22, 45, 20, 44]. For a successful transfer to the real world, researchers use calibration [3, 79], domain randomization [63, 57, 49, 70], fine tuning with real world data , and learn features from a combination of simulation and real data [31, 7].
To experiment with robotic reinforcement learning, one needs to have expertise in many areas, access to a physical robot, an accurate robot model for simulations, a distributed training mechanism and customizability of the training procedure such as modifying the neural network and the loss function or introducing noise. For the uninitiated, dealing with this complexity is daunting and dissuades adoption. As a result, much of prior work is limited to a single robot[4, 63, 32] or a few robots . We reduce the learning curve and alleviate development effort with DeepRacer.
DeepRacer supports state-of-the-art deep RL algorithms , simulations with the OpenAI Gym  interface, distributed rollouts and integration with cloud services. We introduce a training mechanism that decouples RL policy updates with the rollouts, which enables independent scaling of the simulation cluster and supports popular simulators such as Gazebo . The DeepRacer 1/18th scale car is one realization of a physical robot in our platform that uses RL for navigating a race track with a fisheye lens camera. The car hardware includes GPU for executing the neural network policy locally, live streams the camera view over WiFi, the compute battery supports 6 hours of development time and retails at $400. We have a corresponding robot model in simulation, along with rendering for multiple race tracks. We can train the RL policy with different simulation parameters and multiple tracks in parallel using our distributed rollout mechanism.
We learn an end-to-end policy for navigating a race track. We use a single grayscale camera image as observation and discretized throttle/steering as actions. We train in simulation using the Proximal Policy Optimization (PPO) algorithm , which can converge in <5 minutes and 5000 simulation steps. With no pre-processing, real world data or expert labeling, the learned policy successfully transfers from simulation to real tracks (sim2real ). The entire process from training a policy to testing in the real car takes <30 minutes. Multiple models can be trained in parallel with on-demand compute and stored in the car. Thousands of users have designed their own reward functions, trained their models on our platform, and demonstrated real track navigation. To the best of our knowledge, this is the first demonstration of model-free RL based sim2real at scale.
DeepRacer serves as a testbed for many areas of RL research such as reducing sample complexity , sim2real  and generalizability . The car can log camera images, inertial sensor measurements, policy decisions. Simulations can be randomized with different tracks, lighting, sensor and actuator noise. The learned policy can underfit/overfit to the simulation settings. We use a robust evaluation method to identify when the learned policy will generalize to the real world. We evaluate multiple checkpoints of the saved policy with domain randomization such as action noise and different starting points. Models that give good results in robust evaluation generalize well to the real world. Our policies trained with domain randomization generalize to multiple cars, tracks and to variations in speed, background, lighting, track shape, color and texture.
Ii Related Work
RL has been used in robotics for several decades [50, 5, 28, 48]. Initial works used low dimensional state spaces due to scalability challenges. RL concepts were generalized to high dimensional problems with deep networks [54, 76, 73]
. High variance, sample complexity and replicability challenges in deep RL algorithms led to development of simulators , benchmarks [10, 81] and libraries [59, 17]. We build upon these works to create a platform for experimentation with simulation and real robots.
Distributed Rollouts: Algorithms that use distributed rollouts, where multiple simulations are executed in parallel to collect experience data, were introduced to reduce training time [3, 20, 53]. OpenAI Baselines  uses OpenMPI  to support distributed gradient algorithms, where each worker computes gradients on data collected. OpenAI Rapid  generalizes it to a distributed system for the PPO algorithm and demonstrate sim2real transfer on dextrous manipulation. Flex  extends the same distribution mechanism to use GPUs for simulation and hence can run 750 humanoid MuJoCo simulations with a single GPU. Chebotar et al.  use Flex to demonstrate sim2real transfer for manipulation. Surreal  uses a decoupled rollout mechanism to support the experience replay algorithms, where each worker stores the experience data in a buffer and a separate training worker computes gradients. Ray RLlib [44, 56] introduces a stateful actor framework to support distributed rollouts. DeepRacer integrates with Intel Coach library  that supports >20 deep RL algorithms in an easy-to-use, modular interface. DeepRacer uses the same rollout mechanism as Surreal, and extends support for Gazebo. Similar to Rapid, DeepRacer can use different simulation settings for each worker and have separate evaluation workers that validate the performance of the current policy.
Sim2Real: Training RL policies in the real world is challenging due to high sample complexity and safety issues. Simulations alleviate these concerns and serve as a testbed to experiment with algorithms and debug software. However, sim2real transfer is challenging because of differences in dynamics, imagery and as simulated models are just approximations of the real world [63, 57, 35]. Domain randomization, where simulation parameters are perturbed during training, has been used for successful sim2real transfer for various robotic tasks [3, 70, 12]. Methods include adding noise in dynamics [63, 3] and imagery [70, 82], learning model ensembles [55, 66], adding adversarial noise [49, 64] and assessing simulation bias . Domain adaptation  has also been used for sim2real, particularly to address the visual reality gap [31, 9, 36, 78]. DeepRacer serves as a platform to reproduce and experiment with sim2real methods. We demonstrate various forms of domain randomization in our experiments. Navigation with the DeepRacer car can be structured from simple, low speed, lane following to complex tasks such as high speed racing or commuting in traffic.
Our distributed rollout mechanism facilitates iterative experimentation as policies converge faster and helps identify underfitting/overfitting. Prior sim2real works use a fixed number of simulation steps [3, 63, 80, 51]. We show that policies can both underfit and overfit to the simulation while training, as identified by prior works [57, 16, 85]. We use a separate robust evaluation to identify the policy checkpoints that are likely to transfer well to the real world.
Rigid Body Dynamics
GPU on Robot
Robot Cost (USD)
|Fetch [4, 63]||✓||✓||✓||✓||✓||✓||✗||100K|
Many works rely on simulators only for testing and use methods such as state estimation, motion planning and model predictive control (MPC)[62, 84, 86]
for navigation. Other works have used imitation learning, where expert demonstrations are given either by a person[68, 8] or with an MPC algorithm . Kahn et al.  directly learn the RL policy in the real car, with a fixed maneuver when collision occurs. Domain randomization and image segmentation in simulations have been used to close the visual reality gap with a model based controller [70, 18, 47]. Image pre-processing , learned embeddings  and depth camera  have been used to achieve sim2real transfer. Bharadhwaj et al.  demonstrate sim2real transfer by mixing expert demonstrations with simulations. We observe that prior sim2real works rely on a model based controller for high speed navigation [47, 19] or achieve slow speeds because of poor transfer of dynamics [89, 87]. With DeepRacer, we demonstrate speeds of 1.6m/s with a single grayscale monocular image as input and discretized steering/throttle as output. We use simple, non-recurrent networks for our policy and still demonstrate robustness in the real world to multiple cars, tracks, and variations in the environment. We also achieve slow speed (0.5m/s) sim2real transfer with <5 minutes of training.
Table I compares DeepRacer with other platforms for RL, sim2real and autonomous driving. The other simulation platforms can also be used with DeepRacer. We provide an easy-to-use, economical and flexible platform with support for distributed RL, domain randomization and robust evaluation. DeepRacer tools have enabled us to replicate sim2real RL policy transfer with consistency and at scale.
Iii Autonomous Racing with RL
In our formulation, the agent steers the car and the environment is the race track. The track is marked by white lanes, there is a single car on track with no obstacles and the car only moves forwards. The image from the car’s camera is the observation, and actions are the throttle/steering of the car. As the agent does not receive the full state such as the the track layout, this is a partially observed Markov Decision Process. An episode starts with the car somewhere on track and finishes when the car goes off-track or finishes a lap.
The images from the camera are streamed at 15 fps, downsized to 160 x 120 pixels and converted to grayscale. We discretize the actions to 10 values, with 2 levels for throttle and 5 for steering. Users can customize this discretization, which get mapped to low level controls. We fix the maximum throttle in simulation and set it manually in the real car. We incentivize the agent to stay close to the center line of the track. If the car is at the edge of the track, a small deviation can off-road the car and the track is not visible in the image. Staying close to the center of the track leads to a stable policy. Users can customize this reward function. Figure 1 illustrates our problem formulation.
Iii-a Reinforcement Learning Algorithm
We use PPO, a state-of-the-art policy gradient algorithm . The algorithm uses two neural networks during training – a policy network and a value network. The policy network decides which action to take given an image as input and the value network estimates the expected cumulative discounted reward given the image. The agent initializes a policy that takes random actions. The policy network is used to interact with the simulation environment to collect data. The resulting dataset is used to update the policy and value networks as per the algorithm’s loss function. The updated policy is used to interact with the environment to collect more data and the training cycle continues until a time limit.
The policy loss function maximizes the actions that give higher rewards on average as given by the generalized advantage estimation algorithm 
and applies a clipped importance sampling weight as the policy that collects the dataset is an older version of the policy being updated. The value loss function uses the mean squared error between the predicted value and the observed value. Only the policy network gets deployed in the real car. By default, we use three convolutional layers and two fully connected layers for both networks. We train a new policy every 20 episodes. The full list of hyperparameters is given in our source code.
Iv DeepRacer Design and Implementation
We decouple the simulation data collection from the policy updates. We use RoboMaker  for our simulations with Gazebo and SageMaker  to train our policy with the RL Coach  library. Simulations help us train without manual effort. The decoupled training allows us to use separate machines which are specialized for simulations (e.g. license, Mac/Windows OS) and neural network training (e.g. GPU, large RAM) respectively. We also get the flexibility to launch multiple simulations each with their own settings for domain randomization as well as evaluate policies in parallel.
Iv-a Training Workflow
Figure 2 shows the DeepRacer training workflow. The training starts by initializing the policy/value network models and hyper-parameters in SageMaker. The neural network models are saved in S3 , an object store service. RoboMaker initializes the simulation, the agent and loads the models from S3. The agent interacts with the simulation over the OpenAI Gym interface. The agent takes actions (steering/throttle) based on the observation (camera image). The simulator updates the position of the car based on the action and returns with the updated camera image and reward . The experiences collected in the form of are stored in Redis , an in-memory database. SageMaker trains the neural networks with data collected in Redis and saves the models in S3. RoboMaker copies the model from S3 and creates more experience data. The cycle continues until training stops. The models in S3 are continually evaluated in a separate simulation to assess convergence and generalizability. Models in S3 can be deployed on the real car. While we show our results with the PPO algorithm, our architecture can be used for various experience replay based algorithms such as DQN , DDPG  and SAC . Robomaker can be replaced with other simulators that can integrate with the Gym interface.
Iv-B Training with Amazon SageMaker
SageMaker is a platform to train and deploy machine learning models at scale using the Jupyter Notebook as interface. SageMaker integrates RL algorithms using Coach and RLlib 
libraries that build on top of existing deep learning frameworks. SageMaker uses RL Coach to support the decoupled simulation based training used in DeepRacer, and RLlib for integrated simulation and training. The libraries are packaged in a Docker container and training can be launched in a cluster of machines with different configurations (CPU/GPU/RAM). The training clusters are created on-demand and billed per second, freeing users from infrastructure maintenance. Metrics such as rewards per episode, the policy entropy, cpu/memory use are visualized, source code is saved and logs are recorded. Users can launch experiments in parallel and search across experiment metadata. In addition to autonomous racing, SageMaker contains RL examples for HVAC control, robot locomotion, portfolio management and more.
Iv-C Simulation with AWS RoboMaker
RoboMaker is a cloud service to develop, test and deploy robot software. It uses Gazebo for simulation. A robot model describes each component of the DeepRacer car - the chassis, wheels, camera, Ackermann steering - their dimensions, how they link together, their properties such as mass and camera angle. We create our tracks and background environment in Blender, a 3D modeling software and import it into Gazebo. We use the ODE physics engine that simulates the laws of physics using the robot model and takes into account factors like collision, friction, acceleration, etc. A rendering engine, OGRE, visualizes the graphics. We use Gazebo plugins to add the camera and light sources. We use ROS  for communication between the agent and the simulation. The agent uses ROS to place the car in the track at the beginning of an episode, get images from the camera module, get the car’s position, velocity, and send throttle, steering commands to control the car. Users can customize the simulation in Gazebo with their own robot models and environments.
Iv-D Sim2Real Calibration
We have matched the URDF robot model to the measured dimensions of the car. We compared images from the real camera and calibrated the height, angle and the field of view of the simulation camera to match the real images. As DeepRacer camera can capture 15 fps, we match the simulation environment to use the same frame rate and use a producer-consumer mechanism to ensure one action per image. We map the agent’s action space to the motor control commands by measuring the steering angles and speed of the car under different settings. We have created a real world track that is identical in color, shape and dimensions with one of the simulation tracks. We use barricades around this track to reduce visual distractions. In addition, we have eight other tracks with varying shapes, backgrounds and textures.
Iv-E Calculating Rewards
We compute an ordered set of points along the middle of the track, called waypoints, to estimate the relative position of the car on track. The track and the background are modeled as a polygon mesh. We separate the track mesh from the background and identify the border edges as those which belong to a single triangle. We get two boundaries corresponding to inner and outer part of the track by grouping the border vertices. We construct a bipartite graph from the two sets of vertices and compute the linear sum assignment using the Euclidean distance as edge length. This gives us border vertices parallel to each other on both sizes of the track. The waypoints are the mean of the vertices connected by each edge. The spline is the line joining the waypoints. The car starts an episode at a waypoint. We flag the car as off-track when it deviates from the spline by more than half the track width. We measure the car’s progress by the relative distance it covers compared to the length of the spline.
Iv-F DeepRacer Hardware
Figure 4 gives an overview of DeepRacer hardware. We have designed the car for experimentation while keeping the cost nominal. The Intel Atom processor with a built-in GPU can perform 15 inferences per second with our default five layer neural network. The motors are equipped with electronic speed controllers. We can use the car as a regular computer with a monitor, mouse and keyboard connected via HDMI and USB. The camera connects over USB and there are three USB ports for extensions. The 13600 mAh compute battery lasts 6 hours. The 1100 mAh drive battery lasts for
45 minutes in typical experiments. The WiFi chip enables remote monitoring and programming. We built the car software on top of ROS. We can load multiple trained models over WiFi. We use Intel OpenVino to convert our Tensorflow models to an optimized binary for fast inference. The camera images are fed to the OpenVino inference engine and a real-time video feed on a browser. There is a web UI for calibrating steering and throttle. The model inference results are converted to motor control commands based on the calibration and action space mapping. In addition, the browser has an interface for manual joystick like control.
We evaluate our track navigation policies extensively across multiple tracks, with domain randomization in both simulation and real world. We have created a replica of Track A with the track printed on carpet with the same dimensions as in simulation. We place barriers around the track to reduce distractions and evaluate performance both with and without barriers as well as different speeds and lighting conditions. We also made a custom “tape track” with 2 inch white duct tape in our office corridor to test model robustness. The track is roughly 24 inches wide, 12m in length, traverses both carpet and concrete, has multiple turns and the car camera is exposed to clutter and bright lights in the background.
V-a Training with Multiple Rollouts
We train policies with three different conditions: on Track A with a maximum throttle of 1 m/s, on Track A with throttle 1.67 m/s and on Track B with throttle 1.67 m/s. The task gets harder at higher speeds. Track B is more difficult to navigate because of background with buildings and higher number of turns. Each episode starts with a different waypoint so that all parts of the track are experienced by the policy. We use p3.2x instance for training in SageMaker and run each experiment twice for 2 hours. Figure 5 shows the progress on track during training with different number of rollout workers.
As we expect, more rollout workers lead to faster convergence. There is diminishing returns as we increase workers, 16 workers give a slightly faster convergence compared to 8. Somewhat surprisingly, the higher throttle of 1.67 m/s helped speedup convergence in Track A. We hypothesize that the agent collects more uniform experience with the faster speed and this helps with convergence. Track B takes longer to converge but follows similar trends as Track A.
V-B Robust Evaluation
|Training A Replica||
|0.5 m/s||1 m/s||
|Throttle=0.33 m/s||21 (100)||2||0||0||0||2||4|
|Throttle=1.67 m/s||72 (91.1)||6||4||5||6||2||23|
|Throttle=2.33 m/s||79 (57.9)||6||5||5||6||2||24|
|B, D||Default||41 (100)||3||3||3||3||1||13|
|Color Aug.||49 (100)||6||5||6||6||3||26|
|All image aug||48 (100)||5||6||3||4||0||18|
We test whether robust evaluation in simulation is indicative of real world performance. If true, we can identify when to stop training in simulation and avoid underfitting/overfitting. We can tune our hyper-parameters entirely in simulation and avoid extensive testing in the real world. We train policies with increasing levels of domain randomization and evaluate the policy in both simulation and real.
Our baseline case is trained on Track A with no domain randomization and throttle of 1 m/s. For domain randomization, we train policies on Track A with (i) up to 10% uniform random noise to steering and throttle (action noise), (ii) reverse direction of travel each episode (reverse), (iii) include both action noise and reverse, and (iv) train on Track B with both action noise and reverse. For robust evaluation, we add uniform random noise to actions, evaluate in multiple starting positions and both directions of travel on Track A. For naive evaluation, we evaluate on Track A with a fixed starting point without randomization. Both evaluations test each checkpoint 10 times in simulator. We pick six policies during training from checkpoints 5 through 30, and test their sim2real performance in the Track A replica with 3 trials for each direction of travel. The model performance varies with speed, but it is difficult to maintain a constant speed due to changing battery levels and as the model switches between throttle levels. For sim2real experiments we ensure the model completes a lap in 18 to 22 seconds (0.8-1 m/s). In simulation, the models complete the lap in 35 seconds, so we test the policy at about double speed in the real track.
Figure 6 shows the experimental results. The model that perform consistently well with robust evaluation also perform well on the real track. The models are particularly robust when a sequence of checkpoints perform well in simulator. Reversing the direction of travel significantly improves model performance. Action noise does not help by itself, but improves performance when combined with reverse. Policies trained on Track B do not perform well for checkpoints in Figure 6, but with more training start performing well in both robust evaluation and real track, policy checkpoint 35 traversed the real track successfully 5 out of 6 trials.
The performance of the model changes dramatically at slower speeds (35s lap, 0.5 m/s), even checkpoint 5 of the policy trained on Track A with no randomization traverses the real track. This model is trained in <5 minutes. All the above policies were trained in <1 hour with 4 rollouts.
V-C Robust Sim2Real
We test the robustness of sim2real by training on multiple tracks, with multiple speeds, regularization and domain randomization in actions and observations. By default, we train on Track B with throttle of 1 m/s, with action noise and reverse direction each episode. We pick model checkpoints based on performance in robust evaluation and test the policy on Track A replica in two speeds (0.5 m/s, 1 m/s), with bright sunlight, with no barriers and on tape track.
summarizes our results. Training on a different track gives good sim2real results, but vary track to track. For regularization, we used L2 norm, dropout, batch normalization and an entropy bonus to the policy loss. We tested the models that give best performance in robust evaluation. Reducing the entropy bonus to 0.001 (it is 0.1 by default) and dropout with probability 0.3 were particularly effective. Larger throttle speeds in training increased the robustness of the model dramatically but also increased convergence time in the presence of action noise. Mixing multiple tracks during training did not lead to improvement in performance. We perturb the observation images with random color, horizontal translation, shadow, and salt and pepper noise, each with 0.2 probability. For random color, we combine the effects of random hue, saturation, brightness and contrast to create variations in observation. Random color was the most effective method for sim2real transfer.
We combine the best of our parameters and train a model on Track C with L2 regularlization, lower entropy bonus, dropout, color randomization and a maximum throttle of 2.33 m/s. This model performed the best overall in our experiments. The model consistently completed 11 second laps (1.6 m/s) in our Track A replica.
DeepRacer is an experimetation and educational platform for sim2real reinforcement learning. The platform integrates state-of-the-art Deep RL algorithms, multiple simulation engines with OpenAI Gym interface, provides on-demand compute, distributed rollouts that facilitates domain randomization and robust evaluation in parallel. We demonstrate DeepRacer platform features with a 1/18th scale car that navigates a race track using reinforcement learning. We have created a calibrated robot model for the car in Gazebo along with multiple race tracks. We demonstrate robust sim2real navigation performance trained in DeepRacer with PPO algorithm in both our real world replica track as well as a custom tape track. We achieve sim2real in real track with <5 minutes of training at slow speeds and achieve speeds of 1.6 m/s using models trained with tuned parameters. Thousands of users have replicated our model training and demonstrated sim2real RL navigation.
-  (2019) Amazon S3. Note: https://aws.amazon.com/s3/ Cited by: §IV-A.
-  (2019) Amazon SageMaker. Note: https://aws.amazon.com/sagemaker/ Cited by: §IV.
-  (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §I, §II, §II, §II.
-  (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §I, §I, TABLE I.
-  (1996) Purposive behavior acquisition for a real robot by vision-based reinforcement learning. Machine learning 23 (2-3), pp. 279–303. Cited by: §II.
-  (2019) AWS RoboMaker. Note: https://aws.amazon.com/robomaker/ Cited by: §IV.
-  (2019) A data-efficient framework for training and sim-to-real transfer of navigation policies. In 2019 International Conference on Robotics and Automation (ICRA), pp. 782–788. Cited by: §I, §II.
-  (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §II.
-  (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §II.
-  (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §I, §I, §II.
-  (2017-12) Reinforcement Learning Coach. External Links: Cited by: §I, §II, §IV.
-  (2019) Closing the sim-to-real loop: adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8973–8979. Cited by: §II, §II.
-  (2019) Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6015–6022. Cited by: §I.
-  (2019) Deep reinforcement learning of navigation in a complex and crowded environment with a limited field of view. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5993–6000. Cited by: §I.
-  (2019) Guided deep reinforcement learning of control policies for dexterous human-robot interaction. arXiv preprint arXiv:1906.11695. Cited by: §I.
-  (2019-09–15 Jun) Quantifying generalization in reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1282–1289. External Links: Cited by: §I, §II.
-  (2017) Openai baselines. GitHub, GitHub repository. Cited by: §II, §II.
-  (2017-13–15 Nov) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.), Proceedings of Machine Learning Research, Vol. 78, , pp. 1–16. External Links: Cited by: TABLE I, §II.
-  (2019) Vision-based high-speed driving with a deep dynamic observer. IEEE Robotics and Automation Letters 4 (2), pp. 1564–1571. Cited by: §II.
-  (2018-10–15 Jul) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1407–1416. External Links: Cited by: §I, §II.
-  (2018) Motion planning among dynamic, decision-making agents with deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3052–3059. Cited by: §I.
-  (2018) Surreal: open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pp. 767–782. Cited by: §I, TABLE I, §II.
-  (2004-09) Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104. Cited by: §II.
-  (2019) Quasi-direct drive for low-cost compliant robotic manipulation. arXiv preprint arXiv:1904.03815. Cited by: TABLE I.
-  (2019) AutoRally: an open platform for aggressive autonomous driving. IEEE Control Systems Magazine 39 (1), pp. 26–55. Cited by: TABLE I.
-  (2016) Autonomous drifting with onboard sensors. In Advanced Vehicle Control: Proceedings of the 13th International Symposium on Advanced Vehicle Control (AVEC’16), September 13-16, 2016, Munich, Germany, pp. 133. Cited by: TABLE I.
-  (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. Cited by: §I.
-  (1994) Acquiring robot skills via reinforcement learning. IEEE Control Systems Magazine 14 (1), pp. 13–24. Cited by: §II.
-  (2018-10–15 Jul) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. External Links: Cited by: §IV-A.
Deep reinforcement learning that matters.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §II.
-  (2017) Darla: improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1480–1490. Cited by: §I, §II.
-  (2017) Kubernetes: up and running dive into the future of infrastructure. 1st edition, O’Reilly Media, Inc.. External Links: Cited by: §I.
-  (2018) Distributed deep reinforcement learning based indoor visual navigation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2532–2537. Cited by: §I.
-  (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). External Links: Cited by: §I.
-  (1995) Noise and the reality gap: the use of simulation in evolutionary robotics. In European Conference on Artificial Life, pp. 704–720. Cited by: §I, §II.
-  (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In , pp. 12627–12637. Cited by: §II.
-  (2018) Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I, §II.
-  (2003) On the sample complexity of reinforcement learning. Ph.D. Thesis, University of London London, England. Cited by: §I.
-  (2017) Project-based, collaborative, algorithmic robotics for high school students: programming self-driving race cars at mit. In 2017 IEEE Integrated STEM Education Conference (ISEC), pp. 195–203. Cited by: TABLE I.
-  (2004) Autonomous helicopter flight via reinforcement learning. In Advances in neural information processing systems, pp. 799–806. Cited by: §I.
-  (2016) Jupyter notebooks-a publishing format for reproducible computational workflows.. In ELPUB, pp. 87–90. Cited by: §IV-B.
-  (2004-09) Design and use paradigms for gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, pp. 2149–2154. Cited by: §I.
-  (2014) Poppy project: open-source fabrication of 3d printed humanoid robot for science, education and art. Cited by: TABLE I.
-  (2018-10–15 Jul) RLlib: abstractions for distributed reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3053–3062. External Links: Cited by: §I, TABLE I, §II, §IV-B.
-  (2018-29–31 Oct) GPU-accelerated robotic simulation for distributed reinforcement learning. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 270–282. External Links: Cited by: §I, TABLE I, §II.
-  (2016) Continuous control with deep reinforcement learning. See DBLP:conf/iclr/2016, External Links: Cited by: §IV-A.
-  (2019) Deep drone racing: from simulation to reality with domain randomization. arXiv preprint arXiv:1905.09727. Cited by: §II.
-  (1992) Automatic programming of behavior-based robots using reinforcement learning. Artificial intelligence 55 (2-3), pp. 311–365. Cited by: §II.
-  (2017) Adversarially robust policy learning: active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3932–3939. Cited by: §I, §II.
-  (1997) Reinforcement learning in the multi-robot domain. In Robot colonies, pp. 73–83. Cited by: §II.
-  (2018-29–31 Oct) Sim-to-real reinforcement learning for deformable object manipulation. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 734–743. External Links: Cited by: §II.
-  (2014) Docker: lightweight linux containers for consistent development and deployment. Linux Journal 2014 (239), pp. 2. Cited by: §IV-B.
-  (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §II.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §II, §IV-A.
-  (2015) Ensemble-cio: full-body dynamic motion planning that transfers to physical humanoids. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5307–5314. Cited by: §II.
-  (2018) Ray: a distributed framework for emerging ai applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577. Cited by: §II.
-  (2018) Domain randomization for simulation-based policy optimization with transferability assessment. In Conference on Robot Learning, pp. 700–713. Cited by: §I, §II, §II.
-  (2019) F1/10: an open-source autonomous cyber-physical platform. arXiv preprint arXiv:1901.08567. Cited by: TABLE I.
-  Setting up a benchmark environment for deep reinforcement learning. Cited by: §II.
-  (2018) Agile autonomous driving using end-to-end deep imitation learning. In Robotics: science and systems, Cited by: §II.
-  (2015) Visual domain adaptation: a survey of recent advances. IEEE signal processing magazine 32 (3), pp. 53–69. Cited by: §II.
-  (2017) Duckietown: an open, inexpensive and flexible platform for autonomy education and research. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1497–1504. Cited by: TABLE I, §II.
-  (2018) Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I, §I, TABLE I, §II, §II.
-  (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2817–2826. Cited by: §II.
-  (2009) ROS: an open-source robot operating system. In ICRA workshop on open source software, Vol. 3, pp. 5. Cited by: §IV-C.
-  (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §II.
-  (2019) Redis. Note: https://redis.io Cited by: §IV-A.
-  (2019) Donkey car: an opensource DIY self driving platform for small scale cars. Note: http://donkeycar.com Cited by: TABLE I, §II.
-  (2017-13–15 Nov) Sim-to-real robot learning from pixels with progressive nets. In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.), Proceedings of Machine Learning Research, Vol. 78, , pp. 262–270. External Links: Cited by: §I.
-  (2016) CAD2RL: real single-image flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §I, §II, §II.
-  (2018) Sim2real viewpoint invariant visual servoing by recurrent control. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4691–4699. Cited by: §I.
-  (2019) PRIMAL: pathfinding via reinforcement and imitation multi-agent learning. IEEE Robotics and Automation Letters 4 (3), pp. 2378–2385. Cited by: §I, §I.
-  (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §II.
-  (2016) High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §III-A.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I, §III-A.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. External Links: Cited by: §II.
-  (2019) MuSHR: a low-cost, open-source robotic racecar for education and research. arXiv preprint arXiv:1908.08031. Cited by: TABLE I.
-  (2018) Genesis-rt: generating synthetic images for training secondary real-world tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7151–7158. Cited by: §II.
-  (2018-06) Sim-to-real: learning agile locomotion for quadruped robots. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. External Links: Cited by: §I.
-  (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §II.
-  (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §II.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §II.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §II.
-  (2019-06) An online learning approach to model predictive control. In Proceedings of Robotics: Science and Systems, FreiburgimBreisgau, Germany. External Links: Cited by: §II.
-  (2011) Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 120–127. Cited by: §II.
-  (2017) Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. Cited by: §II.
-  (2018) Learn to steer through deep reinforcement learning. Sensors 18 (11), pp. 3650. Cited by: §II.
-  (2018) Feedback control for cassie with deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1241–1246. Cited by: §I.
-  (2019) Self-driving scale car trained by deep reinforcement learning. arXiv preprint arXiv:1909.03467. Cited by: §II.
-  (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §I, §II.