Reinforcement learning  offers a strong framework to approach machine learning problems that can be formulated in terms of agents operating in environments and receiving rewards
. Coupled with the representational power and capacity of deep neural networks, this framework has enabled artificial agents to achieve superhuman performance in Atari games , Go , real time strategy games Dota 2  and StarCraft II . Deep reinforcement learning has been successfully applied to various simulated environments, demonstrating the ability to solve control problems in discrete [24, 44, 55] and continuous [39, 11] action spaces, perform long-term planning [19, 18], use memory , explore environments efficiently  and even learn to communicate with other agents . These and many other capabilities that have been demonstrated by deep reinforcement learning (DRL) agents  hold an inspiring promise of applicability of DRL to real world tasks, in particular in the field of robotics.
Despite the fact that many consider the real world environment to be the ultimate challenge for reinforcement learning research , the search for solutions to that challenge is being carried out predominantly in simulated environments [22, 23, 54, 53, 11, 25, 17, 44, 70, 21]. The main reason for conducting the research in simulated environments is high sample complexity of modern DRL methods. Collecting a sufficient amount of observations on a real robotic system is time consuming and can incur a high cost. Training of robotic agents in the real world is being approached either directly [69, 35, 36, 14, 66, 37, 41, 30] or via different techniques that transfer the agents that were initially trained in simulation to the real world [51, 60, 2]
. Recent work on imitation learning[14, 10, 15, 6, 47, 57] and reduction of sample complexity [20, 7, 49, 52, 62] provide a path towards making training in real feasible.
From the previous major successes of machine learning we see that the goal the community sets as a benchmark is usually the goal the community eventually solves. Since we want to solve RL in the real world, then that is the goal we should set. Adding a real-world benchmark environment to the set of canonical reference tasks such as Atari games  and MuJoCo creatures  would enable future research to take into account the applicability of newly proposed methods to the real world.
In this work we present a prototype OffWorld Gym environment111Video of the agent during learning in the real environment: https://www.youtube.com/watch?v=NBwre0npeLg. OffWorld Gym is a collection of environments with real-world robotic tasks to benchmark RL methods in the real world. This work is based on our first environment that includes one task of navigating towards a visual beacon on an uneven terrain while relying on visual observation only. The tasks that we will formulate in our next environments will address general robotic challenges such as locomotion, navigation, planning, obstacle avoidance, object manipulation, etc. The methods that the research community will find to achieve robust performance on these tasks can be then naturally transferred to the corresponding applications in the real world and industrial robotics.
OffWorld Inc. is committed to providing long-term support and maintenance of the physical environments, as well as constructing additional units to meet the demand.
Ii Related Work
Publicly available simulated environments are playing an important role in the development of RL methods, provide a common ground for comparing different approaches, and allow to track the progress of the field. Simulated environments address various general aspects of reinforcement learning research such as control , navigation [3, 31, 29, 43], physical interactions  and perception . More domain-specific environments explore such fields as robotics [32, 1, 68] and autonomous driving .
Following the signs of applicability of RL in real-world robotics, RL-oriented hardware kits have become available in the past year to support the development of reproducible RL in robotics research [16, 67]. Mandlekar at al.  and Orrb et al.  introduce platforms for generating high fidelity robot interaction data that can be used to pre-train robotic RL agents.
OpenAI Gym  has provided an elegant ecosystem and an abstraction layer between the learning algorithms and the environments. Currently OpenAI gym supports classical control tasks and such environments as Atari, MuJoCo, Box2D and OpenAI robotics environments based on MuJoCo that support simulated creatures, Fetch research platform and Shadow Dexterous Handtm. OpenAI Gym was created to provide a benchmarking platform for RL research by introducing strict naming and versioning conventions (Name-version) making it possible to compare the results achieved by different algorithms and track the progress in the field.
Zamora et al.  introduced an interface to integrate the Gazebo robotics simulator with the OpenAI Gym ecosystem, allowing to extend the set of possible RL environments to any that can be simulated in Gazebo. In their recent work James et al.  introduced a toolkit for robot learning research based on V-REP simulator. Another step in this direction is the PyRobot project  that provides a high-level interface for control of different robots via the Robot Operating System (ROS).
Although these tools provide an easy access to a variety of environments with the focus on specific tasks, these publicly accessible environments are still limited to simulation. The very few projects that have provided physical systems for community-driven robotics research are the LAGR  project from DARPA, Georgia Tech’s Robotarium  and TeleWorkBench  from Bielefeld University. While being the closest to the concept of OffWorld Gym, the LAGR program has concluded and is not active anymore. TeleWorkBench and Robotarium did not postulate a specific task and do not serve as a benchmark challenge. Robotarium’s maximum script execution time of 600 seconds makes it unsuitable for RL research. Moreover, none of the previous systems provided close integration into modern RL research ecosystem, proposed specific and version-controlled challenges nor had the same level of public accessibility as OffWorld Gym.
Iii Offworld Gym
OffWorld Gym is a framework with the goal of enabling the machine learning community to advance reinforcement learning for real-world robotics by validating and comparing different learning methods on a collection of real-world tasks. The framework consists of real-world environments and their simulated replicas along with the necessary hardware and software infrastructure to access and run the experiments.
The prototype real-world environment OffWorldMonolith-v0 that we present in this work is a navigation task, where a wheeled robot has to traverse uneven Moon-like terrain to reach an alluring visual beacon introduced by Kubrick et al. . The robot receives visual input from an RGBD camera and nothing else, and is operating in a four-action discrete action space (left, right, forward, backward). A sparse reward of is assigned when the robot (Husarion Rosbot , dimensions cm) approaches the monolith within the radius of cm. The environment is reset upon successful completion of the task, reaching the limit of steps or approaching the boundary of the environment. After the reset the robot is moved to a random position with a random orientation. Figure 1 shows the image of the environment and the input stream that the robot receives.
Iii-a Physical environment
The real instance of the environment is an enclosure of size meters and is designed to visually emulate the lunar surface. The ground layer is covered with small lava rocks that create an uneven terrain that is challenging for the robot (Husarion Rosbot) to traverse and prevents the robot from having stable visual observations. The enclosure provides power to the robot, network connection to the workstation that is running the environment, and two overhead cameras that allow the user to monitor the environment. HTC Vivetm tracker and two base stations are used to localize the robot within the environment. The location information is not made available to the learning agent and is used internally by the environment control script to calculate the rewards, reset the environment and to specify initial locations at the beginning of an episode.
We define a 3D transformation matrix to allow transformation from the tracker’s coordinate frame to the world coordinate frame (defined as the center of geometry of the enclosure), and another 3D transformation matrix for transformation from the tracker’s coordinate frame to the robot’s coordinate frame. These transformation matrices help determine the robot’s location with respect to the world’s coordinate frame at any time during the experiment. Upon an environment reset the robot is moved to a randomly chosen spawn location using localization information from the HTC Vive setup and motion control using ROS’s move_base navigation and path planning package. Figure 2 shows the internal representation of the real environment that is used by the OffWorld Gym server to control and monitor the environment.
Iii-B Simulated analog
The simulated instance of the OffWorld Gym environment is created using Gazebo simulation software and provides a close replica of the physical environment. It replicates the dimensions, physical parameters of the real system such as mass and friction of the robot, reward and reset criteria, and the visual appearance as close to the real environment as possible.
In addition to the default applications of the simulated environment, such as algorithm development and preliminary testing of the agent, the close match between the OffWorld Gym simulated and real instances provides a platform for sim2real research.
Iii-C Architecture of the system
OffWorld Gym consists of three major parts: (a) a Python library that is running on the client machine, (b) the server that handles communication, resource management and control the environment (reward, episode reset, etc.), and (c) the physical environment that includes power and network infrastructure and the robot itself. Figure 3 provides an overview of the architecture, its components and interactions.
The Husarion Rosbot is equipped with an ASUS Up Board (Quad Core Intel CPU, Ubuntu 16.04) on-board computer, Orbbec Astra RGBD camera and a CORE2-ROS robot controller. The robot controller runs the firmware layer and the on-board computer runs the sensor drivers, ROS sensor packages and robot motion controller ROS package. Since all of the learning happens on the client workstation, the on-board capabilities of the robot can be kept minimal. An Intel NUC (Core i7, 32 GB RAM, Ubuntu 16.04) computer runs the OffWorld Gym Server, the robot mission management software and the ROS packages that control the environment. An IBM workstation (Intel Xeon, 32 GB RAM, Nvidia Quadro, Ubuntu 16.04) interfaces with the HTC Vive lighthouse setup. It runs the HTC Vive driver and a ROS package which publishes the robot’s localization data.
OffWorld Gym library provides the API to access the environment. The client side of the library integrates with the code of the RL agent and handles the requests that are being issued by the agent, forwarding them to the server. The server controls the resource management and if the client has access, transforms the request into a sequence of ROS requests that are then forwarded to the ROS action server that is controlling the physical environment. The ROS action server validates each command and forwards it to the robot. Physical execution of the action by the robot has the largest time requirement and can take up to 4 seconds, making the network latency (up to 200 ms, depending on the geographical location with respect to the OffWorld Gym server) and data transmission delays negligible. The robot completes the requested action (movement, position reset, etc) and sends the final telemetry readings back to the action server. The server pre-processes the telemetry and creates the state variable that is sent back to the client as the observation for the agent. The user does not have direct access to the robot and can only communicate via the established set of telemetry messages and control commands. The control logic and learning run on user’s workstation and the user is thus free to explore any algorithmic solution and make use of any amount of computational resources available at their disposal.
We have closely followed the ecosystem established by OpenAI Gym so that the deployment of an agent in our environment requires minimal change when switching from any other gym environment. Listing 1 illustrates the conceptual blocks of the program that uses our environment to train a reinforcement learning agent.
To deploy an agent in an OffWorld Gym environment a user has to install offworld_gym Python library and register with the resource management system222See https://gym.offworld.ai for details..
We ran a vanilla DQN to provide the baseline performance and behavior in both real and simulated environments. We used a sparse reward of with no step penalty. The agent was rewarded for approaching the monolith within the radius of cm. The environment was reset when either the goal was completed, the agent exceeded the maximum number of actions per episode, or it breached the boundary of the environment. Figure 4 shows the results in simulation that confirm the baseline architecture’s ability to learn the navigation task
with the additional challenge of the unstable camera field of view. In simulation, the agent achieves intelligent behavior and plateaus at 2000 episodes. If similar sample efficiency can be achieved in real environment as well, then the whole end-to-end learning progress will take approximately 50 hours.
The network architecture consisted of a visual input, followed by three convolutional layers of sizeand , the last one corresponding to the number of actions the agents had. In total the network had 3381 trainable parameters.
Visual analysis of the behavior of the resulting simulated agent demonstrates that it has adapted to the terrain of the environment, learning to avoid ridges that are hard to overcome and developing behavior to use flat regions of the ground to reach the target. We expect same behavior will be manifested also in the real environment. We see this is a promising first step towards developing an obstacle avoidance behavior (in a correspondingly configured environment) without any explicit programming.
Figure 5 shows the performance of the same network architecture retrained in the real environment. It was not able to achieve intelligent behavior, highlighting the differences between learning in real versus simulated environments and emphasizing the importance
of directing the efforts of RL community into dedicating more time to finding the algorithmic approaches and architectures, that will result in robust learning and behavior in real environments.
In this work we presented a real-world environment for reinforcement research in robotics. We aim it to serve as a benchmark for RL research, allowing to test learning algorithms not only in simulated and game environments, but also on real robots and real-world tasks.
Real physical environments pose significant challenges for the speed of progress in RL research. Inability to run the experiments faster than real time, mechanical difficulties with the robots and supporting mechanisms, unpredictable behavior of the real physical medium, the cost of the system, and the additional time for resetting the environment between episodes are major technical challenges that have slowed down the advancement of RL in real robotics. Furthermore, in a simulated environment we can engineer any reward schema required by the experimental setup, whereas in the real world reward specification is limited by the sensors a robot has and their robustness. Despite all these challenges, the alternative – robotic simulation – can only partially address all relevant aspects of real robotic behavior. For the real deployment of RL systems the community will have to face the above-mentioned challenges. We hope that interaction with OffWorld Gym will provide valuable insights into these challenges and facilitate the search for solutions to them.
The OffWorld corporation is committed to providing long-term support of OffWorld Gym environments to ensure that they can serve as a benchmark for RL research. By taking care of the maintenance of both the hardware and software components of the system, as well as construction of additional environments, OffWorld ensures that RL community can focus on finding algorithmic solutions to the challenges of deploying RL systems in real world.
The OffWorld Gym architecture has been designed so that all of the real world complexity is abstracted from the user. Experiments can be easily run not only in simulation but also in the real-world with real robots, taking off the burden of managing a physical robotics system. Close integration into existing ecosystem of OpenAI Gym allows to use the environment without any prior experience in robotics, abstracting it under a familiar API. The scalability of the system is addressed by monitoring user activity via the time booking system and building additional physical environments to meet the demand.
The existence of a simulated environment that is a close replica of the real environment as part of the same framework allows not only to setup and validate an experiment in simulation ahead of real deployment, but also to experiment with learning techniques that rely on pre-training in simulation, domain adaptation to close the reality gap, domain randomization and other techniques aiming to reduce sample complexity of RL in real world.
We have conducted experiments with training the same network architecture in simulated and real environments. The results have shown that even when the simulated replica of the environment is made to closely match the real one, the actual learning task is different and the same architecture and learning process that was able to solve the task in simulation is not likely to work in real world. This highlighted booth the need to focus our efforts on learning in real world and the benefit of having a common benchmark to track the progress of those efforts. We hope that OffWorld Gym will become one of such benchmarks.
We have deployed our first OffWorld Gym environment for public access both in real and in simulation and demonstrated the full availability of the framework to benchmark RL methods on a robot navigation task in uneven terrain. This environment is available to the community to start development and training of new RL methodologies.
Our future work includes building and releasing more environments with different tasks, starting with a navigation task that includes the challenge of static obstacle avoidance. An agent that will learn to solve this task will effectively develop obstacle avoidance without any explicit programming. For our follow-on task definitions, we aim to maintain a focus on industrial robotic tasks in unstructured environments, striving towards general applicability of the methodologies that will be discovered inside of these environments to real-world applications.
Future work also includes benchmarking of existing RL algorithms, imitation learning methods, transfer of the agents trained in simulation to the real environment, and of other RL techniques. This research will show which methods are most efficient in terms of sample complexity, optimality and robustness of achieved behaviour and their resilience to the different kinds of noise (environment, sensory, reward, action) a real environment presents.
Thank you Eric Tola, Matt Tomlinson, Matthew Schwab and Piyush Patil for your help with the mechanical and electrical design and implementation.
-  (2018) Ingredients for robotics research. Note: https://openai.com/blog/ingredients-for-robotics-research/ Cited by: §II.
-  (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §I.
-  (2016) Deepmind lab. arXiv preprint arXiv:1612.03801. Cited by: §II.
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research47, pp. 253–279. Cited by: §I, §II.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §II.
-  (2017) Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307. Cited by: §I.
-  (2018) Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214. Cited by: §I.
-  (2018) TarMAC: targeted multi-agent communication. arXiv preprint arXiv:1810.11187. Cited by: §I.
-  (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §II.
-  (2017) One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §I.
-  (2016) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. Cited by: §I, §I.
-  (2019) Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901. Cited by: §I.
-  (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995. Cited by: §I.
-  (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58. Cited by: §I.
-  (2017) One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905. Cited by: §I.
-  (2019) Quasi-direct drive for low-cost compliant robotic manipulation. arXiv preprint arXiv:1904.03815. Cited by: §II.
-  (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §I.
-  (2019) An investigation of model-free planning. arXiv preprint arXiv:1901.03559. Cited by: §I.
-  (2014) Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346. Cited by: §I.
-  (2018) Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, pp. 5302–5311. Cited by: §I.
-  (2018) Learning an embedding space for transferable robot skills. Cited by: §I.
Memory-based control with recurrent neural networks. arXiv preprint arXiv:1512.04455. Cited by: §I.
-  (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952. Cited by: §I.
-  (2018) Rainbow: combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I.
-  (2016) Vime: variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117. Cited by: §I.
-  (2019) ROSBot pro 2.0. Note: https://husarion.com/manuals/rosbot-manual/ Cited by: §III.
-  (2006) The darpa lagr program: goals, challenges, methodology, and phase i results. Journal of Field robotics 23 (11-12), pp. 945–973. Cited by: §II.
-  (2019) PyRep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: §II.
-  (2016) The malmo platform for artificial intelligence experimentation.. In IJCAI, pp. 4246–4247. Cited by: §II.
-  (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §I.
-  (2016) Vizdoom: a doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §II.
-  (2017) Roboschool. Cited by: §II.
-  (1968) 2001: a space odyssey. Cited by: §III.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
-  (2015) DeepMPC: learning deep latent features for model predictive control.. In Robotics: Science and Systems, Cited by: §I.
-  (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §I.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §I.
-  (2018) Deep reinforcement learning. arXiv preprint arXiv:1810.06339. Cited by: §I.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
-  (2019-06) ORRB: openai remote rendering backend. In eprint arXiv, External Links: Cited by: §II.
-  (2018) Benchmarking reinforcement learning algorithms on real-world robots. arXiv preprint arXiv:1809.07731. Cited by: §I.
-  (2018) ROBOTURK: a crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790. Cited by: §II.
-  (2019) Habitat: A Platform for Embodied AI Research. arXiv preprint arXiv:1904.01201. Cited by: §II.
-  (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §I, §I.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §I.
PyRobot: an open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236. Cited by: §II.
-  (2018) Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292–6299. Cited by: §I.
-  (2018) OpenAI five. Note: https://blog.openai.com/openai-five/ Cited by: §I.
-  (2019) Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161. Cited by: §I.
-  (2017) The robotarium: a remotely accessible swarm robotics research testbed. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1699–1706. Cited by: §II.
-  (2016) Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286. Cited by: §I.
-  (2019) Addressing sample complexity in visual tasks using hindsight experience replay and hallucinatory gans. Cited by: §I.
-  (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §I.
High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §I.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §I.
-  (2019) Provably efficient imitation learning from observation alone. arXiv preprint arXiv:1905.10948. Cited by: §I.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
-  (2006) Teleworkbench: a teleoperated platform for multi-robot experiments. In Proceedings of the 3rd International Symposium on Autonomous Minirobots for Research and Edutainment (AMiRE 2005), pp. 49–54. Cited by: §II.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §I.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §I, §II.
-  (2019) Learning latent state representation for speeding up exploration. arXiv preprint arXiv:1905.12621. Cited by: §I.
-  (2019) AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. Note: https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ Cited by: §I.
-  (2018) Unsupervised predictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760. Cited by: §I.
-  (2018) Gibson env: real-world perception for embodied agents. In , pp. 9068–9079. Cited by: §II.
-  (2017) Collective robot reinforcement learning with distributed asynchronous guided policy search. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 79–86. Cited by: §I.
-  (2019) REPLAB: a reproducible low-cost arm benchmark platform for robotic learning. arXiv preprint arXiv:1905.07447. Cited by: §II.
-  (2016) Extending the openai gym for robotics: a toolkit for reinforcement learning using ros and gazebo. arXiv preprint arXiv:1608.05742. Cited by: §II, §II.
-  (2015) Towards vision-based deep reinforcement learning for robotic motion control. arXiv preprint arXiv:1511.03791. Cited by: §I.
-  (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §I.