Robots are increasingly present in our lives, from production lines to homes, hospitals and schools, we rely more and more on their presence. Robots working directly with humans are either controlled directly by human input , pre-programmed to follow a pre-planned choreography of movements between the human and the robot  or programmed to follow a control law . Having an intelligent robot that can learn a task and adapt to situations unseen before, while interacting with a human, is a major challenge in the field.
The rapid advance of artificial intelligence has led researchers to the creation of intuitive agents representing robots in simulated environments, or on real-world robots in specific use cases. The majority of these intelligent systems are based on deep learning [4, 5]. While these systems produce impressive results in automation, they also result in what is referred to as a “blackbox” algorithm, i.e. the decision-making process, and factors affecting it are not clear to the human user. A human interacting with such a system will not realise how their actions are being interpreted by the robotic system, and how they lead to robot actions. These systems are therefore not explainable. For a human interacting with such a system, not knowing how the robot’s actions relate to their own behaviour or the environment, can lead to reduced trust, reliability, safety and efficiency for the overall interaction.
Reinforcement learning is one of the promising solutions to the intelligent robotics problem [6, 4, 7]. Most of these techniques use some form of optimization to solve a task in a given robotics environment , however very few actually look at the inherent structure of the tasks, or are concerned with creating higher level representations which are interpretable by an interacting human user.
In this work, our main goal is to create an intelligent robotic agent that can solve manipulation tasks by learning some form of curriculum  as well as a structure of movement, in a manner that conserves interpretability for a human operator. Our solution has a hierarchical structure where a complex task is split into multiple low-level consecutive simpler actions. This idea builds on the concept of action grammars governing human behaviour, which we previously used in a human-robot interaction scenario 
. In our proposed hierarchy, the high-level agent will divide the full task into smaller (easier) actions that a low-level agent can learn to fulfil. In this manner the low-level agent will learn to make sense of the many degrees of freedom of the system with a curriculum learning approach, while the high-level agent will learn the overall dynamics of the environment and the task at hand, creating a high-level representation governing the decision-making process. Effectively, the high-level agent serves as an interpreter between the human, the environment and the low-level agent controlling the robot’s many degrees of freedom. This will serve as a fundamental step in making intelligent robots operating through reinforcement learning that is explainable to human users.
The paper is structured as follows: Section II covers the necessary background upon which our method is built. Section III presents our proposed algorithm, Dot-to-Dot, with its design and implementation explained in detail. Section IV reports the results of training and task performance for Dot-to-Dot, followed by a discussion on the inner representation the high-level agent creates of its world, and its interpretability. Finally, Section V concludes the paper and covers potential future work.
In this section, we detail the building blocks of our work. Our algorithm relies on three existing algorithms and concepts, namely Deep Deterministic Policy Gradients (DDPG, ), Hindsight Experience Replay (HER, ) and Hierarchical Reinforcement Learning , which we describe below.
Let us first define a few notations used throughout this paper: an observation (i.e. states) space , an action space and a set of goals . For example, in the case of a robotic arm that needs to push a cube around a table, could be the position of the gripper and position of the cube, can be a position on the table where the cube needs to be moved to, and can be a set of actions that end up moving the gripper. We define as the goal of an episode, sampled at when resetting an environment, an achieved goal at time (e.g. position of the cube at time ), and a defined sub-goal which will be a waypoint to . Finally, is the reward at time and the cumulative reward obtained by the agent at time .
Ii-a Deep Deterministic Policy Gradients
The Deep Deterministic Policy Gradients (DDPG) method  combines several techniques together in order to solve continuous control tasks. DDPG is and extension of the Deterministic Policy Gradients (DPG) algorithm . DPG uses an actor-critic architecture  where the actor’s role is to take actions in the environment while the critic is in charge of assessing how useful these actions are to complete a given task. In other terms, both the actor’s and critic’s policies are updated using the temporal difference error of the critic’s value function. Moreover, DPG uses a parameterized function as the actor and as the critic function, which is learned through Q-learning. The updates of the actor consist in applying to both functions the gradients of the expected return with respect to the parameters .
Ii-B Hindsight Experience Replay
In Hindsight Experience Replay (HER), Andrychowicz et al.  use an elegant trick to train agents faster in large states and actions spaces. Observing that very few traces actually yield a positive outcome (i.e. goal reached), the authors propose to make use of every trace no matter whether the goal was reached or not. In fact, they note that regardless of what the objective of a series of actions is, and no matter the outcome, we can still acquire valuable information from experience; i.e. we can still learn how every state of a trace was reached by looking at the previous states visited and actions taken in those states. During an episode, HER samples a batch of traces for and . Then, at training, for a proportion of these traces, HER will replace with a randomly selected with ,
being the uniform distribution; meaning that it assumes the incorrect state we ended up in, was in fact our goal. Therefore, in hindsight, we look at goals that were achieved instead of the original goal, learning from mistakes. This technique proved to be greatly successful for diverse robotics tasks in simulation.
Ii-C Hierarchical Reinforcement Learning
Hierarchical reinforcement learning  is a technique that intends to address the problem of planning how to solve a task, by introducing various levels of abstraction. Most of the time it involves several policies, which interact with each other, often as a high-level policy dictating which of another set of policies to use and when. One such structure  involves several low-level policies called options, which each learn how to complete a specific task, while a high-level policy decides which low-level option to use when.
Our contribution builds on top of these techniques, combining them to create an agent that achieves structured robotic manipulation. Closely related work is that of Nachum et al.  where a similar hierarchical structure is used to solve exploration tasks in a data-efficient manner and using observations as goals. In  another hierarchical structure is used to solve common toy environment tasks. The latter uses different low-level policies for each sub-goals while the former focuses on exploration tasks. In our work on the other hand, we focus on the inherent structure of robotic manipulation and reusing low-level skills across different steps of a task, as well as explainability.
Iii Design and Implementation
We introduce a hierarchical reinforcement learning architecture which we call Dot-to-Dot (DtD). This comes from the fact that our architecture is made of a high-level agent that generates sub-goals for a low-level agent, which follows them one by one to achieve the high-level task – resembling the children’s game of the same name, where connecting dots creates an overall drawing.
To implement this, we define two policies: a low-level one and a high-level one . The low-level policy is trained using DDPG, that takes as inputs observations as well as goals generated by the high-level policy . This is described in figure 1(a).
To teach our agent complex sequences of action in a sparse reward setup, we note that it is easier to learn how to reach nearby goals than further ones. This defines the DtD method: in order to train the low-level agent in an efficient manner, for a given starting point in an episode, (i.e. high-level agent) first generates sub-goals that are in the vicinity of . This is done by making the first sub-goal be a noisy perturbation of , formally: , with
a Gaussian distribution andthe noise parameter, effectively ignoring the final goal . Doing so, the low-level policy can be trained easily on reaching these newly defined sub-goals. The other central idea of DtD is to ignore whether or not goals and sub-goals have been achieved during an episode, and to use HER for both and to extract as much information as possible from past experience.
Let us take an example of a trace and describe the training process. Figure 1(b) shows an episode with two sub-goals.
In this example observations are represented in blue (), sub-goals () in green and the final goal () in red; refers to achieved goals. First, we use to generate . As mentioned above, at first is a noisy perturbation of such that: . Then tries to reach in a certain number of steps and reaches . At the beginning of training, we mostly have , which is fine, as we ignore this and generate a new sub-goal such that . Again, generates actions to reach but instead reaches . Finally, as we want our algorithm to learn which traces are useful for reaching a given goal, and which are not, we constrain the last sub-goal to be equal to the actual final goal . Doing so, we can learn if a sequence of actions and sub-goals can reach a given goal or not.
Once this episode is done, we have obtained a series of states () reached by , sub-goals () generated by and actually achieved goals (). In order to train our agents, we use DDPG in combination with HER. The low level agent is therefore easier to train, as it needs to reach goals closer to its current state. The interesting part is the training of the high level agent, resulting in : as HER allows for replacement of the actual goal with some achieved goal , it makes training more efficient by learning which sub-goal can make reach which goal.
This method of training intrinsically generates a form of curriculum learning as the overall agent will learn more and more complex tasks along the way. In fact, it will first explore its surroundings, then learn how to generate useful sub-goals for a given goal and finally put both together to solve tasks.
For our experiments we used three robotic environments from OpenAI gym robotics suite : FetchPush, FetchPickAndPlace and HandManipulateBlock. These are simulations based on MuJoCo . All of them are goal oriented setups, in both Fetch environments the agent is a robotic arm based on the Fetch robot  which must move a black cube to a desired goal represented by a red dot which is either on a table (FetchPush) or in the air (FetchPickAndPlace). For both, the actions are 4-dimensional with 3 continuously controlling the position of the gripper and one for opening/closing the gripper. The observations are the Cartesian positions of the gripper and the object as well as both their linear and angular velocities. In the FetchPush environment however, the gripper is always set to be closed, which forces the agent to push the cube around, making the task rely heavily on physical properties of the block and table (i.e. friction), which need to be learned by the agent. The Hand environment is a Shadow hand  holding a cube which must be manipulated in-hand to reach a desired orientation goal; note that in figure 5, the goal and sub-goal orientations are displayed to the right of the hand. In the HandManipulateBlock, the agent directly controls the individual joints of the hand, which makes for a much more challenging task. In this case, the actions are 20-dimensional (24 DoF, 4 of which are coupled), and the observations are the position and velocities of the 24 joints, as well as the object’s Cartesian position, its rotation as a quaternion as well as linear and angular velocities .
Algorithm 1 presents more details on DtD. Note that the sub-goal selection agent is in charge of initially generating sub-goals as local noisy perturbations of the current achieved goal, and gradually handing over to the actor-critic agent for sub-goals inference. This can also be done in an -greedy manner.
There are a few technical details that we need to address before looking at the actual implementation and the results we obtained. First of all, we made the choice to always force the last sub-goal of an episode to match the actual end-goal set at the beginning of the episode. Another possibility is to fully ignore the actual goal every time we explore, and only set sub-goals as defined above. However, experiments have shown this technique to be quite inefficient due to the fact that the agent does not even try reaching the goal and we believe this leads to inefficient training of the policy as a result. Another choice we made is that of replacing goals with achieved goals during training. This results in applying HER to sub-goals when training ’s network. In practice, consider an example with 5 sub-goals. During training we sample an episode stored in our replay buffer , from this episode we extract: the sub-goals, and the corresponding observations and achieved goals, and the goal. In classic experience replay training we would train the network on traces such as . Instead, similarly to HER, we choose to replace with a later achieved goal for a certain proportion of traces. For example, we could replace with and train on instead, virtually making the trace successful as was reached by definition. This proved very efficient while training and will be used in all following experiments.
Iv Results and Analysis
Iv-a Training performance
Figures 2(a) and 2(b) show the evolution of the success rate of DDPG, HER and DtD over epochs, on the FetchPush and FetchPickAndPlace tasks respectively. We note that DtD is marginally slower than HER at training on the easier task FetchPush while being marginally faster on the more difficult one FetchPickAndPlace
. Meanwhile, DDPG did not succeed at the tasks in the given number of epochs. Training was done on five random seeds, the figures show the mean and confidence intervals on these seeds.
Iv-B Task performance
In this section we present a few still frames of episodes using the best policy obtained after training. We are interested in assessing whether or not the sub-goals generated by the high-level agent are meaningful and make sense in terms of positioning.
We first present the results obtained on the Fetch environments. where the end goal is represented by a red dot, while sub-goals are represented by a green dot. We look at configurations with only one sub-goal as the table is rather small, note that the last sub-goal is forced to be the end-goal which is why the red dot turns green in the last frames (Figures 3(c) and 0(d)).
In both environments, figures 3(a) and 0(a) show the sub-goals generated are indeed located approximately in the middle of the initial cube’s position and end-goal. We can also see that the low-level agent does indeed succeed in reaching all sub-goals and completes the task. The FetchPush example also shows that the agent managed to learn some form of concept and representation of its environment, the most obvious one being the fact that sub-goals have to be generated on the tabletop. In fact, during training, the high-level agent is not constrained at all in terms of sub-goal generation. As mentioned in the previous part, sub-goals can be generated anywhere in the vicinity of the initial cube’s location, and therefore can even appear inside the table or in the air. However, traces that include these types of sub-goals will bear very low rewards and therefore force the agent to generate sub-goals close to the tabletop.
Figure 5 shows frames of an episode on the ShadowHand simulation where a cube must be rotated to the goal orientation. We can again see that the sub-goals are generated to be on the way to the end-goal, however it is harder to observe than in the Fetch environments due to the nature of the task. As figure 4(c) shows, the agent did not manage to reach the first sub-goal in the given time. Despite this miss, we can see that the agent positioned the cube closer to the target anyway, as shown by the yellow side being positioned correctly. Therefore, even though the low-level agent may sometimes miss a sub-goal due to time constraints, the generated sub-goals help reach an end-goal, as shown in Figure 4(d).
Iv-C High-level agent’s inner representation
In this part, we look at the way the high-level agent values different regions of the environment as candidates for the low-level agent’s sub-goals. This allows us to interpret the way the agent represents the various environments internally, and makes it easier for humans to read into the decision-making process of the agent, improving explainability. We use a specific configuration of the FetchPush environment, where the initial position and the goal have been chosen to be at opposite corners of the table. The idea is to look at the table from above with the robot north of the table, and discretize both its and axes. We then define sub-goals as pairs of the discrete axes: . Finally, for each of these sub-goals, we compute ’s -values which is the expected value of choosing as a sub-goal in the given configuration. A low value means that the sub-goal is not a good candidate, and the overall episode will yield a low cumulated reward.
In figure 6, we show the two extremes in terms of distance from the initial state to the end-goal. The FetchPush environment is ideal to look at inner representations as it is almost two dimensional (goals are always on the tabletop). Figure 5(a) shows this setup. We first look at the -values at the very beginning of training, after just one epoch, as shown in figure 5(b). We can see the values are very close to each other and spread in a small interval, which shows that the high-level agent does not have a clear representation of the environment yet. Despite this lack of clear representation, the agent seems to attribute higher values to sub-goals closer to the end-goal (to the right of the table) rather than those close to the starting position (to its left). This makes sense as sub-goals close to the end-goal are most likely to allow the agent to reach its destination, and are therefore the very first sub-goals that lead to successful traces.
Finally, after training, the best policy’s -values are represented in figure 5(c). We can now see that the values are spread over a much larger interval, and therefore the higher values mean the associated sub-goals are clearly better candidates. This area of higher value is located well in the middle between the starting position and the end-goal. Note that in practice, the sub-goal will only be the point with the highest value on this heatmap. We can therefore conclude that the agent learnt a good representation of its environment as well as a notion of distance, considering the effects of friction and the dynamics involved in pushing a block around to an end-goal.
We set out to create an agent that can learn to complete tasks in environments that are challenging by nature, being high dimensional and only presenting sparse rewards. We also aimed at finding a structure that equips the agent with the ability to create a representation of its environment that can be easily understood by humans. We achieved this by combining several techniques to produce the Dot-to-Dot algorithm, learning a hierarchical structure of motion and manipulation through curriculum learning. This was implemented and tested through OpenAI Gym and MuJoCo, with the Fetch Robotics Manipulator and the Shadow Hand environments.
In terms of training times we obtained results equivalent to the current baselines, however we have shown that on top of this, we achieved to provide the agent with the ability to produce interpretable representations of its environment. The agent learnt a notion of distance, being able to create waypoints to an end-goal, splitting a complex task into several easier consecutive ones and reusing learnt behaviour across these. We believe this can serve as a fundamental first step to help make robotic agents intelligent while preserving the explainability of their actions.
Future work will focus on improving exploration for sub-goals in the vicinity of a current position, one solution for this could be to use intrinsic motivation and curiosity [24, 25]. Another lead could be to produce more goals that do not necessarily need to be achieved, leading the agent towards a direction instead of having waypoints. Finally, we are interested in testing the algorithm on a real robotic system. This method complements the work we presented in , making that robotic setup a good candidate for real-world use of Dot-to-Dot.
-  H. Tanaka, K. Ohnishi, H. Nishi, T. Kawai, Y. Morikawa, S. Ozawa, and T. Furukawa, “Implementation of bilateral control system based on acceleration control using fpga for multi-dof haptic endoscopic surgery robot,” IEEE Transactions on Industrial Electronics, vol. 56, no. 3, pp. 618–627, March 2009.
-  A. Bauer, D. Wollherr, and M. Buss, “Human–robot collaboration: a survey,” International Journal of Humanoid Robotics, vol. 5, no. 01, pp. 47–66, 2008.
-  A. Albu-Schäffer, S. Haddadin, C. Ott, A. Stemmer, T. Wimböck, and G. Hirzinger, “The dlr lightweight robot: design and control concepts for robots in human environments,” Industrial Robot: an international journal, vol. 34, no. 5, pp. 376–385, 2007.
-  J. Kober, J. A. D. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” International Journal of Robotics Research, July 2013.
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep
The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
-  F. Guenter, M. Hersch, S. Calinon, and A. Billard, “Reinforcement learning for imitating constrained reaching movements,” Advanced Robotics, vol. 21, no. 13, pp. 1521–1544, 2007.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3389–3396.
-  Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: ACM, 2009, pp. 41–48.
-  A. Shafti, P. Orlov, and A. A. Faisal, “Gaze-based, context-aware robotic system for assisted reaching and grasping,” in 2019 IEEE International Conference on Robotics and Automation (ICRA), 2019.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
-  M. Andrychowicz, D. Crow, A. K. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in NIPS, 2017.
-  A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, Oct 2003.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. I–387–I–395.
-  R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529 EP –, Feb 2015.
-  M. Stolle and D. Precup, “Learning options in reinforcement learning,” vol. 2371, 08 2002, pp. 212–223.
-  O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 3307–3317.
-  A. Levy, R. P. Jr., and K. Saenko, “Hierarchical actor-critic,” CoRR, vol. abs/1712.00948, 2017.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
-  E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2012, pp. 5026–5033.
-  Fetch. [Online]. Available: https://fetchrobotics.com/robotics-platforms/
-  A. Kochan, “Shadow delivers first hand,” Industrial robot: an international journal, vol. 32, no. 1, pp. 15–16, 2005.
-  M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multi-goal reinforcement learning: Challenging robotics environments and request for research,” CoRR, vol. abs/1802.09464, 2018.
-  P.-Y. Oudeyer, J. Gottlieb, and M. Lopes, Intrinsic motivation, curiosity, and learning: Theory and applications in educational technologies:, 07 2016, vol. 229.
-  C. Colas, P. Fournier, O. Sigaud, and P. Oudeyer, “CURIOUS: intrinsically motivated multi-task, multi-goal reinforcement learning,” CoRR, vol. abs/1810.06284, 2018.