Reinforcement learning (RL) (Sutton & Barto, 2018) is emerging as one of the promising methods to solve complex decision-making tasks by maximizing human-defined rewards (Silver et al., 2016; Qureshi et al., 2017, 2018; Levine et al., 2016). Currently, the primary focus of RL research is on learning new behaviors (Lillicrap et al., 2015; Schulman et al., 2015, 2017; Haarnoja et al., 2018b) rather than composing already acquired skills for a combinatorial outburst in the robot’s abilities (Lake et al., 2017).
In this paper, we propose a novel policy ensemble composition method111Supplementary material and videos are available at https://sites.google.com/view/compositional-rl that takes the basic, low-level, and easily obtainable set of the robot policies and learns a composite model through standard- or hierarchical-RL (Lillicrap et al., 2015; Schulman et al., 2015, 2017; Haarnoja et al., 2018b; Dayan & Hinton, 1993; Vezhnevets et al., 2017; Florensa et al., 2017; Nachum et al., 2018). Our composite model combines low-level policies into complex strategies to solve challenging problems. We evaluate our method in challenging environments where learning a single policy is either not possible or does not yield satisfactory results. Our approach has the following salient features. First, it does not assume any structure on the low-level policies. These policies could be entirely agnostic of high-level tasks, and our method can still compose them to solve complex problems. Second, we do not make any assumption on how the low-level policies are obtained. They could be human-defined decision functions or behaviors learned via imitation or RL. Third, our method is trainable through both standard and hierarchical RL algorithms and demonstrate considerable improvements in data efficiency and overall performance in solving problems.
Composition as Markov Decision Process. Fig. (a) represents the graphical model for a simple MDP. Fig. (b) is the augmented graphical model that integrates composition of sub-level policies. Fig. (c) is the new MDP with the augmented state-space.
2 Related Work
In this section, we present a brief overview of existing solutions to the problem of compositionality and their limitations. We highlight that in the past, the research in learning new primitive skills such as Dynamic Movement Primitives (DMPs) (Schaal et al., 2005) received more attention than building methods to compose already acquired skills. We also mention hierarchical reinforcement learning methods to distinguish them from compositional reinforcement learning methods.
2.1 Composition of control objectives
Compositionality of past skills to acquire new skills has hardly been considered in RL, and to the best of the authors’ knowledge, there exist quite a few approaches that address this problem and yet to a limited extent. For instance, Todorov (2009) and Haarnoja et al. (2018a) addressed the problem of compositionality by combining the independent rewards and Q-functions, respectively, of human-defined sub-tasks of the given complex problem. Their methods extract the composite policy by merely maximizing the averaged reward/action-value functions of the individual sub-level tasks. In a similar vein, Sahni et al. (2017) propose a temporal-logic based composition of low-level policies that solve sub-tasks, defined by human experts, of the given task. In our work, we aim for the composition of general-purpose skills, policies, or motor primitives that are oblivious of complicated high-level tasks and can be reused by the agent wherever and whenever needed, even in different domains/tasks, during the operation. Another recent approach (Hausman et al., 2018) uses entropy regularization to learn diverse transferable skills that are composed sequentially to solve the same or slightly different task. In contrast, our method does not aim to learn composable skills but uses primitive motor skills for composition. Furthermore, we solve the challenge of the concurrent composition of skills rather than just sequential execution.
2.2 Dynamic Movement Primitives
Dynamic Movement Primitives (DMPs) (Ijspeert et al., 2002) correspond to compact, parameterized, and modular representation of robot skills. The ultimate goal of DMPs research is to obtain elemental movements that could be combined to obtain complex skills. A lot of research in this area revolves around formulating and learning such DMPs (Schaal et al., 2005; Ijspeert et al., 2013; Paraschos et al., 2013; Tamosiunaite et al., 2011; Matsubara et al., 2011)
. However, there is again quite a few approaches that address the challenge of composing such DMPs in an efficient, scalable manner. To date, such DMPs are usually combined through human-defined heuristic, mixture models or learning from demonstrations(Konidaris et al., 2012; Muelling et al., 2010; Arie et al., 2012). There also exist some techniques that determine a model of primitive robot actions and the composite policies are extracted via planning (Veeraraghavan & Veloso, 2008; Zoliner et al., 2005). However, such methods tend to be less data efficient as learning the model of robot actions requires a vast amount of interactive experiences (Kaelbling & Lozano-Pérez, 2017). Unlike the methods mentioned above, we do not aim to learn composable representations or models of low-level policies. Instead, we strive for a compositional approach that can blend any black-box elemental controllers, policies, or even DMPs through conventional RL techniques.
2.3 Hierarchical Reinforcement Learning
Hierarchical Reinforcement Learning (HRL) commonly corresponds to a two-level policy structure that learns to solve complex tasks by either decomposing them into subtasks (Nachum et al., 2018) or through temporal abstraction using option framework. The task decomposition methods learn a high-level policy, through maximizing the task reward, to generate intermediate goals for the low-level policy that through achieving them can solve the overall complex task. These methods outperform traditional RL in problems with sub-optimal reward functions requiring complex decision making and task planning (Dayan & Hinton, 1993; Vezhnevets et al., 2017; Nachum et al., 2018). The options framework (Sutton et al., 1999; Precup, 2000) learns a set of sub-level policies (options), their termination functions, and a high-level over options to solve the given task. Prior work in option-based HRL formulation required pre-defined heuristic to determine options. Recent advancements led to the option-critic algorithm (Bacon et al., 2017) that concurrently learns the high-level policy over options and the underlying options with their termination functions. Despite an exciting step, option-critic algorithm requires regularization (Vezhnevets et al., 2016; Harb et al., 2018) or else it ends up discovering options for every time step or a single option for the entire task.
In practice, HRL methods tend to exhibit high sample complexity and therefore, require a huge number of interactions with the real environment. The primary objective of our composition model is to improve the sample complexity of conventional RL and HRL by exploiting prior knowledge or skills. In our experiments, we show that our method is trainable through HRL such as (Nachum et al., 2018) and exhibits significant improvement in data-efficiency and performance compared to traditional HRL methods.
We consider a standard RL formulation (Fig. 1 (a)) based on Markov Decision Process (MDP) defined by a tuple , where and represent the state and action space,
is the set of transition probabilities, anddenotes the reward function. At time , the agent observes a state and performs an action . The agent’s action transitions the environment state from to with respect to the transition probability and leads to a reward .
For compositionality, we extend the standard RL framework by assuming that the agent has access to the finite set of primitive policies that could correspond to agent’s skills, controller, or motor-primitives. Our composition model is agnostic to the structure of primitive policy functions, but for the sake of this work, we assume that each of the sub-policies solves the MDP defined by a tuple . Therefore, , and are the state-space, transition probabilities and rewards of the primitive policy , respectively. Each of the primitive policies , , takes a state and outputs a distribution over the agent’s action space . We define our composition model as a composite policy , parameterize by , that outputs a distribution over the action space conditioned on the environment’s current state and the primitive policies . The state space of the composite model is . The space could include any task specific information such as target locations. Hence, in our framework, the state inputs to the primitive policies and composite policy need not to be the same.
In remainder of this section, we show that our composition model solves an MDP problem. To avoid clutter, we assume that both primitive policy ensemble and composite policy have the same state space , i.e., . The composition model samples an action from a distribution parameterized by the actions of sub-level policies and the state of the environment. We can augment the naive graphical model in Fig. 1 (a) to incorporate the outputs of sub-policies to determine the composite actions, as shown in Fig. 1 (b). It can be seen that by defining a new state space as , where are the outputs of sub-level policies, we can construct a new MDP, as shown in Fig. 1 (c), to represent our composite model. This new MDP is defined as where is the new composite state-space, is the action-space, is the transition probability function, and is the reward function for the given task.
4 Policy Ensemble Composition
In this section, we present our policy ensemble composition framework, shown in Fig. 2. Our composition model consists of i) the encoder network that takes the outputs of primitive policies and embeds them into latent spaces; ii) the decoder network that takes current state
of the environment and the latent embeddings from the encoder network to parameterize the attention network; iii) the attention network that outputs the probability distribution over the primitive low-level policies representing their mixture weights. The remainder of the section explains the individual models of our composition framework and the overall training procedure.
4.1 Encoder Network
4.2 Decoder Network
Our decoder is a simple feed-forward neural network that takes the last hidden states of the forward and backward encoder network, i.e.,, and the current state of the environment to map them into a latent space . The state input to the decoder network is defined as , where is the state input to the low-level policy ensemble and could be any additional information related to the given task, e.g., goal position of the target to be reached by the agent.
4.3 Attention Network
4.4 Composite policy
Given the primitive policy ensemble , the composite action is the weighted sum of all primitive policies outputs, i.e.,
. Since, we consider the primitive policies to be Gaussian distributions, the output of each primitive policy is parameterized by mean
and variance, i.e., . Hence, the composite policy can be represented as , where denotes Gaussian distribution, and . Given the mixture weights, other types of primitive policies, such as DMPs (Schaal et al., 2005), can also be composed together by the weighted combination of their normalized outputs.
4.5 Composite model training objective
The general objective of RL methods is to maximize the cumulative expected reward, i.e., , where is a discount factor. We consider the policy gradient methods to update the parameters of our composite model, i.e., , where is the learning rate. We show that our composite policy can be trained through standard RL and HRL methods, described as follow.
4.5.1 Standard Reinforcement learning
In standard RL, the policy gradients are determined by either on-policy or off-policy updates (Lillicrap et al., 2015; Schulman et al., 2015, 2017; Haarnoja et al., 2018b) and any of them could be used to train our composite model. However, in this paper, we consider off-policy soft-actor critic (SAC) method (Haarnoja et al., 2018b) for the training of our policy function. SAC maximizes the expected entropy in addition to the expected reward, i.e.,
is a hyperparameter. We use SAC as it motivates exploration and has been shown to capture the underlying multiple modes of an optimal behavior. Since there is no direct method to estimate a low-variance gradient of Eq (2), we use off-policy value function-based optimization algorithm (for details refer to Appendix A.1 of supplementary material).
4.5.2 Hierarchical Reinforcement Learning
In HRL, there are currently two streams - task decomposition through sub-goals (Nachum et al., 2018) and option framework (Bacon et al., 2017) that learns temporal abstractions. In the options framework, the options can be composite policies that are acquired with their termination functions. In task decomposition methods that generate sub-goal through high-level policy, the low-level policy can be replaced with our composite policy. In our work, we use the latter approach (Nachum et al., 2018), known as HIRO algorithm, to train our policy function.
Like, standard HIRO, we use two level policy structure. At each time step , the high-level policy , with parameters , observes a state and takes an action by generating a goal in the state-space for the composite low-level policy to achieve. The takes the state , the goal , and the primitive actions to predict a composite action through which an agent interacts with the environment. The high-level policy is trained to maximize the expected task rewards given by the environment whereas the composite low-level policy is trained to maximize the expected intrinsic reward defined as the negative of distance between current and goal states, i.e., . To conform with HIRO settings, we perform off-policy correction of the high-level policy experiences and we train both high- and low-level policies via TD3 algorithm (Fujimoto et al., 2018) (for details refer to Appendix A.2 of supplementary material).
5 Experiments and Results
We evaluate and compare our method against standard RL, and HRL approaches in challenging environments (shown in Fig. 3) that requires complex task and motion planning. The implementation details of all presented methods and environment settings are provided in Appendix C of supplementary material. We also do an ablative study in which we take away different components of our composite model to highlight their importance.
We consider the following seven environments for our analysis: (1) Pusher: A simple manipulator has to push an object to a given target location. (2) Ant Random Goal: In this environment, a quadruped-Ant is trained to reach the randomly sampled goal location in the confined circular region. (3) Ant Cross Maze: The cross-maze contains three target locations. The task for a quadruped Ant is to reach any of the three given targets by navigating through a 3D maze without collision. (4) HalfCheetah Hurdle: In this problem, the task for a halfcheetah is to run and jump over the three barriers to reach the given target location. (5) Ant Maze: A -shaped maze poses a challenging navigation task for a quadruped-Ant. In this task, the agent is given random targets all along the maze to reach while training. However, during the evaluation, we test the agent for reaching the farthest end of the maze. (6) Ant Push: A challenging environment that requires both task and motion planning. The environment contains a movable block, and the goal region is located behind that block. The task for an agent is to reach the target by first moving to the left of the maze so that it can move up and right to push the block out of the way for reaching the target. (7) Ant Fall: A navigation task where the target is located across the rift in front of the agent’s initial position. There also happen to be a moveable block, so the agent has to move to the right, push the block forward, fill the gap, walk across, and move to the left to reach the target location.
In all tasks, we also acquire primitive skills of the agent for our composite policy. For Ant, we use four basic policies for moving left, right, up, and down. The pusher uses two primitive policies that are to push an object to the left and down. In HalfCheetah hurdle environment, the low-level policies include jumping and running forward. Furthermore, in all environments, except pusher, the primitive policies were agnostic of high-level target locations that were therefore provided separately to our composite model via decoder network.
|Ant Random Goal||Ant Cross Maze||Pusher||HalfCheetah Hurdle|
on benchmark control tasks in terms of distance (lower the better) of an agent from the given target. The mean final distances with standard deviations over ten trials are reported. We also normalize the reported values by the agent initial distance from the goal so values close to 1 or higher show failure. It can be seen that our method (shown in bold) accomplishes the tasks by reaching goals whereas other methods fail except for SAC in simple Pusher and Ant Random Goal environments.
5.1 Comparative study
In our comparative studies, we divide our test environments into two groups. The first group includes Pusher, Random Goal Ant, Ant Cross Maze, and HalfCheetah-Hurdle environments, whereas the second group comprises the remaining environments that require task and motion planning under weak reward signals.
In the first group of settings, we compare our composite model trained with SAC (Haarnoja et al., 2018b) against the standard Gaussian policies obtained using SAC (Haarnoja et al., 2018b), PPO (Schulman et al., 2017), and TRPO (Schulman et al., 2015). We exclude HRL methods in these cases for two reasons. First, the environment rewards sufficiently represent the underlying task, whereas HRL approaches are applicable in cases that have a weak reward signal or require task and motion planning. Second, HRL methods usually need a large number of training steps generally much more than tradition RL methods. Table 1 presents the mean and standard deviation of the agent’s final distance from the given targets after the end of an evaluation rollout over the ten trials. Fig. 4 shows the mean learning performance over all trials during the three million training steps. In these set of problems, TRPO and PPO entirely fail to reach the goal, and SAC performs reasonably well but only in simple Ant Random Goal and Pusher environments as it fails in other cases. Our composite policy obtained using SAC successfully solves all tasks and exhibit high data-efficiency by learning in merely a few thousand training steps.
In our second group of environments, we use distance-based rewards that are weak as greedily following them does not lead to solving the problem. Furthermore, in these environments, policies trained with standard RL, including our composite policy, failed to solve the problem even after 20 million training steps. Therefore, we trained our composite policy with HIRO (Nachum et al., 2018) and compared its performance against standard HIRO formulation (Nachum et al., 2018). We also tried to include option-critic framework (Bacon et al., 2017), but we were unable to get any considerable results with their online implementation despite several attempts with the parameter tuning. One of the reasons option-critic fails is because it relies purely on task rewards to learn, which makes them inapplicable for cases with weak reward signals (Nachum et al., 2018). Furthermore, unlike HIRO that used a modified Ant with 30 units joint torque limit, we use Mujoco standard Ant that has a torque limit of 150 units and makes the learning even harder as the Ant is now more prone to instability.
Fig. 5 shows the learning performance, averaged over ten trials, during 7 million steps. In these problems, the composite policy with HIRO outperforms standard HIRO (Nachum et al., 2018) by a significant margin that certifies the utility of solving RL tasks using composition by leveraging basic pre-acquired skills. HIRO performs poorly with standard Ant as it imposes a harder control problem since the agent should also learn to balance the Ant to prevent it from flipping over due to high torques. We were able to replicate the results of HIRO (Nachum et al., 2018) on their modified Ant (Torque Limit 30) and also, our composition model gave comparably better results on modified Ant than standard Ant. However, we use a standard-Ant to conform among all Ant environments presented in this paper. In the Ant Fall environment, composition model struggles to perform well which we believe is because the low-level policies were trained in a 2D planner space rather than a 3D space with an elevation that slightly changes the underlying state-space.
5.2 Ablative study
We remove attention-network, and both attention-network and BRNN (AttBRNN) from our composition model to highlight their importance in the proposed architecture in solving complex problems. We train all models with SAC (Haarnoja et al., 2018b). The first model is our composite policy without attention in which the decoder network takes the state information and last hidden states of the encoder (BRNN) to directly output actions rather than mixture weights. The second model is without attention network and BRNN; it is a feed-forward neural network that takes the state information and the primitive actions and predicts the action to interact with the environment. Fig. 6 shows the mean performance comparison, over ten trials, of our composite model against its ablated versions on a Ant Random Goal, Cross Maze Ant, and Pusher environment. We exclude remaining test environments in this study as ablated models completely failed to perform or show any progress. Note that the better performance of our method compared to ablated versions highlight the merits of our architecture design. Intuitively, BRNN allows the dynamic composition of a skill set of variable lengths and the attention network bypasses the complex transformation of action embeddings (model-without-attention) or actions and state-information (AttBRNN model) directly to action space.
6 Conclusions and Future work
We present a novel policy ensemble composition method that concurrently or sequentially compose primitive motor policies of the robot through RL to solve challenging problems. Our experiments show that composition is vital for solving problems that require task and motion planning where reinforcement learning and hierarchical reinforcement learning methods either fail or need a massive number of interactions with the environment for learning.
In our future work222In our supplementary material (Appendix B), we provide a detailed discussion with preliminary results to concretely illuminate our future research direction., we plan to extend our composition framework to acquire new skills that do not exist in the given primitive skillset but are necessary to solve the given problem. We also aim towards a system that learns the hierarchies of composition models by combining primitive policies into complex policies which would further be composed together for a combinatorial outburst in the robot’s skillset.
- Arie et al. (2012) Arie, H., Arakaki, T., Sugano, S., and Tani, J. Imitating others by composition of primitive actions: A neuro-dynamic model. Robotics and Autonomous Systems, 60(5):729–741, 2012.
Bacon et al. (2017)
Bacon, P.-L., Harb, J., and Precup, D.
The option-critic architecture.
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Dayan & Hinton (1993) Dayan, P. and Hinton, G. E. Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278, 1993.
- Florensa et al. (2017) Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
- Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
- Haarnoja et al. (2018a) Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, S. Composable deep reinforcement learning for robotic manipulation. arXiv preprint arXiv:1803.06773, 2018a.
- Haarnoja et al. (2018b) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018b.
- Harb et al. (2018) Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. When waiting is not an option: Learning options with a deliberation cost. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hausman et al. (2018) Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an embedding space for transferable robot skills. 2018.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Ijspeert et al. (2002) Ijspeert, A. J., Nakanishi, J., and Schaal, S. Movement imitation with nonlinear dynamical systems in humanoid robots. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, volume 2, pp. 1398–1403. IEEE, 2002.
- Ijspeert et al. (2013) Ijspeert, A. J., Nakanishi, J., Hoffmann, H., Pastor, P., and Schaal, S. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25(2):328–373, 2013.
- Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Kaelbling & Lozano-Pérez (2017) Kaelbling, L. P. and Lozano-Pérez, T. Learning composable models of parameterized skills. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 886–893. IEEE, 2017.
- Konidaris et al. (2012) Konidaris, G., Kuindersma, S., Grupen, R., and Barto, A. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360–375, 2012.
- Lake et al. (2017) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
Levine et al. (2016)
Levine, S., Finn, C., Darrell, T., and Abbeel, P.
End-to-end training of deep visuomotor policies.
The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Matsubara et al. (2011) Matsubara, T., Hyon, S.-H., and Morimoto, J. Learning parametric dynamic movement primitives from multiple demonstrations. Neural networks, 24(5):493–500, 2011.
- Muelling et al. (2010) Muelling, K., Kober, J., and Peters, J. Learning table tennis with a mixture of motor primitives. In Humanoid Robots (Humanoids), 2010 10th IEEE-RAS International Conference on, pp. 411–416. IEEE, 2010.
- Nachum et al. (2018) Nachum, O., Gu, S., Lee, H., and Levine, S. Data-efficient hierarchical reinforcement learning. arXiv preprint arXiv:1805.08296, 2018.
- Paraschos et al. (2013) Paraschos, A., Daniel, C., Peters, J. R., and Neumann, G. Probabilistic movement primitives. In Advances in neural information processing systems, pp. 2616–2624, 2013.
- Precup (2000) Precup, D. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst, 2000.
- Qureshi et al. (2017) Qureshi, A. H., Nakamura, Y., Yoshikawa, Y., and Ishiguro, H. Show, attend and interact: Perceivable human-robot social interaction through neural attention q-network. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1639–1645. IEEE, 2017.
- Qureshi et al. (2018) Qureshi, A. H., Nakamura, Y., Yoshikawa, Y., and Ishiguro, H. Intrinsically motivated reinforcement learning for human–robot interaction in the real-world. Neural Networks, 2018.
- Sahni et al. (2017) Sahni, H., Kumar, S., Tejani, F., and Isbell, C. Learning to compose skills. arXiv preprint arXiv:1711.11289, 2017.
- Schaal et al. (2005) Schaal, S., Peters, J., Nakanishi, J., and Ijspeert, A. Learning movement primitives. In Robotics research. the eleventh international symposium, pp. 561–572. Springer, 2005.
- Schapire (1999) Schapire, R. E. A brief introduction to boosting. In Ijcai, volume 99, pp. 1401–1406, 1999.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
- Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Tamosiunaite et al. (2011) Tamosiunaite, M., Nemec, B., Ude, A., and Wörgötter, F. Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives. Robotics and Autonomous Systems, 59(11):910–922, 2011.
- Todorov (2009) Todorov, E. Compositionality of optimal control laws. In Advances in Neural Information Processing Systems, pp. 1856–1864, 2009.
- Veeraraghavan & Veloso (2008) Veeraraghavan, H. and Veloso, M. Teaching sequential tasks with repetition through demonstration. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 3, pp. 1357–1360. International Foundation for Autonomous Agents and Multiagent Systems, 2008.
- Vezhnevets et al. (2016) Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al. Strategic attentive writer for learning macro-actions. In Advances in neural information processing systems, pp. 3486–3494, 2016.
- Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
- Zoliner et al. (2005) Zoliner, R., Pardowitz, M., Knoop, S., and Dillmann, R. Towards cognitive robots: Building hierarchical task representations of manipulations from human demonstration. In Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on, pp. 1535–1540. IEEE, 2005.
Appendix A Composite Model Training Algorithms
a.1 Training with Soft Actor-Critic
In this section, we briefly describe the procedure to train our composition model using SAC (Haarnoja et al., 2018b). Although any RL method can be used to optimize our model, we use SAC as it is reported to perform better than other training methods. Our composite policy is a tractable function parameterized by . The composite policy update through SAC requires the approximation of Q- and value-functions. The parametrized value- and Q-function are denoted as with parameters , and with parameters , respectively. Since, SAC algorithm build on the soft-policy iteration, the soft value-function and soft Q-function are learned by minimizing the squared residual error and squared Bellman error , respectively, i.e.,
where is a replay buffer, and is the Bellman target computed as follows:
The function is the target value function with parameters . The parameters are the moving average of the parameters computed as , where is the smoothing coefficient. Finally the policy parameters are updated by minimizing the following expected KL-divergence.
where is a partition function that normalizes the distribution. Since, just-like SAC, our Q-function is differentiable, the above cost function can be determined through a simple reparametization trick, see Haarnoja et al. (2018b) for details. Like SAC, we also maintain two Q-functions that are trained independently, and we use the minimum of two Q-functions to compute Eqn. 3 and Eqn. 6. This way of using two Q-function has been shown to alleviate the positive biasness problem in the policy improvement step. The overall training procedure is summarized in Algorithm 1.
a.2 Training with HIRO
In this section, we outline the algorithm to train composite policy through HIRO that employs the two level policy structure. The high-level policy generates the sub-goals for the low-level composite policy to achieve. The low-level composite policy also have access to the primitive policy actions. Like HIRO, we use TD3 algorithm (Fujimoto et al., 2018) to train both high-level and low-level policies with their corresponding Q-functions, and , respectively. The low-level policy , with parameters , is trained to maximize the Q-values from the low-level Q-function for the given state-goal pairs. The Q-function parameters are optimized by minimizing temporal-difference error for the given transitions, i.e.,
where and .
The high-level policy , with parameters , is trained to maximize the values of . The Q-function parameters are trained through minimizing the following loss for the given transitions.
During training, the continuous adaptation of low-level policy poses a non-stationery problem for the high-level policy. To mitigate the changing behavior of low-level policy, Nachum et al. (2018) introduced off-policy correction of the high-level actions. During correction, the high-level policy action is usually re-labeled with that would induce the same low-level policy behavior as was previously induced by the original high-level action (for details, refer to (Nachum et al., 2018)). Algorithm 2 presents the procedure to train our composite policy with HIRO.
Appendix B Future Research Directions & Discussion
In this section, we present preliminary results to highlight the potential research avenues to further extend our method.
b.1 New skills acquisition
Apart from compositionality, another research problem in the way of building intelligent machines is the autonomous acquisition of new skills that were lacking in the system and therefore, hindered it from the solving the given task (Lake et al., 2017). In this section, we demonstrate that our composition model holds the potential for building such a system that autonomously acquires missing skills and compose them together with the existing skill set to solve the given problem.
Fig. 7 shows a simple ant-maze environment in which the 3D quadruped-Ant has to reach the target, indicated as a green region. In this problem, we equip our composition model with a single primitive policy for walking in the right direction and a trainable, parametric policy function. The trainable policy function takes the state without goal information and outputs the parameters for Gaussian distribution to model the robot’s action space. Note that the composite model requires a skill of walking upward in addition to moving in the right direction to reach the given goal. To determine if our composition model can teach the new parametric policy function to move upward, we trained our composition model along with the new policy using the shared Q- and value-functions. Once trained, we observed that our composition model learned the parametric policy function for walking in an upward direction with slight right-turning behavior.
The plot in Fig. 7 shows the performance of our composition model and standard RL policy obtained with SAC in this environment. The vertical axis indicates the distance from the target location averaged over five randomly seeded trials. The horizontal axis shows the number of environment steps in millions. It can be observed that our method converged faster than standard RL policy. It shows that our approach is also utilizing the existing skills and therefore learns faster than a standard policy function that solves the entire task from scratch. Furthermore, in current settings, we do not impose any constraint on the new policy function that would make it learn only the missing skills rather than replicating existing skills or solving the given task entirely by itself. We leave the formulation of such constraint for autonomous acquisition of new, missing skills to our future works.
b.2 Hierarchy of composition models
In this section, we discuss that our method can be extended to create the hierarchies of composition models by combining primitive policies into complex policies which can further be composed together to form even more complex policies. Fig. 8 presents an example of such a model. The composition hierarchy begins by combining basic primitive skills into complex models up to the level to build a complex composite function. Such a tree/hierarchy can be constructed through stage-wise training of composite models similar to Boosting algorithms (Schapire, 1999). We performed a simple experiment in the Ant Cross Maze environment where the primitive left, right, up and down policies were composed to create three composite models for reaching left, right and top target locations. These composite models were further combined to solve the original Ant Cross Maze task. We observed that our hierarchical composite model converged significantly faster than the single-layer composite model. However, we leave further experimentation and development of stage-wise training algorithm, similar to Boosting (Schapire, 1999), to our future works.
Appendix C Implementation details
c.1 Environment Details
In this section, we present the environment details including reward functions, primitive policies, and state space information. The reward functions are presented in the Table 3 together with the overall reward scaling values.
c.1.1 Ant environments
In these environments, we use 8 DOF four-legged Ant with 150 units torque limit. The primitive policies of moving left, right, down and up were shared across all these tasks. In these environments, the information in the state corresponds to the target location. Let us introduce the notation to defined reward function. Let , , , and denote xy-position of the robot’s torso, xy-position of the goal, joint torques, and contact-cost, respectively. The scaling factors are defined as . The reward function for the following environments is defined as with reward scaling of 5 units:
Ant Random Goal: In this environment, the ant has to navigate to any randomly sampled target within the confined circular region of radius 5 units. The goal radius is defined to be 0.25 units. The reward function coefficients , , , , and are , , , , and , respectively.
Ant Cross Maze: In this environment, the ant has to navigate through the 3D maze to reach any of the target sampled from the three targets. The goal radius is defined to be 1.0 units. The reward function parameters are same as for the random-goal ant environment.
Simple Ant Maze: The Ant has to navigate through the maze to reach the single target, shown in Fig. 7. The goal radius and reward function parameter were defined as in ant-maze and random-goal ant environments.
For the remaining environment (Ant Maze, Ant Push and Ant Fall), we use the following reward function with no reward scaling:
where coefficients , , and are set to be , , and .
Ant Maze: In this environment, we place the Ant in a -shaped maze for a navigation task between given start and goal configurations. The goal radius is defined to be 5 units. During training, the goal is uniformly sampled from space, and the Ant initial location is always fixed at . During testing, the agent is evaluated to reach the farthest end of the maze located at within L2 distance of 5.
Ant Push: In this environment, the Ant is initially located at coordinate, the moveable block is at , and the goal is at . The agent is trained to reach randomly sampled targets whereas during testing, we evaluate the agent to reach the goal at within L2 distance of 5.
Ant Fall: In this environment, the Ant has to navigate in a 3D maze. The initial agent location is , and a movable block is at at the same elevation as Ant. Their is a rift in the region . To reach the target on the other side of the rift, the Ant must push the block down into the rift, and then step on it to get to the goal position.
|Discount factor ()||0.99||0.99||0.99||0.99|
|Nonlinearity in feedforward networks||ReLU||ReLU||ReLU||ReLU|
|Minibatch samples size||256||128||-||-|
|Replay buffer size||-||-|
|Target parameters smoothing coefficient ()||0.005||0.005||-||-|
|Target parameters update interval||1||2||-||-|
|Model Architectures||Hidden units|
|Composition-HIRO||High-level Policy: Three layer feed forward network||300|
|Encoder Network: Bidirectional RNN with LSTMs||128|
|Decoder Network (Single layer feed forward network)||128|
|Composition-SAC||Encoder Network: Bidirectional RNN with LSTMs||128|
|Decoder Network (Single layer feed forward network)||128|
|HIRO||High-level Policy: Three layer feed forward network||300|
|Low-level Policy: Three layer feed forward network||300|
|Standard RL policy||Two layer feed forward network||256|
In pusher environment, a simple manipulator has to move an object to the target location. The primitive policies were to push the object to the bottom and left. In this environment, the state information for both primitive policies and the composite policy include the goal location. Therefore, , in this case, is null. The reward function is given as:
where , , , and are xy-position of object, xy-position of goal, xy-position of arm, and joint-torques. The coefficients , , and are , , and , respectively.
In halfcheetah-hurdle environment, a 2D cheetah has to jump over the three hurdles to reach the target. In this environment, the information in the state corresponds to the x-position of the next nearest hurdle in front of the agent as well as the distance from that hurdle. The reward function is defined as:
where , , , and are xy-position of robot torso, xy-position of goal, velocity along z-axis, and velocity along x-axis, respectively. The function returns a count indicating the number of hurdles in front of the robot. The indicator function returns 1 if the agent has reached the target otherwise 0. The function is a collision checker which returns 1 if the agent collides with the hurdle otherwise 0. The reward function coefficients , , , , , and are , , , , and , respectively.
c.2 Hyperparameters and Network Architectures
Table 3 summarizes the network architectures. The standard RL policy structure correspond to simple SAC, TRPO and PPO policies. The right most column shows the hidden units per layer.