1 Introduction
Learning goaldirected skills is a major challenge in reinforcement learning when the environment’s feedback is sparse. The difficulty arises from insufficient exploration of the state space by an agent, and results in the agent not learning a robust policy or value function. The problem is further exacerbated in highdimensional tasks, such as in robotics. Although the integration of nonlinear function approximators, such as deep neural networks, with reinforcement learning has made it possible to learn patterns and abstractions over high dimensional spaces (
Silver et al. [2016]; Mnih et al. [2015]), the problem of exploration in the sparse reward regime is still a significant challenge. Rarely occurring sparse reward signals are difficult for neural networks to model, since the action sequences leading to high reward must be discovered in a much larger pool of lowreward sequences. In addition to the above difficulties, robotics tasks that involve dexterous manipulation of objects have the additional challenges of tight precision requirements, complex contact dynamics, and highly variable object geometries.In such settings, one natural solution is for the agent to learn, plan, and represent knowledge at different levels of temporal abstraction, so that solving intermediate tasks at the right times helps achieve the final goal. Sutton et al. [1999] provided a mathematical framework for extending the notion of “actions” in reinforcement learning to “options,” which are policies taking a certain action over a period of time. The duration of execution of the option policy is specified by the time it will take the agent to meet an intermediate goal intrinsic to the option. The goal is a termination condition for the policy, defined based on the state space.
In this work, we present Concept Network Reinforcement Learning (CNRL), an industrially applicable approach to solving complex tasks using reinforcement learning, which facilitates problem decomposition, allows component reuse, simplifies reward functions, trains quickly and robustly, and produces a policy that can be executed safely and reliably when deployed. Inspired by Sutton’s options framework, we introduce the notion of “Concept Networks” which are treelike structures in which leaves are “subconcepts” (subtasks), representing policies on a subset of state space. The parent (nonleaf) nodes are “Selectors,” which include policies on which subconcept to choose at each time during an episode.
Unlike the options in the framework of Sutton et al. [1999], concepts within concept networks are not indivisible. Each concept can be a trained multistep policy or a primitive action (like setting joint velocities to control individual fingers of a robotic arm). In addition, they can be other concept networks, creating a multilevel hierarchy, or classical controllers. Concepts can also be used for perception or other state transformation, instead of action generation. This enables further simplification of the problem each individual concept has to solve.
The flexibility of CNRL allows us to apply state transformations such as partitioning or extending the state space. This permits recombining subconcepts that may have been trained on different state spaces without having to retrain them.
By treating the subconcepts in a task as blackbox components implementing entire skills, we are able to use much simpler reward functions when learning the overall task. Since the subtasks are much simpler than tackling the entire problem at once, their goals can often be defined on subsets of state space, significantly constraining the necessary exploration and leading to dataefficient learning even in complex environments. In addition, the approach is agnostic of the algorithms used to create and, if necessary, train a concept: each concept is treated as a black box by the rest of the concept network. This makes concepts reusable, meaning the same trained concept can be directly used in multiple concept networks. To speed up training and ensure that concepts are only executed in regions where they have been trained, each concept can include a restriction on the set states in which it can execute.
We demonstrate CNRL via the task of grasping a rectangular prism and stacking it precisely on top of a cube. This task closely reflects many problems faced in real robotics tasks, and illustrates several of the difficulties CNRL addresses. First, the problem is high dimensional. Second, the task is composed of several subproblems, such as moving, grasping, and stacking, that are independent of one another yet common to many related tasks. This is typical of the range of realworld robotics problems. Third, the control precision required to solve this task makes it difficult to solve with a classical, hardcoded controller. Finally, for the complete task to be successful, each subtask needs to be mastered: it would not be possible to stack a prism onto a cube if it were not grasped correctly in the first place.
To summarize, the core contributions of CNRL are as follows:

It enables a multilevel hierarchy of problem decomposition, allowing very complex tasks to be broken down into tractable subproblems.

Conversely, it allows existing solutions to subproblems to be composed into an overall solution without requiring retraining, regardless of the algorithms and state space definitions used to solve each subproblem.

It can simply incorporate subtask solutions created with nonRL methods, such as classical controllers.

The method of composing subtask solutions scales to large hierarchies, significantly improving sample complexity compared to the state of the art.

By limiting policy execution only to wellexplored state space, it can improve the safety and reliability of execution when deployed into production.
2 Background
This section provides a brief review of the reinforcement learning (RL) problem, and how deep neural networks can be used to adapt simple Qlearning and policy optimization algorithms to tackle complex learning tasks. We focus on the algorithms used in this work, including Deep QNetworks (DQN), Trust Region Policy Optimization (TRPO), and Hierarchical Reinforcement Learning (HRL). We present the generic framework of these algorithms, with just the background needed to understanding our approach.
2.1 Reinforcement Learning
We consider the version of the reinforcement learning problem where an agent interacts with an environment in discrete timesteps. At each timestep , the agent observes a state , performs an action , transitions to a new state , and receives feedback reward from . The goal of reinforcement learning is to optimize the agent’s actionselecting policy such that it achieves maximum expected return.
The sequence
is modeled as a Markov Decision Process (MDP) with state transition probability
and distribution over initial state . We denote the agent policyin terms of the probability distribution over the actions
, and define the return as expected discounted reward, with the discounted factor , the received reward , and the time step at the terminal state. The standard RL function , is the critic function for evaluating the value of each state per each action.(1) 
2.2 QLearning and the Deep QNetworks
Deep QNetwork is an extended framework of the QLearning algorithm (Watkins and Dayan [1992]), with an approximation of the critic function (1) using deep neural networks (Mnih et al. [2015]). Similar to QLearning, DQN solves the RL problem via maximizing (1), in which the solution satisfies the Bellman equation:
(2) 
where is the greedy policy. With random initialization, a function iteratively updated using the Bellman equation converges to the optimal solution via exploration on and . DQN approximates the function with a neural network, with the policy converging toward the optimal solution via periodic updates to the parameters of the approximate
function. With DQN, the solution to the Bellman equation is achieved by solving a leastsquare convex optimization problem with the following loss function:
(3) 
where is the slowlyupdating target Qnetwork that is used to periodically adjust the parameters of the network. An experience replay buffer stores in a data structure, with as the termination flag, and the network is trained based on randomly sampled minibatches from this buffer. To ensure adequate exploration of the state space, policy selection is based on an greedy strategy, that selects a random action with probability of , which is annealed after each training episode. More details and techniques on DQN can be found in (Mnih et al. [2015]) and (Van Hasselt et al. [2016]).
2.3 Policy Optimization
Policy optimization methods learn the policy directly, and adjust it to make higher rewards more likely given the observed sequences of states, actions, and rewards. There exists a long line of work in the literature, improving the robustness and scalability of such methods. These include methods based on derivativefree optimization and policy gradients.
The two main approaches based on derivativefree optimization are Cross Entropy Method (Mannor et al. [2003]) and Covariance Matrix Adaptation (Hansen and Ostermeier [1996]
). They frame the problem as stochastic optimization, in which the distributions of policy parameters are repeatedly updated using statistics drawn from the most successful sampled paths, and the estimated distributions converge to the optimized solution. The benefits are high scalability and fast convergence on learning, but they may not perform as well as gradient based policy optimization algorithms on problems such as the game Tetris (
Gabillon et al. [2013]).Policy gradient methods are another active field of research in policy optimization. They mainly refer to techniques that optimize the expected return function with respect to policy parameters using gradients. The main challenge here is to approximate the gradient with high accuracy, when the reward is delayed. Spall [2003] introduces an approach using the Finite Difference Method, where gradient estimation is formulated as a regression problem such that policy gradients fit the temporal difference of the expected reward over small tuning in policy parameters. The weakness of the approach is it requires prior knowledge of system dynamics, as inappropriate parameter tuning leads to learning divergence.
Tackling policy optimization via stochastic optimization, Benbrahim and Franklin [1997] computes the policy gradients using the likelihood ratio, which yields fast policy convergence. However, the method has limitations on training with deterministic policies. To address this problem, Silver et al. [2014] and Lillicrap et al. [2015]
develop a pathwise gradient method called "Deterministic Policy Gradient" that computes the policy gradients using the derivative between the output of a critic function and the policy parameters. By approximating the critic and policy functions using Neural Networks, they demonstrate successful results on numerous robotics tasks. Unfortunately, current implementations are limited to continuous action spaces, and the performance of the algorithm is sensitive to hyperparameter tuning.
Based on the "Conservative Policy Iteration" algorithm (Kakade and Langford [2002]), Schulman et al. [2015a] proposes Trust Region Policy Optimization (TRPO), an algorithm that maximizes the "monotonic improvement" term with a stochastic policy constraint, in which the policy gradient is estimated. In contrast to other policy gradient methods, TRPO improved learning stability and accuracy, as well as faster convergence speed.
For our experiments in this paper, we use the TRPO algorithm with generalized advantage estimation (Schulman et al. [2015b]), as it yields more accurate training results on a wide variety of reinforcement learning tasks with little tuning on hyperparameters.
2.4 Hierarchical Reinforcement Learning
Effective exploration is one of the main challenges in MDPs. Although methods like greedy can be effective, in large state spaces they are insufficient to explore the full space. To tackle this problem, one can use goals and temporal abstractions: at each time and for each state , a higher level controller chooses the goal where is the set of all possible goals currently available for the controller to choose from. Goals provide intrinsic motivations for the agent so that it finishes the overall complex task by choosing a sequence of goals in the right order.
Each goal remains active for some amount of time, until a predefined terminal state is reached. There is an internal critic which evaluates how close the agent is to satisfying a terminal condition of and sends the appropriate reward to the controller. The objective of the controller is to maximize accumulated rewards received from the environment when the agent executes the policy defined by . This setup is very similar to classical RL, except that an extra layer of abstraction is defined on the set of actions, so that there are specific actions for each of the goals. Different approaches to hierarchical RL result in variants on this overall approach, choosing different tradeoffs in flexibility, training speed, and other properties. We describe our approach to hierarchical RL below.
3 Related Work
3.1 Reinforcement Learning with Temporal Abstractions
Sutton et al. [1999] propose the options framework which extends the usual notion of action into closedloop policies for taking actions over an extended period of time (options). This sort of temporal abstraction makes it possible to solve complex tasks by decomposing them into a combination of high level abstractions and primitive actions. The framework allows ways of changing or learning the internal structure of options, improving existing options or even combining a given set of options into a single overall policy. The theory extends Markov Decision Processes (MDPs) for reinforcement learning by building on the theory of SemiMarkov Decision Processes (SMDPs).
A shortcoming of this approach is that the structure of an option is not readily exploited. As Precup [2000] writes, “SMDP methods apply to options, but only when they are treated as opaque indivisible units. Once an option has been selected, such methods require that its policy be followed until the option terminates. More interesting and potentially more powerful methods are possible by looking inside options and by altering inside their internal structure.” Our concepts are equivalent to the options described by the framework in that a concept can be a temporally extended action or a primitive. At the same time, our concepts are not indivisible: they can be broken down into subconcepts to lead to a truly hierarchical reinforcement learning framework.
Kulkarni et al. [2016] propose a scheme for temporal abstraction that involves simultaneously learning options and a control policy to compose options in a deep reinforcement learning framework. The authors use goals to enable better exploration of the state space. The agent focuses on learning sequences of goals in order to maximize the cumulative extrinsic reward while learning options simultaneously. The agent only uses a two stage hierarchy (consisting of a controller and a metacontroller), and is thus limited in its ability to scale with the number of goals. In contrast, our fully hierarchical approach scales to many more goals.
Tessler et al. [2017] propose a hierarchical model that is able to retain learned knowledge and transfer the knowledge to new tasks. They achieve the knowledge transfer by using a variation of policy distillation (Rusu et al. [2015]). They tackle the problem of scalability (with an increasing number of skills) by encapsulating multiple policies into a single distilled network, and use temporally extended actions to solve tasks with lower sample complexity.
3.2 Applications in Robotics
Levine et al. [2016]
use Convolutional Neural Networks (CNNs) to support visionbased robot control (visual servoing), training a network to predict the probability of success of a given grasp attempt by a classical controller using camera images, and aborting and replanning if the success probability is too low. They used between 6 and 14 robots to gather samples in parallel, demonstrating the value of parallel sampling when applying deep learning to robotic control. This approach requires an already existing controller that can accomplish the task reasonably well, and augments it with deep learning.
Gu et al. [2016] go further to learn full control policies capable of grasping, picking and placing, and door opening, in simulation and on real platforms. They use staged reward functions, shaping, and parallel sampling in simulation to train policies capable of completing complex tasks from low dimensional input in 500,000 samples. Though they break their reward functions for complex tasks into stages, these stages are not recomposable.
Popov et al. [2017] tackle the problem of combining a set of options into a single more complex overall policy by using a form of apprenticeship learning that facilitates learning of longhorizon tasks even with sparse rewards. They also introduce a recipe for constructing reward functions for complex tasks consisting of a sequence of subtasks. Using previously trained policies for subtasks they sample states from successful policy executions and use those as initial states when training a policy to complete the full task. This biases exploration towards regions of state space useful for solving the subconcepts, accelerating policy convergence. Using these pretrained subconcepts for exploration, in conjunction with staged shaped rewards, they were able to learn the task we explore here, picking up an object and precisely stacking it on another. The optimized algorithm learned this task using approximately 1,000,000 environment transitions. Our approach for this problem, described in detail below, is able to use the solutions to subtasks directly, and solves the same task using only 22,000 environment transitions.
Finn et al. [2016] explore an alternative method of simplifying reward function construction for complex tasks by using inverse reinforcement learning to learn a reward function from expert policy execution. By observing successful execution of a task off policy, they estimate the reward function that led to that policy based on the features provided, and then train a policy to optimize this new reward function. This has the advantage of obviating the requirement for hand crafting a reward function if an expert can demonstrate their skill using the robot. This could be combined with our approach to generate rewards for reusable subconcepts.
4 Concept Network Reinforcement Learning
As discussed earlier, an industrially applicable approach to solving complex tasks using reinforcement learning should facilitate problem decomposition, simplify reward function design, train quickly and robustly, and produce a policy that can be executed safely and reliably when deployed. Concept Network Reinforcement Learning (CNRL) is a variant of Sutton’s options framework that achieves these goals. This section describes CNRL in detail, and discusses its benefits and tradeoffs.
CNRL is based on decomposing the overall learning problem into concepts, each of which represents some aspect of the solution. There are three types of concepts: control concepts, which define actions to take in certain situations, selector concepts, which choose one of their subconcepts to act next, and transformation concepts, which transform lowlevel state input into higher level perceptual features that are more useful for subsequent concepts. The overall concept network is a directed acyclic graph, with the overall system state coming in, a mixture of control, selector, and transformation nodes processing that state, ultimately producing the action to execute in the environment. An example is shown in Figure 2. We now describe each type of concept in more detail.
4.1 Selector Concepts
Figure 1 shows the structure of a selector concept: selectors accept the state from the environment and choose one of a set of child concepts, which can be either other selectors or control concepts. This child concept’s policy then interacts with the environment, receiving states and generating actions to transition the environment until the child reaches its terminal condition, at which point the selector again receives the new state and the new value of its own reward function, and makes a new choice. Execution can recursively descend through several selectors in turn before a control concept is reached. Treating skills implemented by child nodes as discrete units for the selector speeds exploration and avoids unnecessary backtracking—if a new child concept were selected for each time step, one concept’s policy could undo the progress made by another. Our results in Section 6 demonstrate how this greatly speeds up training.
The selector’s children are treated as a policyimplementing black box. This allows incorporating control concepts implemented via nonRL methods such as traditional control, and allows selectors to be nested: simple skills may be grouped together under a selector to form a more complex skill, which may in turn be selected by its parent.
Selectors are trained using a discrete action algorithm – we use DQN in our implementation. Because the chosen policies are trained separately and treated as a black box that executes a policy to termination, reward functions for selectors can be simple, typically rewarding progress toward an overall goal. If the selector’s task can be solved with a small number of child policy executions, or the right child to pick is easy to deduce from the state, this becomes a simple and shorthorizon reinforcement learning problem, and the selector is very quick to train. Section 6 demonstrates this on the pickandplace robotics task.
4.2 Control Concepts
As illustrated in Figure 1, a control concept takes state as input and produces an action, which can be a singlestep or multistep policy. Control concepts can be learned using RL, can use a manually coded controller, e.g. using inverse kinematics, or can be implemented using a pretrained neural network based controller, perhaps reused from another concept network. A policy can even learn behavior on only a part of the action space, with a following transformation node adding hardcoded behavior on the rest: for example, when learning to orient the gripper in our robotics task, we hard code that the gripper fingers should be open, and let the network learn to control the other arm joints.
The blackbox nature of concepts allows each to be specified with the most appropriate state space and action space for the task. Transformation nodes before the concept node can convert the input to an appropriate form, e.g. by omitting irrelevant elements or augmenting with derived properties. The learning problem is solved with the transformed input and an appropriate action space, and a following transformation node transforms the output as needed for following nodes.
Each learned control concept has its own reward function, independent of the overall problem. Thus, reward shaping considerations are encapsulated within concepts, and only need to be defined on the relevant portions of the concept’s state and action space. Each learned control concept can also be trained with the most appropriate learning algorithm for that task. This ability to customize the training approach for each subproblem speeds up task design and iteration, and can significantly speed up training.
4.2.1 Validity regions for control concepts
Once a control concept is selected, it continues to execute its policy until it hits one of its execution terminal conditions. There are three types of terminal conditions. The first is completing the overall task, successfully ending the episode. The second is completing the concept’s task and returning control to its parent selector. The third is the system state leaving the concept’s validity region, a configurable set of states where the concept is allowed to run. The validity region also constrains where the concept may start execution. When the state is outside this region, parent selectors are not allowed to select this concept (we implement this in our system by having the control concept return a noop action if chosen in a terminal region, so the selector learns not to make such choices. This could also be implemented by directly masking out ineligible children in the DQN output.)
By cutting off unpromising exploration quickly, validity regions can drastically reduce the time needed to learn concepts. At deployment time, the validity region ensures that the concept can only execute its policy in regions of state space it has explored during training and in which its policy has been well characterized and deemed safe. Without such constraints, the undefined behaviour of RLbased control policies outside the state space they have explored during training can pose a significant safety hazard when deployed into production. The validity region can be configured differently during training and deployment. This can be used to provide an additional margin for error, or further restrict the work space where execution of a given skill is permitted.
4.3 Transformations
Concept networks can include transformations, which can act on states or actions to adapt them for downstream use. Just like control concepts, transformations can be hardcoded, pretrained, or learned. The two primary uses are transforming states into higher level representations and adjusting state inputs and action outputs to match each concept’s requirements. Typical examples of the former include perception tasks, such as converting visual input into an object identification or converting text into a topic vector and a sentiment estimate. When such perception transformations are learned using neural networks, it may be more effective to output an embedding vector derived from the penultimate network layer, rather than the output value used for training (e.g. the object class, or the selected text topic).
Other uses of transformations include filtering out or converting state information as shown in Figure 1, adding in hardcoded action elements, mapping data representations into other forms (e.g. converting from cartesian to polar coordinates), and concatenating action elements from concepts controlling individual aspects of the overall task into a complete action vector. The ability to include transformations in concept networks naturally follows from the fact that each concept can be trained separately, and gives great practical flexibility in reuse and network organization.
4.4 Discussion
The CNRL approach has many benefits and directions for extension. Perhaps the greatest benefit is the ability to truly decompose reinforcement learning problems into independent parts. This is crucial for applying RL to real industrial problems, allowing teams to divide and conquer: different groups can independently work on different aspects of a learning problem, quickly assemble them into a full solution, and upgrade individual components later.
Because a selector treats its child concepts as black boxes, each can be implemented using the technique most appropriate to the problem. In robotics, for example, complex tasks requiring dexterous manipulation like grasping may be implemented with deep reinforcement learning, while wellcharacterized tasks like moving between work spaces can be handled by inverse kinematics. For the same reason, entire concept networks are reusable and composable: a solution to one problem can be used as a component in a larger problem.
Individual concepts in a network can be easily replaced with alternate implementations, allowing easy experimentation and incremental improvement – a hardcoded controller can be replaced with a learned one, or an intractable concept can be further subdivided, all without requiring any change in the rest of the concept network. Additionally, the independence among all children of a selector typically allows them to be trained in parallel.
CNRL, like other hierarchical methods, makes the model more explainable than monolithic training methods: seeing the concepts activated by each selector gives higherlevel insight into the behavior than simply seeing the lowlevel actions at each time step.
CNRL can have several extensions beyond what is described here. As presented, the task must be broken down into concepts that completely cover the state space – if there are situations where none of the children of a selector can make progress, the overall problem will be unsolvable. We have designed and implemented a more advanced selector that can synthesize a policy to cover such gaps, and will report the details in a future publication. In certain tasks with sequential concepts, completely independent parallel training of concepts may be impossible, if the starting conditions of one concept depend on the end state of the previous concept. In such settings, the system needs to coordinate these end and start conditions among concepts. Finally, decomposing the problem into completely independent pieces prevents the system from adjusting all the elements endtoend: each concept is trained independently. In settings where joint training or finetuning is crucial, one can start with a combined concept integrating several components, then split them apart again after training for reuse and independent evolution.
5 Solving “Grasp and Stack” with CNRL
5.1 Concept Network
We demonstrate CNRL on the task of grasping a rectangular prism and precisely stacking it on top of a cube. The rectangular prism was chosen for ease of manipulation by the gripper provided on the JACO model. We initially broke the overall task down into four subconcepts: 1) reaching the working area, 2) grasping the prism, 3) moving to the second working area, and 4) stacking the prism on top of the cube. During training we found that TRPO, the algorithm we chose for training control concepts, had difficulty learning a single policy to grasp the prism. To simplify the learning problem, we broke grasping into two subconcepts: orienting the hand around the prism in preparation for grasping, and lifting the prism, for a total of five control concepts in the concept network. Three of these – orienting, lifting, and stacking – used TRPO to train, while moving to the working area for both grasp and stack (Staging1 and Staging2 in Fig. 4 and Fig. 5) were handled with inverse kinematics.
We explored two concept hierarchies: a single selector with five children (Fig. 4), and a multilevel tree with two selectors, one with four children and the other with two (Fig. 5). In this small example there is little benefit to nesting selectors in this way, but as the size of the tree scales, the ability to encapsulate sections of the problem and separately learn the correct circumstances in which to invoke concepts to solve that subproblem will improve training parallelization, help keep large concept trees organized, and bound the complexity of the task any single selector must learn.
5.2 State Spaces, Action Spaces, and Rewards
The state vector provided to the agent varied from concept to concept, as did the action space. For example, the reach and grasp stages only need to consider the location of the prism, not the cube, so the cube’s location was omitted from the state space. All actions correspond to target velocities for nine associated joints. In some cases, only some of the target velocities were learned, with others hardcoded—keeping the gripper always open during orient, and always closed during moving and stacking.The action and state vectors are described in Table 1 and 2, while the rewards and terminals are described in the appendices A and B, respectively.
5.3 Experimental Setup
The agent controlled a Kinova JACO arm simulated in MuJoCo. Episodes of the full task were terminated after 150 steps, with a control frequency of . Subconcepts were terminated after 50 steps. All concepts were terminated early under conditions laid out in Appendix B to curtail unnecessary exploration and improve sample efficiency.
The positions and initial angular velocities of the arm joints started with a small random variation, and the positions of the prism and cube varied by in the x and y axes, while the orientations varied by radian.
We report results using the hierarchical concept network shown in Fig. 5. Stack, Orient, and Lift are control concepts trained using TRPO, while the full concept selector and the Grasp selector were trained using DQN. Each node was trained after all of its subconcepts had finished training and their weights were frozen.
We trained the TRPO concepts using the publicly available OpenAI Baselines parallel TRPO implementation, using the ADAM optimizer and 16 parallel workers. We used default hyperparameters, including a batch size of 1024, a maximum KL divergence of 0.01, a gamma of 0.99, and a step size of 1e3. We made no modifications to the underlying algorithm to facilitate replication and comparison.
We trained the DQN concepts using the OpenAI Baselines DQN implementation, with the ADAM optimizer and only a single worker. DQN was trained with a batch size of 64, learner memory capacity of 50000 samples, a minimum learner memory threshold of 1000 samples, an exploration probability that decayed from 1 to 0.02 over 10000 steps, a gamma of 0.98, and a learning rate of 5e4.
6 Experimental Results
Training performance for DQN was evaluated with ten testing episodes for every 50 training episodes, with mean performance in each testing pass plotted in the selector performance graphs below. Training performance for TRPO uses the raw training episode returns, which are less representative of true policy performance but served well enough to show when the policy had converged. In plots showing the performance of DQN, the x axis represents transitions sampled so far, and the y axis represents mean episode reward. Final evaluation of robustness for both DQN and TRPO was done without exploration.
The orient and stack concepts trained in approximately 23 million samples using shaping rewards and guiding terminals, without the need for hyperparameter tuning. The training graphs for the TRPO concepts are presented in Figs. 6, 6, and 6. In one of the training runs the lift concept did not converge within 7.5 million samples and this data was omitted. The very tight terminal constraint on the distance the prism can move from its starting xy coordinates, designed to encourage a straight vertical lift, also increased the number of samples required to find a good policy through exploration, and in at least one case increased it beyond the sample budget we allotted. Better designed terminal conditions and rewards could undoubtedly speed up training on this task, but almost all of the policies were sufficient to effectively complete the task.
The full concept selector trained in 22,000 samples (Fig. 7), though the selector itself only saw 6,000 samples as it does not receive state transitions during long running execution of children. When concepts are compatible — i.e. a concept ends within the operating constraints of another — and there exists some chain of compatible concepts that will achieve a goal, the selector can learn to order these concepts very quickly, without the need to train a monolithic network to subsume the components. The task of ordering the concepts can be learned nearly two orders of magnitude faster than the individual concepts, or 45x faster than the single policy trained by Popov et al. [2017] using one million samples and previously trained subconcepts.
In 500 episodes we observed no task failures during execution, both with the subconcepts executed individually in their own environments and the tree with selectors solving the full task. The concept network is able to very reliably grasp an object and precisely stack it on another, both with varying position and orientation. Videos of the trained policies can be seen at http://bns.ai/robotics_blog.
Concept  State 

Orient  For the six joints excluding the fingers: 
1. The sine and cosine of the angles of the joints.  
2. The angular velocities of the joints.  
3. The position and quaternion orientation of the prism.  
4. The orientation of the line between two of the opposed fingers  
in degrees normalized by .  
5. The euclidean distance between the pinch point of the opposed  
fingers and a point above the prism.  
6. The vector between the same two points above.  
Lift  For all nine joints: 
1. The sine and cosine of the angles of the joints.  
2. The angular velocities of the joints.  
3. The euclidean distance between the pinch point of the  
opposed fingers and the center of the prism.  
4. The vector between the same two points above.  
5. The xy vector between the center of the prism and the  
starting position of the prism.  
Stack  For the six joints excluding the fingers: 
1. The sine and cosine of the angles of the joints.  
2. The angular velocities of the joints.  
3. The position and quaternion orientation of the prism.  
4. The position and quaternion orientation of the cube.  
5. The euclidean distance between the bottom of the prism  
and top of the cube.  
6. The vector between the same two points above.  
Selectors  1. The position of the pinch point between the opposed fingers. 
2. The euclidean distance between the pinch point and the prism.  
3. The position of the centre of mass of the prism.  
4. The position of the centre of mass of the cube.  
5. The euclidean distance between the bottom of the prism  
and the top of the cube. 
Concept  Action 

Orient  Target angular velocities for the 6 joints not including the fingers. 
Fingers extend maximally.  
Lift  Target angular velocities for the upper arm (1st, 2nd, and 3rd joints), 
and opposed fingers (7th, and 9th joints).  
Remaining finger receives no command.  
Stack  Target angular velocities for the 6 joints not including the fingers. 
The fingers close with moderate force. 
7 Conclusion
We presented the Concept Network Reinforcement Learning (CNRL) framework, which enables true problem decomposition for reinforcement learning problems. A complex learning problem can be broken down into concepts, each concept learned independently, then reassembled into a complete solution. Decomposing problems in this way can greatly reduce the amount of training needed to achieve a useful result.
Independent training of concepts allows each to use a focused reward function, simplifying reward design. For subproblems where solutions already exist, such as moving a robotic arm from place to place, they can be seamlessly plugged in among learned concepts. Similarly, individual concepts can be reused as components in other tasks, or replaced with improved versions.
CNRL is suitable for industrial applications, allowing for flexible goal specifications and rapid transfer of solutions to new variants of a problem. Compared with training monolithic networks to solve complete tasks, CNRL greatly accelerates the speed with which new combinations of functionality can be trained and built upon.
CNRL has deploymenttime benefits: the training process for concepts naturally produces policies with welldefined validity regions, so they can be executed safely and reliably. It also provides improved explainability: by tracking which subconcepts are activated when generating behavior, the system can provide context for why decisions were made.
We demonstrated CNRL on a complex robotics task requiring dexterous manipulation – grasping a prism and precisely stacking it on a cube. We successfully solve the task, incorporating several inversekinematicsbased classical controllers as well as a hierarchically decomposed set of learned concepts. Our approach to assembling subconcepts into the overall solution is extremely fast, taking 45x fewer samples than a stateoftheart approach on the same task from Popov et al. [2017].
There are many directions for future work. We plan to tackle problems where the provided subconcepts are not sufficient to solve the complete task, and the selector concept must synthesize additional behavior to cover the gaps. We will report on experiments applying CNRL to a wider variety of learning tasks, including tasks that require learned concepts for perception. Finally, we will apply these techniques to realworld tasks.
References
 Benbrahim and Franklin [1997] Benbrahim, H., Franklin, J.A., 1997. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems 22, 283–302.
 Finn et al. [2016] Finn, C., Levine, S., Abbeel, P., 2016. Guided cost learning: Deep inverse optimal control via policy optimization. arXiv:1603.00448 .
 Gabillon et al. [2013] Gabillon, V., Ghavamzadeh, M., Scherrer, B., 2013. Approximate dynamic programming finally performs well in the game of tetris, in: Advances in neural information processing systems, pp. 1754–1762.
 Gu et al. [2016] Gu, S., Holly, E., Lillicrap, T., Levine, S., 2016. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. arXiv preprint arXiv:1610.00633 .
 Hansen and Ostermeier [1996] Hansen, N., Ostermeier, A., 1996. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation., in: Proceedings of IEEE International Conference on, pp. 312–317.

Kakade and Langford [2002]
Kakade, S., Langford, J.,
2002.
Approximately optimal approximate reinforcement learning, in: The 19st International Conference on Machine Learning, pp. 267–274.
 Kulkarni et al. [2016] Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J., 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation, in: Advances in Neural Information Processing Systems, pp. 3675–3683.
 Levine et al. [2016] Levine, S., Pastor, P., Krizhevsky, A., Quillen, D., 2016. Learning handeye coordination for robotic grasping with deep learning and largescale data collection. arXiv preprint arXiv:1603.02199 .
 Lillicrap et al. [2015] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D., 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 .
 Mannor et al. [2003] Mannor, S., Rubinstein, R., Gat., Y., 2003. The cross entropy method for fast policy search, in: International Conference on Machine Learning, pp. 512–519.
 Mnih et al. [2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al., 2015. Humanlevel control through deep reinforcement learning. Nature 518, 529–533.
 Popov et al. [2017] Popov, I., Heess, N., Lillicrap, T., Hafner, R., BarthMaron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, T., Riedmiller, M., 2017. Dataefficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073 .
 Precup [2000] Precup, D., 2000. Temporal abstraction in reinforcement learning .
 Rusu et al. [2015] Rusu, A.A., Colmenarejo, S.G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., Hadsell, R., 2015. Policy distillation. arXiv preprint arXiv:1511.06295 .
 Schulman et al. [2015a] Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P., 2015a. Trust region policy optimization, in: Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1889–1897.
 Schulman et al. [2015b] Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P., 2015b. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 .
 Silver et al. [2016] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al., 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489.
 Silver et al. [2014] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M., 2014. Deterministic policy gradient algorithms, in: The 31st International Conference on Machine Learning, pp. 387–395.
 Spall [2003] Spall, J.C., 2003. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. WileyInterscience.
 Sutton et al. [1999] Sutton, R.S., Precup, D., Singh, S., 1999. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112, 181–211.
 Tessler et al. [2017] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D.J., Mannor, S., 2017. A deep hierarchical approach to lifelong learning in minecraft., in: AAAI, pp. 1553–1561.
 Van Hasselt et al. [2016] Van Hasselt, H., Guez, A., Silver, D., 2016. Deep reinforcement learning with double qlearning., in: AAAI, pp. 2094–2100.
 Watkins and Dayan [1992] Watkins, C.J., Dayan, P., 1992. Qlearning. Machine learning 8, 279–292.
Appendix A Reward Shaping
As discussed previously, the full control concept is decomposed into separate "Grasp" and "Stack" concepts (skills), while Grasp itself is decomposed into "Orient" and "Lift". In this section, we present the precise shaping reward functions used for the training of each concept.
a.1 Orient
(4) 
where is the angular component of the shaping reward for stack and orient, , , and are the angle between the line passing through the two opposed fingers and the , , or axes in the reference frame of the target object, respectively, and controls the sharpness of the shaping. Since the objects are symmetrical in and , we allow any of the four orientations of the fingers that line up with the or axes by only looking at the smallest angular distance to either the or axis, yielding a distance that ranges from 0 to 45 degrees. The must uniquely line up with the object, and ranges from 0 to 90. Here, we used an value of .
(5) 
where is the shaping reward for the reaching toward the goal location, is the distance between the pinch point of the opposed fingers and the goal location, is the terminal distance for this task, and controls the sharpness of the shaping.
(6) 
where is a time decay factor applied to the reward to encourage fast completion, is the current time step within the episode, and is the time step limit for an episode,
(7) 
where is the final reward for the "Orient" concept, is the bonus awarded on successful completion of the orient task, is the distance between the pinch point and prism, is the angle between the line connecting the opposed fingers and one of the axes of the prism, and the values for each are their tolerances.
a.1.1 Lift
(8) 
where is the pinch shaping reward component for closing the fingers, is the distance between the two opposed fingers, and is the maximum possible distance between the fingers.
(9) 
where is the height shaping component for lifting the prism, is the height of the prism off the ground, and is the distance at which we declare the prism lifted and terminate the episode. Here we used an of .
(10) 
where is the final reward for the "Lift" concept, is the bonus reward assigned for successfully lifting the prism above the threshold height, h is the height of the prism, is the threshold height, and is the threshold distance between the fingers below which they are considered pinched. The bonus rewards for success are greater than the total reward that could have been accumulated had the agent remained in this highest reward state for the remaining time in the episode, to encourage fast completion of the task.
a.2 Grasp
Here, is the final shaping reward for the "Grasp" concept.
(11) 
a.3 Stack
(12) 
where is the distance between the pinch point and the goal, and are the thresholds for success in angular and euclidean distance, and are the weights assigned to those reward components, and is the bonus reward assigned for successful completion of the stack task.
a.4 Full Task
(13) 
Appendix B Terminal Conditions
Orient
For the orient concept, an episode would end early if the hand moved too far from the prism, if the prism tipped more than 15 degrees, or the goal was achieved by aligning the opposed fingers with the prism while the pinch point was 1.5cm above the prism.
Lift
For the lift concept, an episode would end early if the prism was moved outside a virtual cylinder centered on the starting position of the prism, if the hand moved more than a certain distance from the prism, or if the goal was achieved by lifting the prism above a target height.
Stack
For the stack concept, an episode would end early if the prism moved too far from the cube, if the prism touched the ground, or if the goal was achieved by lining the prism up with the cube and bringing them into contact.
Comments
There are no comments yet.