We are interested in devising a new approach to reinforcement learning to solve multiple tasks in a single environment, and to learn separately the dynamic of the environment and the definitions of goals in it.
A simple example would be a 2d maze, where there could be two different sets of tasks: reach horizontal coordinate or reach vertical coordinate . Learning the spatial configuration of the maze would be useful for both sets of tasks.
Another example would be the robotic arm in the MuJoCo simulator Todorov et al. (2012) proposed by Andrychowicz et al. (2017). While tasks in that simulator can be different, they all rely on the underlying physical dynamics of the robotic arm.
Our approach consists in segmenting the model into:
a task-agnostic planning quasi-metric that estimates the minimum expected number of steps necessary to go from any state to any state (see § 2.2), and
a series of task-specific planners that estimate, given the current state and a task-specific goal , what target state the agent should aim for (see § 2.3).
These models are trained concurrently. The key value of this approach is to share the quasi-metric across tasks, for instance to enable transfer learning as demonstrated in the experimental section (§3).
The idea of a quasi-metric between states is the natural extension of recent works, starting with the Universal Value Function Approximators (Schaul et al., 2015) which introduced the notion that learning the reward function can be done without a single privileged goal, and then extended with the Hindsight Experience Replay (Andrychowicz et al., 2017) that introduced the idea that goals do not have to be pre-defined but can be picked arbitrarily. Combined with a constant negative reward, this leads naturally to a metric where states and goal get a more symmetric role, which departs from the historical and classical idea of accumulated reward.
The long-term motivation of our approach is twofold: First, we want to segment the policy of an agent into a life-long learned quasi-metric, and a collection of task-specific easy-to-learn planners. These planners would be related to high-level imperatives for a biological system, triggered by low-level physiological necessities (“eat”, “get warmer”, “reproduce”, “sleep”), and high-level operations for a robot (“recharge battery”, “pick up boxes”, “patrol”, etc.). Second, the central role of a metric where the heavy lifting takes place provides a powerful framework to develop hierarchical planning, curiosity strategies, estimators of performance, etc.
Let be the state space and the action space. We call goal a subset of the state space, and a task a set of goals
. Many tasks can be defined in a given environment with the same state and action spaces. Note that in the environments we consider, the concrete definition of a task is a subset of the state vector coordinates, and a goal is defined by the target values for these coordinates.
concatenates, among others, the position and velocity of the arm and the location of the object to manipulate. Examples of tasks could be “reach a certain position”, in which case a goal is a set of states parameterized by a 3d position, where the position of the arm handle is fixed but all other degrees of freedom are let free, “reach a certain speed” where everything is let unconstrained but the handle’s speed, “put the object at the left side of the table”, where everything is free but one coordinate of the object location, and so on.
For what follows, we also let
be a state / action / reward sequence.
2.1 Q-learning and Hindsight Experience Replay
Given a discount factor , a standard Q-learning algorithm (Watkins, 1989) aims at learning a reward function of the form
that should provide an estimate of the maximum [expected] reward that can be accumulated when starting from state doing action . This is achieved by iteratively computing through updates minimizing
where is a “target model”, usually updated less frequently or through a stabilizing moving average (Hasselt, 2010).
The Universal Value Function Approximators (Schaul et al., 2015) parameterize the Q-function with a goal, and the Hindsight Experience Replay (Andrychowicz et al., 2017) combines this approach with additional goals sampled along visited trajectories, and a constant negative reward.
The goal-dependent q value
is updated, given a , to minimize
As noted by Eysenbach et al. (2019), setting an undiscounted () constant negative reward, except when the goal is reached, as in equation 6, makes the resulting accumulated reward the [opposite of the] distance to the goal in number of steps.
And by considering arbitrary goals, the model actually embodies a distance between any pair state/goals, which is very similar conceptually to a state/state metric.
2.2 Planning Quasi-Metric
Similarly to the distance between states proposed by Eysenbach et al. (2019), we explicitly introduce an action-parameterized quasi-metric
such that is “the minimum [expected] number of steps to go from to when starting with action ”.
We stress that it is a quasi-metric since it is not symmetric in most of the actual planning setups. Consider for instance one-way streets for an autonomous urban vehicle, irreversible physical transformations, or inertia for a robotic task, which may make going from to easy and the reciprocal transition difficult.
Given an arbitrary target state , the update of should minimize
where the first term makes the quasi-metric between successive states, and the second makes it globally consistent with the best policy, following Bellman’s equation. As in equation 3, is a “target model”, usually updated less frequently or through a stabilizing moving average.
We implement the learning of the PQM with a standard actor/critic structure. First the PQM itself that plays the role of the critic
and an actor, which is either an explicit in the case of a finite set of actions, or a model
to approximate when dealing with a continuous action space
For training, given a tuple we update to reduce
and we update to reduce
so that gets closer to the choice of action at that minimizes the remaining distance to .
Note that while the quasi-metric allows to reach a certain state by choosing at any moment the action that decreases the distance to it the most, it does not allow to reach a more abstract “goal”, defined as aset of states. This objective is not trivial: the two objects are defined at completely different scales, the latter possibly ignoring virtually all the degrees of freedom of the former.
Hence, to use the PQM to actually reach goals, a key element is missing to pick the “ideal state” that (1) is in the goal but also (2) is the easiest to reach from the state currently occupied. For this purpose we introduce the idea of planner (see figure 1)
such that is the “best” target state, that is the state in closest to :
The key notion in this formulation is that we can have multiple planners dedicated to as many goal spaces, that utilize the same quasi-metric, which is in charge of the heavy lifting of “understanding” the underlying dynamics of the environment.
We follow the idea of the actor for the action choice, and do not implement the planner by explicitly solving the system of equation 14 but introduce a parameterized model
For training, given a pair we update to reduce
The first term is an estimate of the objective of the problem (14), that is the distance between and , where the actor’s prediction plays the role of the of the original problem.
The second term is a penalty replacing the hard constraints of (14) with a distance to a set. That latter distance is in practice a norm over a subset of the state’s coordinates. We come back to this with more details in § 3.
The third term is a penalty for imposing the validity of the state, for instance ensuring that speed or angles remain in valid ranges.
The resulting policy combines the actor and the planner . Given the current state and the goal , the chosen action is .
We have validated our approach experimentally in PyTorchPaszke et al. (2019), on two standard environments: The bit-flipping problem (see § 3.1), known to be particularly challenging to traditional RL approaches relying on sparse rewards, and the MuJoCo simulator (see § 3.2
), which exhibits some key difficulties of real robotic tasks. Our software to reproduce the experiments will be available under an open-source license at the time of publication.
As observed by many other practitioners, deep learning in general, and reinforcement learning in particular, require heavy optimization of meta-parameters, both related to the regressors’ architectures (number of layers, layer size, non-linearities, etc.) and the optimization itself (batch size, sampling strategy, SGD parameterization, etc.)
This translates to large computational budgets to obtain an optimal setup and performance. The experimental results presented in this section were obtained with roughly 250 vCPU cores for one month, which is far less than the requirement for some state-of-the-art results. It forced us to only coarsely adapt configurations optimized in previous works for more classical and consequently quite different approaches.
3.1.1 Environment and tasks
The state space for this first environment is a Boolean vector of bits, and there are actions, each switching one particular bit of the state.
We fix and define two tasks, corresponding to reaching a target configuration respectively for the first bits and the last bits. A goal in these tasks is defined by the target configuration of bits.
The difficulty in this environment is the cardinality of the state space, the lack of geometrical structure, and the average time it takes to go from any state to any other state under a random policy.
To demonstrate the transferability of the quasi-metric, we also consider a task defined by the 15 first bits, with transfer from a task defined by the 15 last bits. After training an agent on one, we train a new agent on the other but keep the parameters of the quasi-metric and the actor.
3.1.2 Network architectures and training
The critic is implemented as , where
is a ReLU MLP withinput units, one hidden layer with neurons, and outputs. As indicated in § 2.2, the actor for this environment is an explicit over the actions. The planner is implemented also as a ReLU MLP with input units corresponding to the concatenation of a state and a goal definition, one hidden layer with neurons, and output neurons, with a final sigmoid non-linearity.
The length of an episode is equal to the number of bits in the goal, which is twice the median of the optimal number of actions.
We kept the meta-parameters as selected by Plappert et al. (2018, appendix B), and chose . There is no term imposing the validity of the planner output, hence no parameter .
Following algorithm 1, we train for epochs, each consisting of running the policy for episodes and then performing optimization steps on minibatches of size sampled uniformly from a replay buffer consisting of transitions. We use the “future” strategy of HER for the selection of goals Andrychowicz et al. (2017). We update the target networks after every optimization step using the decay coefficient of .
The experiments in the bit-flip environment show the advantage of using a planning quasi-metric. As shown on figure 2, the training is successful and the combination of the quasi-metric and the planner results in a policy similar to that of the standard DQN, both in terms of success rate and in terms of time to goal. It also appears that, while this model is slightly harder to train on a single task compared to DQN, it provides a great performance boost when transferring the PQM between tasks: since the quasi-metric is pre-trained, the training process only needs to learn a planner, which is a simpler object.
It is noteworthy that due to limited computational means, we kept essentially the meta-parameters of the DQN setup of Andrychowicz et al. (2017), which were heavily optimized for a different architecture, and as such the comparison is biased to DQN’s advantage.
Figure 3 gives a clearer view of the accuracy of the metric alone. We have computed after training the value of for pairs of starting states / target states taken at random, and compared it to the “true” distance, which happens to be in that environment the Hamming distance, that is the number of bits that differ between the two.
For the tasks in this environment, the planner predicts a target state whose bits that matter for the task are the goal configuration, and the others are unchanged from the starting state, since this corresponds to the shortest path. Hence we considered two groups of state pairs: Either “in task”, which means that the two states are consistent with the planner prediction in the task, and differ only on the bits that matter for the task, or “random” in which case they are arbitrary, and hence may be inconsistent with the biased statistic observed during training.
The results show that the estimate of the quasi-metric is very accurate on the first group, less so on the second, but still strongly monotonic. This is consistent with the transfer providing a substantial boost to the training on a new task.
3.2.1 Environment and tasks
For our second set of experiments, we use the “Fetch” environments of OpenAI gym Brockman et al. (2016) which use the MuJoCo physics engine (Todorov et al., 2012), pictured in figure 4, and are described in details by Plappert et al. (2018, section 1.1). If left unspecified the details of our experiments are the same as indicated by Plappert et al. (2018, section 1.4), and Andrychowicz et al. (2017, appendix A).
We consider two tasks: “push”, where a box is placed at random on the table and the robot’s objective is to move it to a desired location also on the table, without using the gripper, and “pick and place”, in which the robot can control its gripper, and the desired location for the box may be located above the table surface.
To demonstrate the transferability of the quasi-metric in this environment, we also consider the “push” task with transfer from “pick and place”: after training an agent on the latter, we train a new agent on the former, but initialize the parameters and with the values obtained with the previous agent.
3.2.2 Network architectures and training
In what follows let be the dimension of the state space , the dimension of the action space , the dimension of the goal parameter, which corresponds to the desired spatial location of the manipulated box.
The critic is implemented with a ReLU MLP, with input units, corresponding to the concatenation of two states and an action, three hidden layers with units each, and a single output unit. The actor is a ReLU MLP with input units, three hidden layers with units each, and output units with non-linearity. As Plappert et al. (2018), we also add a penalty to the actor’s loss equal to the square of the output layer pre-activations. Finally the planner is a ReLU MLP with input units, three hidden layers of units, and output units.
Following the algorithm 1, for “pick and place” and “push with transfer”, we train for epochs. For “push”, we train for epochs. As in Plappert et al. (2018), each epoch consists of cycles, and each cycle consists of running the policy for episodes and then performing optimization steps on minibatches of size sampled uniformly from a replay buffer consisting of transitions. We use the “future” strategy of HER for the selection of goals Andrychowicz et al. (2017), and update the target networks after every cycle using the decay coefficient of .
We kept the meta-parameters as selected in Plappert et al. (2018, appendix B), with an additional grid search over the number of hidden neurons in the actor and critic models in , in , and in .
For each combination, we trained a policy on the “push” task, and eventually selected the combination with the highest rolling median success rate over epochs, resulting in hidden neurons, , and .
The results obtained in this environment confirm the observations from the bit-flip environment. Figure 5 shows that the joint training of the PQM and planners works properly and results in a policy similar to that obtained with the DDPG approach. Performance is slightly lower, in part due to the limited meta-optimization we could afford, that favors DDPG, and in part due to the difficulty of learning the metric, which is a more complicated functional.
Figures 4 and 6 show the advantage of using the PQM to transfer knowledge from a task to another. Even though the two tasks are quite different, one using the gripper and moving the object in space while holding it, and the other moving only by contact in the plane, the quasi-metric provides an initial boost in the training by providing the ability to position the arm.
Finally, figure 7 shows that the estimate of the quasi-metric accurately reflects the actual distance to the goal state.
4 Related works
The standard Q-learning approach to reinforcement learning consists of learning a policy implicitly through an estimator of the value of a state-action pairs, defined as the expected accumulated reward starting by doing the said action in the said state, and following an optimal policy then Watkins (1989). Such approach has proven extremely effective, in particular when combined with modern deep architectures as value approximators Mnih et al. (2015). This has been improved by duplicating the model during optimization to avoid over-estimating state values van Hasselt et al. (2015).
The main weakness of such classical methods is the necessity for large training sets, due to the complexity of the model to learn, and the sparsity of the reward. This latter point has been tackled recently by learning goal-conditioned policies Schaul et al. (2015); Andrychowicz et al. (2017), where arbitrary observed states may be considered as “synthetic” goals. Doing so allows to leverage a richer structure, where regularities of the environment observed along trajectories to synthetic goals can be transferred to reach the actual ones of interest.
Interestingly, this idea of goal-conditioned policy is combined with a constant negative reward at any time step, which results in an accumulated reward structure having the form of the [opposite of] the distance to the goal. Eysenbach et al. (2019) explicitly consider a distance between states, and not from state to goal, and from there leverage the set of observed states as vertices in a graph that approximates the geodesic distance in the state manifold.
A natural approach to cope with the sparsity of reward and limited number of examples is to leverage structures learned on other tasks. This idea of multi-tasks learning is not specific to reinforcement learning, and aims at solving the same problem of data scarcity. A very straight-forward approach is to train a single model with multiple “heads” providing each a task-specific prediction Caruana (1998).
The same idea of transferring models has been applied to reinforcement learning Taylor and Stone (2009), with recent successes using a single model that mimics specialized expert actors on individual tasks Parisotto et al. (2016). The key issue of bringing several models to a common representation is tackled by normalizing contributions of the different tasks Hessel et al. (2018). This is in contrast with our proposal, which explicitly leverages being in a similar environment. This corresponds in our view to a more realistic robotic setup for which the embodiment is fixed, and allows to make the shared knowledge explicit.
We have proposed to address the action selection problem by modeling separately a quasi-metric between states, and the estimation of a target state given a goal. Experiments show that this approach is valid, and that these two models can be trained jointly to get an efficient policy. As also illustrated in the experiments, the core advantage is that this decomposition moves the bulk of the modeling to the quasi-metric, which can be trained across tasks, with a dense feedback from the environment.
This model is fundamentally multi-tasks. The quasi-metric captures more than is necessary for any single task and requires both a more complex model and a more expensive learning, as seen in the experiments. However, since it can be shared across tasks it provides a substantial boost to learning a new task, for which only the planner that estimates the target state for a given goal has to be trained, with a minor fine-tuning of the quasi-metric.
By disentangling two very different aspects of the planning, this decomposition is very promising for future extensions. The quasi-metric handles the difficulty of learning a global structure known only through local interactions but is potentially amenable to the triangular inequality, clustering methods, and dimension reduction. The planners are easier to learn and may be improved with a specific class of regressors taking advantage of a coarse-to-fine structure: your final destination can be initially coarsely defined and refined along your way.
The metric structure over the state space opens the way to more complex planning approaches, where intermediate states to reach are “imagined” with an explicit path-search algorithm, combining tree-search and deep-learning estimates, in a way similar to MCTS for Go (Silver et al., 2016; Schrittwieser et al., 2019).
Finally, it is also also promising for deriving quantities related to the remaining metric uncertainty over states prediction, which can be possibly leveraged as a curiosity measure (Pathak et al., 2017).
- Hindsight experience replay. CoRR abs/1707.01495. Cited by: §1, §1, §2.1, §3.1.2, §3.1.3, §3.2.1, §3.2.2, §3.2.2, §4.
- OpenAI gym. CoRR abs/1606.01540. Cited by: §3.2.1.
- Multitask learning. In Learning to Learn, pp. 95–133. Cited by: §4.
- Search on the replay buffer: bridging planning and reinforcement learning. CoRR abs/1906.05253. Cited by: §2.1, §2.2, §4.
- Double q-learning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2613–2621. Cited by: §2.1.
- Multi-task deep reinforcement learning with popart. CoRR abs/1809.04474. Cited by: §4.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.
- Actor-mimic: deep multitask and transfer reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §4.
- PyTorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems (NeurIPS), pp. 8024–8035. Cited by: §3.
- Curiosity-driven exploration by self-supervised prediction. CoRR abs/1705.05363. Cited by: §5.
- Multi-goal reinforcement learning: challenging robotics environments and request for research. CoRR abs/1802.09464. External Links: Cited by: §3.1.2, §3.2.1, §3.2.2, §3.2.2, §3.2.2.
Universal value function approximators.
International Conference on Machine Learning (ICML), Vol. 37, pp. 1312–1320. Cited by: §1, §2.1, §4.
- Mastering atari, go, chess and shogi by planning with a learned model. CoRR abs/1911.08265. Cited by: §5.
Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §5.
- Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research (JMLR) 10, pp. 1633–1685. Cited by: §4.
- MuJoCo: a physics engine for model-based control.. In International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §2, §3.2.1.
- Deep reinforcement learning with double q-learning. CoRR abs/1509.06461v3. Cited by: §4.
- Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, UK. Cited by: §2.1, §4.