1 Introduction
We are interested in devising a new approach to reinforcement learning to solve multiple tasks in a single environment, and to learn separately the dynamic of the environment and the definitions of goals in it.
A simple example would be a 2d maze, where there could be two different sets of tasks: reach horizontal coordinate or reach vertical coordinate . Learning the spatial configuration of the maze would be useful for both sets of tasks.
Another example would be the robotic arm in the MuJoCo simulator Todorov et al. (2012) proposed by Andrychowicz et al. (2017). While tasks in that simulator can be different, they all rely on the underlying physical dynamics of the robotic arm.
Our approach consists in segmenting the model into:

a taskagnostic planning quasimetric that estimates the minimum expected number of steps necessary to go from any state to any state (see § 2.2), and

a series of taskspecific planners that estimate, given the current state and a taskspecific goal , what target state the agent should aim for (see § 2.3).
These models are trained concurrently. The key value of this approach is to share the quasimetric across tasks, for instance to enable transfer learning as demonstrated in the experimental section (§
3).The idea of a quasimetric between states is the natural extension of recent works, starting with the Universal Value Function Approximators (Schaul et al., 2015) which introduced the notion that learning the reward function can be done without a single privileged goal, and then extended with the Hindsight Experience Replay (Andrychowicz et al., 2017) that introduced the idea that goals do not have to be predefined but can be picked arbitrarily. Combined with a constant negative reward, this leads naturally to a metric where states and goal get a more symmetric role, which departs from the historical and classical idea of accumulated reward.
The longterm motivation of our approach is twofold: First, we want to segment the policy of an agent into a lifelong learned quasimetric, and a collection of taskspecific easytolearn planners. These planners would be related to highlevel imperatives for a biological system, triggered by lowlevel physiological necessities (“eat”, “get warmer”, “reproduce”, “sleep”), and highlevel operations for a robot (“recharge battery”, “pick up boxes”, “patrol”, etc.). Second, the central role of a metric where the heavy lifting takes place provides a powerful framework to develop hierarchical planning, curiosity strategies, estimators of performance, etc.
2 Method
Let be the state space and the action space. We call goal a subset of the state space, and a task a set of goals
. Many tasks can be defined in a given environment with the same state and action spaces. Note that in the environments we consider, the concrete definition of a task is a subset of the state vector coordinates, and a goal is defined by the target values for these coordinates.
Consider the robotic arm of the MuJoCo simulator Todorov et al. (2012), that we use for experiments in § 3.2: The state space
concatenates, among others, the position and velocity of the arm and the location of the object to manipulate. Examples of tasks could be “reach a certain position”, in which case a goal is a set of states parameterized by a 3d position, where the position of the arm handle is fixed but all other degrees of freedom are let free, “reach a certain speed” where everything is let unconstrained but the handle’s speed, “put the object at the left side of the table”, where everything is free but one coordinate of the object location, and so on.
For what follows, we also let
(1) 
be a state / action / reward sequence.
2.1 Qlearning and Hindsight Experience Replay
Given a discount factor , a standard Qlearning algorithm (Watkins, 1989) aims at learning a reward function of the form
(2) 
that should provide an estimate of the maximum [expected] reward that can be accumulated when starting from state doing action . This is achieved by iteratively computing through updates minimizing
(3) 
where is a “target model”, usually updated less frequently or through a stabilizing moving average (Hasselt, 2010).
The Universal Value Function Approximators (Schaul et al., 2015) parameterize the Qfunction with a goal, and the Hindsight Experience Replay (Andrychowicz et al., 2017) combines this approach with additional goals sampled along visited trajectories, and a constant negative reward.
The goaldependent q value
(4) 
is updated, given a , to minimize
(5) 
where
(6) 
As noted by Eysenbach et al. (2019), setting an undiscounted () constant negative reward, except when the goal is reached, as in equation 6, makes the resulting accumulated reward the [opposite of the] distance to the goal in number of steps.
And by considering arbitrary goals, the model actually embodies a distance between any pair state/goals, which is very similar conceptually to a state/state metric.
2.2 Planning QuasiMetric
Similarly to the distance between states proposed by Eysenbach et al. (2019), we explicitly introduce an actionparameterized quasimetric
(7) 
such that is “the minimum [expected] number of steps to go from to when starting with action ”.
We stress that it is a quasimetric since it is not symmetric in most of the actual planning setups. Consider for instance oneway streets for an autonomous urban vehicle, irreversible physical transformations, or inertia for a robotic task, which may make going from to easy and the reciprocal transition difficult.
Given an arbitrary target state , the update of should minimize
(8) 
where the first term makes the quasimetric between successive states, and the second makes it globally consistent with the best policy, following Bellman’s equation. As in equation 3, is a “target model”, usually updated less frequently or through a stabilizing moving average.
We implement the learning of the PQM with a standard actor/critic structure. First the PQM itself that plays the role of the critic
(9) 
and an actor, which is either an explicit in the case of a finite set of actions, or a model
(10) 
to approximate when dealing with a continuous action space
For training, given a tuple we update to reduce
(11) 
and we update to reduce
(12) 
so that gets closer to the choice of action at that minimizes the remaining distance to .
2.3 Planner
Note that while the quasimetric allows to reach a certain state by choosing at any moment the action that decreases the distance to it the most, it does not allow to reach a more abstract “goal”, defined as a
set of states. This objective is not trivial: the two objects are defined at completely different scales, the latter possibly ignoring virtually all the degrees of freedom of the former.Hence, to use the PQM to actually reach goals, a key element is missing to pick the “ideal state” that (1) is in the goal but also (2) is the easiest to reach from the state currently occupied. For this purpose we introduce the idea of planner (see figure 1)
(13) 
such that is the “best” target state, that is the state in closest to :
(14) 
The key notion in this formulation is that we can have multiple planners dedicated to as many goal spaces, that utilize the same quasimetric, which is in charge of the heavy lifting of “understanding” the underlying dynamics of the environment.
We follow the idea of the actor for the action choice, and do not implement the planner by explicitly solving the system of equation 14 but introduce a parameterized model
(15) 
For training, given a pair we update to reduce
(16) 
The first term is an estimate of the objective of the problem (14), that is the distance between and , where the actor’s prediction plays the role of the of the original problem.
The second term is a penalty replacing the hard constraints of (14) with a distance to a set. That latter distance is in practice a norm over a subset of the state’s coordinates. We come back to this with more details in § 3.
The third term is a penalty for imposing the validity of the state, for instance ensuring that speed or angles remain in valid ranges.
The resulting policy combines the actor and the planner . Given the current state and the goal , the chosen action is .
3 Experiments
We have validated our approach experimentally in PyTorch
Paszke et al. (2019), on two standard environments: The bitflipping problem (see § 3.1), known to be particularly challenging to traditional RL approaches relying on sparse rewards, and the MuJoCo simulator (see § 3.2), which exhibits some key difficulties of real robotic tasks. Our software to reproduce the experiments will be available under an opensource license at the time of publication.
As observed by many other practitioners, deep learning in general, and reinforcement learning in particular, require heavy optimization of metaparameters, both related to the regressors’ architectures (number of layers, layer size, nonlinearities, etc.) and the optimization itself (batch size, sampling strategy, SGD parameterization, etc.)
This translates to large computational budgets to obtain an optimal setup and performance. The experimental results presented in this section were obtained with roughly 250 vCPU cores for one month, which is far less than the requirement for some stateoftheart results. It forced us to only coarsely adapt configurations optimized in previous works for more classical and consequently quite different approaches.
one standard deviation confidence interval of the success rate (left) and time to goal (right) on the bitflip task (see §
3.1.1) of three different algorithms: a standard deep Qlearning (DQN), our approach that combines a planning quasimetric with a planner (PQM), and the same with transfer of the quasimetric trained on a task where the bits to match to reach the goals were different (PQM w/ transfer).3.1 Bitflip
3.1.1 Environment and tasks
The state space for this first environment is a Boolean vector of bits, and there are actions, each switching one particular bit of the state.
We fix and define two tasks, corresponding to reaching a target configuration respectively for the first bits and the last bits. A goal in these tasks is defined by the target configuration of bits.
The difficulty in this environment is the cardinality of the state space, the lack of geometrical structure, and the average time it takes to go from any state to any other state under a random policy.
To demonstrate the transferability of the quasimetric, we also consider a task defined by the 15 first bits, with transfer from a task defined by the 15 last bits. After training an agent on one, we train a new agent on the other but keep the parameters of the quasimetric and the actor.
3.1.2 Network architectures and training
The critic is implemented as , where
is a ReLU MLP with
input units, one hidden layer with neurons, and outputs. As indicated in § 2.2, the actor for this environment is an explicit over the actions. The planner is implemented also as a ReLU MLP with input units corresponding to the concatenation of a state and a goal definition, one hidden layer with neurons, and output neurons, with a final sigmoid nonlinearity.The length of an episode is equal to the number of bits in the goal, which is twice the median of the optimal number of actions.
We kept the metaparameters as selected by Plappert et al. (2018, appendix B), and chose . There is no term imposing the validity of the planner output, hence no parameter .
Following algorithm 1, we train for epochs, each consisting of running the policy for episodes and then performing optimization steps on minibatches of size sampled uniformly from a replay buffer consisting of transitions. We use the “future” strategy of HER for the selection of goals Andrychowicz et al. (2017). We update the target networks after every optimization step using the decay coefficient of .
3.1.3 Results
The experiments in the bitflip environment show the advantage of using a planning quasimetric. As shown on figure 2, the training is successful and the combination of the quasimetric and the planner results in a policy similar to that of the standard DQN, both in terms of success rate and in terms of time to goal. It also appears that, while this model is slightly harder to train on a single task compared to DQN, it provides a great performance boost when transferring the PQM between tasks: since the quasimetric is pretrained, the training process only needs to learn a planner, which is a simpler object.
It is noteworthy that due to limited computational means, we kept essentially the metaparameters of the DQN setup of Andrychowicz et al. (2017), which were heavily optimized for a different architecture, and as such the comparison is biased to DQN’s advantage.
Figure 3 gives a clearer view of the accuracy of the metric alone. We have computed after training the value of for pairs of starting states / target states taken at random, and compared it to the “true” distance, which happens to be in that environment the Hamming distance, that is the number of bits that differ between the two.
For the tasks in this environment, the planner predicts a target state whose bits that matter for the task are the goal configuration, and the others are unchanged from the starting state, since this corresponds to the shortest path. Hence we considered two groups of state pairs: Either “in task”, which means that the two states are consistent with the planner prediction in the task, and differ only on the bits that matter for the task, or “random” in which case they are arbitrary, and hence may be inconsistent with the biased statistic observed during training.
The results show that the estimate of the quasimetric is very accurate on the first group, less so on the second, but still strongly monotonic. This is consistent with the transfer providing a substantial boost to the training on a new task.
3.2 MuJoCo
3.2.1 Environment and tasks
For our second set of experiments, we use the “Fetch” environments of OpenAI gym Brockman et al. (2016) which use the MuJoCo physics engine (Todorov et al., 2012), pictured in figure 4, and are described in details by Plappert et al. (2018, section 1.1). If left unspecified the details of our experiments are the same as indicated by Plappert et al. (2018, section 1.4), and Andrychowicz et al. (2017, appendix A).
We consider two tasks: “push”, where a box is placed at random on the table and the robot’s objective is to move it to a desired location also on the table, without using the gripper, and “pick and place”, in which the robot can control its gripper, and the desired location for the box may be located above the table surface.
To demonstrate the transferability of the quasimetric in this environment, we also consider the “push” task with transfer from “pick and place”: after training an agent on the latter, we train a new agent on the former, but initialize the parameters and with the values obtained with the previous agent.
3.2.2 Network architectures and training
In what follows let be the dimension of the state space , the dimension of the action space , the dimension of the goal parameter, which corresponds to the desired spatial location of the manipulated box.
The critic is implemented with a ReLU MLP, with input units, corresponding to the concatenation of two states and an action, three hidden layers with units each, and a single output unit. The actor is a ReLU MLP with input units, three hidden layers with units each, and output units with nonlinearity. As Plappert et al. (2018), we also add a penalty to the actor’s loss equal to the square of the output layer preactivations. Finally the planner is a ReLU MLP with input units, three hidden layers of units, and output units.
Following the algorithm 1, for “pick and place” and “push with transfer”, we train for epochs. For “push”, we train for epochs. As in Plappert et al. (2018), each epoch consists of cycles, and each cycle consists of running the policy for episodes and then performing optimization steps on minibatches of size sampled uniformly from a replay buffer consisting of transitions. We use the “future” strategy of HER for the selection of goals Andrychowicz et al. (2017), and update the target networks after every cycle using the decay coefficient of .
We kept the metaparameters as selected in Plappert et al. (2018, appendix B), with an additional grid search over the number of hidden neurons in the actor and critic models in , in , and in .
For each combination, we trained a policy on the “push” task, and eventually selected the combination with the highest rolling median success rate over epochs, resulting in hidden neurons, , and .
3.2.3 Results
The results obtained in this environment confirm the observations from the bitflip environment. Figure 5 shows that the joint training of the PQM and planners works properly and results in a policy similar to that obtained with the DDPG approach. Performance is slightly lower, in part due to the limited metaoptimization we could afford, that favors DDPG, and in part due to the difficulty of learning the metric, which is a more complicated functional.
Figures 4 and 6 show the advantage of using the PQM to transfer knowledge from a task to another. Even though the two tasks are quite different, one using the gripper and moving the object in space while holding it, and the other moving only by contact in the plane, the quasimetric provides an initial boost in the training by providing the ability to position the arm.
Finally, figure 7 shows that the estimate of the quasimetric accurately reflects the actual distance to the goal state.
4 Related works
The standard Qlearning approach to reinforcement learning consists of learning a policy implicitly through an estimator of the value of a stateaction pairs, defined as the expected accumulated reward starting by doing the said action in the said state, and following an optimal policy then Watkins (1989). Such approach has proven extremely effective, in particular when combined with modern deep architectures as value approximators Mnih et al. (2015). This has been improved by duplicating the model during optimization to avoid overestimating state values van Hasselt et al. (2015).
The main weakness of such classical methods is the necessity for large training sets, due to the complexity of the model to learn, and the sparsity of the reward. This latter point has been tackled recently by learning goalconditioned policies Schaul et al. (2015); Andrychowicz et al. (2017), where arbitrary observed states may be considered as “synthetic” goals. Doing so allows to leverage a richer structure, where regularities of the environment observed along trajectories to synthetic goals can be transferred to reach the actual ones of interest.
Interestingly, this idea of goalconditioned policy is combined with a constant negative reward at any time step, which results in an accumulated reward structure having the form of the [opposite of] the distance to the goal. Eysenbach et al. (2019) explicitly consider a distance between states, and not from state to goal, and from there leverage the set of observed states as vertices in a graph that approximates the geodesic distance in the state manifold.
A natural approach to cope with the sparsity of reward and limited number of examples is to leverage structures learned on other tasks. This idea of multitasks learning is not specific to reinforcement learning, and aims at solving the same problem of data scarcity. A very straightforward approach is to train a single model with multiple “heads” providing each a taskspecific prediction Caruana (1998).
The same idea of transferring models has been applied to reinforcement learning Taylor and Stone (2009), with recent successes using a single model that mimics specialized expert actors on individual tasks Parisotto et al. (2016). The key issue of bringing several models to a common representation is tackled by normalizing contributions of the different tasks Hessel et al. (2018). This is in contrast with our proposal, which explicitly leverages being in a similar environment. This corresponds in our view to a more realistic robotic setup for which the embodiment is fixed, and allows to make the shared knowledge explicit.
5 Conclusion
We have proposed to address the action selection problem by modeling separately a quasimetric between states, and the estimation of a target state given a goal. Experiments show that this approach is valid, and that these two models can be trained jointly to get an efficient policy. As also illustrated in the experiments, the core advantage is that this decomposition moves the bulk of the modeling to the quasimetric, which can be trained across tasks, with a dense feedback from the environment.
This model is fundamentally multitasks. The quasimetric captures more than is necessary for any single task and requires both a more complex model and a more expensive learning, as seen in the experiments. However, since it can be shared across tasks it provides a substantial boost to learning a new task, for which only the planner that estimates the target state for a given goal has to be trained, with a minor finetuning of the quasimetric.
By disentangling two very different aspects of the planning, this decomposition is very promising for future extensions. The quasimetric handles the difficulty of learning a global structure known only through local interactions but is potentially amenable to the triangular inequality, clustering methods, and dimension reduction. The planners are easier to learn and may be improved with a specific class of regressors taking advantage of a coarsetofine structure: your final destination can be initially coarsely defined and refined along your way.
The metric structure over the state space opens the way to more complex planning approaches, where intermediate states to reach are “imagined” with an explicit pathsearch algorithm, combining treesearch and deeplearning estimates, in a way similar to MCTS for Go (Silver et al., 2016; Schrittwieser et al., 2019).
Finally, it is also also promising for deriving quantities related to the remaining metric uncertainty over states prediction, which can be possibly leveraged as a curiosity measure (Pathak et al., 2017).
References
 Hindsight experience replay. CoRR abs/1707.01495. Cited by: §1, §1, §2.1, §3.1.2, §3.1.3, §3.2.1, §3.2.2, §3.2.2, §4.
 OpenAI gym. CoRR abs/1606.01540. Cited by: §3.2.1.
 Multitask learning. In Learning to Learn, pp. 95–133. Cited by: §4.
 Search on the replay buffer: bridging planning and reinforcement learning. CoRR abs/1906.05253. Cited by: §2.1, §2.2, §4.
 Double qlearning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2613–2621. Cited by: §2.1.
 Multitask deep reinforcement learning with popart. CoRR abs/1809.04474. Cited by: §4.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.
 Actormimic: deep multitask and transfer reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §4.
 PyTorch: an imperative style, highperformance deep learning library. In Neural Information Processing Systems (NeurIPS), pp. 8024–8035. Cited by: §3.
 Curiositydriven exploration by selfsupervised prediction. CoRR abs/1705.05363. Cited by: §5.
 Multigoal reinforcement learning: challenging robotics environments and request for research. CoRR abs/1802.09464. External Links: 1802.09464 Cited by: §3.1.2, §3.2.1, §3.2.2, §3.2.2, §3.2.2.

Universal value function approximators.
In
International Conference on Machine Learning (ICML)
, Vol. 37, pp. 1312–1320. Cited by: §1, §2.1, §4.  Mastering atari, go, chess and shogi by planning with a learned model. CoRR abs/1911.08265. Cited by: §5.

Mastering the game of go with deep neural networks and tree search
. Nature 529, pp. 484–503. Cited by: §5.  Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research (JMLR) 10, pp. 1633–1685. Cited by: §4.
 MuJoCo: a physics engine for modelbased control.. In International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §2, §3.2.1.
 Deep reinforcement learning with double qlearning. CoRR abs/1509.06461v3. Cited by: §4.
 Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, UK. Cited by: §2.1, §4.
Comments
There are no comments yet.