 # Estimating Q(s,s') with Deep Deterministic Dynamics Gradients

In this paper, we introduce a novel form of value function, Q(s, s'), that expresses the utility of transitioning from a state s to a neighboring state s' and then acting optimally thereafter. In order to derive an optimal policy, we develop a forward dynamics model that learns to make next-state predictions that maximize this value. This formulation decouples actions from values while still learning off-policy. We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies. Code and videos are available at <sites.google.com/view/qss-paper>.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of reinforcement learning is to learn how to act so as to maximize long-term reward. A solution is usually formulated as finding the optimal policy, i.e., selecting the optimal action given a state. A popular approach for finding this policy is to learn a function that defines values though actions, , where is a state’s value and is the optimal action (Sutton and Barto, 1998). We will refer to this approach as QSA.

Here, we propose an alternative formulation for off-policy reinforcement learning that defines values solely through states, rather than actions. In particular, we introduce , or simply QSS, which represents the value of transitioning from one state to a neighboring state and then acting optimally thereafter:

 Q(s,s′)=r(s,s′)+γmaxs′′∈N(s′)Q(s′,s′′). Figure 1: Formulation for (a) Q-learning, or QSA-learning vs. (b) QSS-learning. Instead of proposing an action, a QSS agent proposes a state, which is then fed into an inverse dynamics model that determines the action given the current state and next state proposal. The environment returns the next observation and reward as usual after following the action.

In this formulation, instead of proposing an action, the agent proposes a desired next state, which is fed into an inverse dynamics model that outputs the appropriate action to reach it (see Figure 1). We demonstrate that this formulation has several advantages. First, redundant actions that lead to the same transition are simply folded into one value estimate. Further, by removing actions, QSS becomes easier to transfer than a traditional Q function in certain scenarios, as it only requires learning an inverse dynamics function upon transfer, rather than a full policy or value function. Finally, we show that QSS can learn policies purely from observations of (potentially sub-optimal) demonstrations with no access to demonstrator actions. Importantly, unlike other imitation from observation approaches, because it is off-policy, QSS can learn highly efficient policies even from sub-optimal or completely random demonstrations.

In order to realize the benefits of off-policy QSS, we must obtain value maximizing future state proposals without performing explicit maximization. There are two problems one would encounter in doing so. The first is that a set of neighbors of are not assumed to be known a priori. This is unlike the set of actions in discrete QSA which are assumed to be provided by the MDP. Secondly, for continuous state and action spaces, the set of neighbors may be infinitely many, so maximizing over them explicitly is out of the question. To get around this difficulty, we draw inspiration from Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), which learns a policy over continuous action spaces that maximizes . We develop the analogous Deep Deterministic Dynamics Gradient (D3G), which trains a forward dynamics model to predict next states that maximize . Notably, this model is not conditioned on actions, and thus allows us to train QSS completely off-policy from observations alone.

We begin the next section by formulating QSS, then describe its properties within tabular settings. We will then outline the case of using QSS in continuous settings, where we will use D3G to train . We evaluate in both tabular problems and MuJoCo tasks (Todorov et al., 2012).

## 2 The QSS formulation for RL

We are interested in solving problems specified through a Markov Decision Process, which consists of states

, actions , rewards , and a transition model

that indicates the probability of transitioning to a specific next state given a current state and action,

(Sutton and Barto, 1998)111We use and to denote states consecutive in time, which may alternately be denoted and .. For simplicity, we refer to all rewards as for the remainder of the paper.

Reinforcement learning aims to find a policy that expresses the probability of taking action in state . We are typically interested in policies that maximize the long-term discounted return , where is a discount factor that expresses the importance of long-term rewards and is the environmental horizon.

The standard QSA method to estimate the return is by using Q-values of actions (Watkins and Dayan, 1992):

 Q(s,a)=E[r+γmaxa′Q(s′,a′)|s,a],

This expresses the value of taking an action in a state and acting optimally thereafter. QSA can be approximated using an approach known as Q-learning:

 Q(s,a)=Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)].

QSA learned policies can be formulated as:

 π(s)=argmaxaQ(s,a).

We propose an alternative paradigm where we learn , or the value of transitioning from state to state and acting optimally thereafter. By analogy with standard QSA-learning, we express this quantity as:

 Q(s,s′)=r+γmaxs′′∈N(s′)Q(s′,s′′). (1)

Although this equation may be applied to any environment, for it to be a useful formulation, the environment must be deterministic. To see why, note that in QSA-learning, the max is over actions, which the agent has perfect control over, and any uncertainty in the environment is integrated out by the expectation. In QSS-learning the max is over next states, which in stochastic environments are not perfectly predictable. In such environments the above equation does faithfully track a certain value, but it may be considered the “best possible scenario value” — the value of a current and subsequent state assuming that any stochasticity the agent experiences turns out as well as possible for the agent. Concretely, this means we assume that the agent can transition reliably (with probability 1) to any state that it is possible (with probability 0) to reach from state .

Of course, this will not hold for stochastic domains in general, in which case QSS learning does not track an actionable value. While this limitation may seem severe, we will demonstrate that the QSS formulation affords us a powerful tool for use in deterministic environments, which we develop in the remainder of this article. Henceforth we assume that the transition function is deterministic, and the empirical results that follow show our approach to succeed over a wide range of tasks, including tabular problems and MuJoCo simulations.

### 2.1 Bellman update for QSS

We first consider the simple setting where we have access to an inverse dynamics model that returns an action that would take the agent from state to . Next, we assume access to a function that outputs the neighbors of . We use this as an illustrative example and will formulate the problem without these assumptions in the next section.

We define the Bellman update for QSS-learning as:

 Q(s,s′)=Q(s,s′)+α[r+γmaxs′′∈N(s)Q(s′,s′′)−Q(s,s′)]. (2)

Note is undefined when and are not neighbors. In order to obtain a policy, we define as a function that selects a neighboring state from that maximizes QSS:

 τ(s)=argmaxs′∈N(s)Q(s,s′). (3)

In words, selects states that have large value, and acts similar to a policy over states. In order to obtain the policy over actions, we use the inverse dynamics model:

 π(s)=I(s,τ(s)). (4)

This approach first finds the state that maximizes , and then uses to determine the action that will take the agent there. We can rewrite Equation 2 as:

 Q(s,s′)=Q(s,s′)+α[r+γQ(s′,τ(s′))−Q(s,s′)]. (5)

### 2.2 Equivalence of Q(s,a) and Q(s,s′) (a) 25% Figure 5: Stochastic experiments in cliffworld. This experiment measures the effect of stochastic actions on the average success rate. Before each episode, we evaluated the learned policy and averaged the return over 10 trials. All experiments were averaged over 10 seeds with 95% confidence intervals.

Let us now investigate the relation between values learned using QSA and QSS. Consider an MDP with a deterministic state transition function and inverse dynamics function . QSS can be thought of as equivalent to using QSA to solve the sub-MDP containing only the set of actions returned by for every state :

 Q(s,s′)=Q(s,I(s,s′))

Because the MDP solved by QSS is a sub-MDP of that solved by QSA, there must always be at least one action for which .

The original MDP may contain additional actions not returned by , but following our assumptions, their return must be less than or equal to that by the action . Since this is also true in every state following , we must have:

 Q(s,a)≤maxs′Q(s,I(s,s′))for all a

Thus we obtain the following equivalence between QSA and QSS for deterministic environments:

 maxs′Q(s,s′)=maxaQ(s,a)

This equivalence will allow us to learn accurate action-values without dependence on the action space.

## 3 QSS in tabular settings

In simple settings where the state space is discrete, can be represented by a table. We use this setting to highlight some of the properties of QSS. In each experiment, we evaluate within a simple 11x11 gridworld where an agent, initialized at , navigates in each cardinal direction and receives a reward of until it reaches the goal unless otherwise noted.

### 3.1 Example of equivalence of QSA and QSS

We first examine the values learned by QSS (Figure 2). The output of QSS increases as the agent get closer to the goal, which indicates that QSS learns meaningful values for this task. Additionally, the difference in value between and approaches zero as the values of QSS and QSA converge. Hence, QSS learns similar values as QSA in this deterministic setting.

### 3.2 Example of QSS in a stochastic setting

The next experiment measures the impact of stochastic transitions on learning using QSS. To investigate this property, we add a probability of slipping to each transition, where the agent takes a random action (i.e. slips into an unintended next state) some percentage of time. First, we notice that the values learned by QSS when transitions have 100% slippage (completely random actions) are quite different from those learned by QSA (Figure 3fig:stochastic_vanilla_q-fig:stochastic_model_q). In fact, the values learned by QSS are similar to the previous experiment when there was no stochasticity in the environment (Figure 1(b)). As the transitions become more stochastic, the distance between values learned by QSA and QSS vastly increases (Figure 2(c)). This provides evidence that the formulation of QSS assumes the best possible transition will occur, thus causing the values to be overestimated in stochastic settings.

Curiously, QSS solves this task quicker than QSA, even though it learns incorrect values (Figure 4). One hypothesis is that the slippage causes the agent to stumble into the goal state, which is beneficial for QSS because it directly updates values based on state transitions. The correct action that enables this transition is known using the given inverse dynamics model. QSA, on the other hand, would need to learn how the stochasticity of the environment affects the selected action’s outcome and so the values may propagate more slowly.

We now study the case when stochasticity may lead to negative effects for QSS. We modify the gridworld to include a cliff along the bottom edge similar to the example in Sutton and Barto (1998). The agent is initialized on top of the cliff, and if it attempts to step down, it falls off and the episode is reset. Furthermore, the cliff is “windy”, and the agent has a probability of falling off the edge while walking next to it. The reward here is everywhere except the goal, which has a reward of . Here, we see the effect of stochasticity is detrimental to QSS (Figure 5), as it does not account for falling and instead expects to transition towards the goal.

### 3.3 QSS handles redundant actions

One benefit of training QSS is that the transitions from one action can be used to learn values for another action. Consider the setting where two actions in a given state transition to the same next state. QSA would need to make updates for both actions in order to learn their values. But QSS only updates the transitions, thus ignoring any redundancy in the action space. We further investigate this property in a gridworld with redundant actions. Suppose an agent has four underlying actions, up, down, left, and right, but these actions are duplicated a number of times. As the number of redundant actions increases, the performance of QSA deteriorates, whereas QSS remains unaffected (Figure 6fig:vanilla_actions-fig:model_actions).

We also evaluate how QSS is impacted when the inverse dynamics model is learned rather than given (Figure 6fig:id_actions). To do this, we instantiate as a set that is updated when an action is reached. We sample from this set anytime is called, and return a random sampling over all redundant actions if .

Even when the inverse dynamics model is learned, QSS is able to perform well because it only needs to learn about a single action that transitions from to .

### 3.4 QSS enables value function transfer of permuted actions

The final experiment in the tabular setting considers the scenario of transferring to an environment where the meaning of actions has changed. We imagine this could be useful in environments where the physics are similar but the actions have been labeled differently. In this case, QSS values should directly transfer, but not the inverse dynamics, which would need to be retrained from scratch. We trained QSA and QSS in an environment where the actions were labeled as 0, 1, 2, and 3, then transferred the learned values to an environment where the labels were shuffled. We found that QSS was able to learn much more quickly in the transferred environment than QSA (Figure 6fig:transfer_actions). Hence, we were able to retrain the inverse dynamics model more quickly than the values for QSA. Interestingly, QSA also learns quickly with the transferred values. This is likely because the Q-table is initialized to values that are closer the true values than a uniformly initialized value.

## 4 Extending to the continuous domain with D3G

In contrast to domains where the state space is discrete and both QSA and QSS can represent relevant functions with a table, in continuous settings or environments with large state spaces we must approximate values with function approximation. One such approach is Deep Q-learning, which uses a deep neural network to approximate QSA

(Mnih et al., 2013, 2015). The loss is formulated as: , where .

Here, is a target network that stabilizes training. Training is further improved by sampling experience from a replay buffer to decorrelate the sequential data observed in an episode.

### 4.1 Deep Deterministic Policy Gradients

Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) applies Deep Q-learning to problems with continuous actions. Instead of computing a max over actions for the target , it uses the output of a policy that is trained to maximize a critic : . Here, is known as an actor and trained using the following loss:

 Lψ=−Qθ(s,πψ(s)).

This approach uses a target network that is moved slowly towards by updating the parameters as , where determines how smoothly the parameters are updated. A target policy network is also used when training , and is updated similarly to .

### 4.2 Twin Delayed DDPG

Twin Delayed DDPG (TD3) is a more stable variant of DDPG. One improvement is to delay the updates of the target networks and actor to be slower than the critic updates by a delay parameter . Additionally, TD3 utilizes Double Q-learning (Hasselt, 2010) to reduce overestimation bias in the critic updates. Instead of training a single critic, this approach trains two and uses the one that minimizes the output of :

 y=r+γmini=1,2Qθ′i(s′,πψ′(s′)).

The loss for the critics becomes:

 Lθ=∑i∥y−Qθi(s,a)∥.

Finally, Gaussian noise is added to the policy when sampling actions. We use each of these techniques in our own approach. Figure 7: Illustration of the cycle consistency for training D3G. Given a state s, τ(s) predicts the next state s′τ (black arrow). The inverse dynamics model I(s,s′τ) predicts the action that would yield this transition (blue arrows). Then a forward dynamics model fϕ(s,a) takes the action and current state to obtain the next state, s′f (green arrows).

### 4.3 Deep Deterministic Dynamics Gradients (D3G)

A clear difficulty with training QSS in continuous settings is that it is not possible to iterate over an infinite state space to find a maximizing neighboring state. Instead, we propose training a model to directly output the state that maximizes QSS. We introduce an analogous approach to TD3 for training QSS, Deep Deterministic Dynamics Gradients (D3G). Like the deterministic policy gradient formulation , D3G learns a model that makes predictions that maximize . To train the critic, we specify the loss as:

 Lθ=∑i∥y−Qθi(s,s′)∥. (6)

Here, the target is specified as:

 y=r+γmini=1,2Qθ′i(s′,τψ′(s′))]. (7)

Similar to TD3, we utilize two critics to stabilize training and a target network for Q. We additionally use a target network for , which is updated slowly as .

We train to maximize the expected return, , starting from any state :

 ∇ψJ =E[∇ψQ(s,s′)s′∼τψ(s)] (8) =E[∇s′Q(s,s′)∇ψτψ(s)] [using chain rule]

This can be accomplished by minimizing the following loss:

 Lψ=−Qθ(s,τψ(s)).

We discuss in the next section how this formulation alone may be problematic.

As in the tabular case, acts as a policy over states that aims to maximize , except now it is being trained to do so. To obtain the necessary action, we apply an inverse dynamics model as before:

 π(s)=Iω(s,τψ(s)). (9)

Now, is trained using a neural network with data . The loss is:

 Lω=∥Iω(s,s′)−a∥. (10)

#### 4.3.1 Cycle consistency Figure 8: Gridworld experiments for D3G (top) and D3G– (bottom). The left column represents the value function Q(s,τ(s)). The middle column represents the average nearest neighbor predicted by τ when s was initialized to ⟨0,0⟩. These results were averaged over 5 seeds with 95% confidence intervals. The final column displays the trajectory predicted by τ(s) when starting from the top left corner of the grid. Figure 9: Experiments for training TD3, DDPG, D3G– and D3G in MuJoCo tasks. Every 5000 timesteps, we evaluated the learned policy and averaged the return over 10 trials. The experiments were averaged over 10 seeds with 95% confidence intervals. Figure 10: D3G generated plans learned from observational data obtained from a completely random policy in InvertedPendulum-v2 (top) and Reacher-v2 (bottom). To generate the plans, we first plugged the initial state from the column in the left into C(s,τ(s)) to predict the next state s′f. We then plugged this state into C(s′f,τ(s′f)) to hallucinate the next state. We visualize the model predictions after every 5 steps. In the Reacher-v2 environment, we set the target (ball position) to be constant and the delta between the fingertip position and target position to be determined by the joint positions (fully described by the first four elements of the state) and the target position. This was only for visualization purposes and was not done during training. Videos are available at sites.google.com/view/qss-paper.

DDPG has been shown to overestimate the values of the critic, resulting in a policy that exploits this bias (Fujimoto et al., 2018). Similarly, with the current formulation of the D3G loss, can suggest non-neighboring states that the critic has overestimated the value for. To overcome this, we propose regularizing by ensuring the proposed states are reachable in a single step. In particular, we introduce an additional function for ensuring cycle consistency, (see Algorithm 2).

We use this regularizer as a substitute for training interactions with . As shown in Figure 7, given a state , we use to predict the next state . We use the inverse dynamics model to determine the action that would yield this transition. We then plug that action into a forward dynamics model to obtain the final next state, . In other words, we regularize the model to make predictions that are consistent with the inverse and forward dynamics model learned from environment data.

To train the forward dynamics model, we compute:

 Lϕ=∥fϕ(s,a)−s′∥. (11)

We can then compute the cycle loss for :

 Lψ=−Qθ(s,C(s,τψ(s))+β∥τψ(s)−C(s,τψ(s))∥. (12)

We have the second regularization term to further encourage next state predictions. The final target for training Q becomes:

 y=r+γmini=1,2Qθ′i(s′,C(s′,τψ′(s′))) (13)

We train each of these models concurrently. The full training procedure is described in Algorithm 1.

#### 4.3.2 A note on training dynamics models

We found it useful to train the models and to predict the difference between states rather than the next state, as has been done in several other works (Nagabandi et al., 2018; Goyal et al., 2018; Edwards et al., 2018). As such, we compute to obtain the next state from , and to obtain the next state prediction for . We describe this implementation detail here for clarity of the paper.

## 5 D3G properties and results

We now describe several experiments that aimed to measure different properties of D3G. We include full training details of hyperparameters and architectures in the appendix.

### 5.1 Example of D3G in a gridworld

We first evaluate D3G within a simple 11x11 gridworld with discrete states and actions (Figure 8). The agent can move a single step in one of the cardinal directions, and obtains a reward of -1 until it reaches the goal. Because D3G uses an inverse dynamics model to determine actions, it is straightforward to apply it to this discrete setting.

These experiments examine if D3G learns meaningful values, predicts neighboring states, and makes realistic transitions toward the goal. We additionally investigate the merits of using a cycle loss.

We first visualize the values learned by D3G and D3G without cycle loss (D3G). The output of QSS increases for both methods as the agent moves closer to the goal (Figure 8). This indicates that D3G can be used to learn meaningful QSS values. However, D3G vastly overestimates these values222One seed out of five in the D3G experiments did yield a good value function, but we did not witness this problem of overestimation in D3G.. Hence, it is clear that the cycle loss helps to reduce overestimation bias.

Next, we evaluate if learns to predict neighboring states. First, we set the agent state to . We then compute the minimum Manhattan distance of to the neighbors of ). This experiment examines how close the predictions made by are to neighboring states.

In this task, D3G is able to predict states that are no more than one step away from the nearest neighbor on average (Figure 2). However, D3G makes predictions that are significantly outside of the range of the grid. We see this further when visualizing a trajectory of state predictions made by . D3G simply makes predictions along the diagonal until it extends beyond the grid range. However, QSS learns to predict grid-like steps to the goal, as is required by the environment. This suggests that the cycle loss ensures predictions made by are neighbors of .

### 5.2 D3G can be used to solve control tasks

We next evaluate our approach in more complicated MuJoCo tasks from OpenAI Gym (Brockman et al., 2016). These experiments examine if D3G can be used to learn complex control tasks, and the impact of the cycle loss on training. We compare against TD3 and DDPG.

In several tasks, D3G is able to perform as well as TD3 and significantly outperforms DDPG (Figure 9). Without the cycle loss, D3G is not able to accomplish any of the tasks. D3G does perform poorly in Humanoid-v2 and Walker2d-v2. Interestingly, DDPG also performs poorly in these tasks. Nevertheless, we have demonstrated that D3G can indeed be used to solve difficult control tasks. This introduces a new research direction for actor-critic, enabling training a dynamics model, rather than policy, whose predictions optimize the return. We demonstrate in the next section that this model is powerful enough to learn from observations obtained from completely random policies.

### 5.3 D3G enables learning from observations obtained from random policies

Imitation from observation is a technique for training agents to imitate in settings where actions are not available. Traditionally, approaches have assumed that the observational data was obtained from an expert, and train models to match the distribution of the underlying policy (Torabi et al., 2018; Edwards et al., 2019). Because does not include actions, we can use it to learn from observations, rather than imitate, in an off-policy manner. This allows learning from observation data from completely random policies.

To learn from observations, we assume we are given a dataset of state observations, rewards, and termination conditions obtained by some policy . We then train D3G to learn QSS values and a model offline without interacting with the environment. One problem is that we cannot use the cycle loss described in Section 4, as this data does not consist of actions. Instead, we need another function that allows us to cycle from to a predicted next state.

To do this, we make a novel observation. The forward dynamics model does not need to take in actions to predict the next state. It simply needs an input that can be used as a clue for predicting the next state. We propose using as a replacement for the action. Namely, we now train the forward dynamics model with the following loss:

 Lϕ=∥fϕ(s,Qθ′(s,s′))−s′∥. (14)

Because Q is changing, we use the target network when learning . We can then use the same losses as before for training QSS and , except we utilize the cycle function defined for imitation in Algorithm 2.

We argue that is a good replacement for because for a given state, different QSS values often indicate different neighboring states. While this may not always be useful (there can of course be multiple optimal states), we found that this worked well in practice.

To evaluate this hypothesis, we trained QSS in InvertedPendulum-v2 and Reacher-v2 with data obtained from expert policies with varying degrees of randomness. We first visualize predictions made by when trained from a completely random policy (Figure 10). Because aims to make predictions that maximize QSS, it is able to hallucinate plans that solve the underlying task. In InvertedPendulum-v2, makes predictions that balance the pole, and in Reacher-v2, the arm moves directly to the goal location. As such, we have demonstrated that can be trained from observations obtained from random policies.

Once we learn this model, we can use it to determine how to act in an environment. To do this, given a state , we use to propose the best next state to reach. In order to determine what action to take, we train an inverse dynamics model from a few steps taken in the environment, and use it to predict the action that the agent should take. We compare this to Behavioral Cloning from Observation (BCO) (Torabi et al., 2018), which aims to learn policies that mimic the data collected from .

As the data collected from becomes more random, D3G significantly outperforms BCO, and is able to achieve high reward when the demonstrations were collected from completely random policies (Table 1). This suggests that D3G is indeed capable of off-policy learning. Interestingly, D3G performs poorly when the data has 0% randomness. This is likely because off-policy learning requires that every state has some probability of being visited.

## 6 Related work

We now discuss several works related to QSS and D3G.

### 6.1 Hierarchical reinforcement learning

The concept of generating states is reminiscent of hierarchical reinforcement learning (Barto and Mahadevan, 2003), in which the policy is implemented as a hierarchy of sub-policies. In particular, approaches related to feudal reinforcement learning (Dayan and Hinton, 1993) rely on a manager policy providing goals (possibly indirectly, through sub-manager policies) to a worker policy. These goals generally map to actual environment states, either through a learned state representation as in FeUdal Networks (Vezhnevets et al., 2017), an engineered representation as in h-DQN (Kulkarni et al., 2016), or simply by using the same format as raw environment states as in HIRO (Nachum et al., 2018). One could think of the function in QSS as operating like a manager by suggesting a target state, and of the function as operating like a worker by providing an action that reaches that state. Unlike with hierarchical RL, however, both operate at the same time scale.

### 6.2 Goal generation

This work is also related to goal generation approaches in RL, where a goal is a set of desired states, and a policy is learned to act optimally toward reaching the goal. For example, Universal Value Function Approximators (Schaul et al., 2015) consider the problem of conditioning action-values with goals, , where these goals (denoted as ), in the simplest formulation, are fixed by the environment. Recent advances in automatic curriculum building for RL reflects the importance of self-generated goals, where the intermediate goals of curricula towards a final objective are automatically generated by approaches such as automatic goal generation (Florensa et al., 2018), intrinsically motivated goal exploration processes (Forestier et al., 2017), and reverse curriculum generation (Florensa et al., 2017).

Nair et al. (2018)

employ goal-conditioned value functions along with Variational autoencoders (VAEs) to generate goals for self-supervised practice and for dense reward relabeling in hindsight. Similarly, IRIS

(Mandlekar et al., 2019) trains conditional VAEs for goal prediction and action prediction for robot control. Sahni et al. (2019) use a GAN to hallucinate visual goals and combine it with hindsight experience replay (Andrychowicz et al., 2017) to increase sample efficiency. Unlike all these approaches that learn to generate or sample goals, in our method, goals are always a single step away, generated by maximizing the the value of the neighboring state.

### 6.3 Learning from observation

(Sermanet et al., 2017; Liu et al., 2017; Torabi et al., 2018; Edwards et al., 2019; Torabi et al., 2019; Sun et al., 2019). Imitating when the action space differs between the agent and expert is a similar problem, and typically requires learning a correspondence (Kim et al., 2019; Liu et al., 2019). Our approach aimed to learn, rather than imitate from observations. Deep Q-learning from Demonstrations similarly learns off-policy from demonstration data, but requires demonstrator actions (Hester et al., 2018).

Several works have considered predicting next states from observations, such as videos, which can be useful for planning or video prediction (Finn and Levine, 2017; Kurutach et al., 2018; Rybkin et al., 2018; Schmeckpeper et al., 2019). In our work, the model is trained automatically to make predictions that maximize the return.

### 6.4 Action reduction

QSS naturally combines actions that have the same effects. Recent works have aimed to express the similarities between actions to learn policies more quickly, especially over large action spaces. For example, one approach is to learn action embeddings, which could then be used to learn a policy (Chandak et al., 2019; Chen et al., 2019). Another approach is to directly learn about irrelevant actions and then eliminate them from being selected (Zahavy et al., 2018).

### 6.5 Successor Representations

The successor representation (Dayan, 1993) describes a state as the sum of expected occupancy of future states under the current policy. It allows for decoupling of the environment’s dynamics from immediate rewards when computing expected returns and can be conveniently learned using TD methods. Barreto et al. (2017) extend this concept to successor features, . Successor features are the expected value of the discounted sum of -dimensional features of transitions, , under the policy . In both cases, the decoupling of successor state occupancy or features from a representation of the reward allows easy transfer across tasks where the dynamics remains the same but the reward function can change. Once successor features are learned, they can be used to quickly learn action values for all such tasks. Similarly, QSS is able to transfer or share values when the underlying dynamics are the same but the action label has changed.

## 7 Conclusion

In this paper, we introduced QSS, a novel form of value function that expresses the utility of transitioning to a state and acting optimal thereafter. To train QSS, we developed Deep Deterministic Dynamics Gradients, which we used to train a model to make predictions that maximized QSS. We showed that the formulation of QSS learns similar values as QSA, naturally learns well in environments with redundant actions, and can transfer across shuffled actions. We additionally demonstrated that D3G can be used to learn complicated control tasks, can generate meaningful plans from data obtained from completely random observational data, and can train agents to act from such data.

## 8 Acknowledgements

The authors thank Michael Littman for comments on related literature and further suggestions for the paper. We would also like to acknowledge Joost Huizinga, Felipe Petroski Such, and other members of Uber AI Labs for meaningful discussions about this work.

## References

• M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems 30, pp. 5048–5058. Cited by: §6.2.
• A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver (2017) Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065. Cited by: §6.5.
• A. G. Barto and S. Mahadevan (2003) Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems 13 (1-2), pp. 41–77. Cited by: §6.1.
• G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §5.2.
• Y. Chandak, G. Theocharous, J. Kostas, S. Jordan, and P. S. Thomas (2019) Learning action representations for reinforcement learning. arXiv preprint arXiv:1902.00183. Cited by: §6.4.
• Y. Chen, Y. Chen, Y. Yang, Y. Li, J. Yin, and C. Fan (2019) Learning action-transferable policy with action embedding. arXiv preprint arXiv:1909.02291. Cited by: §6.4.
• P. Dayan and G. E. Hinton (1993) Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278. Cited by: §6.1.
• P. Dayan (1993) Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §6.5.
• A. D. Edwards, L. Downs, and J. C. Davidson (2018) Forward-backward reinforcement learning. arXiv preprint arXiv:1803.10227. Cited by: §4.3.2.
• A. Edwards, H. Sahni, Y. Schroecker, and C. Isbell (2019) Imitating latent policies from observation. In

International Conference on Machine Learning

,
pp. 1755–1763. Cited by: §5.3, §6.3.
• C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §6.3.
• C. Florensa, D. Held, X. Geng, and P. Abbeel (2018) Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pp. 1514–1523. Cited by: §6.2.
• C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 482–495. Cited by: §6.2.
• S. Forestier, Y. Mollard, and P. Oudeyer (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §6.2.
• S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §4.3.1.
• A. Goyal, P. Brakel, W. Fedus, T. Lillicrap, S. Levine, H. Larochelle, and Y. Bengio (2018) Recall traces: backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379. Cited by: §4.3.2.
• H. V. Hasselt (2010) Double q-learning. In Advances in neural information processing systems, pp. 2613–2621. Cited by: §4.2.
• T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. (2018) Deep q-learning from demonstrations. In

Thirty-Second AAAI Conference on Artificial Intelligence

,
Cited by: §6.3.
• K. H. Kim, Y. Gu, J. Song, S. Zhao, and S. Ermon (2019) Cross domain imitation learning. arXiv preprint arXiv:1910.00105. Cited by: §6.3.
• T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §6.1.
• T. Kurutach, A. Tamar, G. Yang, S. J. Russell, and P. Abbeel (2018) Learning plannable representations with causal infogan. In Advances in Neural Information Processing Systems, pp. 8733–8744. Cited by: §6.3.
• T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §4.1.
• F. Liu, Z. Ling, T. Mu, and H. Su (2019) State alignment-based imitation learning. arXiv preprint arXiv:1911.10947. Cited by: §6.3.
• Y. Liu, A. Gupta, P. Abbeel, and S. Levine (2017) Imitation from observation: learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374. Cited by: §6.3.
• A. Mandlekar, F. Ramos, B. Boots, L. Fei-Fei, A. Garg, and D. Fox (2019) IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321. Cited by: §6.2.
• V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with Deep Reinforcement Learning. ArXiv e-prints. Cited by: §4.
• V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.
• O. Nachum, S. S. Gu, H. Lee, and S. Levine (2018) Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §6.1.
• A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §4.3.2.
• A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §6.2.
• O. Rybkin, K. Pertsch, K. G. Derpanis, K. Daniilidis, and A. Jaegle (2018) Learning what you can do before doing anything. arXiv preprint arXiv:1806.09655. Cited by: §6.3.
• H. Sahni, T. Buckley, P. Abbeel, and I. Kuzovkin (2019) Addressing sample complexity in visual tasks using her and hallucinatory gans. In Advances in Neural Information Processing Systems 32, pp. 5823–5833. Cited by: §6.2.
• T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In International conference on machine learning, pp. 1312–1320. Cited by: §6.2.
• K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn (2019) Learning predictive models from observation and interaction. arXiv preprint arXiv:1912.12773. Cited by: §6.3.
• P. Sermanet, C. Lynch, J. Hsu, and S. Levine (2017)

Time-contrastive networks: self-supervised learning from multi-view observation

.
arXiv preprint arXiv:1704.06888. Cited by: §6.3.
• W. Sun, A. Vemula, B. Boots, and J. A. Bagnell (2019) Provably efficient imitation learning from observation alone. arXiv preprint arXiv:1905.10948. Cited by: §6.3.
• R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §1, §2, §3.2.
• E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1.
• F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §5.3, §5.3, §6.3.
• F. Torabi, G. Warnell, and P. Stone (2019) Recent advances in imitation learning from observation. arXiv preprint arXiv:1905.13566. Cited by: §6.3.
• A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3540–3549. Cited by: §6.1.
• C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.
• T. Zahavy, M. Haroush, N. Merlis, D. J. Mankowitz, and S. Mannor (2018) Learn what not to learn: action elimination with deep reinforcement learning. In NeurIPS, Cited by: §6.4.

## Appendix A QSS Experiments

We ran all experiments in an 11x11 gridworld. The state was the agent’s location on the grid. The agent was initialized to and received a reward of until it reached the goal at and obtained a reward of and was reset to the initial position. The episode automatically reset after steps.

We used the same hyperparameters for QSA and QSS. We initialized the Q-values to . The learning rate was set to and the discount factor was set to . The agent followed an -greedy policy. Epsilon was set to and decayed to by subtracting 9e-6 every time step.

## Appendix B D3G Experiments

We used the TD3 implementation from https://github.com/sfujim/TD3 for our experiments. We also used the “OurDDPG” implementation of DDPG. We built our own implementation of D3G from this codebase. We used the default hyperparameters for all of our experiments, as described in Table 2. The replay buffer was filled for steps before learning. All continuous experiments added noise for exploration. In gridworld, the agent followed an -greedy policy. Epsilon was set to and decayed to by subtracting 9e-6 every time step.

We ran these experiments in an 11x11 gridworld. The state was the agent’s location on the grid. The agent was initialized to and received a reward of until it reached the goal at and obtained a reward of and was reset to the initial position. The episode automatically reset after steps.

We ran these experiments in the OpenAI Gym Mujoco environment https://github.com/openai/gym

. We used gym==0.14.0 and mujoco-py==2.0.2. The agent’s state was a vector from the MuJoCo simulator.

### b.3 Learning from Observation Experiments

We used TD3 to train an expert and used the learned policy to obtain demonstrations for learning from observation. We collected samples using the learned policy and took a random action either 0, 25, 50, 75, or 100 percent of the time, depending on the experiment. The samples consisted of the state, reward, next state, and done condition.

We trained BCO with for iterations. During each iteration, we collected samples from the environment using a Behavioral Cloning (BC) policy with added noise , then trained an inverse dynamics model for steps, labeled the observational data using this model, then finally trained the BC policy with this labeled data for steps.

We trained D3G with for time steps without any environment interactions. This allowed us to learn the model which informed the agent of what state it should reach. Similarly to BCO, we used some environment interactions to train an inverse dynamics model for D3G. We ran this training loop for iterations as well. During each iteration, we collected samples from the environment using the inverse dynamics policy with added noise , then trained this model for steps.

## Appendix C Architectures

D3G Model :

D3G Forward Dynamics Model:

D3G Forward Dynamics Model (Imitation):

D3G Inverse Dynamics Model (Continuous):

max action

D3G Inverse Dynamics Model (Discrete):

D3G Critic:

TD3 Actor:

max action

TD3 Critic:

DDPG Actor:

max action

DDPG Critic:

BCO Behavioral Cloning Model:

max action

BCO Inverse Dynamics Model:

max action