1 Introduction
The goal of reinforcement learning is to learn how to act so as to maximize longterm reward. A solution is usually formulated as finding the optimal policy, i.e., selecting the optimal action given a state. A popular approach for finding this policy is to learn a function that defines values though actions, , where is a state’s value and is the optimal action (Sutton and Barto, 1998). We will refer to this approach as QSA.
Here, we propose an alternative formulation for offpolicy reinforcement learning that defines values solely through states, rather than actions. In particular, we introduce , or simply QSS, which represents the value of transitioning from one state to a neighboring state and then acting optimally thereafter:
In this formulation, instead of proposing an action, the agent proposes a desired next state, which is fed into an inverse dynamics model that outputs the appropriate action to reach it (see Figure 1). We demonstrate that this formulation has several advantages. First, redundant actions that lead to the same transition are simply folded into one value estimate. Further, by removing actions, QSS becomes easier to transfer than a traditional Q function in certain scenarios, as it only requires learning an inverse dynamics function upon transfer, rather than a full policy or value function. Finally, we show that QSS can learn policies purely from observations of (potentially suboptimal) demonstrations with no access to demonstrator actions. Importantly, unlike other imitation from observation approaches, because it is offpolicy, QSS can learn highly efficient policies even from suboptimal or completely random demonstrations.
In order to realize the benefits of offpolicy QSS, we must obtain value maximizing future state proposals without performing explicit maximization. There are two problems one would encounter in doing so. The first is that a set of neighbors of are not assumed to be known a priori. This is unlike the set of actions in discrete QSA which are assumed to be provided by the MDP. Secondly, for continuous state and action spaces, the set of neighbors may be infinitely many, so maximizing over them explicitly is out of the question. To get around this difficulty, we draw inspiration from Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), which learns a policy over continuous action spaces that maximizes . We develop the analogous Deep Deterministic Dynamics Gradient (D3G), which trains a forward dynamics model to predict next states that maximize . Notably, this model is not conditioned on actions, and thus allows us to train QSS completely offpolicy from observations alone.
We begin the next section by formulating QSS, then describe its properties within tabular settings. We will then outline the case of using QSS in continuous settings, where we will use D3G to train . We evaluate in both tabular problems and MuJoCo tasks (Todorov et al., 2012).
2 The QSS formulation for RL
We are interested in solving problems specified through a Markov Decision Process, which consists of states
, actions , rewards , and a transition modelthat indicates the probability of transitioning to a specific next state given a current state and action,
(Sutton and Barto, 1998)^{1}^{1}1We use and to denote states consecutive in time, which may alternately be denoted and .. For simplicity, we refer to all rewards as for the remainder of the paper.Reinforcement learning aims to find a policy that expresses the probability of taking action in state . We are typically interested in policies that maximize the longterm discounted return , where is a discount factor that expresses the importance of longterm rewards and is the environmental horizon.
The standard QSA method to estimate the return is by using Qvalues of actions (Watkins and Dayan, 1992):
This expresses the value of taking an action in a state and acting optimally thereafter. QSA can be approximated using an approach known as Qlearning:
QSA learned policies can be formulated as:
We propose an alternative paradigm where we learn , or the value of transitioning from state to state and acting optimally thereafter. By analogy with standard QSAlearning, we express this quantity as:
(1) 
Although this equation may be applied to any environment, for it to be a useful formulation, the environment must be deterministic. To see why, note that in QSAlearning, the max is over actions, which the agent has perfect control over, and any uncertainty in the environment is integrated out by the expectation. In QSSlearning the max is over next states, which in stochastic environments are not perfectly predictable. In such environments the above equation does faithfully track a certain value, but it may be considered the “best possible scenario value” — the value of a current and subsequent state assuming that any stochasticity the agent experiences turns out as well as possible for the agent. Concretely, this means we assume that the agent can transition reliably (with probability 1) to any state that it is possible (with probability 0) to reach from state .
Of course, this will not hold for stochastic domains in general, in which case QSS learning does not track an actionable value. While this limitation may seem severe, we will demonstrate that the QSS formulation affords us a powerful tool for use in deterministic environments, which we develop in the remainder of this article. Henceforth we assume that the transition function is deterministic, and the empirical results that follow show our approach to succeed over a wide range of tasks, including tabular problems and MuJoCo simulations.
2.1 Bellman update for QSS
Learned values for tabular Qlearning in an 11x11 gridworld with stochastic transitions. The first two figures show a heatmap of Qvalues for QSA and QSS in a gridworld with 100% slippage. The final figure represents the euclidean distance between the learned values in QSA and QSS as the transitions become more stochastic (averaged over 10 seeds with 95% confidence intervals).
We first consider the simple setting where we have access to an inverse dynamics model that returns an action that would take the agent from state to . Next, we assume access to a function that outputs the neighbors of . We use this as an illustrative example and will formulate the problem without these assumptions in the next section.
We define the Bellman update for QSSlearning as:
(2) 
Note is undefined when and are not neighbors. In order to obtain a policy, we define as a function that selects a neighboring state from that maximizes QSS:
(3) 
In words, selects states that have large value, and acts similar to a policy over states. In order to obtain the policy over actions, we use the inverse dynamics model:
(4) 
This approach first finds the state that maximizes , and then uses to determine the action that will take the agent there. We can rewrite Equation 2 as:
(5) 
2.2 Equivalence of and
Let us now investigate the relation between values learned using QSA and QSS. Consider an MDP with a deterministic state transition function and inverse dynamics function . QSS can be thought of as equivalent to using QSA to solve the subMDP containing only the set of actions returned by for every state :
Because the MDP solved by QSS is a subMDP of that solved by QSA, there must always be at least one action for which .
The original MDP may contain additional actions not returned by , but following our assumptions, their return must be less than or equal to that by the action . Since this is also true in every state following , we must have:
Thus we obtain the following equivalence between QSA and QSS for deterministic environments:
This equivalence will allow us to learn accurate actionvalues without dependence on the action space.
3 QSS in tabular settings
In simple settings where the state space is discrete, can be represented by a table. We use this setting to highlight some of the properties of QSS. In each experiment, we evaluate within a simple 11x11 gridworld where an agent, initialized at , navigates in each cardinal direction and receives a reward of until it reaches the goal unless otherwise noted.
3.1 Example of equivalence of QSA and QSS
We first examine the values learned by QSS (Figure 2). The output of QSS increases as the agent get closer to the goal, which indicates that QSS learns meaningful values for this task. Additionally, the difference in value between and approaches zero as the values of QSS and QSA converge. Hence, QSS learns similar values as QSA in this deterministic setting.
3.2 Example of QSS in a stochastic setting
The next experiment measures the impact of stochastic transitions on learning using QSS. To investigate this property, we add a probability of slipping to each transition, where the agent takes a random action (i.e. slips into an unintended next state) some percentage of time. First, we notice that the values learned by QSS when transitions have 100% slippage (completely random actions) are quite different from those learned by QSA (Figure 3fig:stochastic_vanilla_qfig:stochastic_model_q). In fact, the values learned by QSS are similar to the previous experiment when there was no stochasticity in the environment (Figure 1(b)). As the transitions become more stochastic, the distance between values learned by QSA and QSS vastly increases (Figure 2(c)). This provides evidence that the formulation of QSS assumes the best possible transition will occur, thus causing the values to be overestimated in stochastic settings.
Curiously, QSS solves this task quicker than QSA, even though it learns incorrect values (Figure 4). One hypothesis is that the slippage causes the agent to stumble into the goal state, which is beneficial for QSS because it directly updates values based on state transitions. The correct action that enables this transition is known using the given inverse dynamics model. QSA, on the other hand, would need to learn how the stochasticity of the environment affects the selected action’s outcome and so the values may propagate more slowly.
We now study the case when stochasticity may lead to negative effects for QSS. We modify the gridworld to include a cliff along the bottom edge similar to the example in Sutton and Barto (1998). The agent is initialized on top of the cliff, and if it attempts to step down, it falls off and the episode is reset. Furthermore, the cliff is “windy”, and the agent has a probability of falling off the edge while walking next to it. The reward here is everywhere except the goal, which has a reward of . Here, we see the effect of stochasticity is detrimental to QSS (Figure 5), as it does not account for falling and instead expects to transition towards the goal.
3.3 QSS handles redundant actions
One benefit of training QSS is that the transitions from one action can be used to learn values for another action. Consider the setting where two actions in a given state transition to the same next state. QSA would need to make updates for both actions in order to learn their values. But QSS only updates the transitions, thus ignoring any redundancy in the action space. We further investigate this property in a gridworld with redundant actions. Suppose an agent has four underlying actions, up, down, left, and right, but these actions are duplicated a number of times. As the number of redundant actions increases, the performance of QSA deteriorates, whereas QSS remains unaffected (Figure 6fig:vanilla_actionsfig:model_actions).
We also evaluate how QSS is impacted when the inverse dynamics model is learned rather than given (Figure 6fig:id_actions). To do this, we instantiate as a set that is updated when an action is reached. We sample from this set anytime is called, and return a random sampling over all redundant actions if .
Even when the inverse dynamics model is learned, QSS is able to perform well because it only needs to learn about a single action that transitions from to .
3.4 QSS enables value function transfer of permuted actions
The final experiment in the tabular setting considers the scenario of transferring to an environment where the meaning of actions has changed. We imagine this could be useful in environments where the physics are similar but the actions have been labeled differently. In this case, QSS values should directly transfer, but not the inverse dynamics, which would need to be retrained from scratch. We trained QSA and QSS in an environment where the actions were labeled as 0, 1, 2, and 3, then transferred the learned values to an environment where the labels were shuffled. We found that QSS was able to learn much more quickly in the transferred environment than QSA (Figure 6fig:transfer_actions). Hence, we were able to retrain the inverse dynamics model more quickly than the values for QSA. Interestingly, QSA also learns quickly with the transferred values. This is likely because the Qtable is initialized to values that are closer the true values than a uniformly initialized value.
4 Extending to the continuous domain with D3G
In contrast to domains where the state space is discrete and both QSA and QSS can represent relevant functions with a table, in continuous settings or environments with large state spaces we must approximate values with function approximation. One such approach is Deep Qlearning, which uses a deep neural network to approximate QSA
(Mnih et al., 2013, 2015). The loss is formulated as: , where .Here, is a target network that stabilizes training. Training is further improved by sampling experience from a replay buffer to decorrelate the sequential data observed in an episode.
4.1 Deep Deterministic Policy Gradients
Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) applies Deep Qlearning to problems with continuous actions. Instead of computing a max over actions for the target , it uses the output of a policy that is trained to maximize a critic : . Here, is known as an actor and trained using the following loss:
This approach uses a target network that is moved slowly towards by updating the parameters as , where determines how smoothly the parameters are updated. A target policy network is also used when training , and is updated similarly to .
4.2 Twin Delayed DDPG
Twin Delayed DDPG (TD3) is a more stable variant of DDPG. One improvement is to delay the updates of the target networks and actor to be slower than the critic updates by a delay parameter . Additionally, TD3 utilizes Double Qlearning (Hasselt, 2010) to reduce overestimation bias in the critic updates. Instead of training a single critic, this approach trains two and uses the one that minimizes the output of :
The loss for the critics becomes:
Finally, Gaussian noise is added to the policy when sampling actions. We use each of these techniques in our own approach.
4.3 Deep Deterministic Dynamics Gradients (D3G)
A clear difficulty with training QSS in continuous settings is that it is not possible to iterate over an infinite state space to find a maximizing neighboring state. Instead, we propose training a model to directly output the state that maximizes QSS. We introduce an analogous approach to TD3 for training QSS, Deep Deterministic Dynamics Gradients (D3G). Like the deterministic policy gradient formulation , D3G learns a model that makes predictions that maximize . To train the critic, we specify the loss as:
(6) 
Here, the target is specified as:
(7) 
Similar to TD3, we utilize two critics to stabilize training and a target network for Q. We additionally use a target network for , which is updated slowly as .
We train to maximize the expected return, , starting from any state :
(8)  
[using chain rule] 
This can be accomplished by minimizing the following loss:
We discuss in the next section how this formulation alone may be problematic.
As in the tabular case, acts as a policy over states that aims to maximize , except now it is being trained to do so. To obtain the necessary action, we apply an inverse dynamics model as before:
(9) 
Now, is trained using a neural network with data . The loss is:
(10) 
4.3.1 Cycle consistency
DDPG has been shown to overestimate the values of the critic, resulting in a policy that exploits this bias (Fujimoto et al., 2018). Similarly, with the current formulation of the D3G loss, can suggest nonneighboring states that the critic has overestimated the value for. To overcome this, we propose regularizing by ensuring the proposed states are reachable in a single step. In particular, we introduce an additional function for ensuring cycle consistency, (see Algorithm 2).
We use this regularizer as a substitute for training interactions with . As shown in Figure 7, given a state , we use to predict the next state . We use the inverse dynamics model to determine the action that would yield this transition. We then plug that action into a forward dynamics model to obtain the final next state, . In other words, we regularize the model to make predictions that are consistent with the inverse and forward dynamics model learned from environment data.
To train the forward dynamics model, we compute:
(11) 
We can then compute the cycle loss for :
(12) 
We have the second regularization term to further encourage next state predictions. The final target for training Q becomes:
(13) 
We train each of these models concurrently. The full training procedure is described in Algorithm 1.
4.3.2 A note on training dynamics models
We found it useful to train the models and to predict the difference between states rather than the next state, as has been done in several other works (Nagabandi et al., 2018; Goyal et al., 2018; Edwards et al., 2018). As such, we compute to obtain the next state from , and to obtain the next state prediction for . We describe this implementation detail here for clarity of the paper.
5 D3G properties and results

We now describe several experiments that aimed to measure different properties of D3G. We include full training details of hyperparameters and architectures in the appendix.
5.1 Example of D3G in a gridworld
We first evaluate D3G within a simple 11x11 gridworld with discrete states and actions (Figure 8). The agent can move a single step in one of the cardinal directions, and obtains a reward of 1 until it reaches the goal. Because D3G uses an inverse dynamics model to determine actions, it is straightforward to apply it to this discrete setting.
These experiments examine if D3G learns meaningful values, predicts neighboring states, and makes realistic transitions toward the goal. We additionally investigate the merits of using a cycle loss.
We first visualize the values learned by D3G and D3G without cycle loss (D3G^{–}). The output of QSS increases for both methods as the agent moves closer to the goal (Figure 8). This indicates that D3G can be used to learn meaningful QSS values. However, D3G^{–} vastly overestimates these values^{2}^{2}2One seed out of five in the D3G^{–} experiments did yield a good value function, but we did not witness this problem of overestimation in D3G.. Hence, it is clear that the cycle loss helps to reduce overestimation bias.
Next, we evaluate if learns to predict neighboring states. First, we set the agent state to . We then compute the minimum Manhattan distance of to the neighbors of ). This experiment examines how close the predictions made by are to neighboring states.
In this task, D3G is able to predict states that are no more than one step away from the nearest neighbor on average (Figure 2). However, D3G^{–} makes predictions that are significantly outside of the range of the grid. We see this further when visualizing a trajectory of state predictions made by . D3G^{–} simply makes predictions along the diagonal until it extends beyond the grid range. However, QSS learns to predict gridlike steps to the goal, as is required by the environment. This suggests that the cycle loss ensures predictions made by are neighbors of .
5.2 D3G can be used to solve control tasks
We next evaluate our approach in more complicated MuJoCo tasks from OpenAI Gym (Brockman et al., 2016). These experiments examine if D3G can be used to learn complex control tasks, and the impact of the cycle loss on training. We compare against TD3 and DDPG.
In several tasks, D3G is able to perform as well as TD3 and significantly outperforms DDPG (Figure 9). Without the cycle loss, D3G^{–} is not able to accomplish any of the tasks. D3G does perform poorly in Humanoidv2 and Walker2dv2. Interestingly, DDPG also performs poorly in these tasks. Nevertheless, we have demonstrated that D3G can indeed be used to solve difficult control tasks. This introduces a new research direction for actorcritic, enabling training a dynamics model, rather than policy, whose predictions optimize the return. We demonstrate in the next section that this model is powerful enough to learn from observations obtained from completely random policies.
5.3 D3G enables learning from observations obtained from random policies
Imitation from observation is a technique for training agents to imitate in settings where actions are not available. Traditionally, approaches have assumed that the observational data was obtained from an expert, and train models to match the distribution of the underlying policy (Torabi et al., 2018; Edwards et al., 2019). Because does not include actions, we can use it to learn from observations, rather than imitate, in an offpolicy manner. This allows learning from observation data from completely random policies.
To learn from observations, we assume we are given a dataset of state observations, rewards, and termination conditions obtained by some policy . We then train D3G to learn QSS values and a model offline without interacting with the environment. One problem is that we cannot use the cycle loss described in Section 4, as this data does not consist of actions. Instead, we need another function that allows us to cycle from to a predicted next state.
To do this, we make a novel observation. The forward dynamics model does not need to take in actions to predict the next state. It simply needs an input that can be used as a clue for predicting the next state. We propose using as a replacement for the action. Namely, we now train the forward dynamics model with the following loss:
(14) 
Because Q is changing, we use the target network when learning . We can then use the same losses as before for training QSS and , except we utilize the cycle function defined for imitation in Algorithm 2.
We argue that is a good replacement for because for a given state, different QSS values often indicate different neighboring states. While this may not always be useful (there can of course be multiple optimal states), we found that this worked well in practice.
To evaluate this hypothesis, we trained QSS in InvertedPendulumv2 and Reacherv2 with data obtained from expert policies with varying degrees of randomness. We first visualize predictions made by when trained from a completely random policy (Figure 10). Because aims to make predictions that maximize QSS, it is able to hallucinate plans that solve the underlying task. In InvertedPendulumv2, makes predictions that balance the pole, and in Reacherv2, the arm moves directly to the goal location. As such, we have demonstrated that can be trained from observations obtained from random policies.
Once we learn this model, we can use it to determine how to act in an environment. To do this, given a state , we use to propose the best next state to reach. In order to determine what action to take, we train an inverse dynamics model from a few steps taken in the environment, and use it to predict the action that the agent should take. We compare this to Behavioral Cloning from Observation (BCO) (Torabi et al., 2018), which aims to learn policies that mimic the data collected from .
As the data collected from becomes more random, D3G significantly outperforms BCO, and is able to achieve high reward when the demonstrations were collected from completely random policies (Table 1). This suggests that D3G is indeed capable of offpolicy learning. Interestingly, D3G performs poorly when the data has 0% randomness. This is likely because offpolicy learning requires that every state has some probability of being visited.
6 Related work
We now discuss several works related to QSS and D3G.
6.1 Hierarchical reinforcement learning
The concept of generating states is reminiscent of hierarchical reinforcement learning (Barto and Mahadevan, 2003), in which the policy is implemented as a hierarchy of subpolicies. In particular, approaches related to feudal reinforcement learning (Dayan and Hinton, 1993) rely on a manager policy providing goals (possibly indirectly, through submanager policies) to a worker policy. These goals generally map to actual environment states, either through a learned state representation as in FeUdal Networks (Vezhnevets et al., 2017), an engineered representation as in hDQN (Kulkarni et al., 2016), or simply by using the same format as raw environment states as in HIRO (Nachum et al., 2018). One could think of the function in QSS as operating like a manager by suggesting a target state, and of the function as operating like a worker by providing an action that reaches that state. Unlike with hierarchical RL, however, both operate at the same time scale.
6.2 Goal generation
This work is also related to goal generation approaches in RL, where a goal is a set of desired states, and a policy is learned to act optimally toward reaching the goal. For example, Universal Value Function Approximators (Schaul et al., 2015) consider the problem of conditioning actionvalues with goals, , where these goals (denoted as ), in the simplest formulation, are fixed by the environment. Recent advances in automatic curriculum building for RL reflects the importance of selfgenerated goals, where the intermediate goals of curricula towards a final objective are automatically generated by approaches such as automatic goal generation (Florensa et al., 2018), intrinsically motivated goal exploration processes (Forestier et al., 2017), and reverse curriculum generation (Florensa et al., 2017).
Nair et al. (2018)
employ goalconditioned value functions along with Variational autoencoders (VAEs) to generate goals for selfsupervised practice and for dense reward relabeling in hindsight. Similarly, IRIS
(Mandlekar et al., 2019) trains conditional VAEs for goal prediction and action prediction for robot control. Sahni et al. (2019) use a GAN to hallucinate visual goals and combine it with hindsight experience replay (Andrychowicz et al., 2017) to increase sample efficiency. Unlike all these approaches that learn to generate or sample goals, in our method, goals are always a single step away, generated by maximizing the the value of the neighboring state.6.3 Learning from observation
Imitation from Observation allows imitation learning without access to actions
(Sermanet et al., 2017; Liu et al., 2017; Torabi et al., 2018; Edwards et al., 2019; Torabi et al., 2019; Sun et al., 2019). Imitating when the action space differs between the agent and expert is a similar problem, and typically requires learning a correspondence (Kim et al., 2019; Liu et al., 2019). Our approach aimed to learn, rather than imitate from observations. Deep Qlearning from Demonstrations similarly learns offpolicy from demonstration data, but requires demonstrator actions (Hester et al., 2018).Several works have considered predicting next states from observations, such as videos, which can be useful for planning or video prediction (Finn and Levine, 2017; Kurutach et al., 2018; Rybkin et al., 2018; Schmeckpeper et al., 2019). In our work, the model is trained automatically to make predictions that maximize the return.
6.4 Action reduction
QSS naturally combines actions that have the same effects. Recent works have aimed to express the similarities between actions to learn policies more quickly, especially over large action spaces. For example, one approach is to learn action embeddings, which could then be used to learn a policy (Chandak et al., 2019; Chen et al., 2019). Another approach is to directly learn about irrelevant actions and then eliminate them from being selected (Zahavy et al., 2018).
6.5 Successor Representations
The successor representation (Dayan, 1993) describes a state as the sum of expected occupancy of future states under the current policy. It allows for decoupling of the environment’s dynamics from immediate rewards when computing expected returns and can be conveniently learned using TD methods. Barreto et al. (2017) extend this concept to successor features, . Successor features are the expected value of the discounted sum of dimensional features of transitions, , under the policy . In both cases, the decoupling of successor state occupancy or features from a representation of the reward allows easy transfer across tasks where the dynamics remains the same but the reward function can change. Once successor features are learned, they can be used to quickly learn action values for all such tasks. Similarly, QSS is able to transfer or share values when the underlying dynamics are the same but the action label has changed.
7 Conclusion
In this paper, we introduced QSS, a novel form of value function that expresses the utility of transitioning to a state and acting optimal thereafter. To train QSS, we developed Deep Deterministic Dynamics Gradients, which we used to train a model to make predictions that maximized QSS. We showed that the formulation of QSS learns similar values as QSA, naturally learns well in environments with redundant actions, and can transfer across shuffled actions. We additionally demonstrated that D3G can be used to learn complicated control tasks, can generate meaningful plans from data obtained from completely random observational data, and can train agents to act from such data.
8 Acknowledgements
The authors thank Michael Littman for comments on related literature and further suggestions for the paper. We would also like to acknowledge Joost Huizinga, Felipe Petroski Such, and other members of Uber AI Labs for meaningful discussions about this work.
References
 Hindsight experience replay. In Advances in Neural Information Processing Systems 30, pp. 5048–5058. Cited by: §6.2.
 Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065. Cited by: §6.5.
 Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems 13 (12), pp. 41–77. Cited by: §6.1.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §5.2.
 Learning action representations for reinforcement learning. arXiv preprint arXiv:1902.00183. Cited by: §6.4.
 Learning actiontransferable policy with action embedding. arXiv preprint arXiv:1909.02291. Cited by: §6.4.
 Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278. Cited by: §6.1.
 Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §6.5.
 Forwardbackward reinforcement learning. arXiv preprint arXiv:1803.10227. Cited by: §4.3.2.

Imitating latent policies from observation.
In
International Conference on Machine Learning
, pp. 1755–1763. Cited by: §5.3, §6.3.  Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §6.3.
 Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pp. 1514–1523. Cited by: §6.2.
 Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 482–495. Cited by: §6.2.
 Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §6.2.
 Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §4.3.1.
 Recall traces: backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379. Cited by: §4.3.2.
 Double qlearning. In Advances in neural information processing systems, pp. 2613–2621. Cited by: §4.2.

Deep qlearning from demonstrations.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §6.3.  Cross domain imitation learning. arXiv preprint arXiv:1910.00105. Cited by: §6.3.
 Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §6.1.
 Learning plannable representations with causal infogan. In Advances in Neural Information Processing Systems, pp. 8733–8744. Cited by: §6.3.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §4.1.
 State alignmentbased imitation learning. arXiv preprint arXiv:1911.10947. Cited by: §6.3.
 Imitation from observation: learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374. Cited by: §6.3.
 IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321. Cited by: §6.2.
 Playing Atari with Deep Reinforcement Learning. ArXiv eprints. Cited by: §4.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.
 Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §6.1.
 Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §4.3.2.
 Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §6.2.
 Learning what you can do before doing anything. arXiv preprint arXiv:1806.09655. Cited by: §6.3.
 Addressing sample complexity in visual tasks using her and hallucinatory gans. In Advances in Neural Information Processing Systems 32, pp. 5823–5833. Cited by: §6.2.
 Universal value function approximators. In International conference on machine learning, pp. 1312–1320. Cited by: §6.2.
 Learning predictive models from observation and interaction. arXiv preprint arXiv:1912.12773. Cited by: §6.3.

Timecontrastive networks: selfsupervised learning from multiview observation
. arXiv preprint arXiv:1704.06888. Cited by: §6.3.  Provably efficient imitation learning from observation alone. arXiv preprint arXiv:1905.10948. Cited by: §6.3.
 Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §1, §2, §3.2.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1.
 Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §5.3, §5.3, §6.3.
 Recent advances in imitation learning from observation. arXiv preprint arXiv:1905.13566. Cited by: §6.3.
 Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3540–3549. Cited by: §6.1.
 Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §2.
 Learn what not to learn: action elimination with deep reinforcement learning. In NeurIPS, Cited by: §6.4.
Appendix A QSS Experiments
We ran all experiments in an 11x11 gridworld. The state was the agent’s location on the grid. The agent was initialized to and received a reward of until it reached the goal at and obtained a reward of and was reset to the initial position. The episode automatically reset after steps.
We used the same hyperparameters for QSA and QSS. We initialized the Qvalues to . The learning rate was set to and the discount factor was set to . The agent followed an greedy policy. Epsilon was set to and decayed to by subtracting 9e6 every time step.
Appendix B D3G Experiments
We used the TD3 implementation from https://github.com/sfujim/TD3 for our experiments. We also used the “OurDDPG” implementation of DDPG. We built our own implementation of D3G from this codebase. We used the default hyperparameters for all of our experiments, as described in Table 2. The replay buffer was filled for steps before learning. All continuous experiments added noise for exploration. In gridworld, the agent followed an greedy policy. Epsilon was set to and decayed to by subtracting 9e6 every time step.
b.1 Gridworld task
We ran these experiments in an 11x11 gridworld. The state was the agent’s location on the grid. The agent was initialized to and received a reward of until it reached the goal at and obtained a reward of and was reset to the initial position. The episode automatically reset after steps.
b.2 Mujoco tasks
We ran these experiments in the OpenAI Gym Mujoco environment https://github.com/openai/gym
. We used gym==0.14.0 and mujocopy==2.0.2. The agent’s state was a vector from the MuJoCo simulator.
b.3 Learning from Observation Experiments
We used TD3 to train an expert and used the learned policy to obtain demonstrations for learning from observation. We collected samples using the learned policy and took a random action either 0, 25, 50, 75, or 100 percent of the time, depending on the experiment. The samples consisted of the state, reward, next state, and done condition.
We trained BCO with for iterations. During each iteration, we collected samples from the environment using a Behavioral Cloning (BC) policy with added noise , then trained an inverse dynamics model for steps, labeled the observational data using this model, then finally trained the BC policy with this labeled data for steps.
We trained D3G with for time steps without any environment interactions. This allowed us to learn the model which informed the agent of what state it should reach. Similarly to BCO, we used some environment interactions to train an inverse dynamics model for D3G. We ran this training loop for iterations as well. During each iteration, we collected samples from the environment using the inverse dynamics policy with added noise , then trained this model for steps.
Appendix C Architectures
D3G Model :
D3G Forward Dynamics Model:
D3G Forward Dynamics Model (Imitation):
D3G Inverse Dynamics Model (Continuous):
max action
D3G Inverse Dynamics Model (Discrete):
D3G Critic:
TD3 Actor:
max action
TD3 Critic:
DDPG Actor:
max action
DDPG Critic:
BCO Behavioral Cloning Model:
max action
BCO Inverse Dynamics Model:
max action
Comments
There are no comments yet.