1 Introduction
Reinforcement learning (RL) is one promising approach to obtain the optimal policy in sequential decisionmaking problems based on reward signals during interaction with the environment. Most popular RL algorithms are modelfree since they do not need the access to the environment models. Value functions play an important role in modelfree RL Sutton1988ReinforcementLA , which are usually used to derive a policy implicitly in valuebased methods Mnih2015DQN or guide the policy updates in policybased methods Schulman2015TRPO ; Silver2014DPG
. With deep neural networks, value functions can be wellestimated even with large state and action space, making it practical for modelfree RL to deal with more challenging tasks
Lillicrap2015DDPG ; Mnih2015DQN ; Silver2016Go .Value functions define the expected cumulative rewards of a policy, indicating the degree a state or an action could be beneficial. They are usually estimated through Monte Carlo (MC) or Temporal Difference (TD) algorithms Sutton1988ReinforcementLA , without explicitly handling the entanglement of reward signals and environmental dynamics. However, in practical problems, the quality of value estimation is heavily crippled by highly stochastic dynamics and flawed or delayed reward. Intuitively, in contrast to the coupling manner, human beings usually evaluate a policy in a twostep way: 1) think how the environment would change afterwards; 2) then evaluate how good the predicted future could be. Such a idea of future prediction is also proposed in cognitive behavior and neuroscience studies Atance2001EpisodicFT ; Schacter2007RememberingTP ; Schacter2007TheCN .
Following this inspiration, in this paper, we look into the value function and rewrite it as a composite form of: 1) a rewardindependent predictive dynamics function, which defines the expected representation of future stateaction trajectory; and 2) a policyindependent trajectory return function that maps any trajectory (representation) to its discounted cumulative reward. This induces a twostep understanding of the value estimation process in modelfree RL and provides a way to disentangle the dynamics and returns accordingly. Further, we use modern deep learning techniques to build a practical algorithm based on the above decomposition, including a convolutional trajectory representation model, a conditional variational dynamics model and a convex trajectory return model.
Key contributions of this work are summarized as follows:

We provide a new understanding of the value estimation process, in a form of the explicit twostep composition between future dynamics prediction and trajectory return estimation.

We propose a decoupling way to learn value functions. Through training the rewardindependent predictive dynamics model and the policyindependent trajectory return model separately, the value estimation process can be performed more effectively and flexibly in more challenging settings, e.g., delayed reward problems.

We propose a conditional Variational AutoEncoder (VAE) Higgins2017Beta ; Kingma2013AEVB
to model the underlying distribution of future trajectory representation. Moreover, we use the generation process of the decoder as the predictive dynamics model and clip the generative noise with small variance to model the expectation of trajectory representation.

For reproducibility, we conduct experiments on commonly adopted MuJoCo continuous control tasks Brockman2016Gym ; Todorov2012MuJoCo and perform ablation studies across each contribution. Our algorithm achieves stateoftheart performance under common settings and shows significant effectiveness and robustness under challenging delayed reward settings.
2 Background
Consider a Markov Decision Process (MDP)
, defined with a set of states , a set of actions , the transition function , the reward function , the initial state distribution , the discounted factor , and the finite horizon . An agent interacts with the MDP at discrete time steps by performing its policy , generating a trajectory of states and actions, , where , and . The objective of the agent is to maximize the expected cumulative discounted reward, denoted by where .In reinforcement learning, the stateaction value function is defined as the expected cumulative discounted reward for selecting action in state , then following a policy afterwards:
(1) 
Similarly, the state value function denotes the expected cumulative discounted reward for performing a policy from a certain state , i.e., .
For continuous control, a parameterized policy , with parameters , can be updated by taking the gradient of the objective . In actorcritic methods, the policy, known as the actor can be updated with the deterministic policy gradient theorem Silver2014DPG :
(2) 
where is the discounted state distribution under policy . The function, also known as the critic, is usually estimated with Monte Carlo (MC) or Temporal Difference (TD) algorithms Sutton1988ReinforcementLA .
3 Model
Value estimation faces the coupling of environmental dynamics and reward signals. It can be intractable to obtain effective estimation of value functions in complex problems with highly stochastic dynamics and flawed or delayed reward. In this section, we propose a way to disentangle the policyindependent part and the rewardindependent part during the value estimation process.
Given a trajectory with , we consider a representation function that , and then introduce the following definitions.
Definition 1
The trajectory return function defines the cumulative discounted reward of any trajectory with the representation :
(3) 
The trajectory return function models the utility of a trajectory and can be viewed as an imperfect longterm reward model of the environment since it does not depend on a particular policy.
Definition 2
Given the representation function , the predictive dynamics function denotes the expected representation of the future trajectory for performing action in state , then following a policy :
(4) 
Note that function has a similar form with the function except for the expectation imposed on the trajectory representation. It is irrelevant to reward and only predicts how the states and actions would evolve afterwards. Now, we can derive the following lemma with the above definitions:
Lemma 1
Given a policy , the following lower bound of the function holds for all and , when function is convex:
(5) 
The equality is strictly established when is a linear function.
The proof can be easily obtained with Jensen’s Inequality, by replacing the summation in Equation 1 with function and then exchanging the expectation and function. Similar conclusion can also be obtained for state value function and we focus on the function in the rest of the paper.
Lemma 1 provides a lowerbound approximation of the function as a composite function of and . When is a linear function, the equality guarantees that we can also obtain the optimal policy through optimizing the composite function. Since the input of , i.e., , can be nonlinear, it still ensures the representation ability of the composite function even with a linear . For the case that
is a commonly adopted ReLUactivated neural network (convex function), we can still maximize the lower bound of the
function by maximizing the composite function. However, there is no guarantee for the optimality of the policy learned in such cases (as we found in our experiments).The above modeling induces an understanding that the function takes an explicit twostep estimation: 1) it first predicts the expected future dynamics under the policy (function ), 2) then evaluates the benefits of future prediction (function ). This provides us a way to decompose the value estimation process by dealing with function and separately. Thus, prediction and evaluation of stateaction trajectory can be more efficient to carry out in a compact representation space. The decoupling of environmental dynamics and returns helps in stabilizing the value estimation process and provides flexibility for the use in different problems. Moreover, it draws a connection between modelfree RL and modelbased RL since the composite function in Lemma 1 indicates an evidence of model learning in modelfree value estimation. Concretely, our decomposition of value functions induces an imperfect reward model and a partial dynamics model from the view of trajectory.
Finally, with the composite function approximation in Lemma 1, we can obtain the valuedecomposed deterministic policy gradient by extending Equation 2
accordingly with the Chain Rule:
(6) 
4 Algorithm
In this section, we use the twostep understanding of value estimation discussed in previous section to derive a practical deep RL algorithm based on modern deep learning techniques.
4.1 StateAction Trajectory Representation
To derive a practical algorithm, an effective and compact representation function is necessary because: 1) the trajectory may have variant length, 2) and there may be irrelevant features in states and actions which might affect the estimation of the cumulative discounted reward of the trajectory. In this paper, we propose using Convolutional Neural Networks (CNNs) to learn a representation model
of stateaction trajectory, similar to the use in sentence classification Kim2014CNN . In our experiments, we found that this way achieves faster training and better performance than the popular LSTM Hochreiter1997LSTM structure (see ablations in Section 6.2). An illustration of is shown in the orange part of Figure 1.Let be the
dimensional feature vector of
pair. A trajectory(padded where necessary) is represented as
, where is the concatenation operator. With a convolution filter applied to a window ofstateaction pairs and then a maxpooling operation, a feature
can be generated as follows:(7)  
We apply multiple filters on similarly to generate the dimensional feature vector , then obtain the trajectory representation after several fullyconnected layers:
(8)  
4.2 Trajectory Return Model
Following Lemma 1, we implement the trajectory return function with convex functions. Without loss of optimality, we use a linear to ensure the strict equality, as illustrated in the blue part of Figure 1. The result for a popular ReLUactivated layer can be seen in Ablation (Section 6.2).
We train the representation model and return model together by minimizing the mean square error loss of minibatch samples from experience buffer , with respect to the parameters :
(9) 
4.3 Conditional Variational Dynamics Model
The most straightforward way to implement the predictive dynamics function is to use a MultiLayer Perceptron (MLP)
that takes the state and action as input and predicts the expected representation of future trajectory. However, such approach does not really model the stochasticity of future trajectory representation. In this paper, we present a conditional Variational AutoEncoder (VAE) Kingma2013AEVB to model the underlying distribution of future trajectory representation conditioned on the state and action, achieving significant improvement over (see Ablation in Section 6.2).The conditional VAE consists of two networks, an encoder and decoder with variational parameters and generative parameters
respectively. With a chosen prior, generally the multivariate normal distribution
, the encoder approximates the conditional posteriors of latent variablefor trajectory representation, producing a Gaussian distribution with mean
. The decoder generates a representation of future trajectory for a given latent variable conditioned on the stateaction pair. Besides, we use a pairwiseproduct operation to emphasize an explicit relation between the condition stream and trajectory representation stream, which shows better inference results in our experiments (see Ablation in Section 6.2). The structure of conditional VAE is illustrated in the green part of Figure 1.During training, the latent variable is sampled from with reparameterization trick Kingma2013AEVB , i.e., , which is taken as part of input by the decoder to reconstruct the trajectory representation. This naturally models the underlying stochasticity of future trajectory. We train the conditional VAE with respect to the variational lower bound Kingma2013AEVB , in a form of the reconstruction loss along with a KL divergence term (see the Supplementary Material for complete formulation):
(10) 
where is obtained from the representation model (Equation 8). We use a weight to encourage VAE to discover disentangled latent factors for better inference quality, which is also known as a VAE Burgess2018Understanding ; Higgins2017Beta . See Ablation (Section 6.2) for the results of different values of .
Since VAE infers the latent distribution via instancetoinstance reconstruction, during the generation process, we propose using a clipped generative noise to narrow down the discrepancy between the generated instance and the expected representation (Equation 4). This allows us to obtain highquality prediction of expected future trajectory. Finally, the predictive dynamics model can be viewed as the generation process of the conditional VAE with a clipped generative noise :
(11) 
When is zero, an expected representation of future trajectory (Equation 4) should be generated from the mean of the latent distribution. A further discussion of clip value is in Ablation (Section 6.2).
4.4 Overall Algorithm
We build our algorithm on Deep Deterministic Policy Gradient (DDPG) Lillicrap2015DDPG algorithm, by replacing the original critic (i.e., the function) with a decomposed one, consisting of the three models introduced in previous subsections. The actor is updated through gradient ascent (Equation 6) similarly with respect to the decomposed critic. Note our algorithm does not use target networks for both the actor and critic. The overall algorithm is summarized in Algorithm 1.
5 Related Work
Future Prediction
Thinking about the future has been considered as an integral component of human cognition Atance2001EpisodicFT ; Schacter2007TheCN . In neuroscience, one concept named the prospective brain Schacter2007RememberingTP indicates that a crucial function of the brain is to use stored information to predict possible future events. The idea of future prediction is also studied in modelbased RL Atkeson1997Robot ; Sutton1991Dyna . Simulated Policy Learning Kaiser2019MBRL is proposed to learn onestep predictive models from the real environment and then train a policy within the simulated environment. Multisteps and longterm future are also modeled in Hafner2019Learning ; Ke2019LearningD with recurrent variational dynamics models, after which actions are chosen through online planning with ModelPredictive Control (MPC). Besides, another related work is Dosovitskiy2017DFP , in which a supervised model is trained to predict the residuals of goalrelated measurements at a set of temporal offsets in the future. With a manually designed goal vector, actions are chosen to maximize the predicted outcomes.
Value Function Approximation
Most modelfree deep RL algorithms approximate value functions directly with deep neural networks, e.g., Proximal Policy Optimization (PPO) Schulman2017PPO , Advantage ActorCritic (A2C) Mnih2016AC and DDPG Lillicrap2015DDPG , without explicitly handling the coupling of environmental dynamics and rewards. One similar approach to our work is the Deep Successor Representation (DSR) Kulkarni2016DSR , which factors the value function into the dotproduct between the expected representation of state occupancy and a vector of immediate reward function. The representation is trained with TD algorithm Sutton1988ReinforcementLA and the vector is approximated from onestep transitions. In our work, we decompose the value function based on the composite form of trajectory dynamics and returns. In contrast to using TD algorithm, we use a conditional VAE to model the latent distribution and obtain expected trajectory representation. We demonstrate the superiority of our algorithm in the experimental section.
6 Experiments
We conduct our experiments on MuJoCo continuous control tasks in OpenAI gym Brockman2016Gym ; Todorov2012MuJoCo . For the convenience of reproducibility, we make no modifications to the original environments or reward functions (except the delay reward modification in Section 6.3). Open source code and learning curves are provided in the Supplementary Material and will soon be released on GitHub.
6.1 Evaluation
To evaluate the effectiveness of our algorithm, we focus on two representative MuJoCo tasks: HalfCheetahv1 and Walker2dv1, as adopted in Fujimoto2018TD3 ; Fujimoto2018BCQ ; Schulman2015TRPO ; Schulman2017PPO . We compare our algorithm (VDFP) against DDPG, PPO, A2C, as well as the Deterministic DSR (DDSR). For PPO and A2C, we adopt nonparallelization implementation and use Generalized Advantage Estimation Schulman2016GAE with for stable policy gradient. Since DSR is originally proposed based on DQN Mnih2015DQN for discrete action problems, we implement DDSR based on DDPG according to the author’s codes for DSR on GitHub. For VDFP, we set the KL weight as 1000 and the clip value as 0.2. We use the max trajectory length of 64 and 256 for HalfCheetahv1 and Walker2dv1 respectively. For VDFP, DDPG and DDSR, a Gaussian noise sampled from Fujimoto2018TD3
is added to each action for exploration. For all algorithms, we use a twolayer feedforward neural network of 200 and 100 hidden units with ReLU activation for both the actor and critic (similar scales for the critic variants in DDSR and VDFP).
Figure 2 shows learning curves of algorithms over 5 random seeds of the Gym simulator and the network initialization. We can observe that our algorithm (VDFP) outperforms other algorithms in both final performance and learning speed. Our results for DDPG, PPO and A2C are comparable with those in Fujimoto2018TD3 ; Schulman2017PPO , where other results for ACKTR Wu2017ACKTR and TRPO Schulman2015TRPO can also been found. Exact experimental details of each algorithm are provided in the Supplementary Material.
6.2 Ablation
We perform ablation studies to analyze the contribution of each component of VDFP: 1) CNN (v.s. LSTM) for trajectory representation model (Representation); 2) conditional VAE (v.s. MLP) for predictive dynamics model (Architecture); 3) pairwiseproduct (v.s. concatenation) operation for conditional encoding process (Operator); and 4) a single linear layer (v.s. ReLUactivated layers) for trajectory return model (Return). We use the same experimental setups in Section 6.1 and the results are presented in Table 1. Complete learning curves can be found in the Supplementary Material.
First, we can observe that CNN achieves better performance than LSTM. Actually, CNN also shows lower training losses and takes less practical training time (almost 8x faster) in our experiments. Second, the significance of conditional VAE is demonstrated by its superior performance over MLP. This is because it is difficult for MLP to approximate the expected representation from various trajectory instances. In contrast, conditional VAE can well capture the trajectory distribution and then obtain the expected representation through the generation process. Third, pairwiseproduct shows an obvious improvement over concatenation. We suggest that the explicit relation between the condition and representation imposed by pairwiseproduct, forces the conditional VAE to learn more effective hidden features. Lastly, adopting linear layer for the trajectory return model outperforms the case of using ReLUactivated layers since it ensures the equality between the composite function approximation and the function (Lemma 1), thus obtains a better guarantee for the optimal policy.
Moreover, we analyse the influence of weight for KL loss term (Equation 10) and clip value for prediction process (Equation 11). The results for different values of are consistent to the studies about VAE Burgess2018Understanding ; Higgins2017Beta : larger applies stronger emphasis on VAE to discover disentangled latent factors, resulting in better inference performance. For clip value , clipping achieves superior performance than not clipping () since this narrows down the discrepancy between prediction instance and expected representation of future trajectory as we discussed in Section 4.3. Though the complete clipping () should ensure the consistence to the expected representation and shows a good performance and lower deviation, considering the imperfect approximation of neural networks, setting to a small positive value () actually achieves a slightly better result.
Representation  Architecture  Operator  Return  
CNN  LSTM  VAE  MLP  PairwiseProd.  Concat.  Linear  ReLU  Results  
✓  ✓  ✓  ✓  1000  0.2  5818.60 336.25  
✓  ✓  ✓  ✓  1000  0.2  5197.03 156.52  
✓  ✓  N/A  N/A  ✓  N/A  N/A  2029.00 486.11  
✓  ✓  ✓  ✓  1000  0.2  4541.71 104.22  
✓  ✓  ✓  ✓  1000  0.2  5119.04 390.89  
✓  ✓  ✓  ✓  100  0.2  4794.96 370.02  
✓  ✓  ✓  ✓  10  0.2  3933.33 361.82  
✓  ✓  ✓  ✓  1000  4752.84 328.75  
✓  ✓  ✓  ✓  1000  0.0  5712.55 233.74 
6.3 Delayed Reward
We further demonstrate the significant effectiveness and robustness of VDFP under delayed reward settings. We consider two representative delayed reward settings in realworld scenarios: 1) multistep accumulated rewards are given at sparse time steps; 2) each onestep reward is delayed for certain time steps. To simulate above two settings, we make a simple modification to MuJoCo tasks respectively: 1) deliver step accumulated reward every time steps and at the end of an episode; 2) delay the immediate reward of each step by steps and compensate at the end of episode.
With the same experimental setups in Section 6.1, we evaluate the algorithms under different delayed reward settings, with a delay step from 16 to 128. For VDFP, a max trajectory length of 256 is used for all settings except that using 64 for HalfCheetahv1 with 16 and 32 already ensures a good performance. Figure 3 plots the results under the first delayed reward setting. Similar results are also observed for the second class of delay reward and can be found in the Supplementary Material.
As the increase of delay step , all algorithms gradually degenerate in comparison with Figure 2 (). DDSR can hardly learn effective policies under such delayed reward settings due to the failure of its onestep reward model even with a relatively small delay step (e.g., = 16). VDFP consistently outperforms others under all settings, in both learning speed and final performance (2x to 4x than DDPG). Besides, VDFP shows good robustness with delay step . As discussed in Section 3, we suggest that the reason for the superior performance of VDFP is twofold: 1) VDFP can always learn the dynamics of the environment effectively from state and action feedbacks, which is irrelevant with how rewards are delayed actually; 2) the trajectory return model is robust with delayed reward since it approximates the cumulative reward instead of onestep immediate reward.
7 Conclusion and Future Work
We present an explicit twostep understanding of value functions in modelfree RL from the perspective of future prediction. Through rewriting the value function in a composite function form, we decompose the value estimation process into two separate parts, which allows more effective and flexible use in different problems. Further, we derive our algorithm from such decomposition and innovatively propose a conditional variational dynamics model with clipped generation noise to predict the future. Evaluation and ablation studies are conducted in MuJoCo continuous control tasks. The effectiveness and robustness are also demonstrated under challenging delay reward settings.
In this paper, we use a offpolicy training for VDFP and it could be flawed since trajectories collected by old policies may not able to represent the future under current policy. However, we do not observe any adverse effect of using offpolicy training in our experiments (similar results are also found in Dosovitskiy2017DFP ), and explicitly introducing several onpolicy correction approaches shows no apparent benefits. We hypothesize that it is because the deterministic policy used in VDFP relaxes the onpolicy requirements. It is worthwhile further investigation of this issue and the extension to stochastic policy. Besides, to some extend, the variational predictive dynamics model of VDFP can be viewed as a Monte Carlo (MC) based estimation Sutton1988ReinforcementLA over the space of trajectory representation. In traditional RL approaches, MC value estimation are widely known to suffer from high variance. Thus, we suggest that VDFP may indicate a new variational MC approach with lower variance. We consider the theoretical analysis of the variance reduction as another future work.
References
 [1] C. M. Atance and D. K. O’Neill. Episodic future thinking. Trends in Cognitive Sciences, 5:533–539, 2001.
 [2] C. G. Atkeson and S. Schaal. Robot learning from demonstration. In ICML, pages 12–20, 1997.
 [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
 [4] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in vae. CoRR, abs/1804.03599, 2018.
 [5] A. Dosovitskiy and V. Koltun. Learning to act by predicting the future. In ICLR, 2017.
 [6] S. Fujimoto, D. Meger, and D. Precup. Offpolicy deep reinforcement learning without exploration. CoRR, abs/1812.02900, 2018.
 [7] S. Fujimoto, H. v. Hoof, and D. Meger. Addressing function approximation error in actorcritic methods. In ICML, 2018.
 [8] D. Hafner, T. P. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. CoRR, abs/1811.04551, 2018.
 [9] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 [10] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [11] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, R. Sepassi, G. Tucker, and H. Michalewski. Modelbased reinforcement learning for atari. CoRR, abs/1903.00374, 2019.
 [12] N. R. Ke, A. Singh, A. Touati, A. Goyal, Y. Bengio, D. Parikh, and D. Batra. Learning dynamics model in reinforcement learning by incorporating the long term future. In ICLR, 2019.
 [13] Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, pages 1746–1751, 2014.
 [14] D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.
 [15] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman. Deep successor reinforcement learning. CoRR, abs/1606.02396, 2016.
 [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2015.
 [17] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
 [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [19] D. L. Schacter, D. R. Addis, and R. L. Buckner. Remembering the past to imagine the future: the prospective brain. Nature Reviews Neuroscience, 8:657–661, 2007.
 [20] D. L. Schacter and D. Rose Addis. The cognitive neuroscience of constructive memory: remembering the past and imagining the future. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 362 1481:773–86, 2007.
 [21] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
 [22] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. In ICLR, 2016.
 [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [25] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A. Riedmiller. Deterministic policy gradient algorithms. In ICML, pages 387–395, 2014.
 [26] R. S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin, 2(4):160–163, 1991.
 [27] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 16:285–286, 1988.
 [28] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for modelbased control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
 [29] Y. Wu, E. Mansimov, S. Liao, R. B. Grosse, and J. Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In NIPS, 2017.
A. Complete Learning Curves
A.1. Learning Curves for the Results in Ablation
Figure 4 shows the learning curves of VDFP and its variants for ablation studies (Section 6.2), corresponding to the results in Table 1.
A.2. Results for the Second Delayed Reward Setting
We use the second delayed reward setting to model the representative delay reward in realworld scenarios: each onestep reward is delayed for certain time steps. We make a simple modification to MuJoCo tasks to simulate such class of delay reward: delay the immediate reward of each step by steps and compensate at the end of episode. The complete learning curves of algorithms under the second delay reward setting are shown in Figure 5 and 6. All algorithms gradually degenerate with the increase of delay step . VDFP consistently outperforms others under all settings, and shows good robustness with delay step .
B. Experimental Details
B.1. Environment Setup
We conduct our experiments on MuJoCo continuous control tasks in OpenAI gym. We use the OpenAI gym with version 0.9.1, the mujocopy with version 0.5.4 and the MuJoCo products with version MJPRO131. Our codes are implemented with Python 3.6 and Tensorflow 1.8. Our code and raw learning curves are submitted under review, and will be released on GitHub soon.
B.2. Network Structure
As shown in Table 2, we use a twolayer feedforward neural network of 200 and 100 hidden units with ReLU activation (except for the output layer) for the actor network for all algorithms, and for the critic network for DDPG, PPO and A2C. For PPO and A2C, the critic denotes the network.
Layer  Actor Network ()  Critic Network ( or ) 

Fully Connected  (state dim, 200)  (state dim, 200) 
Activation  ReLU  ReLU 
Fully Connected  (200, 100)  (action dim + 200, 100) or (200, 100) 
Activation  ReLU  ReLU 
Fully Connected  (100, action dim)  (100, 1) 
Activation  tanh  None 
For DDSR, the factored critic (i.e., function) consists of a representation network, a reconstruction network, a SR network and a linear reward vector, as described in the original paper of DSR. The structure of the factored critic of DDSR is shown in Table 3.
Network  Layer  Structure 
Representation Network  Fully Connected  (state dim, 200) 
Activation  ReLU  
Fully Connected  (200, 100)  
Activation  ReLU  
Fully Connected  (100, representation dim)  
Activation  None  
Reconstruction Network  Fully Connected  (representation dim, 100) 
Activation  ReLU  
Fully Connected  (100, 200)  
Activation  ReLU  
Fully Connected  (200, state dim)  
Activation  None  
SR Network  Fully Connected  (representation dim, 200) 
Activation  ReLU  
Fully Connected  (200, 100)  
Activation  ReLU  
Fully Connected  (100, representation dim)  
Activation  None  
Linear Reward Vector  Fully Connected (not use bias)  (representation dim, 1) 
Activation  None 
For VDFP, the decomposed critic (i.e., function) consists of a convolutional representation network , a linear trajectory return network and a conditional VAE (an encoder network and a decoder network). The structure of the decomposed critic of VDFP is shown in Table 4.
Network  Layer (Name)  Sturcture 

Representation Network  Convolutional  filters with height 
of numbers  
Activation  ReLU  
Pooling (concat)  Maxpooling & Concatenation  
Fully Connected (highway)  (filter num, filter num)  
Joint  Sigmoid(highway) ReLU(highway)  
+ (1  Sigmoid(highway)) ReLU(concat)  
Dropout  Dropout(drop_prob = 0.2)  
Fully Connected  (filter num, representation dim)  
Activation  None  
Conditional Encoder Network  Fully Connected (main)  (representation dim, 400) 
Fully Connected (encoding)  (state dim + action dim, 400)  
PairwiseProduct  Sigmoid(encoding) ReLU(main)  
Fully Connected  (400, 200)  
Activation  ReLU  
Fully Connected (mean)  (200, dim)  
Activation  None  
Fully Connected (log_std)  (200, dim)  
Activation  None  
Conditional Decoder Network  Fully Connected (latent)  (z dim, 200) 
Fully Connected (decoding)  (state dim + action dim, 200)  
PairwiseProduct  Sigmoid(decoding) ReLU(latent)  
Fully Connected  (200, 400)  
Activation  ReLU  
Fully Connected (reconstruction)  (400, representation dim)  
Activation  None  
Trajectory Return Network  Fully Connected  (representation dim, 1) 
Activation  None 
For VDFP_MLP, we also use a twolayer feedforward neural network of 200 and 100 hidden units with ReLU activation (except for the output layer) for . For VDFP_LSTM, we use one LSTM layer with 100 units to replace the convolutional layer (along with maxpooling layer) as described in Table 4. For VDFP_Concat, we concatenate the state, action and the representation (or latent variable) rather than a pairwiseproduct structure. For VDFP_ReLU, we add an ReLUactivated fullyconnected layer with 50 units in front of the linear layer for trajectory return model.
B.3. Hyperparameter
For all our experiments, we use the raw observation and reward from the environment and no normalization or scaling are used. No regularization is used for the actor and the critic in all algorithms. Table 5 shows the common hyperparamters of algorithms used in all our experiments. For VDFP and DDSR, critic learning rate denotes the learning rate of the conditional VAE and the successor representation model respectively. Return (reward) model learning rate denotes the learning rate of the return model (along with the representation model) for VDFP and the learning rate of the immediate reward vector for DDSR.
Hyperparameter  VDFP  DDSR  DDPG  PPO  A2C 
Actor Learning Rate  2.510  2.510  10  10  10 
Critic (VAE, SR) Learning Rate  10  10  10  10  10 
Return (Reward) Model Learning Rate  510  510       
Discount Factor  0.99  0.99  0.99  0.99  0.99 
Optimizer  Adam  Adam  Adam  Adam  Adam 
Target Update Rate    10  10     
Exploration Policy  None  None  
Batch Size  64  64  64  256  256 
Buffer Size  10  10  10     
Actor Epoch        10  10 
Critic Epoch        10  10 
B.4. Additional Implementation Details
For DDPG, the actor network and the critic network is updated every 1 time step. We implement the DDSR based on the DDPG algorithm, by replacing the original critic of DDPG with the factored function as described in the DSR paper. The actor, along with all the networks described in Table 3 are updated every 1 time step. The representation dimension is set to 100. Before the training of DDPG and DDSR, we run 10000 time steps for experience collection, which are also counted in the total time steps.
For PPO and A2C, we use Generalized Advantage Estimation with for stable policy gradient. The clip range of PPO algorithm is set to 0.2. The actor network and the critic network are updated every and episodes for HalfCheetahv1 and Walker2dv1 respectively, with the epoches and batch sizes described in Table 5.
For VDFP, we set the KL weight as 1000 and the clip value as 0.2. The latent variable dimension ( dim) is set to 50 and the representation dimension is set to 100. We collect trajectories experiences in the first 5000 time steps and then pretrain the trajectory model (along with the representation model) and the conditional VAE for 15000 time steps, after which we start the training of the actor. All the time steps above are counted into the total time steps for a fair comparison. The trajectory return model (along with the representation model) is trained every 10 time steps for the pretrain process and is trained every 50 time steps in the rest of training, which already ensures a good performance in our experiments. The actor network and the conditional VAE are trained every 1 time step.
We consider a max trajectory length for VDFP in our experiments. For example, a max length = 256 can be considered to correspond a discounted factor as . In practice, for a max length , we add an additional fullyconnected layer with ReLU activation before the convolutional representation model , to aggregate the trajectory into the length of 64. This is used for the purpose of reducing the time cost and accelerating the training of the convolutional representation model as described in Table 4. For example, we feed every 4 stateaction pairs of a trajectory with length 256 into the aggregation layer to obtain an aggregated trajectory with the length 64, and then feed the aggregated trajectory to for a trajectory representation.
C. Complete Formulation for the conditional VAE
A variational autoencoder (VAE) is a generative model which aims to maximize the marginal loglikelihood where , the dataset. While computing the marginal likelihood is intractable in nature, it is a common choice to train the VAE through optimizing the variational lower bound:
(12) 
where is chosen a prior, generally the multivariate normal distribution .
In our paper, we use a conditional VAE to model the latent distribution of the representation of future trajectory conditioned on the state and action, under certain policy. Thus, the true parameters of the distribution, denoted as , maximize the conditional loglikelihood as follows:
(13) 
The likelihood can be calculated with the prior distribution of latent variable :
(14) 
Since the prior distribution of latent variable is not easy to compute, an approximation of posterior distribution with parameterized is introduced.
The variational lower bound of such a conditional VAE can be obtained as follows, superscripts and subscripts are omitted for clarity:
(15)  
Rearrange the above equation,
(16)  
Since the KL divergence is nonenegative, we obtain the variational lower bound for the conditional VAE:
(17) 
Thus, we can obtain the optimal parameters through optimizing the equation below:
(18)  
is the conditional variational encoder and the is the conditional variational decoder.
In our paper, we implement the encoder and decoder with deep neural networks. The encoder takes the trajectory representation and stateaction pair as input and output a Gaussian distribution with mean and standard deviant , from which a latent variable is sampled and then feed into the decoder for the reconstruction of the trajectory representation. Thus, we train the conditional VAE with respect to the variational lower bound (Equation 18), in the following form:
(19) 
Comments
There are no comments yet.