1 Introduction
Modelfree reinforcement learning (RL) algorithms that learn a good policy without constructing an explicit model of the system’s dynamics have shown promising results in complex simulated problems (Mnih et al., 2013, 2015; Schulman et al., 2015; Haarnoja et al., 2018). However, these methods are not sample efficient, and thus, not suitable for problems in which data collection is burdensome. Modelbased RL algorithms address the data efficiency issue of the modelfree methods by learning a model, and combining modelgenerated data with those collected from interaction with the real system (Janner et al., 2019). However, designing modelbased RL algorithms is often challenging because the bias in model may affect the process of learning policies and result in worse asymptotic performance than the modelfree counterparts. A potential solution to this challenge is to incorporate the policy/value optimization method in the process of learning the model (Farahmand, 2018; Abachi et al., 2020). An ideal case here would be to have a universal objective function that is used to learn and improve model and policy jointly.
Casting RL as a probabilistic inference has a long history (Todorov, 2008; Toussaint, 2009; Kappen et al., 2012; Vijayakumar, 2013). This formulation has the advantage that allows powerful tools for approximate inference to be employed in RL. One such class of tools are variational techniques (Hoffman et al., 2013) that have been successfully used in RL (Neumann, 2011; Levine and Koltun, 2013; Abdolmaleki et al., 2018). Another formulation of RL with strong connection to probabilistic inference is the formulation of policy search as an expectation maximization (EM) style algorithm (Dayan and Hinton, 1997; Peters and Schal, 2007; Peters et al., 2010; Neumann, 2011; Chebotar et al., 2017; Abdolmaleki et al., 2018). The main idea here is to write the expected return of a policy as a (pseudo)likelihood function, and then assuming success in maximizing the return, finding the policy that most likely would have been taken. Another class of RL algorithms that are related to the RL as inference formulation are entropyregularized algorithms that add an entropy term to the reward function and find the softmax optimal policy (Levine and Koltun, 2014; Levine and Abbeel, 2014; Nachum et al., 2017, 2018; Haarnoja et al., 2018; Fellows et al., 2019). For a comprehensive tutorial on RL as probabilistic inference, we refer readers to Levine (2018).
In this paper, we leverage the connection between RL and probabilistic inference, and formulate an objective function for jointly learning and improving model and polciy as a variational lowerbound of a loglikelihood. This allows us to use EM, and iteratively fix a baseline policy and learn a variational distribution, consisting of a model and a policy (Estep), followed by improving the baseline policy given the learned variational distribution (Mstep). We propose modelbased and modelfree policy iteration (PI) style algorithms for the Estep and show how the variational distribution that they learn can be used to optimize the Mstep, only from modelgenerated samples. Both algorithms are modelbased but they differ in using modelbased and modelfree algorithms for the Estep. Our experiments on a number of continuous control tasks show that although our algorithm that uses modelbased PI for the Estep, which we call it variational modelbased policy optimization (VMBPO), is more complex than its modelfree counterpart, it is more sampleefficient and robust to hyperparameter tuning. Using the same control tasks, we also compare VMBPO with several stateoftheart modelbased and modelfree RL algorithms, including modelbased policy optimization (MBPO) Janner et al. (2019) and maximum a posteriori policy optimization (MPO) Abdolmaleki et al. (2018), and show its sample efficiency and performance.
2 Preliminaries
We study the reinforcement learning (RL) problem (Sutton and Barto, 2018)
in which the agent’s interaction with the environment is modeled as a discretetime Markov decision process (MDP)
, where and are state and action spaces; is the reward function; (is the set of probability distributions over
) is the transition kernel; and is the initial state distribution. A stationary Markovian policy is a probabilistic mapping from states to actions. Each policy is evaluated by its expected return, i.e., , where is the stopping time, i.e., the random variable of hitting a
terminal state.^{1}^{1}1Similar to Levine (2018), our setting can be easily extended to infinitehorizon discounted MDPs. This can be done by modifying the transition kernels, such that any action transitions the system to a terminal state with probability , and all standard transition probabilities are multiplied by . We denote by the set of all terminal states. The agent’s goal is to find a policy with maximum expected return, i.e, . We denote by , a system trajectory of lenght , whose probability under a policy is defined as . Finally, we define .3 Policy Optimization as Probabilistic Inference
Policy search in reinforcement learning (RL) can be formulated as a probabilistic inference problem (e.g., Todorov 2008; Toussaint 2009; Kappen et al. 2012; Levine 2018
). The goal in the conventional RL formulation is to find a policy whose generated trajectories maximize the expected return. In contrast, in the inference formulation, we start with a prior over trajectories and then estimate the posterior conditioned on a desired outcome, such as reaching a goal state. In this formulation, the notion of a desired (optimal) outcome is introduced via
independent binary random variables , where denotes that we acted optimally at time . The likelihood of , given the state and action , is modeled as , where is a temperature parameter. This allows us to define the loglikelihood of a policy being optimal as(1) 
where is the optimality likelihood of trajectory and is defined as
(2) 
As a result, finding an optimal policy in this setting would be equivalent to maximizing the loglikelihood in Eq. 1, i.e., .
A potential advantage of formulating RL as an inference problem is the possibility of using a wide range of approximate inference algorithms, including variational methods. In variational inference, we approximate a distribution with a potentially simpler (e.g., tractable factored) distribution in order to make the whole inference process more tractable. If we approximate with a variational distribution , we will obtain the following variational lowerbound for the loglikelihood in Eq. 1:
(3) 
(a) is from Jensen’s inequality, and is the evidence lowerbound (ELBO) of the loglikelihood function. A variety of algorithms have been proposed (e.g., Peters and Schal 2007; Sugiyama 2009; Neumann 2011; Levine and Koltun 2013; Abdolmaleki et al. 2018; Fellows et al. 2019), whose main idea is to approximate by maximizing w.r.t. both and . This often results in an expectationmaximization (EM) style algorithm in which we first fix and maximize for (Estep), and then for the obtained in the Estep, we maximize for (Mstep).
4 Variational Modelbased Policy Optimization
In this section, we describe the ELBO objective function used by our algorithms, study the properties of the resulted optimization problem, and propose algorithms to solve it. We propose to use the variational distribution to approximate . Note that shares the same initial state distribution as , but has different control strategy (policy), , and dynamics, . Using this variational distribution, we may write the ELBO objective of (3) as
(4) 
To maximize w.r.t. and , we first fix and compute the variational distribution (Estep):
(5) 
and then optimize given , i.e., (Mstep). Note that in (5), and are both functions of , but we remove from the notation to keep it lighter.
Remark 1.
In our formulation (choice of the variational distribution ), the Mstep is independent of the true dynamics, , and thus, can be implemented offline (using samples generated by the model ). Moreover, as will be seen in Section 5, we also use the model, , in the Estep. As discussed throughout the paper, using simulated samples (from ) and reducing the need for real samples (from ) is an important feature of our proposed modelbased formulation and algorithm.
Remark 2.
There are similarities between our variational formulation and the one used in the maximum a posteriori policy optimization (MPO) algorithm (Abdolmaleki et al., 2018). However, MPO sets its variational dynamics, , to the dynamics of the real system, , which results in a modelfree algorithm, while our approach is modelbased, since we learn and use it to generate samples in both Estep and Mstep of our algorithms.
In the rest of this section, we study the Estep optimization (5) and propose algorithms to solve it.
4.1 Properties of the Estep Optimization
We start by defining two Bellmanlike operators related to the Estep optimization (5). For any variational policy and any value function , such that , we define the induced operator and the optimal operator as
(6)  
(7) 
We also define the optimal value function of the Estep, , as
(8) 
For any value function , we define its associated actionvalue function as
(9) 
We now prove (in Appendices A.1 and A.2) the following lemmas about the properties of these operators, and , and their relation with the (Estep) optimal value function, .
Lemma 2.
The induced, , and optimal, , operators are monotonic and a contraction. Moreover, the optimal value function is the unique fixedpoint of , i.e., .
From the definition of function in (9) and Lemma 2, we prove (in Appendix A.3) the following proposition for the actionvalue function associated with the Estep optimal value function .
Proposition 1.
The Estep optimal value function, , and its associated actionvalue function, , defined by (9), have the following relationship: .
In the rest of this section, we show how to derive a closedform expression for the variational distribution . For any value function , we define its corresponding variational dynamics, , as the solution to the maximization problem in the definition of (see Eq. 6), i.e.,
(12) 
and its corresponding variational policy, , where is the actionvalue function associated with (Eq. 9), as the solution to the maximization problem in the definition of (see Eqs. 7 and 10), i.e.,
(13) 
We now derive closedform expressions for the variational distributions and .
Lemma 3.
The variational dynamics and policy corresponding to a value function and its associated actionvalue function can be written in closedform as (proof in Appendix A.4)
(14)  
(15) 
From (14) and (15), the variational dynamics, , and policy, , can be seen as an exponential twisting of the dynamics and policy with weights and , respectively. In the special case (the Estep optimal value function), these distributions can be written in closedform as
(16) 
where the denominator of is obtained by applying Proposition 1 to replace with .
4.2 Policy and Value Iteration Algorithms for the Estep
Using the results of Section 4.1, we now propose modelbased and modelfree dynamic programming (DP) style algorithms, i.e., policy iteration (PI) and value iteration (VI), for solving the Estep problem (5). The modelbased algorithms compute the variational dynamics, , at each iteration, while the modelfree ones compute only at the end (upon convergence). Having access to at each iteration has the advantage that we may generate samples from the model, , when we implement the samplebased version (RL version) of these DP algorithms in Section 5.
In the modelbased PI algorithm, at each iteration , given the current variational policy , we
Policy Evaluation: Compute the induced value function (the fixedpoint of the operator ) by iteratively applying from Eq. 6, i.e., , where the variational model in (6) is computed using Eq. 14 with . We then compute the corresponding actionvalue function, using Eq. 9.
Policy Improvement: Update the variational distribution using Eq. 15 with .^{2}^{2}2When the number of actions is large, the denominator of (15) cannot be computed efficiently. In this case, we replace (15) in the policy improvement step of our PI algorithms with , where . We also prove the convergence of our PI algorithms with this update in Appendix A.5.
Upon convergence, i.e., , we compute from Eq. 14 and return .
The modelfree PI algorithm is exactly the same, except in its policy evaluation step, the induced operator, , is applied using Eq. 10 (without the variational dynamics ). In this case, the variational dynamics, , is computed only upon convergence, , using Eq. 14.
Lemma 4.
We can similarly derive modelbased and modelfree value iteration (VI) algorithms for the Estep. These algorithms start from an arbitrary value function, , and iteratively apply the optimal operator, , from Eqs. 6 and 7 (modelbased) and Eq. 11 (modelfree) until convergence, i.e., . Given , these algorithms first compute from Proposition 1, and then compute (, ) using Eq. 16. From the properties of the optimal operator in Lemma 2, it is easy to see that both modelbased and modelfree VI algorithms converge to and .
In this paper, we focus on the PI algorithms, in particular the modelbased one, and leave the VI algorithms for future work. In the next section, we show how the PI algorithms can be implemented and combined with a routine for solving the Mstep, when the true MDP model, , is unknown (the RL setting) and the state and action spaces are large that require using function approximation.
5 Variational Modelbased Policy Optimization Algorithm
In this section, we propose a RL algorithm, called variational modelbased policy optimization (VMBPO). VMBPO is a EMstyle algorithm based on the variational formulation proposed in Section 4. The Estep of VMBPO is the samplebased implementation of the modelbased PI algorithm, described in Section 4.2. We describe the Estep and Mstep of VMBPO in details in Sections 5.1 and 5.2, and report its pseudocode in Algorithm 1 in Appendix B. VMBPO uses neural networks to represent: policy , variational dynamics , variational policy , loglikelihood ratio , value function , actionvalue function , target value function , and target actionvalue function , with parameters , , , , , , , and , respectively.
5.1 The Estep of VMBPO
At the beginning of the Estep, we generate a number of samples from the current baseline policy , i.e., and , and add them to the buffer . The Estep consists of four updates: 1) computing the variational dynamics , 2) estimating the loglikelihood ratio , 3) computing the induced value, , and actionvalue, , functions (critic update), and finally 4) computing the variational policy new (actor update). We describe the details of each step below.
Step 1. (Computing ) We find as the solution to the optimization problem (12) for equal to the target value network . Since the in (14) is the solution of (12), we compute by minimizing , which results in the following forward KL loss, for all and :
(17)  
(18) 
where (a) is by removing the independent terms from (17). We update
by taking several steps in the direction of the gradient of a sample average of the loss function (
18), i.e.,(19) 
where are randomly sampled from . The intuition here is to focus on learning the dynamics model in regions of the stateaction space that has higher temporal difference — regions with higher anticipated future return. Note that we can also obtain by optimizing the reverse KL direction in (17), but since it results in a more involved update, we do not report it here.
Step 2. (Computing ) Using the duality of fdivergence (Nguyen et al., 2008) w.r.t. the reverse KLdivergence, the loglikelihood ratio is a solution to
(20) 
for all and . Note that the optimizer of (20) is unique almost surely (at with ), because is absolutely continuous w.r.t. (see the definition of in Eq. 14) and the objective function of (20) is strictly concave. The optimization problem (20) allows us to compute as an approximation to the loglikelihood ratio . We update by taking several steps in the direction of the gradient of a sample average of (20), i.e.,
(21) 
where is the set of samples for which is drawn from the variational dynamics, i.e., . Here we first sample randomly from and use them in the second sum. Then, for all that have been sampled, we generate from and use the resulting samples in the first sum.
Step 3. (critic update) To compute (fixedpoint of ) and its actionvalue , we first rewrite (6) with the maximizer from Step 1 and the loglikelihood ratio from Step 2:
(22) 
Since can be written as both (10) and (22), we compute the induced function by setting the RHS of these equations equal to each other, i.e., for all and ,
(23) 
Since the expectation in (23) is w.r.t. the variational dynamics (model) , we can estimate only with samples generated from the model. We do this by taking several steps in the direction of the gradient of a sample average of the squareloss obtained by setting the two sides of (23) equal, i.e.,
(24) 
Note that in (23), the actions are generated by . Thus, in (24), we first randomly sample , then sample from , and finally draw from . If the reward function is known (chosen by the designer of the system), then it is used to generate the reward signals in (24), otherwise, a reward model has to be learned.
After estimating , we approximate , the fixedpoint of , using definition in (10) as . This results in updating by taking several steps in the direction of the gradient of a sample average of the squareloss obtained by setting the two sides of the above equation to be equal, i.e.,
(25) 
where is randomly sampled and (without sampling from the true environment).
Step 4. (actor update) We update the variational policy (policy improvement) by solving the optimization problem (13) for the estimated by the critic in Step 3. Since the that optimizes (13) can be written as (15), we update it by minimizing . This results in the following reverse KL loss, for all : . If we reparameterize using a transformation , where is a Gaussian noise, we can update by taking several steps in the direction of the gradient of a sample average of the above loss, i.e.,
(26) 
5.2 The Mstep of VMBPO
As described in Section 4, the goal of the Mstep is to improve the baseline policy , given the variational model learned in the Estep, by solving the following optimization problem
(27) 
A nice feature of (27) is that it can be solved using only the variational model , without the need for samples from the true environment . However, it is easy to see that if the policy space considered in the Mstep, , contains the one used for in the Estep, then we can trivially solve the Mstep by setting . Although this is an option, it is more efficient in practice to solve a regularized version of (27). A practical way to regularize (27) is to make sure that the new baseline policy remains close to the old one, which results in the following optimization problem
(28) 
This is equivalent to the weighted MAP formulation used in the Mstep of MPO (Abdolmaleki et al., 2018). In MPO, they define a prior over the parameter and add it as to the objective function of (27). Then, they set the prior to a specific Gaussian and obtain an optimization problem similar to (28) (see Section 3.3 in Abdolmaleki et al. 2018). However, since in their variational model (their approach is modelfree), they need real samples to solve their optimization problem, while we can solve (28) only by simulated samples (our approach is modelbased).
6 Experiments
To illustrate the effectiveness of VMBPO, we (i) compare it with several stateoftheart RL methods on multiple domains, and (ii) assess the tradeoff between sample efficiency via ablation analysis.
Comparison with Baseline RL Algorithms
We compare VMBPO with five baseline methods, MPO (Abdolmaleki et al., 2018), SAC (Haarnoja et al., 2018) —two popular modelfree deep RL algorithms—and STEVE (Buckman et al., 2018), PETS (Chua et al., 2018), and MBPO (Janner et al., 2019) —three recent modelbased RL algorithms. We also compare with the (Estep) modelfree variant of VMBPO, which is known as VMBPOMFE (see Appendix C
for details). We evaluate the algorithms on one classical control benchmark (Pendulum) and five MuJoCo benchmarks (Hopper, Walker2D, HalfCheetah, Reacher, Reacher7DoF). The neural network architectures (for the the dynamics model, value functions, and policies) of VMBPO are similar to that of MBPO. Details on network architectures and hyperparameters are described in Appendix
D. Since we parameterize in the Estep of VMBPO, according to Section 5.2, in the Mstep we simply set . For the more difficult environments (Walker2D, HalfCheetah), the number of training steps is set to , while for the medium one (Hopper) and for the simpler ones (Pendulum, Reacher, Reacher7DOF), it is set to and respectively. Policy performance is evaluated every training iterations. Each measurement is an average return over episodes, each generated with a separate random seed. To smooth learning curves, data points are averaged over a sliding window of size .Table 1 and Figure 3 (Appendix D.1
) show the average return of VMBPO, VMBPOMFE, and the baselines under the best hyperparameter configurations. VMBPO outperforms the baseline algorithms in most of the benchmarks, with significantly faster convergence speed and a higher reward. This verifies our conjecture about VMBPO: (i) Utilizing synthetic data from the learned dynamics model generally improves dataefficiency of RL; (ii) Extra improvement in VMBPO is attributed to the fact that model is learned from the universal RL objective function. On the other hand, VMBPOMFE outperforms MPO in 4 out of 6 domains. However, in some cases the learning may experience certain degradation issues (which lead to poor performance). This is due to the instability caused by sample variance amplification in critic learning with exponentialTD minimization (see Eq.
34 in Section C.1). To alleviate this issue one may introduce a temperature term to the exponentialTD update (Borkar, 2002). However, tuning this hyperparameter can be quite nontrivial.^{3}^{3}3The variance is further amplified with a large , but the critic learning is hampered by a small . Table 4 (Appendix D.1) and Figure 1 show the summary statistics averaged over all hyperparameter/randomseed configurations and illustrate the sensitivity to hyperparameters of each method. VMBPO is more robust (with best performance on all the tasks) than other baselines. This corroborates with the hypothesis that MBRL generally is more robust to hyperparameters than its modelfree counterparts.Ablation Analysis
We now study the effects of dataefficiency of VMBPO w.r.t. the data samples generated from the dynamics model . For simplicity, we only experiment with 3 standard benchmarks (Pendulum, Hopper, HalfCheetah) and with fewer learning steps (, , ). At each step, we update the actor and critic using synthetic samples. Figure 2 shows the statistics averaged over all hyperparameter/randomseed configurations and illustrates how synthetic data can help with policy learning. The results show that increasing the amount of synthetic data generally improves the policy convergence rate. In the early phase when the dynamics model is inaccurate, sampling data from it may slow down learning, while in the later phase with an improved model adding more synthetic data leads to a more significant performance boost.
7 Conclusion
We formulated the problem of jointly learning and improving model and policy in RL as a variational lowerbound of a loglikelihood, and proposed EMtype algorithms to solve it. Our algorithms, called variational modelbased policy optimization (VMBPO) use modelbased policy iteration for solving the Estep. We compared our (Estep) modelbased and modelfree algorithms with each other, and with a number of stateoftheart modelbased (e.g., MBPO) and modelfree (e.g., MPO) RL algorithms, and showed its sample efficiency and performance.
We briefly discussed VMBPO style algorithms in which the Estep is solved by modelbased policy iteration methods. However, full implementation of these algorithms and studying their relationship with the existing methods requires more work that we leave for future. Another future directions are: 1) finding more efficient implementation for VMBPO, and 2) using VMBPO style algorithms in solving control problems from highdimensional observations, by learning a lowdimensional latent space and a latent space dynamics, and perform control there. This class of algorithms is referred to as learning controllable embedding (Watter et al., 2015; Levine et al., 2020).
Broader Impact
This work proposes methods to jointly learn and improve model and policy in reinforcement learning. We see learning controlaware models as a promising direction to address an important challenge in modelbased reinforcement learning: creating a balance between the bias in simulated data and the ease of data generation (sample efficiency).
References
 Policyaware model learning for policy gradient methods. preprint arXiv:2003.00030. Cited by: §1.
 Maximum a posteriori policy optimisation. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §1, §1, §3, §5.1, §5.2, §6, Remark 2.
 Dynamic programming and optimal control. Vol. 2, Athena scientific Belmont, MA. Cited by: §A.2, §A.2.
 Qlearning for risksensitive control. Mathematics of operations research 27 (2), pp. 294–311. Cited by: §A.2, §6.
 Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §6.
 Path integral guided policy search. In IEEE International Conference on Robotics and Automation, Cited by: §1.
 Unsupervised exploration with deep modelbased reinforcement learning. Cited by: §6.
 Using expectationmaximization for reinforcement learning. Neural Computation 9 (2), pp. 271–278. Cited by: §1.
 Iterative valueaware model learning. In Advances in Neural Information Processing Systems 31, pp. 9072–9083. Cited by: §1.
 VIREL: a variational inference framework for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 7120–7134. Cited by: §1, §3.

Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor.
In
Proceedings of the 35th International Conference on Machine Learning
, pp. 1861–1870. Cited by: §1, §1, §6.  Stochastic variational inference. Journal of Machine Learning Research 14 (4), pp. 1303–1347. Cited by: §1.
 When to trust your model: modelbased policy optimization. In Advances in Neural Information Processing Systems 32, pp. 12519–12530. Cited by: §1, §1, §6.
 Optimal control as a graphical model inference problem. Machine Learning 87 (2), pp. 159–182. Cited by: §1, §3.
 Prediction, consistency, curvature: representation learning for locallylinear control. In Proceedings of the 8th International Conference on Learning Representations, Cited by: §7.
 Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, Cited by: §1.
 Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
 Learning complex neural network policies with trajectory optimization. In International Conference on Machine Learning, Cited by: §1.
 Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv:1805.00909. Cited by: §1, §3, footnote 1.
 Playing Atari with deep reinforcement learning. preprint arXiv:1312.5602. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
 Path consistency learning in tsallis entropy regularized mdps. In Proceedings of the ThirtyFifth International Conference on Machine Learning, pp. 979–988. Cited by: §1.
 Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785. Cited by: §A.1, §A.4, §A.4, §1.
 Variational inference for policy search in changing situations. In Proceedings of the 28th international conference on machine learning, pp. 817–824. Cited by: §1, §3.
 Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, pp. 1089–1096. Cited by: §5.1.

Relative entropy policy search.
In
Proceedings of the 24th AAAI Conference on Artificial Intelligence
, Cited by: §1.  Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th international conference on machine learning, Cited by: §1, §3.
 Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1889–1897. Cited by: §1.
 Efficient sample reuse in EMbased policy search. In Proceedings of the European Conference on Machine Learning, Cited by: §3.
 Reinforcement learning: an introduction. MIT press. Cited by: §2.
 General duality between optimal control and estimation. In Proceedings of the 47th IEEE Conference on Decision and Control, pp. 4286–4292. Cited by: §1, §3.
 Robot trajectory optimization using approximate inference. In Proceedings of the 26th International Conference on Machine Learning, pp. 1049–1056. Cited by: §1, §3.
 On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings of Robotics: Science and Systems, Cited by: §1.
 Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems 28, pp. 2746–2754. Cited by: §7.
Appendix A Proofs of Section 4
a.1 Proof of Lemma 1
Before proving Lemma 1, we first state and prove the following results.
Lemma 5.
For any state , actionvalue function , and policy , we have
(29) 
Analogously, for any stateaction pair , value function , and transition kernel , we have
(30) 
Proof.
We only prove (29) here, since the proof of (30) follows similar arguments. The proof of (29) comes from the following sequence of equalities:
(a) This follows from Lemma 4 in Nachum et al. [2017]. ∎
We now turn to the proof of our main lemma.