Model-free reinforcement learning (RL) algorithms that learn a good policy without constructing an explicit model of the system’s dynamics have shown promising results in complex simulated problems (Mnih et al., 2013, 2015; Schulman et al., 2015; Haarnoja et al., 2018). However, these methods are not sample efficient, and thus, not suitable for problems in which data collection is burdensome. Model-based RL algorithms address the data efficiency issue of the model-free methods by learning a model, and combining model-generated data with those collected from interaction with the real system (Janner et al., 2019). However, designing model-based RL algorithms is often challenging because the bias in model may affect the process of learning policies and result in worse asymptotic performance than the model-free counterparts. A potential solution to this challenge is to incorporate the policy/value optimization method in the process of learning the model (Farahmand, 2018; Abachi et al., 2020). An ideal case here would be to have a universal objective function that is used to learn and improve model and policy jointly.
Casting RL as a probabilistic inference has a long history (Todorov, 2008; Toussaint, 2009; Kappen et al., 2012; Vijayakumar, 2013). This formulation has the advantage that allows powerful tools for approximate inference to be employed in RL. One such class of tools are variational techniques (Hoffman et al., 2013) that have been successfully used in RL (Neumann, 2011; Levine and Koltun, 2013; Abdolmaleki et al., 2018). Another formulation of RL with strong connection to probabilistic inference is the formulation of policy search as an expectation maximization (EM) style algorithm (Dayan and Hinton, 1997; Peters and Schal, 2007; Peters et al., 2010; Neumann, 2011; Chebotar et al., 2017; Abdolmaleki et al., 2018). The main idea here is to write the expected return of a policy as a (pseudo)-likelihood function, and then assuming success in maximizing the return, finding the policy that most likely would have been taken. Another class of RL algorithms that are related to the RL as inference formulation are entropy-regularized algorithms that add an entropy term to the reward function and find the soft-max optimal policy (Levine and Koltun, 2014; Levine and Abbeel, 2014; Nachum et al., 2017, 2018; Haarnoja et al., 2018; Fellows et al., 2019). For a comprehensive tutorial on RL as probabilistic inference, we refer readers to Levine (2018).
In this paper, we leverage the connection between RL and probabilistic inference, and formulate an objective function for jointly learning and improving model and polciy as a variational lower-bound of a log-likelihood. This allows us to use EM, and iteratively fix a baseline policy and learn a variational distribution, consisting of a model and a policy (E-step), followed by improving the baseline policy given the learned variational distribution (M-step). We propose model-based and model-free policy iteration (PI) style algorithms for the E-step and show how the variational distribution that they learn can be used to optimize the M-step, only from model-generated samples. Both algorithms are model-based but they differ in using model-based and model-free algorithms for the E-step. Our experiments on a number of continuous control tasks show that although our algorithm that uses model-based PI for the E-step, which we call it variational model-based policy optimization (VMBPO), is more complex than its model-free counterpart, it is more sample-efficient and robust to hyper-parameter tuning. Using the same control tasks, we also compare VMBPO with several state-of-the-art model-based and model-free RL algorithms, including model-based policy optimization (MBPO) Janner et al. (2019) and maximum a posteriori policy optimization (MPO) Abdolmaleki et al. (2018), and show its sample efficiency and performance.
We study the reinforcement learning (RL) problem (Sutton and Barto, 2018)
in which the agent’s interaction with the environment is modeled as a discrete-time Markov decision process (MDP), where and are state and action spaces; is the reward function; (
is the set of probability distributions over) is the transition kernel; and is the initial state distribution. A stationary Markovian policy is a probabilistic mapping from states to actions. Each policy is evaluated by its expected return, i.e., , where is the stopping time
, i.e., the random variable of hitting aterminal state.111Similar to Levine (2018), our setting can be easily extended to infinite-horizon -discounted MDPs. This can be done by modifying the transition kernels, such that any action transitions the system to a terminal state with probability , and all standard transition probabilities are multiplied by . We denote by the set of all terminal states. The agent’s goal is to find a policy with maximum expected return, i.e, . We denote by , a system trajectory of lenght , whose probability under a policy is defined as . Finally, we define .
3 Policy Optimization as Probabilistic Inference
). The goal in the conventional RL formulation is to find a policy whose generated trajectories maximize the expected return. In contrast, in the inference formulation, we start with a prior over trajectories and then estimate the posterior conditioned on a desired outcome, such as reaching a goal state. In this formulation, the notion of a desired (optimal) outcome is introduced viaindependent binary random variables , where denotes that we acted optimally at time . The likelihood of , given the state and action , is modeled as , where is a temperature parameter. This allows us to define the log-likelihood of a policy being optimal as
where is the optimality likelihood of trajectory and is defined as
As a result, finding an optimal policy in this setting would be equivalent to maximizing the log-likelihood in Eq. 1, i.e., .
A potential advantage of formulating RL as an inference problem is the possibility of using a wide range of approximate inference algorithms, including variational methods. In variational inference, we approximate a distribution with a potentially simpler (e.g., tractable factored) distribution in order to make the whole inference process more tractable. If we approximate with a variational distribution , we will obtain the following variational lower-bound for the log-likelihood in Eq. 1:
(a) is from Jensen’s inequality, and is the evidence lower-bound (ELBO) of the log-likelihood function. A variety of algorithms have been proposed (e.g., Peters and Schal 2007; Sugiyama 2009; Neumann 2011; Levine and Koltun 2013; Abdolmaleki et al. 2018; Fellows et al. 2019), whose main idea is to approximate by maximizing w.r.t. both and . This often results in an expectation-maximization (EM) style algorithm in which we first fix and maximize for (E-step), and then for the obtained in the E-step, we maximize for (M-step).
4 Variational Model-based Policy Optimization
In this section, we describe the ELBO objective function used by our algorithms, study the properties of the resulted optimization problem, and propose algorithms to solve it. We propose to use the variational distribution to approximate . Note that shares the same initial state distribution as , but has different control strategy (policy), , and dynamics, . Using this variational distribution, we may write the ELBO objective of (3) as
To maximize w.r.t. and , we first fix and compute the variational distribution (E-step):
and then optimize given , i.e., (M-step). Note that in (5), and are both functions of , but we remove from the notation to keep it lighter.
In our formulation (choice of the variational distribution ), the M-step is independent of the true dynamics, , and thus, can be implemented offline (using samples generated by the model ). Moreover, as will be seen in Section 5, we also use the model, , in the E-step. As discussed throughout the paper, using simulated samples (from ) and reducing the need for real samples (from ) is an important feature of our proposed model-based formulation and algorithm.
There are similarities between our variational formulation and the one used in the maximum a posteriori policy optimization (MPO) algorithm (Abdolmaleki et al., 2018). However, MPO sets its variational dynamics, , to the dynamics of the real system, , which results in a model-free algorithm, while our approach is model-based, since we learn and use it to generate samples in both E-step and M-step of our algorithms.
In the rest of this section, we study the E-step optimization (5) and propose algorithms to solve it.
4.1 Properties of the E-step Optimization
We start by defining two Bellman-like operators related to the E-step optimization (5). For any variational policy and any value function , such that , we define the -induced operator and the optimal operator as
We also define the optimal value function of the E-step, , as
For any value function , we define its associated action-value function as
The -induced, , and optimal, , operators are monotonic and a contraction. Moreover, the optimal value function is the unique fixed-point of , i.e., .
The E-step optimal value function, , and its associated action-value function, , defined by (9), have the following relationship: .
In the rest of this section, we show how to derive a closed-form expression for the variational distribution . For any value function , we define its corresponding variational dynamics, , as the solution to the maximization problem in the definition of (see Eq. 6), i.e.,
We now derive closed-form expressions for the variational distributions and .
The variational dynamics and policy corresponding to a value function and its associated action-value function can be written in closed-form as (proof in Appendix A.4)
From (14) and (15), the variational dynamics, , and policy, , can be seen as an exponential twisting of the dynamics and policy with weights and , respectively. In the special case (the E-step optimal value function), these distributions can be written in closed-form as
where the denominator of is obtained by applying Proposition 1 to replace with .
4.2 Policy and Value Iteration Algorithms for the E-step
Using the results of Section 4.1, we now propose model-based and model-free dynamic programming (DP) style algorithms, i.e., policy iteration (PI) and value iteration (VI), for solving the E-step problem (5). The model-based algorithms compute the variational dynamics, , at each iteration, while the model-free ones compute only at the end (upon convergence). Having access to at each iteration has the advantage that we may generate samples from the model, , when we implement the sample-based version (RL version) of these DP algorithms in Section 5.
In the model-based PI algorithm, at each iteration , given the current variational policy , we
Policy Evaluation: Compute the -induced value function (the fixed-point of the operator ) by iteratively applying from Eq. 6, i.e., , where the variational model in (6) is computed using Eq. 14 with . We then compute the corresponding action-value function, using Eq. 9.
Policy Improvement: Update the variational distribution using Eq. 15 with .222When the number of actions is large, the denominator of (15) cannot be computed efficiently. In this case, we replace (15) in the policy improvement step of our PI algorithms with , where . We also prove the convergence of our PI algorithms with this update in Appendix A.5.
Upon convergence, i.e., , we compute from Eq. 14 and return .
The model-free PI algorithm is exactly the same, except in its policy evaluation step, the -induced operator, , is applied using Eq. 10 (without the variational dynamics ). In this case, the variational dynamics, , is computed only upon convergence, , using Eq. 14.
We can similarly derive model-based and model-free value iteration (VI) algorithms for the E-step. These algorithms start from an arbitrary value function, , and iteratively apply the optimal operator, , from Eqs. 6 and 7 (model-based) and Eq. 11 (model-free) until convergence, i.e., . Given , these algorithms first compute from Proposition 1, and then compute (, ) using Eq. 16. From the properties of the optimal operator in Lemma 2, it is easy to see that both model-based and model-free VI algorithms converge to and .
In this paper, we focus on the PI algorithms, in particular the model-based one, and leave the VI algorithms for future work. In the next section, we show how the PI algorithms can be implemented and combined with a routine for solving the M-step, when the true MDP model, , is unknown (the RL setting) and the state and action spaces are large that require using function approximation.
5 Variational Model-based Policy Optimization Algorithm
In this section, we propose a RL algorithm, called variational model-based policy optimization (VMBPO). VMBPO is a EM-style algorithm based on the variational formulation proposed in Section 4. The E-step of VMBPO is the sample-based implementation of the model-based PI algorithm, described in Section 4.2. We describe the E-step and M-step of VMBPO in details in Sections 5.1 and 5.2, and report its pseudo-code in Algorithm 1 in Appendix B. VMBPO uses neural networks to represent: policy , variational dynamics , variational policy , log-likelihood ratio , value function , action-value function , target value function , and target action-value function , with parameters , , , , , , , and , respectively.
5.1 The E-step of VMBPO
At the beginning of the E-step, we generate a number of samples from the current baseline policy , i.e., and , and add them to the buffer . The E-step consists of four updates: 1) computing the variational dynamics , 2) estimating the log-likelihood ratio , 3) computing the -induced value, , and action-value, , functions (critic update), and finally 4) computing the variational policy new (actor update). We describe the details of each step below.
Step 1. (Computing ) We find as the solution to the optimization problem (12) for equal to the target value network . Since the in (14) is the solution of (12), we compute by minimizing , which results in the following forward KL loss, for all and :
where (a) is by removing the -independent terms from (17). We update
by taking several steps in the direction of the gradient of a sample average of the loss function (18), i.e.,
where are randomly sampled from . The intuition here is to focus on learning the dynamics model in regions of the state-action space that has higher temporal difference — regions with higher anticipated future return. Note that we can also obtain by optimizing the reverse KL direction in (17), but since it results in a more involved update, we do not report it here.
Step 2. (Computing ) Using the duality of f-divergence (Nguyen et al., 2008) w.r.t. the reverse KL-divergence, the log-likelihood ratio is a solution to
for all and . Note that the optimizer of (20) is unique almost surely (at with ), because is absolutely continuous w.r.t. (see the definition of in Eq. 14) and the objective function of (20) is strictly concave. The optimization problem (20) allows us to compute as an approximation to the log-likelihood ratio . We update by taking several steps in the direction of the gradient of a sample average of (20), i.e.,
where is the set of samples for which is drawn from the variational dynamics, i.e., . Here we first sample randomly from and use them in the second sum. Then, for all that have been sampled, we generate from and use the resulting samples in the first sum.
Step 3. (critic update) To compute (fixed-point of ) and its action-value , we first rewrite (6) with the maximizer from Step 1 and the log-likelihood ratio from Step 2:
Since the expectation in (23) is w.r.t. the variational dynamics (model) , we can estimate only with samples generated from the model. We do this by taking several steps in the direction of the gradient of a sample average of the square-loss obtained by setting the two sides of (23) equal, i.e.,
Note that in (23), the actions are generated by . Thus, in (24), we first randomly sample , then sample from , and finally draw from . If the reward function is known (chosen by the designer of the system), then it is used to generate the reward signals in (24), otherwise, a reward model has to be learned.
After estimating , we approximate , the fixed-point of , using definition in (10) as . This results in updating by taking several steps in the direction of the gradient of a sample average of the square-loss obtained by setting the two sides of the above equation to be equal, i.e.,
where is randomly sampled and (without sampling from the true environment).
Step 4. (actor update) We update the variational policy (policy improvement) by solving the optimization problem (13) for the estimated by the critic in Step 3. Since the that optimizes (13) can be written as (15), we update it by minimizing . This results in the following reverse KL loss, for all : . If we reparameterize using a transformation , where is a Gaussian noise, we can update by taking several steps in the direction of the gradient of a sample average of the above loss, i.e.,
5.2 The M-step of VMBPO
As described in Section 4, the goal of the M-step is to improve the baseline policy , given the variational model learned in the E-step, by solving the following optimization problem
A nice feature of (27) is that it can be solved using only the variational model , without the need for samples from the true environment . However, it is easy to see that if the policy space considered in the M-step, , contains the one used for in the E-step, then we can trivially solve the M-step by setting . Although this is an option, it is more efficient in practice to solve a regularized version of (27). A practical way to regularize (27) is to make sure that the new baseline policy remains close to the old one, which results in the following optimization problem
This is equivalent to the weighted MAP formulation used in the M-step of MPO (Abdolmaleki et al., 2018). In MPO, they define a prior over the parameter and add it as to the objective function of (27). Then, they set the prior to a specific Gaussian and obtain an optimization problem similar to (28) (see Section 3.3 in Abdolmaleki et al. 2018). However, since in their variational model (their approach is model-free), they need real samples to solve their optimization problem, while we can solve (28) only by simulated samples (our approach is model-based).
To illustrate the effectiveness of VMBPO, we (i) compare it with several state-of-the-art RL methods on multiple domains, and (ii) assess the trade-off between sample efficiency via ablation analysis.
Comparison with Baseline RL Algorithms
We compare VMBPO with five baseline methods, MPO (Abdolmaleki et al., 2018), SAC (Haarnoja et al., 2018) —two popular model-free deep RL algorithms—and STEVE (Buckman et al., 2018), PETS (Chua et al., 2018), and MBPO (Janner et al., 2019) —three recent model-based RL algorithms. We also compare with the (E-step) model-free variant of VMBPO, which is known as VMBPO-MFE (see Appendix C
for details). We evaluate the algorithms on one classical control benchmark (Pendulum) and five MuJoCo benchmarks (Hopper, Walker2D, HalfCheetah, Reacher, Reacher7DoF). The neural network architectures (for the the dynamics model, value functions, and policies) of VMBPO are similar to that of MBPO. Details on network architectures and hyperparameters are described in AppendixD. Since we parameterize in the E-step of VMBPO, according to Section 5.2, in the M-step we simply set . For the more difficult environments (Walker2D, HalfCheetah), the number of training steps is set to , while for the medium one (Hopper) and for the simpler ones (Pendulum, Reacher, Reacher7DOF), it is set to and respectively. Policy performance is evaluated every training iterations. Each measurement is an average return over episodes, each generated with a separate random seed. To smooth learning curves, data points are averaged over a sliding window of size .
) show the average return of VMBPO, VMBPO-MFE, and the baselines under the best hyperparameter configurations. VMBPO outperforms the baseline algorithms in most of the benchmarks, with significantly faster convergence speed and a higher reward. This verifies our conjecture about VMBPO: (i) Utilizing synthetic data from the learned dynamics model generally improves data-efficiency of RL; (ii) Extra improvement in VMBPO is attributed to the fact that model is learned from the universal RL objective function. On the other hand, VMBPO-MFE outperforms MPO in 4 out of 6 domains. However, in some cases the learning may experience certain degradation issues (which lead to poor performance). This is due to the instability caused by sample variance amplification in critic learning with exponential-TD minimization (see Eq.34 in Section C.1). To alleviate this issue one may introduce a temperature term to the exponential-TD update (Borkar, 2002). However, tuning this hyper-parameter can be quite non-trivial.333The variance is further amplified with a large , but the critic learning is hampered by a small . Table 4 (Appendix D.1) and Figure 1 show the summary statistics averaged over all hyper-parameter/random-seed configurations and illustrate the sensitivity to hyperparameters of each method. VMBPO is more robust (with best performance on all the tasks) than other baselines. This corroborates with the hypothesis that MBRL generally is more robust to hyperparameters than its model-free counterparts.
We now study the effects of data-efficiency of VMBPO w.r.t. the data samples generated from the dynamics model . For simplicity, we only experiment with 3 standard benchmarks (Pendulum, Hopper, HalfCheetah) and with fewer learning steps (, , ). At each step, we update the actor and critic using synthetic samples. Figure 2 shows the statistics averaged over all hyper-parameter/random-seed configurations and illustrates how synthetic data can help with policy learning. The results show that increasing the amount of synthetic data generally improves the policy convergence rate. In the early phase when the dynamics model is inaccurate, sampling data from it may slow down learning, while in the later phase with an improved model adding more synthetic data leads to a more significant performance boost.
We formulated the problem of jointly learning and improving model and policy in RL as a variational lower-bound of a log-likelihood, and proposed EM-type algorithms to solve it. Our algorithms, called variational model-based policy optimization (VMBPO) use model-based policy iteration for solving the E-step. We compared our (E-step) model-based and model-free algorithms with each other, and with a number of state-of-the-art model-based (e.g., MBPO) and model-free (e.g., MPO) RL algorithms, and showed its sample efficiency and performance.
We briefly discussed VMBPO style algorithms in which the E-step is solved by model-based policy iteration methods. However, full implementation of these algorithms and studying their relationship with the existing methods requires more work that we leave for future. Another future directions are: 1) finding more efficient implementation for VMBPO, and 2) using VMBPO style algorithms in solving control problems from high-dimensional observations, by learning a low-dimensional latent space and a latent space dynamics, and perform control there. This class of algorithms is referred to as learning controllable embedding (Watter et al., 2015; Levine et al., 2020).
This work proposes methods to jointly learn and improve model and policy in reinforcement learning. We see learning control-aware models as a promising direction to address an important challenge in model-based reinforcement learning: creating a balance between the bias in simulated data and the ease of data generation (sample efficiency).
- Policy-aware model learning for policy gradient methods. preprint arXiv:2003.00030. Cited by: §1.
- Maximum a posteriori policy optimisation. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §1, §1, §3, §5.1, §5.2, §6, Remark 2.
- Dynamic programming and optimal control. Vol. 2, Athena scientific Belmont, MA. Cited by: §A.2, §A.2.
- Q-learning for risk-sensitive control. Mathematics of operations research 27 (2), pp. 294–311. Cited by: §A.2, §6.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §6.
- Path integral guided policy search. In IEEE International Conference on Robotics and Automation, Cited by: §1.
- Unsupervised exploration with deep model-based reinforcement learning. Cited by: §6.
- Using expectation-maximization for reinforcement learning. Neural Computation 9 (2), pp. 271–278. Cited by: §1.
- Iterative value-aware model learning. In Advances in Neural Information Processing Systems 31, pp. 9072–9083. Cited by: §1.
- VIREL: a variational inference framework for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 7120–7134. Cited by: §1, §3.
Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor.
Proceedings of the 35th International Conference on Machine Learning, pp. 1861–1870. Cited by: §1, §1, §6.
- Stochastic variational inference. Journal of Machine Learning Research 14 (4), pp. 1303–1347. Cited by: §1.
- When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems 32, pp. 12519–12530. Cited by: §1, §1, §6.
- Optimal control as a graphical model inference problem. Machine Learning 87 (2), pp. 159–182. Cited by: §1, §3.
- Prediction, consistency, curvature: representation learning for locally-linear control. In Proceedings of the 8th International Conference on Learning Representations, Cited by: §7.
- Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, Cited by: §1.
- Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
- Learning complex neural network policies with trajectory optimization. In International Conference on Machine Learning, Cited by: §1.
- Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv:1805.00909. Cited by: §1, §3, footnote 1.
- Playing Atari with deep reinforcement learning. preprint arXiv:1312.5602. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
- Path consistency learning in tsallis entropy regularized mdps. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pp. 979–988. Cited by: §1.
- Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785. Cited by: §A.1, §A.4, §A.4, §1.
- Variational inference for policy search in changing situations. In Proceedings of the 28th international conference on machine learning, pp. 817–824. Cited by: §1, §3.
- Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, pp. 1089–1096. Cited by: §5.1.
Relative entropy policy search.
Proceedings of the 24th AAAI Conference on Artificial Intelligence, Cited by: §1.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on machine learning, Cited by: §1, §3.
- Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1889–1897. Cited by: §1.
- Efficient sample reuse in EM-based policy search. In Proceedings of the European Conference on Machine Learning, Cited by: §3.
- Reinforcement learning: an introduction. MIT press. Cited by: §2.
- General duality between optimal control and estimation. In Proceedings of the 47th IEEE Conference on Decision and Control, pp. 4286–4292. Cited by: §1, §3.
- Robot trajectory optimization using approximate inference. In Proceedings of the 26th International Conference on Machine Learning, pp. 1049–1056. Cited by: §1, §3.
- On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings of Robotics: Science and Systems, Cited by: §1.
- Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems 28, pp. 2746–2754. Cited by: §7.
Appendix A Proofs of Section 4
a.1 Proof of Lemma 1
Before proving Lemma 1, we first state and prove the following results.
For any state , action-value function , and policy , we have
Analogously, for any state-action pair , value function , and transition kernel , we have
(a) This follows from Lemma 4 in Nachum et al. . ∎
We now turn to the proof of our main lemma.