Introduction
Reinforcement Learning focuses on maximizing longterm return by interacting with the environment sequentially [Sutton and Barto1998]. Generally, a reinforcement learning algorithm has two aspects, on the one hand, it estimates the stateaction value function (also known as the Qfunction), on the other hand, it optimizes or improves the policy to maximize its performance measure.
There are two classes of reinforcement learning methods, modelfree and modelbased methods. Modelfree methods [Peters and Schaal2006] estimate and iteratively update the stateaction value with the rollout samples by Temporal Difference learning [Sutton and Barto1998]. It is said that modelbased methods maintain an approximate model including the reward functions and the state transitions, and then use the approximated rewards and transitions to estimate the value function. Modelbased methods are more efficient than modelfree methods [Li and Todorov2004, Levine and Koltun2013, Montgomery and Levine2016, Wahlström, Schön, and Desienroth2015, Watter et al.2015] especially in discrete environments by reducing the sample complexity. Modelfree methods are more generally applicable to continuous and complex control problems but may suffer from high sample complexity [Schulman et al.2015, Lillicrap et al.2015].
Modelfree methods directly use the immediate rewards and next states from rollout samples and estimate the longterm stateaction value by Temporal Difference learning, for example, Sarsa or Qlearning [Sutton and Barto1998]. Therefore, the target value is unbiased but may induce large variance due to the randomness of the transition dynamics or the offpolicy stochastic exploration strategy. Modelbased methods use the prediction of the immediate reward and the next state by its own belief of the environment. The belief of the agent, including the approximate reward function and transition model, is updated after receiving more signals from the environment. So the estimated target value in modelbased methods is often biased due to the approximation error of the model, but it has low variance compared with modelfree methods. Combining modelbased and modelfree learning has been an important question in the recent literature.
We aim to answer the following question: How to incorporate modelbased and modelfree methods for better control?
The DynaQ [Sutton1990], Normalized Advantage Function (NAF) algorithm [Gu et al.2016], and ModelEnsemble TrustRegion Policy Optimization (METRPO) algorithm [Kurutach et al.2018] use simulated experiences in a learned transition model to supplement the real transition samples. [Cai, Pan, and Tang2018] uses a convex combination of the two target values as the new target, following the insight that the ensemble could more accurate. But the large loss of model prediction would also lead to an incorrect update for policy iteration.
In this paper, we incorporate the two methods together to address the tradeoff between exploration and exploitation (EE), by a simple heuristic that adds the discrepancy between both target values as a relative measure of the
exploration value, so as to encourage the agent to explore more difficult transition dynamics.The tradeoff between EE is a fundamental and challenging problem in reinforcement learning because it is hard to evaluate the value of exploring unfamiliar states and actions. Modelfree exploration methods are widely discussed over decades. To name some, Upper Confidence Bounds [Auer, CesaBianchi, and Fischer2002, Li et al.2010, Chen et al.2017]
adds the upper confidence bound of the uncertainty of value estimation as an exploration value of actions. Thompson Sampling
[Thompson1933] can also be understood as adding stochastic exploration value based on the posterior distribution of value estimation [May et al.2012]. However these modelfree valuebased methods often require a table to store the number of visits of each stateaction pair so that the uncertainty of value estimation can be delivered [Tang et al.2017], so they are not practical to handle problems with continuous state spaces such as Atari games or largescale realworld problems such as mechanism design [Tang2017] and ecommerce [Cai et al.2018a, Cai et al.2018b]. Modelbased methods for deep RL can have better exploration based on maximization of information gain [Houthooft et al.2016] or minimization of error [Pathak et al.2017] about the agents’ belief, however, may be unstable due to the difficulty of learning the transition dynamics.In this paper, based on the discrepancy between modelfree and modelbased target values, we present a new algorithm named Policy Optimization with Modelbased Exploration (POME), an exploratory modified version of the wellknown algorithm Proximal Policy Optimization (PPO). Different from previous methods, we use the modelfree samplebased estimator for the advantage function as the base for POME, which is already successfully used in various actorcritic methods including PPO. Then POME adds a centralized and clipped exploration bonus onto the advantage value, so as to encourage exploration as well as stabilize the performance.
Our intuition is that the discrepancy of target values of modelbased and modelfree methods can be understood as the uncertainty of the transition dynamics. A high exploration value means that the transition of the stateaction pair is hard to estimate and needs to be explored. We directly add the difference to the advantage estimation as the exploration bonus of the stateaction pair. So if the exploration value of a stateaction pair is higher than average, POME will encourage the agent to visit it more frequently in the future so as to better learn the transition dynamics. If not, POME will reduce the chance of picking it and give the agent an opportunity to try other actions.
We verify POME on the Atari 2600 game playing benchmarks. We compare our POME with the original modelfree PPO algorithm and a modelbased extended version of PPO. Experimental results show that POME outperforms the original PPO on 33 Atari games out of 49. We also tested two versions of POME, one with decaying exploration level and the other with nondecaying exploration level, which results in an interesting finding that the exploration bonus can improve the performance even in a long run.
Background and Notations
A Markov Decision Process (MDP) is defined as the tuple
, where is the (discrete or continuous) state space, is the (discrete or continuous) action space, is the (immediate) reward function, is the state transition model, andis the probability distribution for the initial state
.The goal of the agent is to find the policy from a restricted family of parametrized policy functions that maximizes its performance,
(1) 
where is the performance objective defined as
(2) 
where is a discount factor that balances the short and longterm returns. For convenience, let denote the (unnormalized) discounted cumulated state distribution induced by policy ,
(3) 
then the performance objective can rewrite as
Since the expected longterm return is unknown, the basic idea behind RL is to construct a tractable estimator to approximate the actual return. Then we can update the policy in the direction that the performance measure is guaranteed to improve at every step [Kakade and Langford2002]. Let denote the expected action value of action at state by following policy , i.e.
(4) 
And we write the state value and the advantage action value as
(5)  
(6) 
So the state value is defined as the weighted average of action values, and the advantage function provides a relative measure of value of each action.
Since we have the Bellman equation
(7) 
the advantage function can also be written as
(8) 
In practice, estimations of these quantities are used to evaluate the policy and to guide the direction of policy updates.
Policy Optimization Methods
Policy Gradient (PG) methods [Sutton et al.2000] compute the gradient of w.r.t policy parameters and then use gradient ascent to update the policy. The wellknown form of policy gradient writes as follow
(9) 
Instead of directly using the policy gradients, trust region based methods use a surrogate objective to perform policy optimization. For the reason that the trajectory distribution is known only for the prior policy before the update, trust region based methods introduce the following local approximation to for any untested policy , as
(10) 
It is proved [Kakade and Langford2002, Schulman et al.2015] that if the new policy is close to the in the sense of the KL divergence, there is a lower bound of the longterm rewards of the new policy. For the general case, we have the following result:
Theorem 1 (Schulman et al. 2015).
Let and . The performance of the policy can be bounded by
(11) 
By following Theorem 1, there is a category of policy iteration algorithms with monotonic improvement guarantee, known as conservative policy iteration [Kakade and Langford2002]. Among them, Trust Region Policy Optimization (TRPO) is one of the most widely used baselines to optimize parametric policies to solve complex or continuous problems. The optimization problem w.r.t the new parameter is
(12)  
s.t. 
where is a hard constraint for the KLdivergence. In practice, the hard constraint of KLdivergence is softened by an expectation over visited states and moved to the objective with a multiplier so that it becomes an unconstrained optimization problem, i.e.
(13) 
which stabilise standard RL objectives.
Proximal Policy Optimization (PPO) [Schulman et al.2017] can be view as a modified version of TRPO, which mainly uses a clipped probability ratio in the objective to avoid excessively large policy updates. It mainly replaces the term inside the expectation in (12) with
(14) 
PPO is said to be empirically more sample efficient than TRPO, and is made one of the most frequently used baselines in various kinds of deep RL tasks.
Policy Evaluation with Function Approximation
In the previous part, we show several objectives for policy optimization. The objectives commonly need to compute an expectation over pairs obtained by following the current policy and an estimator of advantage values.
In onpolicy methods such as TRPO and PPO, the expectation is approximated with the empirical average over samples induced by the current policy
. Therefore the policy updates in a similar manner to the stochastic gradient descent as in modern supervised learning.
Modelfree value estimation
Modelfree methods estimate the value functions by the samples in the rollout trajectory . By following the Bellman equation, we can estimate the action or advantage value by Temporal Difference learning. When referring to the quantities at time step of a trajectory, we will use , , and for short to denote , , and respectively.
Policybased methods, as described in the previous part, typically need to use the advantage value . It is discussed in [Schulman et al.2015, Wang et al.2016] that it can be one of the following estimators without introducing bias: the stateaction value , the discounted return of the trajectory started from , the onestep temporal difference error (TD error)
(15) 
or the kstep cumulated temporal difference error as used in the actual algorithm of PPO
(16) 
where is a discount coefficient to balance future errors. When , the estimator naturally becomes the advantage approximation in the asynchronous advantage actor critic algorithm (A3C) [Mnih et al.2016] . It is known that these estimators would not introduce bias to the policy gradient, but they have different variances.
Moreover, in deep RL, we often approximate the value functions with neural networks, which introduces additional biases and variances
[Mnih et al.2015, Schulman et al.2015, Wang et al.2016, Schulman et al.2017]. For example, to compute the temporal difference errors, the state value is always approximated with a neural network , whereis the network parameter. Then if the onestep TD error is computed with onpolicy samples, it becomes a loss function w.r.t the parameters
,(17) 
where denotes the modelfree target value
(18) 
Therefore in many algorithms the value parameters are updated by minimizing the error on samples using stochastic gradient descent.
The methods discussed so far can be categorized into modelfree methods, because the immediate rewards and the state transitions are sampled from the trajectories.
Modelbased value estimation
For the modelbased case, additional function approximation should be used to estimate the reward function and the transition function, i.e.
(19) 
In continuous problems the form of and can be neural networks, which are trained by minimizing some error measures between the model predictions and the real and . Here the symbols for parameters are omitted for simplicity. When using the mean squared error, we can define the loss function to these two networks
which can be minimized using stochastic gradient descent w.r.t the network parameters. To make use of the modelbased estimators, similar to the modelfree case, we have the modelbased TD error
(20) 
as an alternative to in (17), where denotes the modelbased target value
(21) 
To solve complex tasks with highdimensional inputs, modelfree deep RL methods can often learn faster comparing to modelbased methods especially in the beginning, mainly because it has fewer parameters to learn.
Policy Optimization with Modelbased Explorations
In this section, we propose an extension of the trust region based policy optimization method with the exploration heuristics by making use of the difference between the modelfree and modelbased value estimations.
Let us think of the TD error as a subtract between the target value and the current function approximation. The target value for modelfree and modelbased learning is written in (18) and (21). Here we define the discrepancy of targets of stateaction pair as
(22)  
(23) 
For our method, we use this error as an additional exploration value for the pair . The proposed methods is named Policy Optimization with Modelbased Exploration (POME). So the resulted TD error used in POME is
(24) 
where is a coefficient decaying to and is used to shift the exploration bonus to zero mean over samples from the same batch.
Insights
The basic idea behind POME is to add an exploration bonus to the stateaction pairs where there is a discrepancy between the modelbased and modelfree Qvalues.
The intuition is that the discrepancy between both Qvalues would be small if the agent is “familiar” with the transition dynamics, for example, when the probability distribution of the next state concentrates on one deterministic state and at the meanwhile the next state has already been visited several times. On another hand, the error would be large if the agent is uncertain about what is going on, for example, if it has trouble predicting the next state or even it has never visited the next state yet. Therefore we think of the discrepancy of targets as a measure of the uncertainty of the longterm value estimation.
By the update rule of policy iteration methods (9) and (12), the chance that the policy selects the action with larger advantage estimation would be higher after a few updates. When starting to learn, the discrepancy of both target values can be high, which means that the transition dynamic is hard to estimate, thus we need to encourage the agent to explore. After training properly for a while, we tend to schedule to a tiny value approaching , which means that the algorithm can asymptotically reach the modelfree objective.
By the definition of (24), we let the discrepancy of targets serve as an additional exploration value of actions, along with the coefficient to address the tradeoff between exploration and exploitation. Note that, the exploration bonus on some stateaction pairs will then be propagated to others so that our method guides exploration to these “interesting” parts of the stateaction space.
Techniques
With the definition of the discrepancy of targets and the new exploratory target value , we still need some techniques to build an actual policy optimization algorithm that is stable and efficient.
We first discuss why we use for zeromean normalization. Consider an onpolicy algorithm that optimizes its policy using the samples visited by following the old policy, i.e. . Conventional methods estimate the advantage values on these stateaction pairs to guide the next step of policy iteration. It means that, for a state , if there exists a certain action
that has never been taken before, it has to wait for a moment that the algorithm gives negative advantage values to other actions at the same state, otherwise it will never get a chance of being chosen. However, when we use POME to give biased advantage estimations by adding another term
where is used to normalize to zero mean, it naturally reduces the chance of exploiting familiar actions and encourages to pick unfamiliar actions which either have uncertain results or have little chance to be picked before. For the actual algorithm, is calculated as the median ofof onpolicy samples because the median of samples is more robust to outliers than the mathematical average.
Next, we show a practical trick to stabilize the exploration. Since the discrepancy of targets is an unbounded positive value, it can be extremely large when the agent arrives at a totally unexpected state for the first time, so the algorithm would be unstable. So we clip the exploration value to a certain range to stabilize the algorithm. By following Theorem 1, it is necessary to reduce the error when estimating to guarantee monotonic improvements of policy iteration. So we clip the discrepancy of targets to the same range of the modelfree TD error, by replacing (24) with
(25) 
Formally, the optimization problem for POME is
(26) 
where
(27) 
Following the onestep temporal difference error defined in (25), we can use the kstep cumulated error to estimate the advantage value of stateaction pairs along the trajectory,
(28) 
Therefore it is easy to compute by simply replacing the advantage estimation. The overall samplebased objective for policy parameters is
(29) 
The loss function for value estimator therefore becomes
(30) 
Now we present our new algorithm named Policy Optimization with Modelbase Exploration (POME) by incorporating the techniques above, as shown in Algorithm 1.
Experiments
Experimental Setup
We use the Arcade Learning Environment [Bellemare et al.2013] benchmarks along with a standard opensourced PPO implementation [Dhariwal et al.2017].
We will evaluate two versions of POME in this section. For the first version, to guarantee that POME asymptotically approximates the original objective, we linearly anneal the coefficient from to over the training period. For the second version, we fix the value of as 0.1. The first version is used to show that the heuristic of POME helps fast learning because it mainly influences the algorithm in the beginning phase. The second version additionally shows that, even though the exploration value added to the estimation of advantage value introduces additional bias, the bias would not damage the performance too much.
We test on all 49 Atari games. Each game is run for 10 million timesteps, over 3 random seeds. The average score of the last 100 episodes will be taken as the measure of performance. The experimental results of PPO in the comparison of Table 1 and Table 2 are borrowed from the original paper [Schulman et al.2017].
Implementation Details
For all the algorithms, the discount factor is set to and the advantage values are estimated by the kstep error where the horizon is set to be . We use 8 actors (workers) to simultaneously run the algorithm and the minibatch size is .
For the basic network structures of PPO and POME, we use the same actorcritic architecture as [Mnih et al.2016, Schulman et al.2017], with shared convolutional layers and separate MLP layers for the policy and the value network.
Since POME is an extended version of PPO by incorporating modelbased value estimations, the implementation of POME involves two additional parts: fitting the model and adding the discrepancy of targets. Note that PPO and POME were always given the same amount of data during training.
To estimate the model, due to the fact that learning the state dynamics is more important than learning the rewards, we use a convolutional neural network with one hidden layer to fit the state transition. The inputs of the transition network have two parts, the state (images of four frames) and the action (a discrete number). Before being fed into the model, the images are scaled to the range
, and the action is onehot encoded. We concatenate the onehot encoding of the action to the states’ images to form the inputs of the transition model. We use the
sigmoid activation for the outputs of the transition network. After that, we can finally compute the loss function of the transition model between the scaled images of the next state and the outputs.Games  PPO  POME 

Alien  1850.3  1897.0 
Amidar  674.6  943.9 
Assault  4971.9  5638.6 
Asterix  4532.5  4989.2 
Asteroids  2097.5  1737.6 
Atlantis  2311815.0  1941792.3 
BankHeist  1280.6  1241.7 
BattleZone  17366.7  15156.7 
BeamRider  1590.0  1815.7 
Bowling  40.1  58.3 
Boxing  94.6  92.9 
Breakout  274.8  411.8 
Centipede  4386.4  2921.6 
ChopperCommand  3516.3  4689.0 
CrazyClimber  110202.0  115282.0 
DemonAttack  11378.4  14847.1 
DoubleDunk  14.9  6.8 
Enduro  758.3  835.3 
FishingDerby  17.8  21.1 
Freeway  32.5  33.0 
Frostbite  314.2  272.9 
Gopher  2932.9  4801.8 
Gravitar  737.2  914.5 
IceHockey  4.2  4.5 
Jamesbond  560.7  507.2 
Kangaroo  9928.7  2511.0 
Krull  7942.3  8001.1 
KungFuMaster  23310.3  24570.3 
MontezumaRevenge  42.0  0.0 
MsPacman  2096.5  1966.5 
NameThisGame  6254.9  5902.2 
Pitfall  32.9  0.3 
Pong  20.7  20.8 
PrivateEye  69.5  100.0 
Qbert  14293.3  15712.8 
Riverraid  8393.6  8407.9 
RoadRunner  25076.0  44520.0 
Robotank  5.5  14.6 
Seaquest  1204.5  1789.7 
SpaceInvaders  942.5  964.2 
StarGunner  32689.0  44696.7 
Tennis  14.8  15.5 
TimePilot  4232.0  4052.0 
Tutankham  254.4  199.8 
UpNDown  95445.0  181250.4 
Venture  0  2.0 
VideoPinball  37389.0  33388.0 
WizardOfWor  4185.3  4301.7 
Zaxxon  5008.7  6358.0 
In the actual implementation, POME use a unified objective function in order to simplify the computation
(31) 
which is optimized by the Adam gradient descent optimizer [Kingma and Ba2014] with learning rate , where is a fraction linearly annealed from 1 to 0 over the course of learning, and , are coefficients for tuning the learning rate of the value function and the transition function. In our experiments we set these coefficients to and .
Comparison with PPO
Table 1 compares POME with decaying coefficient against the original PPO, on the averaged scores of the last 100 episodes of algorithms with each environment. In Table 1, we see that, among the 49 games, POME with decaying coefficient outperforms PPO in 32 games at the last 100 episodes.
The learning curves of four representative Atari games is shown in Figure 1. It shows that, in these environments, POME outperforms PPO over the entire training period, which indicates that it achieves fast learning and validate the power of our exploration technique by using the discrepancy of targets as exploration value.
Additional experimental results
We now investigate two questions: (1) how would POME perform if we do not tune the coefficient to ? (2) how would the direct modelbased extension of PPO perform?
For the first question, we set up the experiment to see if the exploration value used in POME would damage the performance in a long run. The coefficient is now set to for the entire training period. Secondly, we implement a modelbased extension of PPO with the same architecture of the transition network as POME and replacing the target value with the modelbased target value (21), so the agent can perform onpolicy learning while maintaining the belief model.
We test the two extensions on Atari 2600 games. The setup of the environments and the hyperparameters remain the same with the previous experiments.
The experimental results in Table 2 show that the modelbased version is far from good. Only in one game can it outperforms the baseline. It shows that by using pure modelbased PPO, the approximation errors introduced by fitting the model can substantially affect the performance.
However, POME with nondecaying coefficient turns out to be not only good but even better than POME with decaying coefficient. It outperforms the original PPO in 33 games out of 49. This result indicates that even though adding the exploration value would increase the bias when estimating the advantage functions, it is empirically not harmful to the policy optimization algorithms in most of the environments.
Games  PPO  PPO  POME 

modelbased  nondecay  
Alien  1850.3  1386.4  1658.1 
Amidar  674.6  27.7  704.0 
Assault  4971.9  872.2  6211.5 
Asterix  4532.5  1606.2  7235.0 
Asteroids  2097.5  1456.8  1788.4 
Atlantis  2311815  2864448  2030477 
BankHeist  1280.6  159.4  1245.8 
BattleZone  17366.7  2790.0  15313.3 
BeamRider  1590.0  448.9  1989.2 
Bowling  40.1  28.0  66.2 
Boxing  94.6  52.5  92.6 
Breakout  274.8  18.8  399.2 
Centipede  4386.4  3343.6  2684.7 
ChopperCommand  3516.3  1603.5  3886.3 
CrazyClimber  110202.0  3640.0  112166.3 
DemonAttack  11378.4  169.4  18877.9 
DoubleDunk  14.9  17.9  8.8 
Enduro  758.3  92.2  862.8 
FishingDerby  17.8  62.7  19.3 
Freeway  32.5  4.0  33.0 
Frostbite  314.2  265.4  275.1 
Gopher  2932.9  102.4  5050.4 
Gravitar  737.2  150.0  773.8 
IceHockey  4.2  4.5  4.5 
Jamesbond  560.7  37.2  871.8 
Kangaroo  9928.7  1422.0  2237.0 
Krull  7942.3  6091.0  8795.4 
KungFuMaster  23310.3  12255.5  27667.0 
MontezumaRevenge  42.0  0.0  0.0 
MsPacman  2096.5  2059.7  2101.0 
NameThisGame  6254.9  4485.1  5462.9 
Pitfall  32.9  591.5  4.4 
Pong  20.7  16.1  20.8 
PrivateEye  69.5  39.3  98.9 
Qbert  14293.3  3308.2  14373.7 
Riverraid  8393.6  3646.9  8358.3 
RoadRunner  25076.0  4534.5  41740.7 
Robotank  5.5  2.6  13.3 
Seaquest  1204.5  592.4  1805.7 
SpaceInvaders  942.5  355.4  971.6 
StarGunner  32689.0  1497.5  33836.0 
Tennis  14.8  22.6  12.6 
TimePilot  4232.0  3150.0  3970.7 
Tutankham  254.4  99.0  159.6 
UpNDown  95445.0  1589.4  203080.4 
Venture  0.0  0.0  55.0 
VideoPinball  37389.0  19720.4  31462.0 
WizardOfWor  4185.3  1551.0  6050.7 
Zaxxon  5008.7  1.5  6088.7 
Conclusion and Discussion
Due to the challenge of the tradeoff between exploration and exploitation in environments with continuous state space, in this paper, we propose a novel policybased algorithm named POME, which uses the discrepancy between both target values of modelfree and modelbased methods to build a relative measure of the exploration value. POME uses several practical techniques to enable the exploration while being stable, i.e., the clipped and centralized exploration value. In the actual algorithm, POME builds on the modelfree PPO algorithm and adds the exploration bonus to the estimation of the advantage function. Experiments show that POME outperforms the original PPO in 33 Atari games out of 49.
There is yet a limitation that, if the reward signal is extremely sparse, the error of the target values would be close to . So POME would have little improvement in such situations. For the future work, it would be interesting to further extend the insight to address the exploration problem in the environments with sparse reward signals, for example, by incorporating with some rewardindependent curiositybased exploration methods.
References
 [Auer, CesaBianchi, and Fischer2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine learning 47(23):235–256.

[Bellemare et al.2013]
Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.
2013.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
47:253–279.  [Cai et al.2018a] Cai, Q.; FilosRatsikas, A.; Tang, P.; and Zhang, Y. 2018a. Reinforcement mechanism design for ecommerce. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, 1339–1348.
 [Cai et al.2018b] Cai, Q.; FilosRatsikas, A.; Tang, P.; and Zhang, Y. 2018b. Reinforcement mechanism design for fraudulent behaviour in ecommerce. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
 [Cai, Pan, and Tang2018] Cai, Q.; Pan, L.; and Tang, P. 2018. Generalized deterministic policy gradient algorithms. arXiv preprint arXiv:1807.03708.
 [Chen et al.2017] Chen, R. Y.; Sidor, S.; Abbeel, P.; and Schulman, J. 2017. UCB exploration via Qensembles. arXiv preprint arXiv:1706.01502.
 [Dhariwal et al.2017] Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; and Zhokhov, P. 2017. Openai baselines. https://github.com/openai/baselines.
 [Gu et al.2016] Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Continuous deep Qlearning with modelbased acceleration. In International Conference on Machine Learning, 2829–2838.
 [Houthooft et al.2016] Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; and Abbeel, P. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, 1109–1117.
 [Kakade and Langford2002] Kakade, S., and Langford, J. 2002. Approximately optimal approximate reinforcement learning. In ICML, volume 2, 267–274.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 [Kurutach et al.2018] Kurutach, T.; Clavera, I.; Duan, Y.; Tamar, A.; and Abbeel, P. 2018. Modelensemble trustregion policy optimization. In International Conference on Learning Representations.
 [Levine and Koltun2013] Levine, S., and Koltun, V. 2013. Guided policy search. In International Conference on Machine Learning, 1–9.
 [Li and Todorov2004] Li, W., and Todorov, E. 2004. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), 222–229.
 [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670.
 [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 [May et al.2012] May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Optimistic bayesian sampling in contextualbandit problems. Journal of Machine Learning Research 13(Jun):2069–2106.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529.
 [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937.
 [Montgomery and Levine2016] Montgomery, W. H., and Levine, S. 2016. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, 4008–4016.
 [Pathak et al.2017] Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017.
 [Peters and Schaal2006] Peters, J., and Schaal, S. 2006. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, 2219–2225. IEEE.
 [Schulman et al.2015] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International Conference on Machine Learning, 1889–1897.
 [Schulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. In International Conference on Learning Representations.
 [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
 [Sutton et al.2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
 [Sutton1990] Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990. Elsevier. 216–224.
 [Tang et al.2017] Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Chen, O. X.; Duan, Y.; Schulman, J.; DeTurck, F.; and Abbeel, P. 2017. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2753–2762.
 [Tang2017] Tang, P. 2017. Reinforcement mechanism design. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, 5146–5150.
 [Thompson1933] Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.
 [Wahlström, Schön, and Desienroth2015] Wahlström, N.; Schön, T. B.; and Desienroth, M. P. 2015. From pixels to torques: Policy learning with deep dynamical models. In Deep Learning Workshop at the 32nd International Conference on Machine Learning.
 [Wang et al.2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224.
 [Watter et al.2015] Watter, M.; Springenberg, J.; Boedecker, J.; and Riedmiller, M. 2015. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, 2746–2754.