1 Introduction
Modelfree deep RL algorithms have been employed in solving a variety of complex tasks (Sutton & Barto, 2005; Silver et al., 2016, 2017; Mnih et al., 2013, 2016; Schulman et al., 2015a; Lillicrap et al., 2015; Gu et al., 2016). Modelfree RL consists of on and offpolicy methods. Offpolicy methods enable a target policy to be learned while following and obtaining data (action value) from another policy (behavior policy). It means that the agent learns about a policy distinct from the one it is carrying out while there is a single policy (target policy) in onpolicy methods. It means that the agent learns only about the policy it is carrying out. In short, if the two policies are the same i.e. , then the setting is called onpolicy. Otherwise, the setting is called the offpolicy () (Harutyunyan et al., 2016; Degris et al., 2012; Precup et al., 2001; Gu et al., 2016; Hanna et al., 2018; Gruslys et al., 2017)
. Onpolicy method offers unbiased and stable but often suffer from high variance and sample inefficient. Offpolicy methods are more sample efficient, safe but unstable. Neither on nor offpolicies are perfect. Therefore, several methods have been proposed to get rid of the deficiency of each policy. For example, how onpolicy can achieve similar sample efficient as offpolicy
(Gu et al., 2016; Mnih et al., 2016; Schaul et al., 2015; Schulman et al., 2015a; van Hasselt & Mahmood, 2014) and how offpolicy can achieve similar stability as onpolicy (Degris et al., 2012; Mahmood et al., 2014; Gruslys et al., 2017; Wang et al., 2016; Haarnoja et al., 2018). The aim of this study is to make offpolicy learning as stable as onpolicy using the actorcritic algorithm in the deep neural network. Therefore, This paper will primarily focus on offpolicy rather than onpolicy. A common approach is to use importance sampling techniques for stabilizing offpolicy caused by the mismatch between the behavior policy and the target policy (Gu et al., 2016; Hachiya et al., 2009; Rubinstein, 1981).Importance sampling is a wellknown method for offpolicy evaluation, permitting offpolicy data to be used as if it were onpolicy (Hanna et al., 2018). IS can be used to study one distribution while sampling from another (Owen, 2013). The degree of deviation of the target policy from the behavior policy at each time t is captured by the importance sampling ratio i.e. (Precup et al., 2001)
. IS is also considered as a method for reducing the variance of the estimate of an expectation by carefully choosing a sampling distribution (b). If b is chosen properly, our new estimate has lower variance. The variance of an estimator depends on how much the sampling distribution and the target distribution differ
(Rubinstein, 1981). For the theory behind the importance sampling that is presented here, we refer to (Owen, 2013, Chapter 9). Another reason for instability is that IS does not always generate uniform values for all samples. IS sometimes generates large value for some sample, and a small value for another sample. Thus, Yamada et al. (2011) propose a smooth variant of importance sampling, such as the relative importance sample to mitigate instability. Some of the more important methods based on IS include: WIS (Mahmood et al., 2014), ACER (Wang et al., 2016), Retrace (Munos et al., 2016), Qprop (Gu et al., 2016), SAC (Haarnoja et al., 2018), MIS (Elvira et al., 2015), OffPAC (Degris et al., 2012), The Reactor (Gruslys et al., 2017), GPS (Levine & Koltun, 2013) etc.In this paper, we propose an offpolicy actorcritic algorithm based on relative importance sampling in deep reinforcement learning for stabilizing offpolicy methods, called RISOffPAC. To the best of our knowledge, we introduce the first time RIS with actorcritic. We use the deep neural network to train both actor and critic. The behavior policy is also generated by the deep neural network. We also explore a different type of actorcritic algorithm such as natural gradient actorcritic using RIS.
2 Related Work
2.1 OnPolicy
Thomas (2014) claims that biased discounted reward makes the natural actorcritic algorithms unbiased average reward natural actorcritic. Bhatnagar et al. (2009) present four new online actorcritic reinforcement learning algorithms based on naturalgradient, functionapproximation, and temporal difference learning. They also demonstrate the convergence of these four algorithms to a local maximum. Schaul et al. (2015)
present a framework for prioritizing experience, so as to replay significant transitions more often, and thus learn more efficiently. Bounded actions introduce bias when the standard Gaussian distribution is used as a stochastic policy.
Chou et al. (2017)suggest using Beta distribution instead of Gaussian and examine the tradeoff between bias and variance of policy gradient of both on and offpolicy.
Mnih et al. (2016) purpose four asynchronous deep RL algorithms. The most effective one is asynchronous advantage actorcritic (A3c), maintains a policy and an estimated of the value function . Van Seijen & Sutton (2014) introduce a true online TD() learning algorithm that is exactly equivalent to an online forward view and that empirically perform better than its standard counterpart in both prediction and control problems. Schulman et al. (2015a) develop an algorithm, called Trust Region Policy Optimization (TRPO) offers monotonic policy improvements and derives a practical algorithm with a better sample efficiency and performance. It is similar to natural policy gradient methods. Schulman et al. (2015b) develop a variance reduction method for policy gradient, called generalized advantage estimation (GAE) where a trust region optimization method used for the value function. Policy gradient of GAE significantly minimizes variance while maintaining an acceptable level of bias. We are interested in offpolicy rather than onpolicy.2.2 OffPolicy
Hachiya et al. (2009) consider variance of the value function estimator for an offpolicy method to control the tradeoff between bias and variance. Mahmood et al. (2014) uses weighted importance sampling with function approximation and extends to a new weightedimportance sampling form of offpolicy LSTD() called WISLSTD(). Degris et al. (2012) purpose a method called offpolicy actorcritic (offPAC) in which agent learns target policy while following and getting samples from behavior policy. Gruslys et al. (2017) present sampleefficient actorcritic reinforcement learning agent called Reactor. It uses an offpolicy multistep Retrace algorithm to train critic while a new policy gradient algorithm, called Bleaveoneout is used to train actor. Zimmer et al. (2018) show new offpolicy actorcritic RL algorithms to cope with continuous state and actions spaces using the neural network. Their algorithm also allows a tradeoff between dataefficiency and scalability. Levine & Koltun (2013) talk to avoid poor local optima in complex policies with hundreds of parameter using guided policy search (GPS). GPS uses differential dynamic programming to generate suitable guiding samples, and define a regularized importance sampled policy optimization that integrates these samples into policy search. Lillicrap et al. (2015) introduce a modelfree, offpolicy actorcritic algorithm using deep function approximators based on the deterministic policy gradient (DPG) that can learn policies in highdimensional, continuous action spaces, call it deep deterministic policy gradient (DDPG). Wang et al. (2016) present stable, sample efficient an actorcritic deep RL agent with experience replay, called ACER that applies to both continuous and discrete action spaces successfully. ACER utilize truncate importance sampling with bias correction, stochastic dueling network architectures, and efficient trust region policy optimization to achieve it. Munos et al. (2016) show a novel algorithm, called Retrace() which has three properties: small variance, safe because of using samples collected from any behavior policy and efficient because it estimates QFunction from offpolicy efficiently. Gu et al. (2016) develop a method called QProp which is both samples efficient and stable. It merges the advantages of onpolicy (stability of policy gradients) and offpolicy methods (efficiency). Modelfree deep RL algorithms typical undergo from two major challenges: very high sample inefficient and unstable. Haarnoja et al. (2018) present a soft actorcritic (SAC) method, based on maximum entropy and offpolicy. Offpolicy provides sample efficiency and entropy maximization provide stability. The most of these methods are similar to our method, but they use standard IS or entropy method while we use RIS. For a review of ISOffpolicy methods, see the works of (Precup et al., 2000; Sutton et al., 2016; Tang & Abbeel, 2010; Elvira et al., 2015; Gu et al., 2017; van Hasselt & Mahmood, 2014; Precup et al., 2001).
3 Preliminaries
Markov decision process (MDP) is a mathematical formulation of RL problems. MDP is defined by tuples of objects, consisting of (,,,,). where is set of possible states, is set of possible actions, is distribution of reward given (state, action) pair,
is transition probability i.e. distribution of next state given (state, action) pair and
is discount factor. and b denote target policy and behavior policy respectively. A policy ( or b) is a function from to that specifies what action to take in each state. An agent interacts with an environment over a number of discrete time steps in classical RL. At each time step t, the agent picks an action according to its policy ( or b) given its present state . In return, the agent gets the next state according to the transition probability and observes a scalar reward . The process carries on until the agent arrives a terminal state after which the process starts again. The agent outputs discounted total accumulated return from each state i.e. . In RL, there are two typical function to select action following policy( or b): stateaction value () and state value () function. is expectation mean. Finally, the goal of the agent is to maximize the expected return using policy gradient ()with respect to parameter . The policy gradient (Sutton et al., 1999) which taking notation from (Schulman et al., 2015b), is defined as:(1) 
Where is an advantage function. Schulman et al. (2015b) shows that we use several expression in the place of without introducing bias such as stateaction value (), the discounted return or the temporal difference (TD) residual (). We use TD residual in our method. In reality, we use the neural network to estimate advantage function and this injects extra estimation errors and bias. A classic policy gradient approximators with Rt have a higher variance and lower bias whereas the approximators using function approximation have higher bias and lower variance (Wang et al., 2016). IS often has low bias but high variance (Sutton et al., 2016; Hachiya et al., 2009; Mahmood et al., 2014). We use RIS instead of IS. Merging advantage function with function approximation and RIS to achieve stable RL and also a tradeoff between bias and variance is one of our main aims. Policy gradient with function approximation denotes actorcritic (Sutton et al., 1999) which optimize the policy against a critic,e.g. deterministic policy gradient (Silver et al., 2014; Lillicrap et al., 2015).
4 Standard Importance Sampling
One reason for the instability of offpolicy learning is a discrepancy between distributions. In offpolicy RL, we would like to gather data samples from the distribution of target policy but data samples are drawn from the distribution of the behavior policy. Importance sampling is a wellknown approach to handle this kind of mismatch (Rubinstein, 1981; Precup et al., 2000)
. For example, we would like to estimate the expected value of a random variable x with samples from the
(target policy) distribution while in reality, samples are drawn from another distribution b (behavior policy). A classical form of importance sampling can be defined as:The importance sampling estimate of = is
(2) 
Where the are samples drawn from b and estimator computes average of sample values.
4.1 Relative Importance Sampling
Although some research, for instance, (Wang et al., 2016; Precup et al., 2001; Gu et al., 2016) has been carried out on solving instability, no studies have been found which uses a smooth variant of IS in RL. A smooth variant of IS, such as the RIS (Sugiyama, 2016; Yamada et al., 2011) is used to ease the instability. Our quasi RIS can be defined as:
(3) 
Where controls the smoothness. The logarithm of the relative importance function becomes the ordinary importance sampling if . It becomes smoother if is increased, and it produces the uniform weight if . This is our one of the main contribution.
Proposition 1.
Since the importance is always nonnegative, the relative importance is no greater than :
(4) 
proof:
let and
where .
where .
where is positive and between 0 and 1.
The RIS estimate of = is
(5) 
Theorem 1.
The relative importance sampling estimator
is a consistent unbiased estimator of
. has bounded variance.5 RISOffPAC Algorithm
The actorcritic algorithm uses with both on and offpolicy learning. However, our main focus is on offpolicy learning. We present our algorithm for actor and critic in this section. We also present a natural actorcritic version of our algorithm.
5.1 The Critic: Policy Evaluation
Let V be an approximate value function and can be defined as . is behavior policy probabilities for current state s. The TD residual of V with discount factor (Sutton & Barto, 2005) is given as ). Policy gradient uses value function () to evaluate target policy (). is considered as an estimate of of the action . i.e. .
(6)  
As can be seen from the above, the agent uses action generated by behavior policy instead of target policy in our method. The approximated value function is trained to minimize the squared TD residual error.
(7) 
write about convergence
5.2 The Actor: Policy Improvement
Critic updates actionvalue function parameters . Actor Updates policy parameters , in the direction, recommended by the critic. The actor selects which action to take, and the critic conveys the actor how good its action was and how it should adjust. In practice, we use an approximate TD error () to compute the policy gradient. The discounted TD residual () can be used to establish policy gradient estimator in the following form.
(8) 
Our aim is to reduce instability into offpolicy. The imbalance between bias and variance (large bias and large variance or small bias and large variance) is often likely to make offpolicy unstable. IS reduces bias but introduces high variance. The reason is that IS ratio fluctuates greatly from sample to sample and IS averages the , which is of high variance, thus, a smooth variant of IS is required to mitigate high variance (high variance is directly proportional to instability) such as RIS. RIS has bounded variance and low bias. It has been proven by theorem 1 because RIS is bounded i.e. , thus, variance of RIS is also bounded. IS reduces bias and RIS is a smooth variant of IS, thus, RIS also reduces bias (Hachiya et al., 2009; Gu et al., 2017; Mahmood et al., 2014; Sugiyama, 2016). Therefore, to minimize bias while maintaining bounded variance, we use offpolicy case, where () can be estimated using action drawn from in place of and combine RIS ratio with which we call RISOffPAC.
(9)  
(10) 
Two important truths about equation (10) must be pointed out. First, note that it relies on instead of . Second, it doesn’t involve a product of several unbounded important weights, but instead only need to approximate relative importance weight . Bounding RIS is expected to demonstrate a lower variance. We present two variant of the actorcritic algorithm here: (i) relative importance sampling offpolicy actorcritic (RISoffPAC) and (ii) relative importance sampling offpolicy natural actorcritic (RISoffPNAC).
Where in algorithm 1, and are learning rate for actor and critic respectively. State s represents current state while state represents next state. The next algorithm is RISoffPNAC which is based on the natural gradient estimate . is natural gradient and we refer to (Bhatnagar et al., 2009; Konda & Tsitsiklis, 2003; Peters et al., 2005; Silver et al., 2014) for further details. The only difference between RISOffPAC and RISOffPNAC is that we use a natural gradient estimate in place of regular gradient estimate in RISOffPNAC. RISOffPNAC utilizes algorithm 2 and Equation 26 of Bhatnagar et al. (2009) to estimate the natural gradient. However, natural actorcritic (NAC) algorithms of Bhatnagar et al. (2009)
are onpolicy while our algorithms are offpolicy. In RL, we want to maximize the rewards, thus, the optimization problem we consider here is a maximization instead of a minimization. So, we actual minimize the negative loss function, the negative of minimum loss function return maximum reward in the original problem.
5.3 RISOffPolicy Actorcritic Architecture
Figure 1(a) shows the RISOffPAC architecture. The difference between RISOffPAC and traditional actorcritic architecture (Sutton et al., 1999; Sutton & Barto, 2005) is that we introduce behavior policy and RIS in our method, use action generated by instead of and our method is offpolicy. We compute RIS using both and policy into Actor, therefore, we pass samples from to the actor as shown in figure 1(a). TD error and others are same as a traditional actorcritic method. Figure 1(b)
shows the RISOffPAC neural network (NN) architecture. We use classical RL task: MountainCar and CartPole for our experiment. We apply our RISOffPACNN on both of this tasks. Details of our NN as follows: In our architecture, we have a target network (Actor), value network (Critic) and offpolicy network (behavior policy). Each of them implemented as a fully connected layer using Tensorflow as shown in figure
1(b). Each NN contains inputs layer, 2 hidden layers: hidden layer 1 and hidden layer 2, and an output layer. Hidden layer 1 has 24 neurons (units) for all three Network for all RL task. Hidden layer 2 has a single neuron in the value network for all RL task. A number of neurons in hidden layer 2 for target network and the offpolicy network are equal to a number of actions available in given RL task. Hidden layer 1 employs RELU activation function in target and value network while CRELU activation function in the offpolicy network. Hidden layer 2 utilizes SOFTMAX activation function in target and offpolicy network whereas it uses no activation function in the value network. Weight W is generated using the ”he_uniform” function of Tensforflow for all NN and tasks. We availed AdamOptimizer for learning neural network parameters for all RL tasks.
is generated uniform random value between 0 and 1.6 Experimental Setup
We conduct experiments on two Open AI Gym classic control tasks: MountainCarv0 and CartPole v0. Our experiments run on a single machine with 16 GB memory, Intel Core i72600 cpu and no GPU. We use operating system: 64bit Ubuntu 18.04.1 LTS, programming language: python 3.6.4, library: Tensorflow 1.7 and OpenAI Gym (Brockman et al., 2016). For all experiment, we use simulated environment provided by OpenAI Gym (Brockman et al., 2016) library. The specific experiment details of each control task can be found in the appendix A and B.
6.1 Experimental Results
We evaluate RISOffPAC policies on two OpenAI Gym’s environments: MountainCarv0 and CartPolev0. The results obtained from our experiment are presented in table 1. The goal for the MountainCarv0 is to drive up on the right and reach on the top of the mountain with minimum episodes and steps. We use a maximum of 200 steps for each algorithm. Figure 1(d) shows the results obtained using different algorithms. As shown in Figure 1(d), The RISOffPNAC outperforms all algorithms. The RISOffPAC and AC take two steps to reach the goal. However, The RISOffPAC only takes 150 episodes whereas AC takes 184 steps to reach the goal. Overall, RISOffPAC and RISOffPNAC algorithm seems to be more stable than the other two. The results for RISOffPAC and RISOffPNAC algorithms using different value of are shown Figure 1(e) & 1(f) respectively. controls the smoothness which helps to remove instability. The instability mitigation depends on the choice of smoothness . Offpolicy becomes more stable when RIS is smoother. RIS gets smoother when increases. From Figure 1(e) we can see that RISOffPAC is stable for all value of except . The RISOffPNAC is the most stable for all value of as can be seen from Figure 1(f). The goal of CartPolev0 is to prevent it from falling over as long as possible. We use a maximum 1000 steps for each algorithm. Figure 2(b) shows that RISOffPNAC algorithm outperforms all algorithms. It takes only 135 episode from 105 step to accomplish the goal. The RISOffPAC secures second place whereas AC and NAC take more than 200 steps to complete the goal. Figure 2(c) & 2(d) are similar to 1(e) & 1(f) using different value of . Overall, both algorithms show the same kind of stability for all value of . Leaderboard webpage (https://github.com/openai/gym/wiki/Leaderboard) tracks the performance of user algorithms for various tasks in the gym including mountaincarv0 and cartpolev0. When we compare our results from table 1 with the leaderboard, our RISOffPAC and RISOffPNAC for mountain car outperform all user algorithms. Similarly, our RISOffPAC and RISOffPNAC for cartpole outperform several user algorithms. Videos of the policies learned with RISOffPAC for MountainCarv0^{1}^{1}1https://youtu.be/dBsYHNUdKSw and CartPolev0^{2}^{2}2https://youtu.be/Zbx0eZ3vvzk are available online. AC and NAC are onpolicy algorithms while RISOffPAC and RISOffPNAC are offpolicy algorithms. Our experiment shows that our offpolicy algorithms are more stable than onpolicy.
Environments  Algorithm  Episodes before solve  #Steps before solve  #Total Steps 

MountainCar v0  AC  184  2  200 
MountainCar v0  NAC  176  9  200 
MountainCar v0  RISOffPAC  150  2  200 
MountainCar v0  RISOffPNAC  196  1  200 
CartPole v0  AC  188  758  1000 
CartPole v0  NAC  95  320  1000 
CartPole v0  RISOffPAC  116  132  1000 
CartPole v0  RISOffPNAC  135  105  1000 
7 Conclusions
We have shown offpolicy actorcritic reinforcement learning algorithms based on RIS. It can achieve better or similar performance as a standard state of the art methods. This method mitigates the instability issue of offpolicy learning. In addition, our algorithm robustly solves classic RL problems such as MountainCarv0 and CartPolev0. A future work is to extend this idea to weighted RIS.
Acknowledgements
We would like to thank editors, referees for their valuable suggestions and comments.
Appendix
Appendix A MountainCar v0
Mountain car is a famous benchmark for RL shown in figure 1(c). Moore (1991) first presented this problem in his PhD thesis. A car is stationed between two hills. The goal is to drive up the hill on the right and reach to the top of a hill (top = 0.5 position). However, the car’s engine is inadequate power to climb up the hill in a single pass. Therefore, the only way to accomplish is to drive back and forth to boost momentum. We have three actions which are used the values of the force applied to the car. The state S is defined as . We obtained our result using the following value of parameters:
ActorCritic(AC): we used learning rate of and for actor and critic respectively. .
Natural ActorCritic(NAC): we used learning rate of and for actor and critic respectively. .
RISOffPAC: we used learning rate of , and for actor, critic and offpolicy network respectively. .
RISOffPNAC: we used learning rate of , and for actor, critic and offpolicy network respectively. .
We run for 200 time steps and the episode terminates when the car reaches its target at top=0.5 position or if getting average reward of 110.0 over 100 consecutive steps or if 200 iteration completed. Our reward function is defined as
Where is the action chosen at the time t, is state at time t and is next state at time t+1.
Appendix B CartPole v0
CartPole is another famous benchmark for RL shown in figure 2(a). The cartpole environment used here is described by Barto et al. (1983). A cart moves along a frictionless track while balancing a pole. The pendulum starts upright, and the goal is to stop it from falling over by increasing and reducing the cart’s velocity. A reward of +1 is given for every time step that the pole remains upright. We have two actions which are used the values of the force applied to the cart. The state S is defined as . We obtained our result using the following value of parameters:
ActorCritic(AC): we used learning rate of and for actor and critic respectively. .
Natural ActorCritic(NAC): we used learning rate of and for actor and critic respectively. .
RISOffPAC: we used learning rate of , and for actor, critic and offpolicy network respectively. .
RISOffPNAC: we used learning rate of , and for actor, critic and offpolicy network respectively. .
We run for 1000 time steps and the episode ends when the pole is more than degrees from vertical, or the cart travels more than units from the center or if getting average reward of 195.0 over 100 consecutive steps or if 1000 iteration completed. Our reward function is defined as
Where is the action chosen at the time t, is state at time t and is next state at time t+1.
References
 Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC13:834–846, 1983.
 Bhatnagar et al. (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actorcritic algorithms. Automatica, 45:2471–2482, 2009.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. CoRR, abs/1606.01540, 2016.
 Chou et al. (2017) Chou, P.W., Maturana, D., and Scherer, S. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In ICML, 2017.
 Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. Offpolicy actorcritic. 2012.
 Elvira et al. (2015) Elvira, V., Martino, L., Luengo, D., and Bugallo, M. F. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22:1757–1761, 2015.
 Gruslys et al. (2017) Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sampleefficient actorcritic architecture. 2017.
 Gu et al. (2016) Gu, S., Lillicrap, T. P., Ghahramani, Z., Turner, R. E., and Levine, S. Qprop: Sampleefficient policy gradient with an offpolicy critic. CoRR, abs/1611.02247, 2016.
 Gu et al. (2017) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., Schölkopf, B., and Levine, S. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. 2017.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
 Hachiya et al. (2009) Hachiya, H., Akiyama, T., Sugiayma, M., and Peters, J. Adaptive importance sampling for value function approximation in offpolicy reinforcement learning. Neural Networks, 22(10):1399–1410, 2009.
 Hanna et al. (2018) Hanna, J., Niekum, S., and Stone, P. Importance sampling policy evaluation with an estimated behavior policy. CoRR, abs/1806.01347, 2018.
 Harutyunyan et al. (2016) Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. Q() with offpolicy corrections. 2016.
 Konda & Tsitsiklis (2003) Konda, V. R. and Tsitsiklis, J. N. Onactorcritic algorithms. SIAM J. Control and Optimization, 42:1143–1166, 2003.
 Levine & Koltun (2013) Levine, S. and Koltun, V. Guided policy search. In ICML, 2013.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. Computer Science, 8(6):A187, 2015.
 Mahmood et al. (2014) Mahmood, A. R., Hasselt, H. V., and Sutton, R. S. Weighted importance sampling for offpolicy learning with linear function approximation. In International Conference on Neural Information Processing Systems, pp. 3014–3022, 2014.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. 2016.
 Moore (1991) Moore, A. Efficient Memorybased Learning for Robot Control. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, March 1991.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. Safe and efficient offpolicy reinforcement learning. In NIPS, 2016.
 Owen (2013) Owen, A. B. Monte Carlo theory, methods and examples. 2013.
 Peters et al. (2005) Peters, J., Vijayakumar, S., and Schaal, S. Natural actorcritic. In ECML, 2005.
 Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. P. Eligibility traces for offpolicy policy evaluation. In ICML, 2000.
 Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Offpolicy temporal difference learning with function approximation. In ICML, 2001.
 Rubinstein (1981) Rubinstein, R. Y. Simulation and the monte carlo method. In Wiley series in probability and mathematical statistics, 1981.
 Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
 Schulman et al. (2015a) Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. Trust region policy optimization. In ICML, 2015a.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015b.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. A. Deterministic policy gradient algorithms. In ICML, 2014.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
 Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L. R., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550:354–359, 2017.

Sugiyama (2016)
Sugiyama, M.
Introduction to Statistical Machine Learning
. Morgan Kaufmann Publishers Inc., 2016.  Sutton & Barto (2005) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, bradford book. IEEE Transactions on Neural Networks, 16(1):285–286, 2005.
 Sutton et al. (1999) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.
 Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of offpolicy temporaldifference learning. Journal of Machine Learning Research, 17:73:1–73:29, 2016.
 Tang & Abbeel (2010) Tang, J. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. In NIPS, 2010.
 Thomas (2014) Thomas, P. Bias in natural actorcritic algorithms. In ICML, 2014.
 van Hasselt & Mahmood (2014) van Hasselt, H. and Mahmood, A. R. Offpolicy td () with a true online equivalence. 2014.
 Van Seijen & Sutton (2014) Van Seijen, H. and Sutton, R. S. True online td(). In International Conference on International Conference on Machine Learning, pp. I–692, 2014.
 Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actorcritic with experience replay. CoRR, abs/1611.01224, 2016.
 Yamada et al. (2011) Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative densityratio estimation for robust distribution comparison. Neural Computation, 25:1324–1370, 2011.
 Zimmer et al. (2018) Zimmer, M., Boniface, Y., and Dutech, A. Offpolicy neural fitted actorcritic. 2018.
Comments
There are no comments yet.