Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning

10/30/2018 ∙ by Mahammad Humayoo, et al. ∙ 2

Off-policy learning is more unstable compared to on-policy learning in reinforcement learning (RL). One reason for the instability of off-policy learning is a discrepancy between the target (π) and behavior (b) policy distributions. The discrepancy between π and b distributions can be alleviated by employing a smooth variant of the importance sampling (IS), such as the relative importance sampling (RIS). RIS has parameter β∈[0, 1] which controls smoothness. To cope with instability, we present the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL. In our method, the network yields a target policy (the actor), a value function (the critic) assessing the current policy (π), and behavior policy. We use action value generated from the behavior policy to train our algorithm rather than from the target policy. We also use deep neural networks to train both actor and critic. We evaluated our algorithm on a number of Open AI Gym benchmark problems and demonstrate better or comparable performance to several state-of-the-art RL baselines.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free deep RL algorithms have been employed in solving a variety of complex tasks (Sutton & Barto, 2005; Silver et al., 2016, 2017; Mnih et al., 2013, 2016; Schulman et al., 2015a; Lillicrap et al., 2015; Gu et al., 2016). Model-free RL consists of on- and off-policy methods. Off-policy methods enable a target policy to be learned while following and obtaining data (action value) from another policy (behavior policy). It means that the agent learns about a policy distinct from the one it is carrying out while there is a single policy (target policy) in on-policy methods. It means that the agent learns only about the policy it is carrying out. In short, if the two policies are the same i.e. , then the setting is called on-policy. Otherwise, the setting is called the off-policy () (Harutyunyan et al., 2016; Degris et al., 2012; Precup et al., 2001; Gu et al., 2016; Hanna et al., 2018; Gruslys et al., 2017)

. On-policy method offers unbiased and stable but often suffer from high variance and sample inefficient. Off-policy methods are more sample efficient, safe but unstable. Neither on- nor off-policies are perfect. Therefore, several methods have been proposed to get rid of the deficiency of each policy. For example, how on-policy can achieve similar sample efficient as off-policy

(Gu et al., 2016; Mnih et al., 2016; Schaul et al., 2015; Schulman et al., 2015a; van Hasselt & Mahmood, 2014) and how off-policy can achieve similar stability as on-policy (Degris et al., 2012; Mahmood et al., 2014; Gruslys et al., 2017; Wang et al., 2016; Haarnoja et al., 2018). The aim of this study is to make off-policy learning as stable as on-policy using the actor-critic algorithm in the deep neural network. Therefore, This paper will primarily focus on off-policy rather than on-policy. A common approach is to use importance sampling techniques for stabilizing off-policy caused by the mismatch between the behavior policy and the target policy (Gu et al., 2016; Hachiya et al., 2009; Rubinstein, 1981).

Importance sampling is a well-known method for off-policy evaluation, permitting off-policy data to be used as if it were on-policy (Hanna et al., 2018). IS can be used to study one distribution while sampling from another (Owen, 2013). The degree of deviation of the target policy from the behavior policy at each time t is captured by the importance sampling ratio i.e. (Precup et al., 2001)

. IS is also considered as a method for reducing the variance of the estimate of an expectation by carefully choosing a sampling distribution (b). If b is chosen properly, our new estimate has lower variance. The variance of an estimator depends on how much the sampling distribution and the target distribution differ

(Rubinstein, 1981). For the theory behind the importance sampling that is presented here, we refer to (Owen, 2013, Chapter 9). Another reason for instability is that IS does not always generate uniform values for all samples. IS sometimes generates large value for some sample, and a small value for another sample. Thus, Yamada et al. (2011) propose a smooth variant of importance sampling, such as the relative importance sample to mitigate instability. Some of the more important methods based on IS include: WIS (Mahmood et al., 2014), ACER (Wang et al., 2016), Retrace (Munos et al., 2016), Q-prop (Gu et al., 2016), SAC (Haarnoja et al., 2018), MIS (Elvira et al., 2015), Off-PAC (Degris et al., 2012), The Reactor (Gruslys et al., 2017), GPS (Levine & Koltun, 2013) etc.

In this paper, we propose an off-policy actor-critic algorithm based on relative importance sampling in deep reinforcement learning for stabilizing off-policy methods, called RIS-Off-PAC. To the best of our knowledge, we introduce the first time RIS with actor-critic. We use the deep neural network to train both actor and critic. The behavior policy is also generated by the deep neural network. We also explore a different type of actor-critic algorithm such as natural gradient actor-critic using RIS.

Rest of the paper is arranged as follows. Related works are discussed in section 2. In section 3, we present Preliminaries. Section 4 & 5 show Importance Sampling and actor-critic model respectively. Section 6 presents experiments. Finally, we present a conclusion in section 7.

2 Related Work

2.1 On-Policy

Thomas (2014) claims that biased discounted reward makes the natural actor-critic algorithms unbiased average reward natural actor-critic. Bhatnagar et al. (2009) present four new online actor-critic reinforcement learning algorithms based on natural-gradient, function-approximation, and temporal difference learning. They also demonstrate the convergence of these four algorithms to a local maximum. Schaul et al. (2015)

present a framework for prioritizing experience, so as to replay significant transitions more often, and thus learn more efficiently. Bounded actions introduce bias when the standard Gaussian distribution is used as a stochastic policy.

Chou et al. (2017)

suggest using Beta distribution instead of Gaussian and examine the trade-off between bias and variance of policy gradient of both on- and off-policy.

Mnih et al. (2016) purpose four asynchronous deep RL algorithms. The most effective one is asynchronous advantage actor-critic (A3c), maintains a policy and an estimated of the value function . Van Seijen & Sutton (2014) introduce a true online TD() learning algorithm that is exactly equivalent to an online forward view and that empirically perform better than its standard counterpart in both prediction and control problems. Schulman et al. (2015a) develop an algorithm, called Trust Region Policy Optimization (TRPO) offers monotonic policy improvements and derives a practical algorithm with a better sample efficiency and performance. It is similar to natural policy gradient methods. Schulman et al. (2015b) develop a variance reduction method for policy gradient, called generalized advantage estimation (GAE) where a trust region optimization method used for the value function. Policy gradient of GAE significantly minimizes variance while maintaining an acceptable level of bias. We are interested in off-policy rather than on-policy.

2.2 Off-Policy

Hachiya et al. (2009) consider variance of the value function estimator for an off-policy method to control the trade-off between bias and variance. Mahmood et al. (2014) uses weighted importance sampling with function approximation and extends to a new weighted-importance sampling form of off-policy LSTD() called WIS-LSTD(). Degris et al. (2012) purpose a method called off-policy actor-critic (off-PAC) in which agent learns target policy while following and getting samples from behavior policy. Gruslys et al. (2017) present sample-efficient actor-critic reinforcement learning agent called Reactor. It uses an off-policy multi-step Retrace algorithm to train critic while a new policy gradient algorithm, called B-leave-one-out is used to train actor. Zimmer et al. (2018) show new off-policy actor-critic RL algorithms to cope with continuous state and actions spaces using the neural network. Their algorithm also allows a trade-off between data-efficiency and scalability. Levine & Koltun (2013) talk to avoid poor local optima in complex policies with hundreds of parameter using guided policy search (GPS). GPS uses differential dynamic programming to generate suitable guiding samples, and define a regularized importance sampled policy optimization that integrates these samples into policy search. Lillicrap et al. (2015) introduce a model-free, off-policy actor-critic algorithm using deep function approximators based on the deterministic policy gradient (DPG) that can learn policies in high-dimensional, continuous action spaces, call it deep deterministic policy gradient (DDPG). Wang et al. (2016) present stable, sample efficient an actor-critic deep RL agent with experience replay, called ACER that applies to both continuous and discrete action spaces successfully. ACER utilize truncate importance sampling with bias correction, stochastic dueling network architectures, and efficient trust region policy optimization to achieve it. Munos et al. (2016) show a novel algorithm, called Retrace() which has three properties: small variance, safe because of using samples collected from any behavior policy and efficient because it estimates Q-Function from off-policy efficiently. Gu et al. (2016) develop a method called Q-Prop which is both samples efficient and stable. It merges the advantages of on-policy (stability of policy gradients) and off-policy methods (efficiency). Model-free deep RL algorithms typical undergo from two major challenges: very high sample inefficient and unstable. Haarnoja et al. (2018) present a soft actor-critic (SAC) method, based on maximum entropy and off-policy. Off-policy provides sample efficiency and entropy maximization provide stability. The most of these methods are similar to our method, but they use standard IS or entropy method while we use RIS. For a review of IS-Off-policy methods, see the works of (Precup et al., 2000; Sutton et al., 2016; Tang & Abbeel, 2010; Elvira et al., 2015; Gu et al., 2017; van Hasselt & Mahmood, 2014; Precup et al., 2001).

3 Preliminaries

Markov decision process (MDP) is a mathematical formulation of RL problems. MDP is defined by tuples of objects, consisting of (,,,,). where is set of possible states, is set of possible actions, is distribution of reward given (state, action) pair,

is transition probability i.e. distribution of next state given (state, action) pair and

is discount factor. and b denote target policy and behavior policy respectively. A policy ( or b) is a function from to that specifies what action to take in each state. An agent interacts with an environment over a number of discrete time steps in classical RL. At each time step t, the agent picks an action according to its policy ( or b) given its present state . In return, the agent gets the next state according to the transition probability and observes a scalar reward . The process carries on until the agent arrives a terminal state after which the process starts again. The agent outputs -discounted total accumulated return from each state i.e. . In RL, there are two typical function to select action following policy( or b): state-action value () and state value () function. is expectation mean. Finally, the goal of the agent is to maximize the expected return using policy gradient ()with respect to parameter . The policy gradient (Sutton et al., 1999) which taking notation from (Schulman et al., 2015b), is defined as:


Where is an advantage function. Schulman et al. (2015b) shows that we use several expression in the place of without introducing bias such as state-action value (), the discounted return or the temporal difference (TD) residual (). We use TD residual in our method. In reality, we use the neural network to estimate advantage function and this injects extra estimation errors and bias. A classic policy gradient approximators with Rt have a higher variance and lower bias whereas the approximators using function approximation have higher bias and lower variance (Wang et al., 2016). IS often has low bias but high variance (Sutton et al., 2016; Hachiya et al., 2009; Mahmood et al., 2014). We use RIS instead of IS. Merging advantage function with function approximation and RIS to achieve stable RL and also a trade-off between bias and variance is one of our main aims. Policy gradient with function approximation denotes actor-critic (Sutton et al., 1999) which optimize the policy against a critic,e.g. deterministic policy gradient (Silver et al., 2014; Lillicrap et al., 2015).

4 Standard Importance Sampling

One reason for the instability of off-policy learning is a discrepancy between distributions. In off-policy RL, we would like to gather data samples from the distribution of target policy but data samples are drawn from the distribution of the behavior policy. Importance sampling is a well-known approach to handle this kind of mismatch (Rubinstein, 1981; Precup et al., 2000)

. For example, we would like to estimate the expected value of a random variable x with samples from the

(target policy) distribution while in reality, samples are drawn from another distribution b (behavior policy). A classical form of importance sampling can be defined as:

The importance sampling estimate of = is


Where the are samples drawn from b and estimator computes average of sample values.

4.1 Relative Importance Sampling

Although some research, for instance, (Wang et al., 2016; Precup et al., 2001; Gu et al., 2016) has been carried out on solving instability, no studies have been found which uses a smooth variant of IS in RL. A smooth variant of IS, such as the RIS (Sugiyama, 2016; Yamada et al., 2011) is used to ease the instability. Our quasi RIS can be defined as:


Where controls the smoothness. The logarithm of the relative importance function becomes the ordinary importance sampling if . It becomes smoother if is increased, and it produces the uniform weight if . This is our one of the main contribution.

Proposition 1.

Since the importance is always non-negative, the relative importance is no greater than :


proof: let and

where .

where .
where is positive and between 0 and 1.

The RIS estimate of = is

Theorem 1.

The relative importance sampling estimator

is a consistent unbiased estimator of

. has bounded variance.

This estimator is unbiased, but it suffers from very high variance as it involves a product of many potentially unbounded importance weights (Wang et al., 2016; Hachiya et al., 2009).

5 RIS-Off-PAC Algorithm

The actor-critic algorithm uses with both on- and off-policy learning. However, our main focus is on off-policy learning. We present our algorithm for actor and critic in this section. We also present a natural actor-critic version of our algorithm.

5.1 The Critic: Policy Evaluation

Let V be an approximate value function and can be defined as . is behavior policy probabilities for current state s. The TD residual of V with discount factor (Sutton & Barto, 2005) is given as ). Policy gradient uses value function () to evaluate target policy (). is considered as an estimate of of the action . i.e. .


As can be seen from the above, the agent uses action generated by behavior policy instead of target policy in our method. The approximated value function is trained to minimize the squared TD residual error.


write about convergence

5.2 The Actor: Policy Improvement

Critic updates action-value function parameters . Actor Updates policy parameters , in the direction, recommended by the critic. The actor selects which action to take, and the critic conveys the actor how good its action was and how it should adjust. In practice, we use an approximate TD error () to compute the policy gradient. The discounted TD residual () can be used to establish policy gradient estimator in the following form.


Our aim is to reduce instability into off-policy. The imbalance between bias and variance (large bias and large variance or small bias and large variance) is often likely to make off-policy unstable. IS reduces bias but introduces high variance. The reason is that IS ratio fluctuates greatly from sample to sample and IS averages the , which is of high variance, thus, a smooth variant of IS is required to mitigate high variance (high variance is directly proportional to instability) such as RIS. RIS has bounded variance and low bias. It has been proven by theorem 1 because RIS is bounded i.e. , thus, variance of RIS is also bounded. IS reduces bias and RIS is a smooth variant of IS, thus, RIS also reduces bias (Hachiya et al., 2009; Gu et al., 2017; Mahmood et al., 2014; Sugiyama, 2016). Therefore, to minimize bias while maintaining bounded variance, we use off-policy case, where () can be estimated using action drawn from in place of and combine RIS ratio with which we call RIS-Off-PAC.


Two important truths about equation (10) must be pointed out. First, note that it relies on instead of . Second, it doesn’t involve a product of several unbounded important weights, but instead only need to approximate relative importance weight . Bounding RIS is expected to demonstrate a lower variance. We present two variant of the actor-critic algorithm here: (i) relative importance sampling off-policy actor-critic (RIS-off-PAC) and (ii) relative importance sampling off-policy natural actor-critic (RIS-off-PNAC).

  Initialize: policy parameters , critic parameters , discount factor (), done=false, t=0,
  for  to  do
        Choose an action , according to
        Observe output next state (), reward (r), and done
        Update the critic:
        Update the actor:
     until  is
  end for
Algorithm 1 The RIS-off-PAC algorithm

Where in algorithm 1, and are learning rate for actor and critic respectively. State s represents current state while state represents next state. The next algorithm is RIS-off-PNAC which is based on the natural gradient estimate . is natural gradient and we refer to (Bhatnagar et al., 2009; Konda & Tsitsiklis, 2003; Peters et al., 2005; Silver et al., 2014) for further details. The only difference between RIS-Off-PAC and RIS-Off-PNAC is that we use a natural gradient estimate in place of regular gradient estimate in RIS-Off-PNAC. RIS-Off-PNAC utilizes algorithm 2 and Equation 26 of Bhatnagar et al. (2009) to estimate the natural gradient. However, natural actor-critic (NAC) algorithms of Bhatnagar et al. (2009)

are on-policy while our algorithms are off-policy. In RL, we want to maximize the rewards, thus, the optimization problem we consider here is a maximization instead of a minimization. So, we actual minimize the negative loss function, the negative of minimum loss function return maximum reward in the original problem.

  Initialize: policy parameters , critic parameters , discount factor (), done=false, t=0,
  for  to  do
        Choose an action , according to
        Observe output next state (), reward (r), and done
        Update the critic:
        Update the actor:
     until  is
  end for
Algorithm 2 The RIS-off-PNAC algorithm

5.3 RIS-Off-Policy Actor-critic Architecture

Figure 1(a) shows the RIS-Off-PAC architecture. The difference between RIS-Off-PAC and traditional actor-critic architecture (Sutton et al., 1999; Sutton & Barto, 2005) is that we introduce behavior policy and RIS in our method, use action generated by instead of and our method is off-policy. We compute RIS using both and policy into Actor, therefore, we pass samples from to the actor as shown in figure 1(a). TD error and others are same as a traditional actor-critic method. Figure 1(b)

shows the RIS-Off-PAC neural network (NN) architecture. We use classical RL task: MountainCar and CartPole for our experiment. We apply our RIS-Off-PAC-NN on both of this tasks. Details of our NN as follows: In our architecture, we have a target network (Actor), value network (Critic) and off-policy network (behavior policy). Each of them implemented as a fully connected layer using Tensorflow as shown in figure


. Each NN contains inputs layer, 2 hidden layers: hidden layer 1 and hidden layer 2, and an output layer. Hidden layer 1 has 24 neurons (units) for all three Network for all RL task. Hidden layer 2 has a single neuron in the value network for all RL task. A number of neurons in hidden layer 2 for target network and the off-policy network are equal to a number of actions available in given RL task. Hidden layer 1 employs RELU activation function in target and value network while CRELU activation function in the off-policy network. Hidden layer 2 utilizes SOFTMAX activation function in target and off-policy network whereas it uses no activation function in the value network. Weight W is generated using the ”he_uniform” function of Tensforflow for all NN and tasks. We availed AdamOptimizer for learning neural network parameters for all RL tasks.

is generated uniform random value between 0 and 1.

(a) The RIS-Off-PAC Architecture.
(b) The RIS-Off-PAC Neural Network Architecture.

6 Experimental Setup

We conduct experiments on two Open AI Gym classic control tasks: MountainCar-v0 and CartPole v0. Our experiments run on a single machine with 16 GB memory, Intel Core i7-2600 cpu and no GPU. We use operating system: 64-bit Ubuntu 18.04.1 LTS, programming language: python 3.6.4, library: Tensorflow 1.7 and OpenAI Gym (Brockman et al., 2016). For all experiment, we use simulated environment provided by OpenAI Gym (Brockman et al., 2016) library. The specific experiment details of each control task can be found in the appendix A and B.

6.1 Experimental Results

We evaluate RIS-Off-PAC policies on two OpenAI Gym’s environments: MountainCar-v0 and CartPole-v0. The results obtained from our experiment are presented in table 1. The goal for the MountainCar-v0 is to drive up on the right and reach on the top of the mountain with minimum episodes and steps. We use a maximum of 200 steps for each algorithm. Figure 1(d) shows the results obtained using different algorithms. As shown in Figure 1(d), The RIS-Off-PNAC outperforms all algorithms. The RIS-Off-PAC and AC take two steps to reach the goal. However, The RIS-Off-PAC only takes 150 episodes whereas AC takes 184 steps to reach the goal. Overall, RIS-Off-PAC and RIS-Off-PNAC algorithm seems to be more stable than the other two. The results for RIS-Off-PAC and RIS-Off-PNAC algorithms using different value of are shown Figure 1(e) & 1(f) respectively. controls the smoothness which helps to remove instability. The instability mitigation depends on the choice of smoothness . Off-policy becomes more stable when RIS is smoother. RIS gets smoother when increases. From Figure 1(e) we can see that RIS-Off-PAC is stable for all value of except . The RIS-Off-PNAC is the most stable for all value of as can be seen from Figure 1(f). The goal of CartPole-v0 is to prevent it from falling over as long as possible. We use a maximum 1000 steps for each algorithm. Figure 2(b) shows that RIS-Off-PNAC algorithm outperforms all algorithms. It takes only 135 episode from 105 step to accomplish the goal. The RIS-Off-PAC secures second place whereas AC and NAC take more than 200 steps to complete the goal. Figure 2(c) & 2(d) are similar to 1(e) & 1(f) using different value of . Overall, both algorithms show the same kind of stability for all value of . Leaderboard webpage ( tracks the performance of user algorithms for various tasks in the gym including mountaincar-v0 and cartpole-v0. When we compare our results from table 1 with the leaderboard, our RIS-Off-PAC and RIS-Off-PNAC for mountain car outperform all user algorithms. Similarly, our RIS-Off-PAC and RIS-Off-PNAC for cartpole outperform several user algorithms. Videos of the policies learned with RIS-Off-PAC for MountainCar-v0111 and CartPole-v0222 are available online. AC and NAC are on-policy algorithms while RIS-Off-PAC and RIS-Off-PNAC are off-policy algorithms. Our experiment shows that our off-policy algorithms are more stable than on-policy.

(c) Mountain Car
(d) All Algorithms
(e) RIS-Off-PAC Algorithm
(f) RIS-Off-PNAC Algorithm
Figure 1: (a) Screenshots of the MountainCar control task on OpenAI Gym. (b) training summary of all four algorithms. (c), (d) training summary of RIS-Off-PAC and RIS-Off-PNAC respectively for different value of . The x-axis shows the total number of training episodes. The y-axis shows the average rewards over 200 steps.
(a) Cart Pole
(b) Cart Pole
(c) RIS-Off-PAC Algorithm
(d) RIS-Off-PNAC Algorithm
Figure 2: (a) Screenshots of the CartPole control task on OpenAI Gym. (b) training summary of all four algorithms. (c), (d) training summary of RIS-Off-PAC and RIS-Off-PNAC respectively for different value of . The x-axis shows the total number of training episodes. The y-axis shows the average rewards over 1000 steps.
Environments Algorithm Episodes before solve #Steps before solve #Total Steps
MountainCar v0 AC 184 2 200
MountainCar v0 NAC 176 9 200
MountainCar v0 RIS-Off-PAC 150 2 200
MountainCar v0 RIS-Off-PNAC 196 1 200
CartPole v0 AC 188 758 1000
CartPole v0 NAC 95 320 1000
CartPole v0 RIS-Off-PAC 116 132 1000
CartPole v0 RIS-Off-PNAC 135 105 1000
Table 1: Comparison of algorithm performance across mountain car and cart pole.

7 Conclusions

We have shown off-policy actor-critic reinforcement learning algorithms based on RIS. It can achieve better or similar performance as a standard state of the art methods. This method mitigates the instability issue of off-policy learning. In addition, our algorithm robustly solves classic RL problems such as MountainCar-v0 and CartPole-v0. A future work is to extend this idea to weighted RIS.


We would like to thank editors, referees for their valuable suggestions and comments.


Appendix A MountainCar v0

Mountain car is a famous benchmark for RL shown in figure 1(c). Moore (1991) first presented this problem in his PhD thesis. A car is stationed between two hills. The goal is to drive up the hill on the right and reach to the top of a hill (top = 0.5 position). However, the car’s engine is inadequate power to climb up the hill in a single pass. Therefore, the only way to accomplish is to drive back and forth to boost momentum. We have three actions which are used the values of the force applied to the car. The state S is defined as . We obtained our result using the following value of parameters:
Actor-Critic(AC): we used learning rate of and for actor and critic respectively. .
Natural Actor-Critic(NAC): we used learning rate of and for actor and critic respectively. .
RIS-Off-PAC: we used learning rate of , and for actor, critic and off-policy network respectively. .
RIS-Off-PNAC: we used learning rate of , and for actor, critic and off-policy network respectively. .
We run for 200 time steps and the episode terminates when the car reaches its target at top=0.5 position or if getting average reward of -110.0 over 100 consecutive steps or if 200 iteration completed. Our reward function is defined as

Where is the action chosen at the time t, is state at time t and is next state at time t+1.

Appendix B CartPole v0

CartPole is another famous benchmark for RL shown in figure 2(a). The cart-pole environment used here is described by Barto et al. (1983). A cart moves along a frictionless track while balancing a pole. The pendulum starts upright, and the goal is to stop it from falling over by increasing and reducing the cart’s velocity. A reward of +1 is given for every time step that the pole remains upright. We have two actions which are used the values of the force applied to the cart. The state S is defined as . We obtained our result using the following value of parameters:
Actor-Critic(AC): we used learning rate of and for actor and critic respectively. .
Natural Actor-Critic(NAC): we used learning rate of and for actor and critic respectively. .
RIS-Off-PAC: we used learning rate of , and for actor, critic and off-policy network respectively. .
RIS-Off-PNAC: we used learning rate of , and for actor, critic and off-policy network respectively. .
We run for 1000 time steps and the episode ends when the pole is more than degrees from vertical, or the cart travels more than units from the center or if getting average reward of 195.0 over 100 consecutive steps or if 1000 iteration completed. Our reward function is defined as

Where is the action chosen at the time t, is state at time t and is next state at time t+1.


  • Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983.
  • Bhatnagar et al. (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor-critic algorithms. Automatica, 45:2471–2482, 2009.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. CoRR, abs/1606.01540, 2016.
  • Chou et al. (2017) Chou, P.-W., Maturana, D., and Scherer, S. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In ICML, 2017.
  • Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. Off-policy actor-critic. 2012.
  • Elvira et al. (2015) Elvira, V., Martino, L., Luengo, D., and Bugallo, M. F. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22:1757–1761, 2015.
  • Gruslys et al. (2017) Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efficient actor-critic architecture. 2017.
  • Gu et al. (2016) Gu, S., Lillicrap, T. P., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. CoRR, abs/1611.02247, 2016.
  • Gu et al. (2017) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., Schölkopf, B., and Levine, S. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. 2017.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  • Hachiya et al. (2009) Hachiya, H., Akiyama, T., Sugiayma, M., and Peters, J. Adaptive importance sampling for value function approximation in off-policy reinforcement learning. Neural Networks, 22(10):1399–1410, 2009.
  • Hanna et al. (2018) Hanna, J., Niekum, S., and Stone, P. Importance sampling policy evaluation with an estimated behavior policy. CoRR, abs/1806.01347, 2018.
  • Harutyunyan et al. (2016) Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. Q() with off-policy corrections. 2016.
  • Konda & Tsitsiklis (2003) Konda, V. R. and Tsitsiklis, J. N. Onactor-critic algorithms. SIAM J. Control and Optimization, 42:1143–1166, 2003.
  • Levine & Koltun (2013) Levine, S. and Koltun, V. Guided policy search. In ICML, 2013.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. Computer Science, 8(6):A187, 2015.
  • Mahmood et al. (2014) Mahmood, A. R., Hasselt, H. V., and Sutton, R. S. Weighted importance sampling for off-policy learning with linear function approximation. In International Conference on Neural Information Processing Systems, pp. 3014–3022, 2014.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
  • Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. 2016.
  • Moore (1991) Moore, A. Efficient Memory-based Learning for Robot Control. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, March 1991.
  • Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. Safe and efficient off-policy reinforcement learning. In NIPS, 2016.
  • Owen (2013) Owen, A. B. Monte Carlo theory, methods and examples. 2013.
  • Peters et al. (2005) Peters, J., Vijayakumar, S., and Schaal, S. Natural actor-critic. In ECML, 2005.
  • Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. P. Eligibility traces for off-policy policy evaluation. In ICML, 2000.
  • Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal difference learning with function approximation. In ICML, 2001.
  • Rubinstein (1981) Rubinstein, R. Y. Simulation and the monte carlo method. In Wiley series in probability and mathematical statistics, 1981.
  • Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
  • Schulman et al. (2015a) Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. Trust region policy optimization. In ICML, 2015a.
  • Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015b.
  • Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. A. Deterministic policy gradient algorithms. In ICML, 2014.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L. R., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550:354–359, 2017.
  • Sugiyama (2016) Sugiyama, M.

    Introduction to Statistical Machine Learning

    Morgan Kaufmann Publishers Inc., 2016.
  • Sutton & Barto (2005) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, bradford book. IEEE Transactions on Neural Networks, 16(1):285–286, 2005.
  • Sutton et al. (1999) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.
  • Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research, 17:73:1–73:29, 2016.
  • Tang & Abbeel (2010) Tang, J. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. In NIPS, 2010.
  • Thomas (2014) Thomas, P. Bias in natural actor-critic algorithms. In ICML, 2014.
  • van Hasselt & Mahmood (2014) van Hasselt, H. and Mahmood, A. R. Off-policy td () with a true online equivalence. 2014.
  • Van Seijen & Sutton (2014) Van Seijen, H. and Sutton, R. S. True online td(). In International Conference on International Conference on Machine Learning, pp. I–692, 2014.
  • Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actor-critic with experience replay. CoRR, abs/1611.01224, 2016.
  • Yamada et al. (2011) Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative density-ratio estimation for robust distribution comparison. Neural Computation, 25:1324–1370, 2011.
  • Zimmer et al. (2018) Zimmer, M., Boniface, Y., and Dutech, A. Off-policy neural fitted actor-critic. 2018.