Log In Sign Up

Mean Actor Critic

We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning. MAC is a policy gradient algorithm that uses the agent's explicit representation of all action values to estimate the gradient of the policy, rather than using only the actions that were actually executed. This significantly reduces variance in the gradient updates and removes the need for a variance reduction baseline. We show empirical results on two control domains where MAC performs as well as or better than other policy gradient approaches, and on five Atari games, where MAC is competitive with state-of-the-art policy search algorithms.


page 1

page 2

page 3

page 4


Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

We present the first class of policy-gradient algorithms that work with ...

Sample-efficient Policy Optimization with Stein Control Variate

Policy gradient methods have achieved remarkable successes in solving ch...

Combining policy gradient and Q-learning

Policy gradient is an efficient technique for improving a policy in a re...

Distributional Policy Optimization: An Alternative Approach for Continuous Control

We identify a fundamental problem in policy gradient-based methods in co...

Variance Reduction for Policy-Gradient Methods via Empirical Variance Minimization

Policy-gradient methods in Reinforcement Learning(RL) are very universal...

Learning Permutations with Sinkhorn Policy Gradient

Many problems at the intersection of combinatorics and computer science ...

Discrete Action On-Policy Learning with Action-Value Critic

Reinforcement learning (RL) in discrete action space is ubiquitous in re...

1 Introduction

In reinforcement learning (RL), two important classes of algorithms are value-function-based methods and policy search methods. Value-function-based methods maintain an estimate of the value of performing each action in each state, and choose the actions associated with the most value in their current state [Sutton and Barto1998]. By contrast, policy search algorithms maintain an explicit policy, and agents draw actions directly from that policy to interact with their environment [Sutton et al.2000]

. A subset of policy search algorithms, policy gradient methods, represent the policy using a differentiable parameterized function approximator (for example, a neural network) and use stochastic gradient ascent to update its parameters to achieve more reward.

To facilitate gradient ascent, the agent interacts with its environment according to the current policy and keeps track of the outcomes of its actions. From these (potentially noisy) sampled outcomes, the agent estimates the gradient of the objective function. A critical question here is how to compute an accurate gradient using these samples, which may be costly to acquire, while using as few sample interactions as possible.

Actor-critic algorithms compute the policy gradient using a learned value function to estimate expected future reward [Sutton et al.2000, Konda and Tsitsiklis2000]. Since the expected reward is a function of the environment’s dynamics, which the agent does not know, it is typically estimated by executing the policy in the environment. Existing algorithms compute the policy gradient using the value of states the agent visits, and critically, these methods take into account only the actions the agent actually executes during environmental interaction.

We propose a new policy gradient algorithm, Mean Actor-Critic (or MAC), for the discrete-action continuous-state case. MAC uses the agent’s policy distribution to average the value function over all actions, rather than using the action-values of only the sampled actions. We prove that, under modest assumptions, this approach reduces variance in the policy gradient estimates relative to traditional actor-critic approaches. We implement MAC using deep neural networks, and we show empirical results on two control domains and six Atari games, where MAC is competitive with state-of-the-art policy search methods.

We note that the core idea behind MAC has also been independently and concurrently explored by ciosek2017expected ciosek2017expected. However, their results mainly focus on continuous action spaces and are more theoretical. We introduce a simpler proof of variance reduction that makes fewer assumptions, and we also show that the algorithm works well in discrete-action domains.

2 Background

In RL, we train an agent to select actions in its environment so that it maximizes some notion of long-term reward. We formalize the problem as a Markov decision process (MDP)

[Puterman1990], which we specify by the tuple , where is a set of states, is a fixed initial state, is a set of discrete actions, the functions and respectively describe the reward and transition dynamics of the environment, and is a discount factor representing the relative importance of immediate versus long-term rewards.

More concretely, we denote the expected reward for performing action in state as:

and we denote the probability that performing action

in state results in state as:

In the context of policy search methods, the agent maintains an explicit policy denoting the probability of taking action in state under the policy parameterized by

. Note that for each state, the policy outputs a probability distribution over the discrete set of actions:

. At each timestep , the agent takes an action drawn from its policy , then the environment provides a reward signal and transitions to the next state .

The agent’s goal at every timestep is to maximize the sum of discounted future rewards, or simply return, which we define as:

In a slight abuse of notation, we will also denote the total return for a trajectory as , which is equal to for that same trajectory.

The agent’s policy induces a value function over the state space. The expression for return allows us to define both a state value function, , and a state-action value function, . Here, represents the expected return starting from state , and following the policy thereafter, and represents the expected return starting from , executing action , and then following the policy thereafter:

Note that:

The agent’s goal is to find a policy that maximizes the return for every timestep, so we define an objective function that allows us to score an arbitrary policy parameter :

where denotes a trajectory. Note that the probability of a specific trajectory depends on policy parameters as well as the dynamics of the environment. Our goal is to be able to compute the gradient of with respect to the policy parameters :


where is the discounted state distribution. In the second and third lines we rewrite the gradient term using a score function. In the fourth line, we convert the summation to an expectation, and use the notation in place of . Next, we make use of the fact that , given by williams1992simple williams1992simple. Intuitively this makes sense, since the policy for a given state should depend only on the rewards achieved after that state. Finally, we invoke the definition that .

A nice property of expectation (1) is that, given access to , the expectation can be estimated through implementing policy in the environment. Alternatively, we can estimate using the return , which is an unbiased (and usually a high variance) sample of . This is essentially the idea behind the REINFORCE algorithm [Williams1992], which uses the following gradient estimator:


Alternatively, we can estimate using some sort of function approximation: , which results in variants of actor-critic algorithms. Perhaps the simplest actor-critic algorithm approximates (1) as follows:


Note that value function approximation can, in general, bias the gradient estimation [Baxter and Bartlett2001].

One way of reducing variance in both REINFORCE and actor-critic algorithms is to use an additive control variate as a baseline [Williams1992, Sutton et al.2000, Greensmith, Bartlett, and Baxter2004]. The baseline function is typically a function that is fixed over actions, and so subtracting it from either the sampled returns or the estimated Q-values does not bias the gradient estimation. We refer to techniques that use such a baseline as advantage variations of the basic algorithms, since they approximate the advantage of choosing action over some baseline representing “typical” performance for the policy in state [Baird1994]. The update performed by advantage REINFORCE is:

where is a scalar baseline measuring the performance of the policy, such as a running average of the observed return over the past few episodes of interaction.

Advantage actor-critic uses an approximation of the expected value of each state as its baseline: , which leads to the following update rule:

Another way of estimating the advantage function is to use the TD-error signal . This approach is convenient, because it only requires estimating one set of parameters, namely for . However, because the TD-error is a sample of the advantage function , this approach has higher variance (due to the environmental dynamics) than methods that explicitly compute . Moreover, given and , can easily be computed as , so in practice, it is still only necessary to estimate one set of parameters (for ).

3 Mean Actor-Critic

An overwhelming majority of recent actor-critic papers have computed the policy gradient using an estimate similar to Equation (3) [Degris, White, and Sutton2012, Mnih et al.2016, Wang et al.2016]. This estimate samples both states and actions from trajectories executed according to the current policy in order to compute the gradient of the objective function with respect to the policy weights.

Instead of using only the sampled actions, Mean Actor-Critic (MAC) explicitly computes the probability-weighted average over all Q-values, for each state sampled from the trajectories. In doing so, MAC is able to produce an estimate of the policy gradient where the variance due to action sampling is reduced to zero. This is exactly the difference between computing the sample mean (whose variance is inversely proportional to the number of samples), and calculating the mean directly (which is simply a scalar with no variance).

MAC is based on the observation that expectation (1), which we repeat here, can be rewritten in the following way:


We can estimate (4) by sampling states from a trajectory and using function approximation:

In our implementation, the inner summation is computed by combining two neural networks that represent the policy and state-action value function. The value function can be learned using a variety of methods, such as temporal-difference learning or Monte Carlo sampling. After performing a few updates to the value function, we update the parameters of the policy with the following update rule:


To improve stability, repeated updates to the value and policy networks are interleaved, as in Generalized Policy Iteration [Sutton and Barto1998].

In traditional actor-critic approaches, which we refer to as sampled-action actor-critic, the only actions involved in the computation of the policy gradient estimate are those that were actually executed in the environment. In MAC, computing the policy gradient estimate will frequently involve actions that were not actually executed in the environment. This results in a trade-off between bias and variance. In domains where we can expect accurate Q-value predictions from our function approximator, despite not actually executing all of the relevant state-action pairs, MAC results in lower variance gradient updates and increased sample-efficiency. In domains where this assumption is not valid, MAC may perform worse than sampled-action actor-critic due to increased bias.

Figure 1: Screenshots of the classic control domains Cart Pole (left) and Lunar Lander (right)

In some ways, MAC is similar to Expected Sarsa [Van Seijen et al.2009]. Expected Sarsa considers all next-actions , then computes the expected TD-error, , and uses the resulting error signal to update the function. By contrast, MAC considers all current-actions , and uses the corresponding values to update the policy directly.

It is natural to consider whether MAC could be improved by subtracting an action-independent baseline, as in sampled-action actor-critic and REINFORCE:

However, we can simplify the expectation as follows:

In doing so, we see that both and the gradient operator can be moved outside of the summation, leaving just the sum of the action probabilities, which is always 1, and hence the gradient of the baseline term is always zero. This is true regardless of the choice of baseline, since the baseline cannot be a function of the actions or else it will bias the expectation. Thus, we see that subtracting a baseline is unnecessary in MAC, since it has no effect on the policy gradient estimate.

4 Analysis of Bias and Variance

In this section we prove that MAC does not increase variance over sampled-action actor-critic (AC), and also, that given a fixed , both algorithms have the same bias. We start with the bias result.

4.1 Theorem 1

If the estimated Q-values, , for both MAC and AC are the same in expectation, then the bias of MAC is equal to the bias of AC.

4.2 Proof

See Appendix A.

This result makes sense because in expectation, AC will choose all of the possible actions with some probability according to the policy. MAC simply calculates this expectation over actions explicitly. We now move to the variance result.

4.3 Theorem 2

If the estimated Q-values, , for both MAC and AC are the same in expectation, and if is independent of for , then . For deterministic policies, there is equality, and for stochastic policies the inequality is strict.

4.4 Proof

See Appendix B.

Intuitively, we can see that for cases where the policy is deterministic, MAC’s formulation of the policy gradient is exactly equivalent to AC, and hence we can do no better than AC. For high-entropy policies, MAC will beat AC in terms of variance.

5 Experiments

This section presents an empirical evaluation of MAC across three different problem domains. We first evaluate the performance of MAC versus popular policy gradient benchmarks on two classic control problems. We then evaluate MAC on a subset of Atari 2600 games and investigate its performance compared to state-of-the-art policy search methods.

5.1 Classic Control Experiments

In order to determine whether MAC’s lower variance policy gradient estimate translates to faster learning, we chose two classic control problems, namely Cart Pole and Lunar Lander, and compared MAC’s performance against four standard sampled-action policy gradient algorithms. We used the open-source implementations of Cart Pole and Lunar Lander provided by OpenAI Gym [Brockman et al.2016], in which both domains have continuous state spaces and discrete action spaces. Screenshots of the two domains are provided in Figure 1.

Algorithm Cart Pole Lunar Lander
Adv. Actor-Critic
Table 1: Performance summary of MAC vs. sampled-action policy gradient algorithms. Scores denote the mean performance of each algorithm over all trials and episodes.
Figure 2: Performance comparison for CartPole (left) and Lunar Lander (right) of MAC vs. sampled-action policy gradient algorithms. Results are averaged over 100 independent trials.

For each problem domain, we implemented MAC using two independent neural networks, representing the policy and Q function. We then performed a hyperparameter search to determine the best network architectures, optimization method, and learning rates. Specifically, the hyperparameter search considered: 0, 1, 2, or 3 hidden layers; 50, 75, 100, or 300 neurons per layer; ReLU, Leaky ReLU (with leak factor 0.3), or tanh activation; SGD, RMSProp, Adam, or Adadelta as the optimization method; and a learning rate chosen from 0.0001, 0.00025, 0.0005, 0.001, 0.005, 0.01, or 0.05. To find the best setting, we ran 10 independent trials for each combination of hyperparameters and chose the setting with the best asymptotic performance over the 10 trials. We terminated each episode after 200 and 1000 timesteps (in Cart Pole and Lunar Lander, respectively), regardless of the state of the agent.

We compared MAC against four standard benchmarks: REINFORCE, advantage REINFORCE, actor-critic, and advantage actor-critic. We implemented the REINFORCE benchmarks using just a single neural network to represent the policy, and we implemented the actor-critic benchmarks using two networks to represent both the policy and Q function. For each benchmark algorithm, we then performed the same hyperparameter search that we had used for MAC.

In order to keep the variance as low as possible for the advantage actor-critic benchmark, we explicitly computed the advantage function , where , rather than sampling it using the TD-error (see Section 2).

Once we had determined the best hyperparameter settings for MAC and each of the benchmark algorithms, we then ran each algorithm for 100 independent trials. Figure 2 shows learning curves for the different algorithms, and Table 1 summarizes the results using the mean performance over trials and episodes. On Cart Pole, MAC learns substantially faster than all of the benchmarks, and on Lunar Lander, it performs competitively with the best benchmark algorithm, advantage actor-critic.

5.2 Atari Experiments

To test whether MAC can scale to larger problem domains, we evaluated it on several Atari 2600 games using the Arcade Learning Environment (ALE) [Bellemare et al.2013] and compared MAC’s performance against that of state-of-the-art policy search methods, namely, Trust Region Policy Optimization (TRPO) [Schulman et al.2015], Evolutionary Strategies (ES) [Salimans et al.2017], and Advantage Actor-Critic (A2C) [Wu et al.2017]. Due to the computational load inherent in training deep networks to play Atari games, we limited our experiments to a subset of six Atari games: Beamrider, Breakout, Pong, Q*bert, Seaquest and Space Invaders. These six games are commonly selected for tuning hyperparameters [Mnih et al.2015, Mnih et al.2016, Wu et al.2017], and thus provide a fair comparison against established benchmarks, despite our limited computational resources.

The MAC network architecture was derived from the OpenAI Baselines implementation of A2C [Wu et al.2017]

. It uses three convolutional layers (size/stride/filters: 8/4/32, 4/2/64, 3/1/64), followed by a fully-connected layer (size 512), all with ReLU activation. A final fully-connected layer is split into two batches of N outputs each, where N is the number of actions. One batch uses a linear activation and corresponds to the Q-values; the other batch uses a softmax activation and corresponds to the policy. We used this architecture for both the MAC results and the A2C results. The TRPO and ES results are taken from their respective papers.

Game Random TRPO ES A2C MAC
Beam Rider 363.9 1425.2 744.0 5846.0 6072.0
Breakout 1.7 10.8 9.5 370.9 372.7
Pong -20.7 20.9 21.0 18.0 10.6
Q*bert 183.0 1973.5 147.5 1651.5 243.4
Seaquest 68.4 1908.6 1390.0 1702.5 1703.4
Space Invaders 148.0 568.4 678.5 1201.2 1173.1
Table 2: Atari performance of MAC vs. policy search methods (random start condition). TRPO and ES results are from their respective papers [Schulman et al.2015, Salimans et al.2017]. A2C and MAC results were obtained with modified versions of the OpenAI Baselines implementation of A2C [Wu et al.2017].

We trained the network using a variation of the multi-part loss function used in A2C

[Wu et al.2017]. The value loss at each timestep was equal to the mean squared error between the observed reward and the Q-value of the selected action. The policy entropy loss was simply the negative entropy of the policy at each timestep. For the A2C experiments, the policy improvement loss was the negative log probability of the selected action times its advantage value. For the MAC experiments, the policy improvement loss became the negative sum of action probabilities times their associated Q-values. The overall loss function was a linear combination of the policy improvement loss (coefficient 0.1), policy entropy loss (coefficient 0.001), and value loss (coefficient 0.5), and the network was trained using RMSProp with a learning rate of 1.5e-3. These coefficients trade off the importance of learning good Q-values, improving the policy, and preventing the policy from converging prematurely. This configuration of hyperparameters was found to perform well experimentally for both methods after a small hyperparameter search. The only difference between the A2C and MAC implementations was to replace A2C’s sampled-action policy improvement loss with MAC’s sum-over-actions loss; the algorithms used exactly the same architecture and hyperparameters.

For A2C and MAC, we trained a network for each game on 50 million frames of play, across 16 parallel threads, pausing every 200K frames to evaluate performance and compute learning curves. In each evaluation, we ran 16 agents in parallel, for 4500 frames (5 minutes) each, or 50 total episodes, whichever came first, and averaged the scores of the completed (or timed-out) episodes. Agents were trained and evaluated under the typical random start condition, where the game is initialized with a random number of no-op ALE actions (between 0 and 30) [Mnih et al.2015]. The A2C and MAC results in Table 2 come from the final evaluation after all 50M frames, and they are averaged across 5 trials involving separately trained networks. Learning curves for each game can be found in Figure 3 in the Appendix. In addition to A2C, we also compared MAC against TRPO (results from a single trial) [Schulman et al.2015], and ES (results averaged over 30 trials) [Salimans et al.2017], and found that MAC performed competitively with all three benchmark algorithms.

Note that MAC’s performance on Pong and Q*bert was low relative to A2C. For Pong this was due to one of the five MAC trials obtaining a final score of -20.1 and pulling the average performance down significantly. The individual Pong scores for MAC were {20.5, 19.7, 18.3, 14.7, -20.1}; the scores for A2C were {19.4, 19.4, 19.3, 16.3, 15.6}. For Q*bert, the performance for both algorithms was much more variable. A2C scored 0.0 on 3 out of 5 trials, and MAC scored 0.0 on 2 out of 5 trials. The reason A2C’s average score is so much higher than MAC’s is that it had one lucky trial where it scored 7780.9 points. The individual Q*bert scores for MAC were {557.4, 504.7, 155.1, 0.0, 0.0}; the scores for A2C were {7780.9, 476.6, 0.0, 0.0, 0.0}. Additional hyperparameter tuning might lead to improved performance; however, the purpose of this Atari experiment was mainly to show that MAC is competitive with state-of-the-art policy search algorithms, and these results seem to indicate that it is.

6 Discussion

At its core, MAC offers a new way of computing the policy gradient that can substantially reduce variance and increase learning speed. There are a number of orthogonal improvements to policy gradient methods, such as using natural gradients [Kakade2002, Peters and Schaal2008], off-policy learning [Wang et al.2016, Gu et al.2016, Asadi and Williams2016], second-order methods [Furmston, Lever, and Barber2016], and asynchronous exploration [Mnih et al.2016]. We have not investigated how MAC performs with these extensions; however, just as these improvements were added to basic actor-critic methods, they could be added to MAC as well, and we expect they would improve its performance in a similar way.

A typical use-case for actor-critic algorithms is for problem domains with continuous actions, which are awkward for value-function-based methods [Sutton and Barto1998]. One approach to dealing with continuous actions is Deterministic Policy Gradients (DPG) [Silver et al.2014, Lillicrap et al.2015], which uses a deterministic policy to perform off-policy policy gradient updates. However, in settings where on-policy learning is necessary, using a deterministic policy leads to sub-optimal behavior [Sutton and Barto1998], and hence a stochastic policy is typically used instead. The recently-introduced Expected Policy Gradients (EPG) [Ciosek and Whiteson2017] addresses this problem by generalizing DPG for stochastic policies. However, while EPG has good experimental performance on domains with continuous action spaces, the authors do not provide experimental results for discrete domains. MAC’s discrete results and EPG’s continuous results are in some sense complementary.

7 Conclusion

Figure 3:

Learning curves on six Atari games for A2C (blue) and MAC (orange). Vertical axis is score; horizontal axis is number of training frames (in millions). Results are averaged over 5 independent trials, and smoothed slightly for readability. Error bars represent standard deviation.

The basic formulation of policy gradient estimators presented here—where the gradient is estimated by averaging the state-action value function across actions—leads to a new family of actor-critic algorithms. This family has the advantage of not requiring an additional variance-reduction baseline, substantially reducing the design effort required to apply them. It is also a natural fit with deep neural network function approximators, resulting in a network architecture that is identical to some sampled-action actor-critic algorithms, but with less variance.

We prove that for stochastic policies, the MAC algorithm (the simplest member of the resulting family), reduces variance relative to traditional actor-critic approaches, while maintaining the same bias. Our neural network implementation of MAC either outperforms, or is competitive with, state-of-the-art policy search algorithms, and our experimental results show that MAC’s lower variance lead to dramatically faster training in some cases. In future work, we aim to develop this family of algorithms further by including typical elaborations of the basic actor-critic architecture like natural or second-order gradients. Our results so far suggest that our new approach is highly promising, and that extensions to it will provide even further improvement in performance.


8 Appendix

8.1 A. Proof of Theorem 1

Both AC and MAC are estimators of the true policy gradient (PG). Given a batch of data , we can write the bias of AC and MAC as:


For clarity, we will rewrite the AC and MAC expectations (1) and (4) to explicitly denote the way that each algorithm estimates the policy gradient, given a batch of data (with size ):

AC (8)
MAC (9)

Substituting (8) and (1) into Eqn. (6) gives:


Since is sampled from trajectories that were carried out according to the policy, we can drop the dependence on inside the expectation, and rewrite (10) as follows:


Now we turn our attention to MAC, and substitute (9) and (1) into Eqn. (7), to obtain:


Comparing (13) and (17), we see that AC and MAC have the same bias.

8.2 B. Proof of Theorem 2

For any random variable

, the variance can be written as:

If we assume the estimated Q-values for MAC and AC are the same in expectation, then the squared expectation’s contribution to the variance of each algorithm will be equal. We are only interested in determining which estimator has lower variance, so we can drop the second term and simply compare

, the second moments.

Again we will employ the explicit definitions of the AC and MAC estimators, for a data set , given by (8) and (9), respectively.

For ease of notation, we define the following two functions:



represents a single parameter of the parameter vector

. We consider an arbitrary choice of , so the following proof holds for all .

The above expressions allow us to rewrite the AC and MAC estimators (Eqn. 8 & 9) in terms of and :


For convenience, we drop the subscript for the rest of this analysis.

Now we are ready to compare vs. .

By the assumption that is independent of for , we can distribute the expectation through in line 3 on the left, to obtain . In the last line, we can drop the second term in each expression, because they are the same. At this point we just need to compare vs. . In order to make this comparison, we make use of Jensen’s Inequality [Jensen1906], which says that for a convex function and a vector :

We note that is convex, and as such, the following holds:

Thus, we can conclude that . Moreover, since is strictly convex, this inequality is strict as long as is not almost surely constant for a given state. That means for deterministic policies, we have , and for stochastic policies, .