1 Introduction
In reinforcement learning (RL), two important classes of algorithms are valuefunctionbased methods and policy search methods. Valuefunctionbased methods maintain an estimate of the value of performing each action in each state, and choose the actions associated with the most value in their current state [Sutton and Barto1998]. By contrast, policy search algorithms maintain an explicit policy, and agents draw actions directly from that policy to interact with their environment [Sutton et al.2000]
. A subset of policy search algorithms, policy gradient methods, represent the policy using a differentiable parameterized function approximator (for example, a neural network) and use stochastic gradient ascent to update its parameters to achieve more reward.
To facilitate gradient ascent, the agent interacts with its environment according to the current policy and keeps track of the outcomes of its actions. From these (potentially noisy) sampled outcomes, the agent estimates the gradient of the objective function. A critical question here is how to compute an accurate gradient using these samples, which may be costly to acquire, while using as few sample interactions as possible.
Actorcritic algorithms compute the policy gradient using a learned value function to estimate expected future reward [Sutton et al.2000, Konda and Tsitsiklis2000]. Since the expected reward is a function of the environment’s dynamics, which the agent does not know, it is typically estimated by executing the policy in the environment. Existing algorithms compute the policy gradient using the value of states the agent visits, and critically, these methods take into account only the actions the agent actually executes during environmental interaction.
We propose a new policy gradient algorithm, Mean ActorCritic (or MAC), for the discreteaction continuousstate case. MAC uses the agent’s policy distribution to average the value function over all actions, rather than using the actionvalues of only the sampled actions. We prove that, under modest assumptions, this approach reduces variance in the policy gradient estimates relative to traditional actorcritic approaches. We implement MAC using deep neural networks, and we show empirical results on two control domains and six Atari games, where MAC is competitive with stateoftheart policy search methods.
We note that the core idea behind MAC has also been independently and concurrently explored by ciosek2017expected ciosek2017expected. However, their results mainly focus on continuous action spaces and are more theoretical. We introduce a simpler proof of variance reduction that makes fewer assumptions, and we also show that the algorithm works well in discreteaction domains.
2 Background
In RL, we train an agent to select actions in its environment so that it maximizes some notion of longterm reward. We formalize the problem as a Markov decision process (MDP)
[Puterman1990], which we specify by the tuple , where is a set of states, is a fixed initial state, is a set of discrete actions, the functions and respectively describe the reward and transition dynamics of the environment, and is a discount factor representing the relative importance of immediate versus longterm rewards.More concretely, we denote the expected reward for performing action in state as:
and we denote the probability that performing action
in state results in state as:In the context of policy search methods, the agent maintains an explicit policy denoting the probability of taking action in state under the policy parameterized by
. Note that for each state, the policy outputs a probability distribution over the discrete set of actions:
. At each timestep , the agent takes an action drawn from its policy , then the environment provides a reward signal and transitions to the next state .The agent’s goal at every timestep is to maximize the sum of discounted future rewards, or simply return, which we define as:
In a slight abuse of notation, we will also denote the total return for a trajectory as , which is equal to for that same trajectory.
The agent’s policy induces a value function over the state space. The expression for return allows us to define both a state value function, , and a stateaction value function, . Here, represents the expected return starting from state , and following the policy thereafter, and represents the expected return starting from , executing action , and then following the policy thereafter:
Note that:
The agent’s goal is to find a policy that maximizes the return for every timestep, so we define an objective function that allows us to score an arbitrary policy parameter :
where denotes a trajectory. Note that the probability of a specific trajectory depends on policy parameters as well as the dynamics of the environment. Our goal is to be able to compute the gradient of with respect to the policy parameters :
(1)  
where is the discounted state distribution. In the second and third lines we rewrite the gradient term using a score function. In the fourth line, we convert the summation to an expectation, and use the notation in place of . Next, we make use of the fact that , given by williams1992simple williams1992simple. Intuitively this makes sense, since the policy for a given state should depend only on the rewards achieved after that state. Finally, we invoke the definition that .
A nice property of expectation (1) is that, given access to , the expectation can be estimated through implementing policy in the environment. Alternatively, we can estimate using the return , which is an unbiased (and usually a high variance) sample of . This is essentially the idea behind the REINFORCE algorithm [Williams1992], which uses the following gradient estimator:
(2) 
Alternatively, we can estimate using some sort of function approximation: , which results in variants of actorcritic algorithms. Perhaps the simplest actorcritic algorithm approximates (1) as follows:
(3) 
Note that value function approximation can, in general, bias the gradient estimation [Baxter and Bartlett2001].
One way of reducing variance in both REINFORCE and actorcritic algorithms is to use an additive control variate as a baseline [Williams1992, Sutton et al.2000, Greensmith, Bartlett, and Baxter2004]. The baseline function is typically a function that is fixed over actions, and so subtracting it from either the sampled returns or the estimated Qvalues does not bias the gradient estimation. We refer to techniques that use such a baseline as advantage variations of the basic algorithms, since they approximate the advantage of choosing action over some baseline representing “typical” performance for the policy in state [Baird1994]. The update performed by advantage REINFORCE is:
where is a scalar baseline measuring the performance of the policy, such as a running average of the observed return over the past few episodes of interaction.
Advantage actorcritic uses an approximation of the expected value of each state as its baseline: , which leads to the following update rule:
Another way of estimating the advantage function is to use the TDerror signal . This approach is convenient, because it only requires estimating one set of parameters, namely for . However, because the TDerror is a sample of the advantage function , this approach has higher variance (due to the environmental dynamics) than methods that explicitly compute . Moreover, given and , can easily be computed as , so in practice, it is still only necessary to estimate one set of parameters (for ).
3 Mean ActorCritic
An overwhelming majority of recent actorcritic papers have computed the policy gradient using an estimate similar to Equation (3) [Degris, White, and Sutton2012, Mnih et al.2016, Wang et al.2016]. This estimate samples both states and actions from trajectories executed according to the current policy in order to compute the gradient of the objective function with respect to the policy weights.
Instead of using only the sampled actions, Mean ActorCritic (MAC) explicitly computes the probabilityweighted average over all Qvalues, for each state sampled from the trajectories. In doing so, MAC is able to produce an estimate of the policy gradient where the variance due to action sampling is reduced to zero. This is exactly the difference between computing the sample mean (whose variance is inversely proportional to the number of samples), and calculating the mean directly (which is simply a scalar with no variance).
MAC is based on the observation that expectation (1), which we repeat here, can be rewritten in the following way:
(4) 
We can estimate (4) by sampling states from a trajectory and using function approximation:
In our implementation, the inner summation is computed by combining two neural networks that represent the policy and stateaction value function. The value function can be learned using a variety of methods, such as temporaldifference learning or Monte Carlo sampling. After performing a few updates to the value function, we update the parameters of the policy with the following update rule:
(5) 
To improve stability, repeated updates to the value and policy networks are interleaved, as in Generalized Policy Iteration [Sutton and Barto1998].
In traditional actorcritic approaches, which we refer to as sampledaction actorcritic, the only actions involved in the computation of the policy gradient estimate are those that were actually executed in the environment. In MAC, computing the policy gradient estimate will frequently involve actions that were not actually executed in the environment. This results in a tradeoff between bias and variance. In domains where we can expect accurate Qvalue predictions from our function approximator, despite not actually executing all of the relevant stateaction pairs, MAC results in lower variance gradient updates and increased sampleefficiency. In domains where this assumption is not valid, MAC may perform worse than sampledaction actorcritic due to increased bias.
In some ways, MAC is similar to Expected Sarsa [Van Seijen et al.2009]. Expected Sarsa considers all nextactions , then computes the expected TDerror, , and uses the resulting error signal to update the function. By contrast, MAC considers all currentactions , and uses the corresponding values to update the policy directly.
It is natural to consider whether MAC could be improved by subtracting an actionindependent baseline, as in sampledaction actorcritic and REINFORCE:
However, we can simplify the expectation as follows:
In doing so, we see that both and the gradient operator can be moved outside of the summation, leaving just the sum of the action probabilities, which is always 1, and hence the gradient of the baseline term is always zero. This is true regardless of the choice of baseline, since the baseline cannot be a function of the actions or else it will bias the expectation. Thus, we see that subtracting a baseline is unnecessary in MAC, since it has no effect on the policy gradient estimate.
4 Analysis of Bias and Variance
In this section we prove that MAC does not increase variance over sampledaction actorcritic (AC), and also, that given a fixed , both algorithms have the same bias. We start with the bias result.
4.1 Theorem 1
If the estimated Qvalues, , for both MAC and AC are the same in expectation, then the bias of MAC is equal to the bias of AC.
4.2 Proof
See Appendix A.
This result makes sense because in expectation, AC will choose all of the possible actions with some probability according to the policy. MAC simply calculates this expectation over actions explicitly. We now move to the variance result.
4.3 Theorem 2
If the estimated Qvalues, , for both MAC and AC are the same in expectation, and if is independent of for , then . For deterministic policies, there is equality, and for stochastic policies the inequality is strict.
4.4 Proof
See Appendix B.
Intuitively, we can see that for cases where the policy is deterministic, MAC’s formulation of the policy gradient is exactly equivalent to AC, and hence we can do no better than AC. For highentropy policies, MAC will beat AC in terms of variance.
5 Experiments
This section presents an empirical evaluation of MAC across three different problem domains. We first evaluate the performance of MAC versus popular policy gradient benchmarks on two classic control problems. We then evaluate MAC on a subset of Atari 2600 games and investigate its performance compared to stateoftheart policy search methods.
5.1 Classic Control Experiments
In order to determine whether MAC’s lower variance policy gradient estimate translates to faster learning, we chose two classic control problems, namely Cart Pole and Lunar Lander, and compared MAC’s performance against four standard sampledaction policy gradient algorithms. We used the opensource implementations of Cart Pole and Lunar Lander provided by OpenAI Gym [Brockman et al.2016], in which both domains have continuous state spaces and discrete action spaces. Screenshots of the two domains are provided in Figure 1.
Algorithm  Cart Pole  Lunar Lander 

REINFORCE  
Adv. REINFORCE  
ActorCritic  
Adv. ActorCritic  
MAC 
For each problem domain, we implemented MAC using two independent neural networks, representing the policy and Q function. We then performed a hyperparameter search to determine the best network architectures, optimization method, and learning rates. Specifically, the hyperparameter search considered: 0, 1, 2, or 3 hidden layers; 50, 75, 100, or 300 neurons per layer; ReLU, Leaky ReLU (with leak factor 0.3), or tanh activation; SGD, RMSProp, Adam, or Adadelta as the optimization method; and a learning rate chosen from 0.0001, 0.00025, 0.0005, 0.001, 0.005, 0.01, or 0.05. To find the best setting, we ran 10 independent trials for each combination of hyperparameters and chose the setting with the best asymptotic performance over the 10 trials. We terminated each episode after 200 and 1000 timesteps (in Cart Pole and Lunar Lander, respectively), regardless of the state of the agent.
We compared MAC against four standard benchmarks: REINFORCE, advantage REINFORCE, actorcritic, and advantage actorcritic. We implemented the REINFORCE benchmarks using just a single neural network to represent the policy, and we implemented the actorcritic benchmarks using two networks to represent both the policy and Q function. For each benchmark algorithm, we then performed the same hyperparameter search that we had used for MAC.
In order to keep the variance as low as possible for the advantage actorcritic benchmark, we explicitly computed the advantage function , where , rather than sampling it using the TDerror (see Section 2).
Once we had determined the best hyperparameter settings for MAC and each of the benchmark algorithms, we then ran each algorithm for 100 independent trials. Figure 2 shows learning curves for the different algorithms, and Table 1 summarizes the results using the mean performance over trials and episodes. On Cart Pole, MAC learns substantially faster than all of the benchmarks, and on Lunar Lander, it performs competitively with the best benchmark algorithm, advantage actorcritic.
5.2 Atari Experiments
To test whether MAC can scale to larger problem domains, we evaluated it on several Atari 2600 games using the Arcade Learning Environment (ALE) [Bellemare et al.2013] and compared MAC’s performance against that of stateoftheart policy search methods, namely, Trust Region Policy Optimization (TRPO) [Schulman et al.2015], Evolutionary Strategies (ES) [Salimans et al.2017], and Advantage ActorCritic (A2C) [Wu et al.2017]. Due to the computational load inherent in training deep networks to play Atari games, we limited our experiments to a subset of six Atari games: Beamrider, Breakout, Pong, Q*bert, Seaquest and Space Invaders. These six games are commonly selected for tuning hyperparameters [Mnih et al.2015, Mnih et al.2016, Wu et al.2017], and thus provide a fair comparison against established benchmarks, despite our limited computational resources.
The MAC network architecture was derived from the OpenAI Baselines implementation of A2C [Wu et al.2017]
. It uses three convolutional layers (size/stride/filters: 8/4/32, 4/2/64, 3/1/64), followed by a fullyconnected layer (size 512), all with ReLU activation. A final fullyconnected layer is split into two batches of N outputs each, where N is the number of actions. One batch uses a linear activation and corresponds to the Qvalues; the other batch uses a softmax activation and corresponds to the policy. We used this architecture for both the MAC results and the A2C results. The TRPO and ES results are taken from their respective papers.
Game  Random  TRPO  ES  A2C  MAC 

Beam Rider  363.9  1425.2  744.0  5846.0  6072.0 
Breakout  1.7  10.8  9.5  370.9  372.7 
Pong  20.7  20.9  21.0  18.0  10.6 
Q*bert  183.0  1973.5  147.5  1651.5  243.4 
Seaquest  68.4  1908.6  1390.0  1702.5  1703.4 
Space Invaders  148.0  568.4  678.5  1201.2  1173.1 
We trained the network using a variation of the multipart loss function used in A2C
[Wu et al.2017]. The value loss at each timestep was equal to the mean squared error between the observed reward and the Qvalue of the selected action. The policy entropy loss was simply the negative entropy of the policy at each timestep. For the A2C experiments, the policy improvement loss was the negative log probability of the selected action times its advantage value. For the MAC experiments, the policy improvement loss became the negative sum of action probabilities times their associated Qvalues. The overall loss function was a linear combination of the policy improvement loss (coefficient 0.1), policy entropy loss (coefficient 0.001), and value loss (coefficient 0.5), and the network was trained using RMSProp with a learning rate of 1.5e3. These coefficients trade off the importance of learning good Qvalues, improving the policy, and preventing the policy from converging prematurely. This configuration of hyperparameters was found to perform well experimentally for both methods after a small hyperparameter search. The only difference between the A2C and MAC implementations was to replace A2C’s sampledaction policy improvement loss with MAC’s sumoveractions loss; the algorithms used exactly the same architecture and hyperparameters.For A2C and MAC, we trained a network for each game on 50 million frames of play, across 16 parallel threads, pausing every 200K frames to evaluate performance and compute learning curves. In each evaluation, we ran 16 agents in parallel, for 4500 frames (5 minutes) each, or 50 total episodes, whichever came first, and averaged the scores of the completed (or timedout) episodes. Agents were trained and evaluated under the typical random start condition, where the game is initialized with a random number of noop ALE actions (between 0 and 30) [Mnih et al.2015]. The A2C and MAC results in Table 2 come from the final evaluation after all 50M frames, and they are averaged across 5 trials involving separately trained networks. Learning curves for each game can be found in Figure 3 in the Appendix. In addition to A2C, we also compared MAC against TRPO (results from a single trial) [Schulman et al.2015], and ES (results averaged over 30 trials) [Salimans et al.2017], and found that MAC performed competitively with all three benchmark algorithms.
Note that MAC’s performance on Pong and Q*bert was low relative to A2C. For Pong this was due to one of the five MAC trials obtaining a final score of 20.1 and pulling the average performance down significantly. The individual Pong scores for MAC were {20.5, 19.7, 18.3, 14.7, 20.1}; the scores for A2C were {19.4, 19.4, 19.3, 16.3, 15.6}. For Q*bert, the performance for both algorithms was much more variable. A2C scored 0.0 on 3 out of 5 trials, and MAC scored 0.0 on 2 out of 5 trials. The reason A2C’s average score is so much higher than MAC’s is that it had one lucky trial where it scored 7780.9 points. The individual Q*bert scores for MAC were {557.4, 504.7, 155.1, 0.0, 0.0}; the scores for A2C were {7780.9, 476.6, 0.0, 0.0, 0.0}. Additional hyperparameter tuning might lead to improved performance; however, the purpose of this Atari experiment was mainly to show that MAC is competitive with stateoftheart policy search algorithms, and these results seem to indicate that it is.
6 Discussion
At its core, MAC offers a new way of computing the policy gradient that can substantially reduce variance and increase learning speed. There are a number of orthogonal improvements to policy gradient methods, such as using natural gradients [Kakade2002, Peters and Schaal2008], offpolicy learning [Wang et al.2016, Gu et al.2016, Asadi and Williams2016], secondorder methods [Furmston, Lever, and Barber2016], and asynchronous exploration [Mnih et al.2016]. We have not investigated how MAC performs with these extensions; however, just as these improvements were added to basic actorcritic methods, they could be added to MAC as well, and we expect they would improve its performance in a similar way.
A typical usecase for actorcritic algorithms is for problem domains with continuous actions, which are awkward for valuefunctionbased methods [Sutton and Barto1998]. One approach to dealing with continuous actions is Deterministic Policy Gradients (DPG) [Silver et al.2014, Lillicrap et al.2015], which uses a deterministic policy to perform offpolicy policy gradient updates. However, in settings where onpolicy learning is necessary, using a deterministic policy leads to suboptimal behavior [Sutton and Barto1998], and hence a stochastic policy is typically used instead. The recentlyintroduced Expected Policy Gradients (EPG) [Ciosek and Whiteson2017] addresses this problem by generalizing DPG for stochastic policies. However, while EPG has good experimental performance on domains with continuous action spaces, the authors do not provide experimental results for discrete domains. MAC’s discrete results and EPG’s continuous results are in some sense complementary.
7 Conclusion
The basic formulation of policy gradient estimators presented here—where the gradient is estimated by averaging the stateaction value function across actions—leads to a new family of actorcritic algorithms. This family has the advantage of not requiring an additional variancereduction baseline, substantially reducing the design effort required to apply them. It is also a natural fit with deep neural network function approximators, resulting in a network architecture that is identical to some sampledaction actorcritic algorithms, but with less variance.
We prove that for stochastic policies, the MAC algorithm (the simplest member of the resulting family), reduces variance relative to traditional actorcritic approaches, while maintaining the same bias. Our neural network implementation of MAC either outperforms, or is competitive with, stateoftheart policy search algorithms, and our experimental results show that MAC’s lower variance lead to dramatically faster training in some cases. In future work, we aim to develop this family of algorithms further by including typical elaborations of the basic actorcritic architecture like natural or secondorder gradients. Our results so far suggest that our new approach is highly promising, and that extensions to it will provide even further improvement in performance.
References
 [Asadi and Williams2016] Asadi, K., and Williams, J. D. 2016. Sampleefficient deep reinforcement learning for dialog control. arXiv preprint arXiv:1612.06000.
 [Baird1994] Baird, L. C. 1994. Reinforcement learning in continuous time: Advantage updating. In Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on, volume 4, 2448–2453. IEEE.

[Baxter and
Bartlett2001]
Baxter, J., and Bartlett, P. L.
2001.
Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
15:319–350.  [Bellemare et al.2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47:253–279.
 [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. CoRR abs/1606.01540.
 [Ciosek and Whiteson2017] Ciosek, K., and Whiteson, S. 2017. Expected policy gradients. arXiv preprint arXiv:1706.05374.
 [Degris, White, and Sutton2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839.

[Furmston, Lever, and
Barber2016]
Furmston, T.; Lever, G.; and Barber, D.
2016.
Approximate newton methods for policy search in markov decision
processes.
Journal of Machine Learning Research
17(227):1–51.  [Greensmith, Bartlett, and Baxter2004] Greensmith, E.; Bartlett, P. L.; and Baxter, J. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5(Nov):1471–1530.
 [Gu et al.2016] Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016. Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv preprint arXiv:1611.02247.
 [Jensen1906] Jensen, J. L. W. V. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica 30(1):175–193.
 [Kakade2002] Kakade, S. M. 2002. A natural policy gradient. In Advances in neural information processing systems, 1531–1538.
 [Konda and Tsitsiklis2000] Konda, V. R., and Tsitsiklis, J. N. 2000. Actorcritic algorithms. In Advances in neural information processing systems, 1008–1014.
 [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533.
 [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning.
 [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actorcritic. Neurocomputing 71(7):1180–1190.
 [Puterman1990] Puterman, M. L. 1990. Markov decision processes. Handbooks in operations research and management science 2:331–434.
 [Salimans et al.2017] Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
 [Schulman et al.2015] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), 1889–1897.
 [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML14), 387–395.
 [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction, volume 1. MIT press Cambridge.
 [Sutton et al.2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
 [Van Seijen et al.2009] Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on, 177–184. IEEE.
 [Wang et al.2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224.
 [Williams1992] Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
 [Wu et al.2017] Wu, Y.; Mansimov, E.; Grosse, R. B.; Liao, S.; and Ba, J. 2017. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, 5285–5294.
8 Appendix
8.1 A. Proof of Theorem 1
Both AC and MAC are estimators of the true policy gradient (PG). Given a batch of data , we can write the bias of AC and MAC as:
(6)  
(7) 
For clarity, we will rewrite the AC and MAC expectations (1) and (4) to explicitly denote the way that each algorithm estimates the policy gradient, given a batch of data (with size ):
AC  (8)  
MAC  (9) 
Substituting (8) and (1) into Eqn. (6) gives:
(10) 
Since is sampled from trajectories that were carried out according to the policy, we can drop the dependence on inside the expectation, and rewrite (10) as follows:
(11)  
(12)  
(13) 
Now we turn our attention to MAC, and substitute (9) and (1) into Eqn. (7), to obtain:
(14)  
(15)  
(16)  
(17) 
Comparing (13) and (17), we see that AC and MAC have the same bias.
8.2 B. Proof of Theorem 2
If we assume the estimated Qvalues for MAC and AC are the same in expectation, then the squared expectation’s contribution to the variance of each algorithm will be equal. We are only interested in determining which estimator has lower variance, so we can drop the second term and simply compare
, the second moments.
Again we will employ the explicit definitions of the AC and MAC estimators, for a data set , given by (8) and (9), respectively.
For ease of notation, we define the following two functions:
(18)  
(19) 
Here,
represents a single parameter of the parameter vector
. We consider an arbitrary choice of , so the following proof holds for all .The above expressions allow us to rewrite the AC and MAC estimators (Eqn. 8 & 9) in terms of and :
(20)  
(21) 
For convenience, we drop the subscript for the rest of this analysis.
Now we are ready to compare vs. .
By the assumption that is independent of for , we can distribute the expectation through in line 3 on the left, to obtain . In the last line, we can drop the second term in each expression, because they are the same. At this point we just need to compare vs. . In order to make this comparison, we make use of Jensen’s Inequality [Jensen1906], which says that for a convex function and a vector :
We note that is convex, and as such, the following holds:
Thus, we can conclude that . Moreover, since is strictly convex, this inequality is strict as long as is not almost surely constant for a given state. That means for deterministic policies, we have , and for stochastic policies, .