1 Introduction
Qlearning (Watkins, 1989)
is one of the most popular reinforcement learning algorithms. One of the reasons for this widespread adoption is the simplicity of the update. On each step, the agent updates its action value estimates towards the observed reward and the estimated value of the maximal action in the next state. This target represents the highest value the agent thinks it could obtain from the current state and action, given the observed reward.
Unfortunately, this simple update rule has been shown to suffer from overestimation bias (Thrun and Schwartz, 1993; van Hasselt, 2010)
. The agent updates with the maximum over action values might be large because an action’s value actually is high, or it can be misleadingly high simply because of the stochasticity or errors in the estimator. With many actions, there is a higher probability that one of the estimates is large simply due to stochasticity and the agent will overestimate the value. This issue is particularly problematic under function approximation, and can significant impede the quality of the learned policy
(Thrun and Schwartz, 1993; Szita and Lőrincz, 2008; Strehl et al., 2009) or even lead to failures of Qlearning (Thrun and Schwartz, 1993). More recently, experiments across several domains suggest that this overestimation problem is common (Hado van Hasselt et al., 2016).Double Qlearning (van Hasselt, 2010) is introduced to instead ensure underestimation bias. The idea is to maintain two unbiased independent estimators of the action values. The expected action value of estimator one is selected for the maximal action from estimator two, which is guaranteed not to overestimate the true maximum action value. Double DQN (Hado van Hasselt et al., 2016)
, the extension of this idea to Qlearning with neural networks, has been shown to significantly improve performance over Qlearning. However, this is not a complete answer to this problem, because trading overestimation bias for underestimation bias is not always desirable, as we show in our experiments.
Several other methods have been introduced to reduce overestimation bias, without fully moving towards underestimation. Weighted Double Qlearning (Zhang et al., 2017) uses a weighted combination of the Double Qlearning estimate, which likely has underestimation bias, and the Qlearning estimate, which likely has overestimation bias. Biascorrected QLearning (Lee et al., 2013) reduces the overestimation bias through a bias correction term. Ensemble Qlearning and Averaged Qlearning (Anschel et al., 2017) take averages of multiple action values, to both reduce the overestimation bias and the estimation variance. However, with a finite number of actionvalue functions, the average operation in these two algorithms will never completely remove the overestimation bias, as the average of several overestimation biases is always positive. Further, these strategies do not guide how strongly we should correct for overestimation bias, nor how to determine—or control—the level of bias.
The overestimation bias also appears in the actorcritic setting (Fujimoto et al., 2018; Haarnoja et al., 2018). For example, Fujimoto et al. (2018) propose the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) which reduces the overestimation bias by taking the minimum value between two critics. However, they do not provide a rigorous theoretical analysis for the effect of applying the minimum operator. There is also no theoretical guide for choosing the number of estimators such that the overestimation bias can be reduced to 0.
In this paper, we study the effects of overestimation and underestimation bias on learning performance, and use them to motivate a generalization of Qlearning called Maxmin Qlearning. Maxmin Qlearning directly mitigates the overestimation bias by using a minimization over multiple actionvalue estimates. Moreover, it is able to control the estimation bias varying from positive to negative which helps improve learning efficiency as we will show in next sections. We prove that, theoretically, with an appropriate number of actionvalue estimators, we are able to acquire an unbiased estimator with a lower approximation variance than Qlearning. We empirically verify our claims on several benchmarks. We study the convergence properties of our algorithm within a novel Generalized Qlearning framework, which is suitable for studying several of the recently proposed Qlearning variants. We also combine deep neural networks with Maxmin Qlearning (Maxmin DQN) and demonstrate its effectiveness in several benchmark domains.
2 Problem Setting
We formalize the problem as a Markov Decision Process (MDP),
, where is the state space, is the action space, is the transition probabilities, is the reward mapping, and is the discount factor. At each time step , the agent observes a state and takes an action and then transitions to a new state according to the transition probabilities and receives a scalar reward . The goal of the agent is to find a policy that maximizes the expected return starting from some initial state.Qlearning is an offpolicy algorithm which attempts to learn the stateaction values for the optimal policy. It tries to solve for
The optimal policy is to act greedily with respect to these action values: from each select from . The update rule for an approximation for a sampled transition is:
(1) 
where is the stepsize. The transition can be generated offpolicy, from any behaviour that sufficiently covers the state space. This algorithm is known to converge in the tabular setting (Tsitsiklis, 1994), with some limited results for the function approximation setting (Melo and Ribeiro, 2007).
3 Understanding when Overestimation Bias Helps and Hurts
In this section, we briefly discuss the estimation bias issue, and empirically show that both overestimation and underestimation bias may improve learning performance, depending on the environment. This motivates our Maxmin Qlearning algorithm described in the next section, which allows us to flexibly control the estimation bias and reduce the estimation variance.
The overestimation bias occurs since the target is used in the Qlearning update. Because
is an approximation, it is probable that the approximation is higher than the true value for one or more of the actions. The maximum over these estimators, then, is likely to be skewed towards an overestimate. For example, even unbiased estimates
for all , will vary due to stochasticity. , and for some actions, will be positive. As a result, .This overestimation bias, however, may not always be detrimental. And, further, in some cases, erring towards an underestimation bias can be harmful. Overestimation bias can help encourage exploration for overestimated actions, whereas underestimation bias might discourage exploration. In particular, we expect more overestimation bias in highly stochastic areas of the world; if those highly stochastic areas correspond to highvalue regions, then encouraging exploration there might be beneficial. An underestimation bias might actually prevent an agent from learning that a region is highvalue. Alternatively, if highly stochastic areas also have low values, overestimation bias might cause an agent to overexplore a lowvalue region.
We show this effect in the simple MDP, shown in Figure 1. The MDP for state has only two actions: Left and Right. It has a deterministic neutral reward for both the Left action and the Right action. The Left action transitions to state where there are eight actions transitions to a terminate state with a highly stochastic reward. The mean of this stochastic reward is . By selecting , the stochastic region becomes highvalue, and we expect overestimation bias to help and underestimation bias to hurt. By selecting , the stochastic region becomes lowvalue, and we expect overestimation bias to hurt and underestimation bias to help.
We test Qlearning, Double Qlearning and our new algorithm Maxmin Qlearning in this environment. Maxmin Qlearning (described fully in the next section) uses estimates of the action values in the targets. For , it corresponds to Qlearning; otherwise, it progresses from overestimation bias at towards underestimation bias with increasing . In the experiment, we used a discount factor ; a replay buffer with size ; an greedy behaviour with
; tabular actionvalues, initialized with a Gaussian distribution
; and a stepsize of for all algorithms.The results in Figure 2 verify our hypotheses for when overestimation and underestimation bias help and hurt. Double Qlearning underestimates too much for , and converges to a suboptimal policy. Qlearning learns the optimal policy the fastest, though for all values of , Maxmin Qlearning does progress towards the optimal policy. All methods get to the optimal policy for , but now Double Qlearning reaches the optimal policy the fastest, and followed by Maxmin Qlearning with larger .
4 Maxmin Qlearning
In this section, we develop Maxmin Qlearning, a simple generalization of Qlearning designed to control the estimation bias, as well as reduce the estimation variance of action values. The idea is to maintain estimates of the action values, , and use the minimum of these estimates in the Qlearning target: . For , the update is simply Qlearning, and so likely has overestimation bias. As increase, the overestimation decreases; for some , this maxmin estimator switches from an overestimate, in expectation, to an underestimate. We characterize the relationship between and the expected estimation bias below in Theorem 1. Note that Maxmin Qlearning uses a different mechanism to reduce overestimation bias than Double Qlearning; Maxmin Qlearning with is not Double Qlearning.
The full algorithm is summarized in Algorithm 1, and is a simple modification of Qlearning with experience replay. We use random subsamples of the observed data for each of the estimators, to make them nearly independent. To do this training online, we keep a replay buffer. On each step, a random estimator is chosen and updated using a minibatch from the buffer. Multiple such updates can be performed on each step, just like in experience replay, meaning multiple estimators can be updated per step using different random minibatches. In our experiments, to better match DQN, we simply do one update per step. Finally, it is also straightforward to incorporate target networks to get Maxmin DQN, by maintaining a target network for each estimator.
We now characterize the relation between the number of actionvalue functions used in Maxmin Qlearning and the estimation bias of action values. For compactness, we write instead of . Each has random approximation error
We assume that
is a uniform random variable
for some . The uniform random assumption was used by Thrun and Schwartz (1993) to demonstrate bias in Qlearning, and reflects that nonnegligible positive and negative are possible. Notice that for estimators with samples, the will be proportional to some function of , because the data will be shared amongst the estimators. For the general theorem, we use a generic , and in the following corollary provide a specific form for in terms of and .Recall that is the number of actions applicable at state . Define the estimation bias for transition to be
where
We now show how the expected estimation bias and the variance of are related to the number of actionvalue functions in Maxmin Qlearning.
Theorem 1
Under the conditions stated above,

[label=(),leftmargin=0cm,itemindent=.5cm,itemsep=0em]

the expected estimation bias is
decreases as increases: and .

decreases as increases: for N=1 and for .
Theorem 1 is a generalization of the first lemma in Thrun and Schwartz (1993); we provide the proof in Appendix A as well as a visualization of the expected bias for varying and . This theorem shows that the average estimation bias , decreases as increases. Thus, we can control the bias by changing the number of estimators in Maxmin Qlearning. Specifically, the average estimation bias can be reduced from positive to negative as increases. Notice that when . This suggests that by choosing such that , we can reduce the bias to near .
Furthermore, decreases as increases. This indicates that we can control the estimation variance of target action value through . We show just this in the following Corollary. The subtlety is that with increasing , each estimator will receive less data. The fair comparison is to compare the variance of a single estimator that uses all of the data, as compared to the maxmin estimator which shares the samples across estimators. We show that there is an such that the variance is lower, which arises largely due to the fact that the variance of each estimator decreases linearly in , but the parameter for each estimator only decreases at a square root rate in the number of samples.
Corollary 1
Assuming the samples are evenly allocated amongst the estimators, then where is the variance of samples for and, for the estimator that uses all samples for a single estimate,
Under this uniform random noise assumption, for , .
5 Experiments
In this section, we first investigate robustness to reward variance, in a simple environment (Mountain Car) in which we can perform more exhaustive experiments. Then, we investigate performance in seven benchmark environments.
Robustness under increasing reward variance in Mountain Car
Mountain Car (Sutton and Barto, 2018) is a classic testbed in Reinforcement Learning, where the agent receives a reward of per step with , until the car reaches the goal position and the episode ends. In our experiment, we modify the rewards to be stochastic with the same mean value: the reward signal is sampled from a Gaussian distribution on each time step. An agent should learn to reach the goal position in as few steps as possible.
The experimental setup is as follows. We trained each algorithm with episodes. The number of steps to reach the goal position in the last training episode was used as the performance measure. The fewer steps, the better performance. All experimental results were averaged over runs. The key algorithm settings included the function approximator, stepsizes, exploration parameter and replay buffer size. All algorithm used greedy with and a buffer size of . For each algorithm, the best stepsize was chosen from , separately for each reward setting. Tilecoding was used to approximate the actionvalue function, where we used tilings with each tile covering th of the bounded distance in each dimension. For Maxmin Qlearning, we randomly chose one actionvalue function to update at each step.
As shown in Figure 3, when the reward variance is small, the performance of Qlearning, Double Qlearning, Averaged Qlearning, and Maxmin Qlearning are comparable. However, as the variance increases, Qlearning, Double Qlearning, and Averaged Qlearning became much less stable than Maxmin Qlearning. In fact, when the variance was very high (, see Appendix C.2), Qlearning and Averaged Qlearning failed to reach the goal position in steps, and Double Qlearning produced runs steps, even after many episodes.
show the average number of steps taken in the last episode with one standard error. The lines in
show the number of steps to reach the goal position during training when the reward variance . All results were averaged across runs, with standard errors. Additional experiments with further elevated can be found in Appendix C.2.Results on Benchmark Environments
To evaluate Maxmin DQN, we choose seven games from Gym (Brockman et al., 2016), PyGame Learning Environment (PLE) (Tasfi, 2016), and MinAtar (Young and Tian, 2019): Lunarlander, Catcher, Pixelcopter, Asterix, Seaquest, Breakout, and Space Invaders. For games in MinAtar (i.e. Asterix, Seaquest, Breakout, and Space Invaders), we reused the hyperparameters and settings of neural networks in (Young and Tian, 2019). And the stepsize was chosen from
. For Lunarlander, Catcher, and Pixelcopter, the neural network was a multilayer perceptron with hidden layers fixed to
. The discount factor was . The size of the replay buffer was. The weights of neural networks were optimized by RMSprop with gradient clip
. The batch size was . The target network was updated every frames. greedy was applied as the exploration strategy with decreasing linearly from to in steps. After steps, was fixed to . For Lunarlander, the best stepsize was chosen from . For Catcher and Pixelcopter, the best stepsize was chosen from .For both Maxmin DQN and Averaged DQN, the number of target networks was chosen from . And we randomly chose one actionvalue function to update at each step. We first trained each algorithm in a game for certain number of steps. After that, each algorithm was tested by running test episodes with greedy where . Results were averaged over runs for each algorithm, with learning curves shown for the best hyperparameter setting (see Appendix C.3 for the parameter sensitivity curves).
We see from Figure 4 that Maxmin DQN performs as well as or better than other algorithms. In environments where final performance is noticeably better—Pixelcopter, Lunarlander and Asterix—the initial learning is slower. A possible explanation for this is that the Maxmin agent more extensively explored early on, promoting better final performance. We additionally show on Pixelcopter and Asterix that for smaller , Maxmin DQN learns faster but reaches suboptimal performance—behaving more like Qlearning—and for larger learns more slowly but reaches better final performance.
6 Convergence Analysis of Maxmin Qlearning
In this section, we show Maxmin Qlearning is convergent in the tabular setting. We do so by providing a more general result for what we call Generalized Qlearning: Qlearning where the bootstrap target uses a function of action values. The main condition on is that it maintains relative maximum values, as stated in Assumption 1. We use this more general result to prove Maxmin Qlearning is convergent, and then discuss how it provides convergence results for Qlearning, Ensemble Qlearning, Averaged Qlearning and Historical Best Qlearning as special cases.
Many variants of Qlearning have been proposed, including Double Qlearning (van Hasselt, 2010), Weighted Double Qlearning (Zhang et al., 2017), Ensemble Qlearning (Anschel et al., 2017), Averaged Qlearning (Anschel et al., 2017), and Historical Best Qlearning (Yu et al., 2018). These algorithms differ in their estimate of the onestep bootstrap target. To encompass all variants, the target actionvalue of Generalized Qlearning is defined based on actionvalue estimates from both dimensions:
(2) 
where is the current time step and the actionvalue function is a function of :
(3) 
For simplicity, the vector
is denoted as , same for . The corresponding update rule is(4) 
For different functions, Generalized Qlearning reduces to different variants of Qlearning, including Qlearning itself. For example, Generalized Qlearning can be reduced to Qlearning simply by setting , with . Double Qlearning can be specified with , , and .
We first introduce Assumption 1 for function in Generalized Qlearning, and then state the theorem. The proof can be found in Appendix B.
Assumption 1 (Conditions on )
Let and where , and , and .

[label=()]

If , , , and , then .

, .
We can verify that Assumption 1 holds for Maxmin Qlearning. Set and set to be a positive integer. Let and define . It is easy to check that part (i) of Assumption 1 is satisfied. Part (ii) is also satisfied because
Assumption 2 (Conditions on the stepsizes)
There exists some (deterministic) constant such that for every , , and with probability ,
Theorem 2
Assume a finite MDP and that Assumption 1 and 2 hold. Then the actionvalue functions in Generalized Qlearning, using the tabular update in Equation (3), will converge to the optimal actionvalue function with probability , in either of the following cases: (i) , or (ii) , where is an absorbing state and all policies are proper.
As shown above, because the function for Maxmin Qlearning satisfies Assumption 1, then by Theorem 2 it converges. Next, we apply Theorem 2 to Qlearning and its variants, proving the convergence of these algorithms in the tabular case. For Qlearning, set and . Let . It is straightforward to check that Assumption 1 holds for function . For Ensemble Qlearning, set and set to be a positive integer. Let . Easy to check that Assumption 1 is satisfied. For Averaged Qlearning, the proof is similar to Ensemble Qlearning except that and is a positive integer. For Historical Best Qlearning, set and to be a positive integer. We assume that all auxiliary actionvalue functions are selected from actionvalue functions at most updates ago. Define to be the largest actionvalue among for state . Assumption 1 is satisfied and the convergence is guaranteed.
7 Conclusion
Overestimation bias is a byproduct of Qlearning, stemming from the selection of a maximal value to estimate the expected maximal value. In practice, overestimation bias leads to poor performance in a variety of settings. Though multiple Qlearning variants have been proposed, Maxmin Qlearning is the first solution that allows for a flexible control of bias, allowing for overestimation or underestimation determined by the choice of
and the environment. We showed theoretically that we can decrease the estimation bias and the estimation variance by choosing an appropriate number of actionvalue functions. We empirically showed that advantages of Maxmin Qlearning, both on toy problems where we investigated the effect of reward noise and on several benchmark environments. Finally, we introduced a new Generalized Qlearning framework which we used to prove the convergence of Maxmin Qlearning as well as several other Qlearning variants that use actionvalue estimates.Acknowledgments
We would like to thank Huizhen Yu and Yi Wan for their valuable feedback and helpful discussion.
References

AveragedDQN: Variance Reduction and Stabilization for Deep Reinforcement Learning.
In
International Conference on Machine Learning
, pp. 176–185. Cited by: §1, §6.  Parallel and distributed computation: numerical methods. Vol. 23, Prentice hall Englewood Cliffs, NJ. Cited by: §B.3, §B.3, §B.3.
 Neurodynamic programming. Vol. 5, Athena Scientific Belmont, MA. Cited by: §B.2.
 OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §5.
 Order Statistics. Encyclopedia of Statistical Sciences. Cited by: item 1, Appendix A.
 Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pp. 1587–1596. Cited by: §1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. Cited by: §1.

Deep Reinforcement Learning with Double Qlearning.
In
AAAI Conference on Artificial Intelligence
, Cited by: §1, §1.  Biascorrected Qlearning to Control Maxoperator Bias in Qlearning. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 93–99. Cited by: §1.

Qlearning with Linear Function Approximation.
In
International Conference on Computational Learning Theory
, pp. 308–322. Cited by: §2.  Reinforcement Learning in Finite MDPs: PAC Analysis. Journal of Machine Learning Research 10 (Nov), pp. 2413–2444. Cited by: §1.
 Reinforcement Learning: An Introduction. Second edition, MIT Press. Cited by: Figure 1, §5.
 The Many Faces of Optimism: A Unifying Approach. In International Conference on Machine learning, pp. 1048–1055. Cited by: §1.
 PyGame learning environment. GitHub. Note: https://github.com/ntasfi/PyGameLearningEnvironment Cited by: §5.
 Issues in Using Function Approximation for Reinforcement Learning. In Fourth Connectionist Models Summer School, Cited by: §1, §4, §4.
 Asynchronous Stochastic Approximation and Qlearning. Machine learning. Cited by: Appendix B, §2.
 Double Qlearning. In Advances in Neural Information Processing Systems, pp. 2613–2621. Cited by: §1, §1, §6.
 Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge. Cited by: §1.
 MinAtar: An Atariinspired Testbed for More Efficient Reinforcement Learning Experiments. arXiv preprint arXiv:1903.03176. Cited by: Figure 4, §5.
 Historical Best QNetworks for Deep Reinforcement Learning. In International Conference on Tools with Artificial Intelligence, pp. 6–11. Cited by: §6.
 Weighted Double Qlearning. In International Joint Conference on Artificial Intelligence, pp. 3455–3461. Cited by: §1, §6.
Appendix A The Proof of Theorem 1
We first present Lemma 1 here as a tool to prove Theorem 1. Note that the first three properties in this lemma are wellknown results of order statistics (David and Nagaraja, 2004).
Lemma 1
Let be
i.i.d. random variables from an absolutely continuous distribution with probability density function(PDF)
and cumulative distribution function (CDF)
. Denote and . Set and . Denote the PDF and CDF of as and , respectively. Similarly, denote the PDF and CDF of as and , respectively. We then have
[label=()]

and .

. .

. .

If , we have and for any positive integer .
Proof.

[label=()]

By the definition of , we have . Thus . Since , . The proof of can be found in (David and Nagaraja, 2004, Chapter 4 Section 4.2).

We first consider the cdf of . . Then the pdf of is .

Similar to (ii), we first consider cdf of . . Then the pdf of is .

Since , we have and . . It is easy to check that for any positive integer .
Next, we prove Theorem 1.
Proof. Let and be the cdf and pdf of , respectively. Similarly, Let and be the cdf and pdf of . Since is sampled from , it is easy to get and . By Lemma 1, we have and . The expectation of is
Let , so that . Substitute by where , then
Each term in the denominator decreases as increases, because gets smaller. Therefore, and . Using this, we conclude that decreases as increases and and .
By Lemma 1, the variance of is
decreases as increases. In particular, for and for .
The biasvariance tradeoff of Maxmin Qlearning is illustrated by the empirical results in Figure 5, which support Theorem 1. For each , can be selected such that the absolute value of the expected estimation bias is close to according to Theorem 1. As increases, we can adjust to reduce both the estimation variance and the estimation bias.
Finally, we prove the result of the Corollary.
Corollary 1 Assuming the samples are evenly allocated amongst the estimators, then where is the variance of samples for and, for the estimator that uses all samples for a single estimate,
Under this uniform random noise assumption, for , .
Proof. Because is a sample mean, its variance is where is the variance of samples for and its mean is (because it is an unbiased sample average). Consequently, has mean zero and variance . Because is a uniform random variable which has variance , we know that . Plugging this value into the variance formula in Theorem 1, we get that
because for the sample average that uses all the samples for one estimator. Easy to verify that for , .
Appendix B The Convergence Proof of Generalized Qlearning
The convergence proof of Generalized Qlearning is based on Tsitsiklis (1994). The key steps to use this result for Generalized Qlearning include showing that the operator is a contraction and verifying the noise conditions. We first show these two steps in Lemma 2 and Lemma 3. We then use these lemmas to make the standard argument for convergence.
b.1 Problem Setting for Generalized Qlearning
Consider a Markov decision problem defined on a finite state space . For every state , there is a finite set of possible actions for state and a set of nonnegative scalars , , , such that for all . The scalar is interpreted as the probability of a transition to , given that the current state is and action is applied. Furthermore, for every state and action , there is a random variable which represents the reward if action is applied at state . We assume that the variance of is finite for every and .
A stationary policy is a function defined on such that for all
. Given a stationary policy, we obtain a discretetime Markov chain
with transition probabilities(5) 
Let be a discount factor. For any stationary policy and initial state , the state value is defined by
(6) 
The optimal state value function is defined by
(7) 
The Markov decision problem is to evaluate the function . Once this is done, an optimal policy is easily determined.
Markov decision problems are easiest when the discount is strictly smaller than . For the undiscounted case (), we will assume throughout that there is a rewardfree state, say state , which is absorbing; that is, and for all . The objective is then to reach that state at maximum expected reward. We say that a stationary policy is proper if the probability of being at the absorbing state converges to as time converges to infinity; otherwise, we say that the policy is improper.
We define the dynamic programming operator
Comments
There are no comments yet.