1 Introduction
Our main motivation is the work by Mnih et al. (2015), in which Qlearning (Watkins, 1989)
is combined with a deep convolutional neural network
(cf. LeCun et al., 2015). The resulting deep Q network (DQN) algorithm learned to play a varied set of Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al., 2013), which was proposed as an evaluation framework to test general learning algorithms on solving many different interesting tasks. DQN was proposed as a singular solution, using a single set of hyperparameters, but the magnitudes and frequencies of rewards vary wildly between different games. To overcome this hurdle, the rewards and temporaldifference errors were clipped to
. For instance, in Pong the rewards are bounded by and while in Ms. PacMan eating a single ghost can yield a reward of up to , but DQN clips the latter to as well. This is not a satisfying solution for two reasons. First, such clipping introduces domain knowledge. Most games have sparse nonzero rewards outside of . Clipping then results in optimizing the frequency of rewards, rather than their sum. This is a good heuristic in Atari, but it does not generalize to other domains. More importantly, the clipping changes the objective, sometimes resulting in qualitatively different policies of behavior.We propose a method to adaptively normalize the targets used in the learning updates. If these targets are guaranteed to be normalized it is much easier to find suitable hyperparameters. The proposed technique is not specific to DQN and is more generally applicable in supervised learning and reinforcement learning. There are several reasons such normalization can be desirable. First, sometimes we desire a single system that is able to solve multiple different problems with varying natural magnitudes, as in the Atari domain. Second, for multivariate functions the normalization can be used to disentangle the natural magnitude of each component from its relative importance in the loss function. This is particularly useful when the components have different units, such as when we predict signals from sensors with different modalities. Finally, adaptive normalization can help deal with nonstationary. For instance, in reinforcement learning the policy of behavior can change repeatedly during learning, thereby changing the distribution and magnitude of the values.
1.1 Related work
Many machinelearning algorithms rely on apriori access to data to properly tune relevant hyperparameters
(Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). However, it is much harder to learn efficiently from a stream of data when we do not know the magnitude of the function we seek to approximate beforehand, or if these magnitudes can change over time, as is for instance typically the case in reinforcement learning when the policy of behavior improves over time.Input normalization has long been recognized as important to efficiently learn nonlinear approximations such as neural networks (LeCun et al., 1998), leading to research on how to achieve scaleinvariance on the inputs (e.g., Ross et al., 2013; Ioffe and Szegedy, 2015; Desjardins et al., 2015)
. Output or target normalization has not received as much attention, probably because in supervised learning data is commonly available before learning commences, making it straightforward to determine appropriate normalizations or to tune hyperparameters. However, this assumes the data is available a priori, which is not true in online (potentially nonstationary) settings.
Natural gradients (Amari, 1998) are invariant to reparameterizations of the function approximation, thereby avoiding many scaling issues, but these are computationally expensive for functions with many parameters such as deep neural networks. This is why approximations are regularly proposed, typically trading off accuracy to computation (Martens and Grosse, 2015), and sometimes focusing on a certain aspect such as input normalization (Desjardins et al., 2015; Ioffe and Szegedy, 2015). Most such algorithms are not fully invariant to rescaling the targets.
In the Atari domain several algorithmic variants and improvements for DQN have been proposed (van Hasselt et al., 2016; Bellemare et al., 2016; Schaul et al., 2016; Wang et al., 2016), as well as alternative solutions (Liang et al., 2016; Mnih et al., 2016). However, none of these address the clipping of the rewards or explicitly discuss the impacts of clipping on performance or behavior.
1.2 Preliminaries
Concretely, we consider learning from a stream of data where the inputs and targets
are realvalued tensors. The aim is to update parameters
of a function such that the output is (in expectation) close to the target according to some loss , for instance defined as a squared difference:. A canonical update is stochastic gradient descent (SGD). For a sample
, the update is then , where is a step size. The magnitude of this update depends on both the step size and the loss, and it is hard to pick suitable step sizes when nothing is known about the magnitude of the loss.An important special case is when is a neural network (McCulloch and Pitts, 1943; Rosenblatt, 1962), which are often trained with a form of SGD (Rumelhart et al., 1986), with hyperparameters that interact with the scale of the loss. Especially for deep neural networks (LeCun et al., 2015; Schmidhuber, 2015) large updates may harm learning, because these networks are highly nonlinear and such updates may ‘bump’ the parameters to regions with high error.
2 Adaptive normalization with PopArt
We propose to normalize the targets , where the normalization is learned separately from the approximating function. We consider an affine transformation of the targets
(1) 
where and are scale and shift parameters that are learned from data. The scale matrix can be dense, diagonal, or defined by a scalar as
. Similarly, the shift vector
can contain separate components, or be defined by a scalar as . We can then define a loss on a normalized function and the normalized target . The unnormalized approximation for any input is then given by , where is the normalized function and is the unnormalized function.At first glance it may seem we have made little progress. If we learn and using the same algorithm as used for the parameters of the function , then the problem has not become fundamentally different or easier; we would have merely changed the structure of the parameterized function slightly. Conversely, if we consider tuning the scale and shift as hyperparameters then tuning them is not fundamentally easier than tuning other hyperparameters, such as the step size, directly.
Fortunately, there is an alternative. We propose to update and according to a separate objective with the aim of normalizing the updates for . Thereby, we decompose the problem of learning an appropriate normalization from learning the specific shape of the function. The two properties that we want to simultaneously achieve are

to update scale and shift such that is appropriately normalized, and

to preserve the outputs of the unnormalized function when we change the scale and shift.
We discuss these properties separately below. We refer to algorithms that combine outputpreserving updates and adaptive rescaling, as PopArt algorithms, an acronym for “Preserving Outputs Precisely, while Adaptively Rescaling Targets”.
2.1 Preserving outputs precisely
Unless care is taken, repeated updates to the normalization might make learning harder rather than easier because the normalized targets become nonstationary. More importantly, whenever we adapt the normalization based on a certain target, this would simultaneously change the output of the unnormalized function of all inputs. If there is little reason to believe that other unnormalized outputs were incorrect, this is undesirable and may hurt performance in practice, as illustrated in Section 3. We now first discuss how to prevent these issues, before we discuss how to update the scale and shift.
The only way to avoid changing all outputs of the unnormalized function whenever we update the scale and shift is by changing the normalized function itself simultaneously. The goal is to preserve the outputs from before the change of normalization, for all inputs. This prevents the normalization from affecting the approximation, which is appropriate because its objective is solely to make learning easier, and to leave solving the approximation itself to the optimization algorithm.
Without loss of generality the unnormalized function can be written as
(2) 
where is a parametrized (nonlinear) function, and is the normalized function. It is not uncommon for deep neural networks to end in a linear layer, and then can be the output of the last (hidden) layer of nonlinearities. Alternatively, we can always add a square linear layer to any nonlinear function to ensure this constraint, for instance initialized as and .
The following proposition shows that we can update the parameters and to fulfill the second desideratum of preserving outputs precisely for any change in normalization.
Proposition 1.
Consider a function defined as in (2) as
where is any nonlinear function of , is a matrix, and are element vectors, and is a matrix. Consider any change of the scale and shift parameters from to and from to , where is nonsingular. If we then additionally change the parameters and to and , defined by
and 
then the outputs of the unnormalized function are preserved precisely in the sense that
This and later propositions are proven in the appendix. For the special case of scalar scale and shift, with and , the updates to and become and . After updating the scale and shift we can update the output of the normalized function toward the normalized output , using any learning algorithm. Importantly, the normalization can be updated first, thereby avoiding harmful large updates just before they would otherwise occur. This observation will be made more precise in Proposition 2 in Section 2.2.
Algorithm 1 is an example implementation of SGD with PopArt for a squared loss. It can be generalized easily to any other loss by changing the definition of . Notice that and are updated twice: first to adapt to the new scale and shift to preserve the outputs of the function, and then by SGD. The order of these updates is important because it allows us to use the new normalization immediately in the subsequent SGD update.
2.2 Adaptively rescaling targets
A natural choice is to normalize the targets to approximately have zero mean and unit variance. For clarity and conciseness, we consider scalar normalizations. It is straightforward to extend to diagonal or dense matrices. If we have data
up to some time , we then may desireand  
and  (3) 
This can be generalized to incremental updates
(4) 
Here estimates the second moment of the targets and is a step size. If is positive initially then it will always remain so, although to avoid issues with numerical precision it can be useful to enforce a lower bound explicitly by requiring with . For full equivalence to (3) we can use . If is constant we get exponential moving averages, placing more weight on recent data points which is appropriate in nonstationary settings.
A constant has the additional benefit of never becoming negligibly small. Consider the first time a target is observed that is much larger than all previously observed targets. If is small, our statistics would adapt only slightly, and the resulting update may be large enough to harm the learning. If is not too small, the normalization can adapt to the large target before updating, potentially making learning more robust. In particular, the following proposition holds.
Proposition 2.
When using updates (4) to adapt the normalization parameters and , the normalized targets are bounded for all by
For instance, if for all , then the normalized target is guaranteed to be in . Note that Proposition 2 does not rely on any assumptions about the distribution of the targets. This is an important result, because it implies we can bound the potential normalized errors before learning, without any prior knowledge about the actual targets we may observe.
It is an open question whether it is uniformly best to normalize by mean and variance. In the appendix we discuss other normalization updates, based on percentiles and minibatches, and derive correspondences between all of these.
2.3 An equivalence for stochastic gradient descent
We now step back and analyze the effect of the magnitude of the errors on the gradients when using regular SGD. This analysis suggests a different normalization algorithm, which has an interesting correspondence to PopArt SGD.
We consider SGD updates for an unnormalized multilayer function of form . The update for the weight matrix is
where is gradient of the squared loss, which we here call the unnormalized error. The magnitude of this update depends linearly on the magnitude of the error, which is appropriate when the inputs are normalized, because then the ideal scale of the weights depends linearly on the magnitude of the targets.^{1}^{1}1In general care should be taken that the inputs are wellbehaved; this is exactly the point of recent work on input normalization (Ioffe and Szegedy, 2015; Desjardins et al., 2015).
Now consider the SGD update to the parameters of , where is the Jacobian for . The magnitudes of both the weights and the errors depend linearly on the magnitude of the targets. This means that the magnitude of the update for depends quadratically on the magnitude of the targets. There is no compelling reason for these updates to depend at all on these magnitudes because the weights in the top layer already ensure appropriate scaling. In other words, for each doubling of the magnitudes of the targets, the updates to the lower layers quadruple for no clear reason.
This analysis suggests an algorithmic solution, which seems to be novel in and of itself, in which we track the magnitudes of the targets in a separate parameter , and then multiply the updates for all lower layers with a factor . A more general version of this for matrix scalings is given in Algorithm 2. We prove an interesting, and perhaps surprising, connection to the PopArt algorithm.
Proposition 3.
Consider two functions defined by
and 
where is the same differentiable function in both cases, and the functions are initialized identically, using and , and the same initial , and . Consider updating the first function using Algorithm 1 (PopArt SGD) and the second using Algorithm 2 (Normalized SGD). Then, for any sequence of nonsingular scales and shifts , the algorithms are equivalent in the sense that 1) the sequences are identical, 2) the outputs of the functions are identical, for any input.
The proposition shows a duality between normalizing the targets, as in Algorithm 1, and changing the updates, as in Algorithm 2. This allows us to gain more intuition about the algorithm. In particular, in Algorithm 2
the updates in top layer are not normalized, thereby allowing the last linear layer to adapt to the scale of the targets. This is in contrast to other algorithms that have some flavor of adaptive normalization, such as RMSprop
(Tieleman and Hinton, 2012), AdaGrad (Duchi et al., 2011), and Adam (Kingma and Adam, 2015) that each component in the gradient by a square root of an empirical second moment of that component. That said, these methods are complementary, and it is straightforward to combine PopArt with other optimization algorithms than SGD.3 Binary regression experiments
We first analyze the effect of rare events in online learning, when infrequently a much larger target is observed. Such events can for instance occur when learning from noisy sensors that sometimes captures an actual signal, or when learning from sparse nonzero reinforcements. We empirically compare three variants of SGD: without normalization, with normalization but without preserving outputs precisely (i.e., with ‘Art’, but without ‘Pop’), and with PopArt.
The inputs are binary representations of integers drawn uniformly randomly between and . The desired outputs are the corresponding integer values. Every 1000 samples, we present the binary representation of as input (i.e., all 16 inputs are 1) and as target
. The approximating function is a fully connected neural network with 16 inputs, 3 hidden layers with 10 nodes per layer, and tanh internal activation functions. This simple setup allows extensive sweeps over hyperparameters, to avoid bias towards any algorithm by the way we tune these. The step sizes
for SGD and for the normalization are tuned by a grid search over .Figure 1a shows the root mean squared error (RMSE, log scale) for each of 5000 samples, before updating the function (so this is a test error, not a train error). The solid line is the median of 50 repetitions, and shaded region covers the 10th to 90th percentiles. The plotted results correspond to the best hyperparameters according to the overall RMSE (i.e., area under the curve). The lines are slightly smoothed by averaging over each 10 consecutive samples.
SGD favors a relatively small step size () to avoid harmful large updates, but this slows learning on the smaller updates; the error curve is almost flat in between spikes. SGD with adaptive normalization (labeled ‘Art’) can use a larger step size () and therefore learns faster, but has high error after the spikes because the changing normalization also changes the outputs of the smaller inputs, increasing the errors on these. In comparison, PopArt performs much better. It prefers the same step size as Art (), but PopArt can exploit a much faster rate for the statistics (best performance with for PopArt and for Art). The faster tracking of statistics protects PopArt from the large spikes, while the output preservation avoids invalidating the outputs for smaller targets. We ran experiments with RMSprop but left these out of the figure as the results were very similar to SGD.
4 Atari 2600 experiments
An important motivation for this work is reinforcement learning with nonlinear function approximators such as neural networks (sometimes called deep reinforcement learning). The goal is to predict and optimize action values defined as the expected sum of future rewards. These rewards can differ arbitrarily from one domain to the next, and nonzero rewards can be sparse. As a result, the action values can span a varied and wide range which is often unknown before learning commences.
Mnih et al. (2015) combined Qlearning with a deep neural network in an algorithm called DQN, which impressively learned to play many games using a single set of hyperparameters. However, as discussed above, to handle the different reward magnitudes with a single system all rewards were clipped to the interval . This is harmless in some games, such as Pong where no reward is ever higher than 1 or lower than , but it is not satisfactory as this heuristic introduces specific domain knowledge that optimizing reward frequencies is approximately is useful as optimizing the total score. However, the clipping makes the DQN algorithm blind to differences between certain actions, such as the difference in reward between eating a ghost (reward ) and eating a pellet (reward ) in Ms. PacMan. We hypothesize that 1) overall performance decreases when we turn off clipping, because it is not possible to tune a step size that works on many games, 2) that we can regain much of the lost performance by with PopArt. The goal is not to improve stateoftheart performance, but to remove the domaindependent heuristic that is induced by the clipping of the rewards, thereby uncovering the true rewards.
We ran the Double DQN algorithm (van Hasselt et al., 2016) in three versions: without changes, without clipping both rewards and temporal difference errors, and without clipping but additionally using PopArt. The targets are the cumulation of a reward and the discounted value at the next state:
(5) 
where is the estimated action value of action in state according to current parameters , and where is a more stable periodic copy of these parameters (cf. Mnih et al., 2015; van Hasselt et al., 2016, for more details). This is a form of Double Qlearning (van Hasselt, 2010, 2011). We roughly tuned the main step size and the step size for the normalization to . It is not straightforward to tune the unclipped version, for reasons that will become clear soon.
Figure 1b shows norm of the gradient of Double DQN during learning as a function of number of training steps. The left plot corresponds to no reward clipping, middle to clipping (as per original DQN and Double DQN), and right to using PopArt instead of clipping. Each faint dashed lines corresponds to the median norms (where the median is taken over time) on one game. The shaded areas correspond to , , and of games.
Without clipping the rewards, PopArt produces a much narrower band within which the gradients fall. Across games, of median norms range over less than two orders of magnitude (roughly between 1 and 20), compared to almost four orders of magnitude for clipped Double DQN, and more than six orders of magnitude for unclipped Double DQN without PopArt. The wide range for the latter shows why it is impossible to find a suitable step size with neither clipping nor PopArt: the updates are either far too small on some games or far too large on others.
After 200M frames, we evaluated the actual scores of the best performing agent in each game on 100 episodes of up to 30 minutes of play, and then normalized by human and random scores as described by Mnih et al. (2015). Figure 1 shows the differences in normalized scores between (clipped) Double DQN and Double DQN with PopArt.
The main eyecatching result is that the distribution in performance drastically changed. On some games (e.g., Gopher, Centipede) we observe dramatic improvements, while on other games (e.g., Video Pinball, Star Gunner) we see a substantial decrease. For instance, in Ms. PacMan the clipped Double DQN agent does not care more about ghosts than pellets, but Double DQN with PopArt learns to actively hunt ghosts, resulting in higher scores. Especially remarkable is the improved performance on games like Centipede and Gopher, but also notable is a game like Frostbite which went from below 50% to a nearhuman performance level. Raw scores can be found in the appendix.
Some games fare worse with unclipped rewards because it changes the nature of the problem. For instance, in Time Pilot the PopArt agent learns to quickly shoot a mothership to advance to a next level of the game, obtaining many points in the process. The clipped agent instead shoots at anything that moves, ignoring the mothership.^{2}^{2}2A video is included in the supplementary material. However, in the long run in this game more points are scored with the safer and more homogeneous strategy of the clipped agent. One reason for the disconnect between the seemingly qualitatively good behavior combined with lower scores is that the agents are fairly myopic: both use a discount factor of , and therefore only optimize rewards that happen within a dozen or so seconds into the future.
On the whole, the results show that with PopArt we can successfully remove the clipping heuristic that has been present in all prior DQN variants, while retaining overall performance levels. Double DQN with PopArt performs slightly better than Double DQN with clipped rewards: on 32 out of 57 games performance is at least as good as clipped Double DQN and the median (+0.4%) and mean (+34%) differences are positive.
5 Discussion
We have demonstrated that PopArt can be used to adapt to different and nonstationary target magnitudes. This problem was perhaps not previously commonly appreciated, potentially because in deep learning it is common to tune or normalize a priori, using an existing data set. This is not as straightforward in reinforcement learning when the policy and the corresponding values may repeatedly change over time. This makes PopArt a promising tool for deep reinforcement learning, although it is not specific to this setting.
We saw that PopArt can successfully replace the clipping of rewards as done in DQN to handle the various magnitudes of the targets used in the Qlearning update. Now that the true problem is exposed to the learning algorithm we can hope to make further progress, for instance by improving the exploration (Osband et al., 2016), which can now be informed about the true unclipped rewards.
References
 Amari (1998) S. I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998. ISSN 08997667.
 Bellemare et al. (2013) M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. (JAIR), 47:253–279, 2013.
 Bellemare et al. (2016) M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. In AAAI, 2016.
 Bergstra and Bengio (2012) J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. The Journal of Machine Learning Research, 13(1):281–305, 2012.
 Bergstra et al. (2011) J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
 Desjardins et al. (2015) G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2062–2070, 2015.
 Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
 Efron (1991) B. Efron. Regression percentiles using asymmetric squared error loss. Statistica Sinica, 1(1):93–125, 1991.

Hochreiter (1998)
S. Hochreiter.
The vanishing gradient problem during learning recurrent neural nets and problem solutions.
International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 6(2):107–116, 1998.  Ioffe and Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Kingma and Adam (2015) D. P. Kingma and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representation, 2015.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kushner and Yin (2003) H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 05 2015.
 Liang et al. (2016) Y. Liang, M. C. Machado, E. Talvitie, and M. H. Bowling. State of the art control of atari games using shallow reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, 2016.
 Martens and Grosse (2015) J. Martens and R. B. Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2408–2417, 2015.
 McCulloch and Pitts (1943) W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
 Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
 Newey and Powell (1987) W. K. Newey and J. L. Powell. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, pages 819–847, 1987.
 Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. CoRR, abs/1602.04621, 2016.
 Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
 Rosenblatt (1962) F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.

Ross et al. (2013)
S. Ross, P. Mineiro, and J. Langford.
Normalized online learning.
In
Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence
, 2013.  Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
 Schaul et al. (2016) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico, 2016.
 Schmidhuber (2015) J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
 Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 Tieleman and Hinton (2012) T. Tieleman and G. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 van Hasselt (2010) H. van Hasselt. Double Qlearning. Advances in Neural Information Processing Systems, 23:2613–2621, 2010.
 van Hasselt (2011) H. van Hasselt. Insights in Reinforcement Learning. PhD thesis, Utrecht University, 2011.
 van Hasselt et al. (2016) H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with Double Qlearning. AAAI, 2016.
 Wang et al. (2016) Z. Wang, N. de Freitas, T. Schaul, M. Hessel, H. van Hasselt, and M. Lanctot. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, New York, NY, USA, 2016.
 Watkins (1989) C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
Appendix
In this appendix, we introduce and analyze several extensions and variations, including normalizing based on percentiles or minibatches. Additionally, we prove all propositions in the main text and the appendix.
Experiment setup
For the experiments described in Section 4 in the main paper, we closely followed the setup described in Mnih et al. [2015] and van Hasselt et al. [2016]. In particular, the Double DQN algorithm is identical to that described by van Hasselt et al. The shown results were obtained by running the trained agent for 30 minutes of simulated play (or 108,000 frames). This was repeated 100 times, where diversity over different runs was ensured by a small probability of exploration on each step (greedy exploration with ), as well as by performing up to 30 ‘noop’ actions, as also used and described by Mnih et al. In summary, the evaluation setup was the same as used by Mnih et al., except that we allowed more evaluation time per game (30 minutes instead of 5 minutes), as also used by Wang et al. [2016].
The results in Figure 2 were obtained by normalizing the raw scores by first subtracting the score by a random agent, and then dividing by the absolute difference between human and random agents, such that
The raw scores are given below, in Table 1.
Generalizing normalization by variance
We can change the variance of the normalized targets to influence the magnitudes of the updates. For a desired standard deviation of
, we can usewith the updates for and as normal. It is straightforward to show that then a generalization of Proposition 2 holds with a bound of
This additional parameter is for instance useful when we desire fast tracking in nonstationary problems. We then want a large step size , but without risking overly large updates.
The new parameter may seem superfluous because increasing the normalization step size also reduces the hard bounds on the normalized targets. However, additionally influences the distribution of the normalized targets. The histograms in the leftmost plot in Figure 2 show what happens when we try to limit the magnitudes using only
. The red histogram shows normalized targets where the unnormalized targets come from a normal distribution, shown in blue. The normalized targets are contained in
, but the distribution is very nonnormal even though the actual targets are normal. Conversely, the red histogram in the middle plot shows that the distribution remains approximately normal if we instead use to reduce the magnitudes. The right plot shows the effect on the variance of normalized targets for either approach. When we change while keeping fixed, the variance of the normalized targets can drop far below the desired variance of one (magenta curve). When we use change while keeping fixed, the variance remains predictably at approximately (black line). The difference in behavior of the resulting normalization demonstrates thatgives us a potentially useful additional degree of freedom.
Sometimes, we can simply roll the additional scaling into the step size, such that without loss of generality we can use and decrease the step size to avoid overly large updates. However, sometimes it is easier to separate the magnitude of the targets, as influenced by , from the magnitude of the updates, for instance when using an adaptive stepsize algorithm. In addition, the introduction of an explicit scaling allows us to make some interesting connections to normalization by percentiles, in the next section.
Adaptive normalization by percentiles
Instead of normalizing by mean and variance, we can normalize such that a given ratio of normalized targets is inside the predetermined interval. The peroutput objective is then
For normally distributed targets, there is a direct correspondence to normalizing by means and variance.
Proposition 4.
If scalar targets are distributed according to a normal distribution with arbitrary finite mean and variance, then the objective is equivalent to the joint objective and with
For example, percentiles of and correspond to and , respecticely. Conversely, corresponds to . The fact only applies when the targets are normal. For other distributions the two forms of normalization differ even in terms of their objectives.
We now discuss a concrete algorithm to obtain normalization by percentiles. Let denote order statistics of the targets up to time ,^{3}^{3}3For noninteger we can define by either rounding
to an integer or, perhaps more appropriately, by linear interpolation between the values for the nearest integers.
such that , , and . For notational simplicity, define and . Then, for data up to time , the goal isand 
Solving for and gives
and 
In the special case where we get and . We are then guaranteed that all normalized targets fall in
, but this could result in an overly conservative normalization that is sensitive to outliers and may reduce the overall magnitude of the updates too far. In other words, learning will then be safe in the sense that no updates will be too big, but it may be slow because many updates may be very small. In general it is probably typically better to use a ratio
.Exact order statistics are hard to compute online, because we would need to store all previous targets. To obtain more memoryefficient online updates for percentiles we can store two values and , which should eventually have the property that a proportion of values is larger than and a proportion of values is smaller than , such that
(6) 
This can be achieved asymptotically by updating and according to
and  (7)  
(8) 
where the indicator function is equal to one when its argument is true and equal to zero otherwise.
Proposition 5.
If the step size is too small it will take long for the updates to converge to appropriate values. In practice, it might be better to let the magnitude of the steps depend on the actual errors, such that the update takes the form of an asymmetrical leastsquares update [Newey and Powell, 1987, Efron, 1991].
Online learning with minibatches
Online normalization by mean and variance with minibatches of size can be achieved by using the updates
Another interesting possibility is to update and towards the extremes of the minibatch such that
(9)  
and then use
The statistics of this normalization depend on the size of the minibatches, and there is an interesting correspondence to normalization by percentiles.
Proposition 6.
Consider minibatches of size
whose elements are drawn i.i.d. from a uniform distribution with support on
. If and , then in the limit the updates (9) converge to values such that (6) holds, with .This fact connects the online minibatch updates (9) to normalization by percentiles. For instance, a minibatch size of would correspond roughly to online percentile updates with and, by Proposition 4, to a normalization by mean and variance with a . These different normalizations are not strictly equivalent, but may behave similarly in practice.
Proposition 6 quantifies an interesting correspondence between minibatch updates and normalizing by percentiles. Although the fact as stated holds only for uniform targets, the proportion of normalized targets in the interval more generally becomes larger when we increase the minibatch size, just as when we increase or decrease , potentially resulting in better robustness to outliers at the possible expense of slower learning.
A note on initialization
When using constant step sizes it is useful to be aware of the start of learning, to trust the data rather than arbitrary initial values. This can be done by using a step size as defined in the following fact.
Proposition 7.
Consider a recencyweighted running average updated from a stream of data using , with defined by
(10) 
Then 1) the relative weights of the data in are the same as when using a constant step size , and 2) the estimate does not depend on the initial value .
A similar result was derived to remove the effect of the initialization of certain parameters by Kingma and Ba [2014] for a stochastic optimization algorithm called Adam. In that work, the initial values are assumed to be zero and a standard exponentially weighted average is explicitly computed and stored, and then divided by a term analogous to . The step size (10) corrects for any initialization in place, without storing auxiliary variables, but for the rest the method and its motivation are very similar.
Alternatively, it is possible to initialize the normalization safely, by choosing a scale that is relatively high initially. This can be beneficial when at first the targets are relatively small and noisy. If we would then use the step size in (10), the updates would treat these initial observations as important, and would try to fit our approximating function to the noise. A high initialization (e.g., or ) would instead reduce the effect of the first targets on the learning updates, and would instead use these only to find an appropriate normalization. Only after finding this normalization the actual learning would then commence.
Deep PopArt
Sometimes it makes sense to apply the normalization not to the output of the network, but at a lower level. For instance, the output of a neural network with a softmax on top can be written
where is the weight matrix of the last linear layer before the softmax. The actual outputs are already normalized by using the softmax, but the outputs of the layer below the softmax may still benefit from normalization. To determine the targets to be normalized, we can either backpropagate the gradient of our loss through the softmax or invert the function.
More generally, we can consider applying normalization at any level of a hierarchical nonlinear function. This seems a promising way to counteract undesirable characteristics of backpropagating gradients, such as vanishing or exploding gradients [Hochreiter, 1998].
In addition, normalizing gradients further down in a network can provide a straightforward way to combine gradients from different sources in more complex network graphs than a standard feedforward multilayer network. First, the normalization allows us to normalize the gradient from each source separately before merging gradients, thereby avoiding one source to fully drown out any others and allowing us to weight the gradients by actual relative importance, rather than implicitly relying on the current magnitude of each as a proxy for this. Second, the normalization can prevent undesirably large gradients when many gradients come together at one point of the graph, by normalizing again after merging gradients.
Proofs
Proposition 1.
Consider a function defined by
where is any nonlinear function of , is a matrix, and are element vectors, and is a matrix. Consider any change of the scale and shift parameters from to and from to , where is nonsingular. If we then additionally change the parameters and to and , defined by
and 
then the outputs of the unnormalized function are preserved precisely in the sense that
Proof.
The stated result follows from
Proposition 2.
When using updates (4) to adapt the normalization parameters and , the normalized target is bounded for all by
Proof.
The inequality follows from the fact that . ∎
Proposition 3.
Consider two functions defined by
where is the same differentiable function in both cases, and the functions are initialized identically, using and , and the same initial , and . Consider updating the first function using Algorithm 1 and the second using Algorithm 2. Then, for any sequence of nonsingular scales and shifts , the algorithms are equivalent in the sense that 1) the sequences are identical, 2) the outputs of the functions are identical, for any input.
Proof.
Let and denote the parameters of for Algorithms 1 and 2, respectively. Similarly, let and be parameters of the first function, while and are parameters of the second function. It is enough to show that single updates of both Algorithms 1 and 2 from the same starting points have equivalent results. That is, if
then it must follow that
where the quantities , , and are updated with Algorithm 2 and quantities , , and are updated with Algorithm 1. We do not require or , and indeed these quantities will generally differ.
We use the shorthands and for the first and second function, respectively. First, we show that , for all . For , this holds trivially because , and . Now assume that . Let be the unnormalized error at time . Then, Algorithm 1 results in
Similarly, and if then
Comments
There are no comments yet.