1 Introduction
Deep reinforcement learning (RL) agents have achieved impressive results in recent years, tackling longstanding challenges in board games [38, 39], video games [23, 26, 45] and robotics [27]
. At the same time, their learning dynamics are notoriously complex. In contrast with supervised learning (SL), these algorithms operate on highly nonstationary data distributions that are
coupled with the agent’s performance: an incompetent agent will not generate much relevant training data. This paper identifies a problematic, hitherto unnamed, issue with the learning dynamics of RL systems under function approximation (FA).We focus on the case where the learning objective can be decomposed into multiple components,
Although not always explicit, complex tasks commonly possess this property. For example, this property arises when learning about multiple tasks or contexts, when using multiple starting points, in the presence of multiple opponents, and in domains that contain decisions points with bifurcating dynamics. Sharing knowledge or representations between components can be beneficial in terms of skill reuse or generalization, and sharing seems essential to scale to complex domains. A common mechanism for sharing is a shared function approximator (e.g., a neural network). In general however, the different components do not coordinate and so may compete for resources, resulting in
interference.The core insight of this paper is that problematic learning dynamics arise when combining two algorithm properties that are relatively innocuous in isolation, the interference caused by the different components and the coupling of the learning signal to the future behaviour of the algorithm. Combining these properties leads to a phenomenon we call ‘ray interference’^{1}^{1}1So called, due to the tentative resemblance of Figure 1 (top, left) to a batoidea (ray fish).:
A learning system suffers from plateaus if it has (a) negative interference between the components of its objective, and (b) coupled performance and learning progress.
Intuitively, the reason for the problem is that negative interference creates winnertakeall (WTA) regions, where improvement in one component forces learning to stall or regress for the other components of the objective. Only after such a dominant component is learned (and its gradient vanishes), will the system start learning about the next component. This is where the coupling of learning and performance has its insidious effect: this new stage of learning is really slow, resulting in a long plateau for overall performance (see Figure 1).
We believe aspect (a) is common when using neural networks, which are known to exhibit negative interference when sharing representations between arbitrary tasks. Onpolicy RL exhibits coupling between performance and learning progress (b) because the improving behavior policy generates the future training data. Thus ray interference is likely to appear in the learning dynamics of onpolicy deep RL agents.
2 Minimal, explicit setting
A minimal example that exhibits ray interference is a bandit problem; a deterministic contextual bandit with contexts and discrete actions . This problem setting eliminates confounding RL elements of bootstrapping, temporal structure, and stochastic rewards. It also permits a purely analytic treatment of the learning dynamics.
The bandit reward function is one when the action matches the context and zero otherwise: where is the indicator function. The policy is a softmax , where
are logits. The mapping from context to logits is parameterised by the trainable weights
. The expected performance is the sum across contexts, .A simple way to train is to follow the policy gradient in REINFORCE [46]. For a contextactionreward sample, this yields . When samples are generated onpolicy according to , then the expected update in context is
Interference can arise from function approximation parameters that are shared across contexts. To see this, represent the logits as , where is a matrix, is an
dimensional vector, and
is a onehot vector representing the current context. Note that each context uses a different row of , hence no component of is shared among contexts. However is shared among contexts and this is sufficient for interference, defined as follows:Definition 1 (Interference)
We quantify the degree of interference between two components of
as the cosine similarity between their gradients:
Qualitatively implies no interference, positive transfer, and negative transfer (interference). is bounded between and .
2.1 Explicit dynamics for a bandit
For the twodimensional case with contexts and arms, we can visualize and express the full learning dynamics exactly (for expected updates with an infinite batchsize), in the continuous time limit of small stepsizes. First, we clarify our use of two kinds of derivatives that simplify our presentation. The operator describes a partial derivative and is usually taken with respect to the parameters . We omit whenever it is unambiguous, e.g., . Next, the ‘overdot’ notation for a function denotes temporal derivatives when following the gradient with respect to , namely
Let and . For two actions, the softmax can be simplified to a simple sigmoid . By removing redundant parameters, we obtain:
where . This yields the following gradients:
From these we compute the degree of interference
which implies a substantial amount of negative interference at all points in parameter space.
2.2 Dynamical system
The learning dynamics follow the full gradient , and we examine them in the limit of small stepsizes , in the coordinate system given by and . The directional derivatives along for the two components and are:
(1)  
(2) 
This system of differential equations has fixed points at the four corners, where is unstable, is a stable attractor (the global optimum), and and are saddle points; see Section B.1 for derivations. The left of Figure 1 depicts exactly these dynamics.
2.3 Flat plateaus
By considering the inflection points where the learning dynamics reaccelerate after a slowdown, we can characterize its plateaus, formally:
Definition 2 (Plateaus)
We say that the learning dynamics of have an plateau at a point if and only if
In other words, is an inflection point of where the learning curve along switches from concave to convex. At , the derivative characterizes the plateau’s flatness, characterizing how slow learning is at the slowest (nearby) point.
In our example, the acceleration is:
(3) 
where is a polynomial of max degree 6 in and ; see Section B.2 for details. This implies that along the diagonal where , and changes sign there. We have a plateau if the signchange is from negative to positive, see Figure 2. These points lie near the saddle points and are plateaus, with their ‘flatness’ given by
which is vanishingly small near the corners. Under appropriate smoothness constraints on , the existence of an plateau slows down learning by steps compared to a plateaufree baseline; see Figure 8 for empirical results.
2.4 Basins of attraction
Flat plateaus are only a serious concern if the learning trajectories are likely to pass through them. The exact basins of attraction are difficult to characterize, but some parts are simple, namely the regions where one component dominates.
Definition 3 (Winnertakeall)
We denote the learning dynamics at a point as winnertakeall (WTA) if and only if
that is, following the full gradient only increases the ^{th} component.
The core property of interest for a WTA region is that for every trajectory passing through it, when the trajectory leaves the region, all components of the objective except for the ^{th} will have decreased. In our example, the null clines describe the WTA regions. They follow the hyperbolae:
as shown in Figure 2. Armed with the knowledge of plateau locations and WTA regions, we can establish their basins of attraction, see Figure 2 and Section B.3, and thus the likelihood of hitting an plateau under any distribution of starting points. For distributions near the origin and uniform across angular directions, it can be shown that the chance of initializing the model in a WTA region is over 50% (Section B.4). Figure 5 shows empirically that for initializations with low overall performance , a large fraction of learning trajectories hit (very) flat plateaus.
2.5 Contrast example: supervised learning
We can obtain the explicit dynamics for a variant of the setup where the policy is not trained by REINFORCE, but by supervised learning using a crossentropy loss toward the groundtruth ideal arm, where crucially the performancelearning coupling of onpolicy RL is absent: . In this case, interference is the same as before (), but there are no saddle points (the only fixed point is the global optimum at ), nor are there any inflection points that could indicate the presence of a plateau, because is concave everywhere (see Figure 1, right, and Section B.5 for the derivations):
(4)  
2.6 Summary: Conditions for ray interference
To summarize, the learning dynamics of a bandit exhibit ray interference. For many initializations, the WTA dynamics pull the system into a flat plateau near a saddle point. Figures 3 and 1 show this hinges on two conditions. The negative interference (a) is due to having multiple contexts () and a shared function approximator; this creates the WTA regions that make it likely to hit flat plateaus (left subplots).
When using a tabular representation instead (i.e., without the actionbias), there are no WTA regions, so the basins of attraction for the plateaus are smaller and do not extend toward the origin, and ray interference vanishes, see Figure 1 (middle subplots). On the other hand, the learning dynamics couple performance and learning progress (b) because samples are generated by the current policy. Ray interference disappears when this coupling is broken, because without the coupling, the dynamics have no saddle points or plateaus. We see this when a uniform random behavior policy is used, or when the policy is trained directly by supervised learning (Section 2.5); see Figure 1 (right subplots).
3 Generalizations
In this section, we generalize the intuitions gained on the simple bandit example, and characterize a broader class of problems that exhibit ray interference.
3.1 Factored objectives
There is one class of learning problems that lends itself to such generalization, namely when the component updates are explicitly coupled to performance, and can be written in the following form:
(5) 
where is a smooth scalar function mapping to positive numbers that does not depend on the current except via , and is a gradient vector field. Furthermore, suppose that each component is bounded: . When the optimum is reached, there is no further learning, so , but for intermediate points learning is always possible, i.e., .
A sufficient condition for a saddle point to exist is that for one of the components there is no learning at its performance minimum, i.e., . The reason is that , so then at any point where all other components are fully learned.
As a first step to assess whether there exist plateaus near the saddle points, we look at the twodimensional case. Without loss of generality, we pick and . The saddle point of interest is , so we need to determine the sign of at the two nearby points and , with . Under reasonable smoothness assumptions, a sufficient condition for a plateau to exist between these points is that both and , because has to cross zero between them. Under certain assumptions, made explicit in our derivations in Section C.1, we have:
the sign of which only depends on . Furthermore, we know that for small because is smooth, and , and similarly because and . In other words, the same condition sufficient to induce a saddle point () is also sufficient to induce plateaus nearby. Note that the approximation here comes from assuming that is small near the saddle point.
At this point it is worth restating the shape of the in the bandit examples from the previous section: with the REINFORCE objective, we had and under supervised learning we had ; the extra factor of in the RL case is what introduced the saddle points, and its source was the (onpolicy) data coupling; see Figure 6 for an illustration.
Ray interference requires a second ingredient besides the existence of plateaus, namely WTA regions that create the basins of attraction for these plateaus. For the saddle point at , the WTA region of interest is the one where dominates, i.e.,
Of course, this can only happen if there is negative interference (). If that is the case however, a WTA region necessarily exists in a strip around , because being smooth means that eventually becomes small enough for the negative term to dominate. In addition, as Section C.2 shows, the sign change in occurs in the region between the null clines and , which in turn means that for any plateau, there exist starting points inside the WTA region that lead to it.
3.2 More than two components
We have discussed conditions for saddle points to exist for any number of components . In fact, the number of saddle points grows exponentially with the number of components that satisfy . The previous section’s arguments that establish the existence of plateaus nearby can be extended to the case as well, but we omit the details here.
Characterizing the WTA regions in higher dimensions is less straightforward. The simple case is the ‘fullyinterfering’ one, where all components compete for the same resources, and they have negative pairwise interference everywhere (): in this case, the previous section’s argument can be extended to show that WTA regions must exist near the boundaries.
However, WTA is an unnecessarily strong criterion for pulling trajectories toward plateaus for ray interference: we have seen in Figure 2 that the basins of attraction extend beyond the WTA region (compare green and orange), especially in the lowperformance regime. For example consider three components A, B and C, where A learns first and suppresses B (as before). Now during this stage, C might behave in different ways. It could learn only partially, converge in parallel with A or be suppressed as B. When moving to stage two, once A is learned, if C has not fully converged, ray interference dynamics can appear between B and C. Note that a critical quantity is the performance of C after stage one; this is not a trivial one to make formal statements about, so we rely on an empirical study. Figure 4 shows that the number of plateaus grows with in fullyinterfering scenarios. For , we observe that typically a first plateau is hit after a few components have been learned (), indicating that the initialization was not in a WTA region. But after that, most learning curves look like step functions that learn one of the remaining components at a time, with plateaus inbetween these stages.
A more surprising secondary phenomenon is that the plateaus seem to get exponentially worse in each stage (note the logscale). We propose two interpretations. First, consider the two last components to be learned. They have not dominated learning for stages, all the while being suppressed by interference, and thus their performance level is very low when the last stage starts, much lower than at initialization. And as Figure 5 shows, a low initial performance dramatically affects the chance that the dynamics go through a very flat plateau. Second, the length of the plateau may come from the interference of the ^{th} task with all previously learned tasks. Basically, when starting to learn the ^{th} task, a first step that improves it can negatively interfere with the first tasks. In a second step, these previous tasks may dominate the update and move the parameters such that they recover their performance. Thus the only changes preserved from these two tugofwar steps are those in the nullspace of the first tasks. Learning to use only that restricted capacity for task takes time, especially in the presence of noise, and could thus explain that the length the plateaus grows with .
3.3 From bandits to RL
While the bandit setting has helped ease the exposition, our aim is to gain understanding in the more general (deep) RL case. In this section, we argue that there are some RL settings that are likely to be affected by ray interference. First, there are many cases where the single scalar reward objective can be seen as a composite of many : the simple analogue to the contextual bandit are domains with multiple rooms, levels or opponents (e.g., Atari games like Montezuma’s Revenge), and a competent policy needs to be good in/against many of these. The additive assumption may not be a perfect fit, but can be a good firstorder approximation in many cases. More generally, when rewards are very sparse, decompositions that split trajectories near reward events is a common approximation in hierarchical RL [19, 24]. Other plausible decompositions exist in undiscounted domains where each reward can be ‘collected’ just once, in which case each such reward can be viewed as a distinct , and the overall performance depends on how many of them the policy can collect, ignoring the paths and ordering. It is important to note that a decomposition does not have to be explicit, semantic or clearly separable: in some domains, competent behavior may be well explained by a combination of implicit skills—and if the learning of such skills is subject to interference, then ray interference may be an issue.
Another way to explain RL dynamics from our bandit investigation is to consider that each arm is a temporally abstract option [43], with ‘contexts’ referring to parts of state space where different options are optimal. This connection makes an earlier assumption less artificial: we assumed low initial performance in Figure 3, which is artificial in a 2armed bandit, but plausible if there is an arm for each possible option.
In RL, interference can also arise in other ways than the competition in policy space we observed for bandits. There can be competition around what to memorize, how to represent state, which options to refine, and where to improve value accuracy. These are commonly conflated in the dynamics of a shared function approximator, which is why we consider ray interference to apply to deep RL in particular. On the other hand, the potential for interference is also paired with the potential for positive transfer (), and a number of existing techniques try to exploit this, for example by learning about auxiliary tasks [17].
Coupling can also be stronger in the RL case than for the bandit: while we considered a uniform distribution over contexts, it is more realistic to assume that the RL agent will visit some parts of state space far more often than others. It is likely to favour those where it is learning, or seeing reward already (amplifying the richgetricher effect). To make this more concrete, assume that the performance
sufficiently determines the data distribution the agent encounters, such that its effect on learning about in onpolicy RL can be summarized by the scalar function (see Section 3.1), at least in the lowperformance regime. For the types of RL domains discussed above it is likely that , i.e., that very little can be learned from only the data produced by an incompetent policy—thereby inducing ray interference.We have already alluded to the impact of onpolicy versus offpolicy learning: the latter is generally considered to lead to a number of difficulties [42]—however, it can also reduce the coupling in the learning dynamics; see for example Figure 6 for an illustration of how changes when mixing the softmax policy with 10% of random actions: crucially it no longer satisfies . This perspective that offpolicy learning can induce better learning dynamics in some settings, took some of the authors by surprise.
3.4 Beyond RL
A related phenomenon to the one described here was previously reported for supervised learning by Saxe et al. [34], for the particular case of deep linear models. This setting makes it easy to analytically express the learning dynamics of the system, unlike traditional neural networks. Assuming a single hidden layer deep linear model, and following the derivation in [34], under the full batch dynamics, the continuous time update rule is given by:
where and are the weights on the first and second layer respectively (the deep linear model does not have biases), is the correlation matrix between the input and target, and
is the input correlation matrix. Using the singular value decomposition of
and assuming , they study the dynamics for each mode, and find that the learning dynamics lead to multiple plateaus, where each plateau corresponds to learning of a different mode. Modes are learned sequentially, starting with the one corresponding to the largest singular value (see the original paper for full details). Learning each mode of the has its analogue in the different objective components in our notation. The learning curves showed in [34] resemble those observed in Figures 1 and 4, and their Equation (5) describing the permode dynamics has a similar structure to our Equations 1 and 2.Our intuition on how this similarity comes about is speculative. One could view the hidden representation as the input of the top part of the model (
). Now from the perspective of that part, the input distribution is nonstationary because the hidden features (defined by ) change. Moreover, this nonstationarity is coupled to what has been learned so far, because the error is propagated through the hidden features into . If the system is initialized such that the hidden units are correlated, then learning the different modes leads to a competition over the same representation or resources. The gradient is initially dominated by the mode with the largest singular value and therefore the changes to the hidden representation correspond features needed to learn this mode only. Once the loss for the dominant mode converges, symmetrybreaking can happen, and some of the hidden features specialize to represent the second mode. This transition is visible as a plateau in the learning curve.While this interpretation highlights the similarities with the RL case via coupling and competition of resources, we want to be careful to highlight that both of these aspects work differently here. The coupling does not have the property that low performance also slows down learning (Section 3.1). It is not clear whether the modes exhibit negative interference where learning about one mode leads to undoing progress on another one, it could be more akin to the magnitude of the noise of the larger mode obfuscating the signal on how to improve the smaller one.
Their proposed solution is an initialization scheme that ensures all variations of the data are preserved when going through the hierarchy, in line with previous initialization schemes [10, 20], which leads to symmetry breaking and reduces negative interference during learning. Unfortunately, as this solution requires access to the entire dataset, it does not have a direct analogue in RL where the relevant states and possible rewards are not available at initialization.
Multitask versus continual learning
Our investigation has natural connections to the field of multitask learning (be it for supervised or RL tasks [31, 25]), namely by considering that the multitask objective is additive over one per task. It is not uncommon to observe task dominance in this scenario (learning one task at a time [12]
), and our analysis suggests possible reasons why tasks are sometimes learned sequentially despite the setup presenting them to the learning system all at once. On the other hand, we know that deep learning struggles with fully sequential settings, as in
continual learning or lifelong learning [30, 36, 40], one of the reasons being that the neural network’s capacity can be exhausted prematurely (saturated units), resulting in an agent that can never reach its full potential. So this raises the questions why current multitask techniques appear to be so much more effective than continual learning, if they implicitly produce sequential learning? One hypothesis is that the potential tugofwar dynamics that happen when moving from one component to another are akin to rehearsal methods for continual learning, help split the representation, and allowing room for learning the features required by the next component. Two other candidates could be that the implicit sequencing produces better task orderings and timings than external ones, or the notion that what is sequenced are not tasks themselves but skills that are useful across multiple tasks. But primarily, we profess our ignorance here, and hope that future work will elucidate this issue, and lead to significant improvements in continual learning along the way.4 Discussion
How prevalent is it?
Ray interference does not require an explicit multitask setting to appear. A given single objective might be internally composed of subtasks, some of which have negative interference. We hypothesize, for example, that performance plateaus observed in Atari [e.g. 23, 14] might be due to learning multiple interfering skills (such as picking up pellets, avoiding ghosts and eating ghosts in Ms PacMan). Conversely, some of the explicit multitask RL setups appear not to suffer from visible plateaus [e.g. 6]. There is a long list of reasons for why this could be, from the task not having interfering subtasks, positive transfer outweighing negative interference, the particular architecture used, or population based training [16] hiding the plateaus through reliance on other members of the population. Note that the lack of plateaus does not exclude the sequential learning of the tasks. Finally, ray interference might not be restricted to RL settings. Similar behaviour has been observed for deep (linear) models [34], though we leave developing the relationship between these phenomena as future work.
How to detect it?
Ray interference is straightforward to detect if the components are known (and appropriate), by simply monitoring whether progress stalls on some components while others learn, and then picks up later. It can be verified by training a separate network for each component from the same (fixed) data. In the more general case, where only the overall objective is known, a first symptom to watch out for is plateaus in the learning curve of individual runs, as plateaus tend to be averaged out when curves aggregated across many runs. Once plateaus have been observed, there are two types of control experiments: interference can be reduced by changing the function approximator (capacity or structure), or coupling can be reduced by fixing the data distribution or learning more offpolicy. If the plateaus dissipate under these control experiments, they were likely due to ray interference.
What makes it worse?
For simplicity, we have examined only one type of coupling, via the data generated from the current policy, but there can be other sources. When contexts/tasks are not sampled uniformly but shaped into curricula based on recent learning progress [8, 11], this amplifies the winnertakeall dynamics. Also, using temporaldifference methods that bootstrap
from value estimates
[42] may introduce a form of coupling where the values improve faster in regions of state space that already have accurate and consistent bootstrap targets. A form of coupling that operates on the population level is connected to selective pressure [16]: the population member that initially learns fastest can come to dominate the population and reduce diversity—favoring onetrick ponies in a multiplayer setup, for example.What makes it better?
There are essentially three approaches: reduce interference, reduce coupling, or tackle ray interfere headon. Assuming knowledge of the components of the objective, the multitask literature offers a plethora of approaches to avoid negative interference, from modular or multihead architectures to gating or attention mechanisms [7, 32, 41, 3, 37, 44]. Additionally, there are methods that prevent the interference directly at the gradient level [5, 47], normalize the scales of the losses [15], or explicitly preserve capacity for latelearned subtasks [18]. It is plausible that following the natural gradient [1, 28] helps as well, see for example [4] (their figures 9b and 11b) for preliminary evidence. When the components are not explicit, a viable approach is to use populationbased methods that encourage diversity, and exploit the fact that different members will learn about different implicit components; and that knowledge can be combined [e.g., using a distillationbased crossover operator, 9]
. A possibly simpler approach is to rely on innovations in deep learning itself: it is plausible that deep networks with ReLU nonlinearities and appropriate initialization schemes
[13] implicitly allow units to specialize. Note also that the interference can be positive, learning one component helps on others (e.g., via refined features). Coupling can be reduced by introducing elements of offpolicy learning that dilutes the coupled data distribution with exploratory experience (or experience from other agents), rebalancing the data distribution with a suitable form of prioritized replay [35] or fitness sharing [33], or by reward shaping that makes the learning signal less sparse [2]. A generic type of decoupling solution (when components are explicit) is to train separate networks per component, and distill them into a single one [21, 32]. Headon approaches to alleviate ray interference could draw from the growing body of continual learning techniques [30, 36, 40, 29, 22].5 Conclusion
This paper studied ray interference, an issue that can stall progress in reinforcement learning systems. It is a combination of harms that arise from (a) conflicting feedback to a shared representation from multiple objectives, and (b) changing the data distribution during policy improvement. These harms are much worse when combined, as they cause learning progress to stall, because the expected learning update drags the learning system towards plateaus in gradient space that require a long time to escape. As such, ray interference is not restricted to deep RL (a bias unit weight shared across different actions in a linear model suffices), but rather it shows how harmful forms of interference, similar to those studied in deep learning, can arise naturally within reinforcement learning. This initial investigation stops short of providing a full remedy, but it sheds light onto these dynamics, improves understanding, teases out some of the key factors, and hints at possible directions for solution methods.
Zooming out from the overall tone of the paper, we want to highlight that plateaus are not omnipresent in deep RL, even in complex domains. Their absence might be due to the many commonly used practical innovations that have been proposed for stability or performance reasons. As they affect the learning dynamics, they could indeed alleviate ray interference as a secondary effect. It may therefore be worth revisiting some of these methods, from a perspective that sheds light on their relation to phenomena like ray interference.
References
 [1] S.I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 [2] J. A. ArjonaMedina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter, and S. Hochreiter. RUDDER: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
 [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 [4] R. Dadashi, A. A. Taïga, N. L. Roux, D. Schuurmans, and M. G. Bellemare. The value function polytope in reinforcement learning. CoRR, abs/1901.11524, 2019.
 [5] Y. Du, W. M. Czarnecki, S. M. Jayakumar, R. Pascanu, and B. Lakshminarayanan. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224, 2018.
 [6] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. IMPALA: Scalable distributed DeepRL with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561, 2018.
 [7] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 [8] S. Forestier and P.Y. Oudeyer. Modular active curiositydriven discovery of tool use. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 3965–3972. IEEE, 2016.
 [9] T. Gangwani and J. Peng. Genetic policy optimization. arXiv preprint arXiv:1711.01012, 2017.

[10]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics
, 2010.  [11] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003, 2017.

[12]
M. Guo, A. Haque, D.A. Huang, S. Yeung, and L. FeiFei.
Dynamic task prioritization for multitask learning.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 270–287, 2018.  [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CoRR, abs/1502.01852, 2015.
 [14] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [15] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. Multitask deep reinforcement learning with popart. arXiv preprint arXiv:1809.04474, 2018.
 [16] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
 [17] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 [18] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
 [19] T. Lane and L. P. Kaelbling. Toward hierarchical decomposition for planning in uncertain environments. In Proceedings of the 2001 IJCAI workshop on planning under uncertainty and incomplete information, pages 1–7, 2001.
 [20] Y. LeCun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. In G. Montavon, G. B. Orr, and K.R. Müller, editors, Neural Networks: Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science, pages 9–48. Springer, 1998.
 [21] D. LopezPaz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
 [22] D. LopezPaz and M. A. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
 [23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [24] A. W. Moore, L. Baird, and L. P. Kaelbling. Multivaluefunctions: Effcient automatic action hierarchies for multiple goal MDPs. In Proceedings of the international joint conference on artificial intelligence, pages 1316–1323, 1999.

[25]
J. Oh, S. Singh, H. Lee, and P. Kohli.
Zeroshot task generalization with multitask deep reinforcement
learning.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 2661–2670. JMLR. org, 2017.  [26] OpenAI. Openai five. https://blog.openai.com/openaifive/, 2018.
 [27] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning dexterous inhand manipulation. CoRR, abs/1808.00177, 2018.
 [28] J. Peters, S. Vijayakumar, and S. Schaal. Natural actorcritic. In European Conference on Machine Learning, pages 280–291. Springer, 2005.
 [29] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, , and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, 2019.
 [30] M. B. Ring. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin Austin, Texas 78712, 1994.
 [31] S. Ruder. An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
 [32] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.

[33]
B. Sareni and L. Krahenbuhl.
Fitness sharing and niching methods revisited.
IEEE transactions on Evolutionary Computation
, 2(3):97–106, 1998.  [34] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, 2014.
 [35] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 [36] T. Schaul, H. van Hasselt, J. Modayil, M. White, A. White, P. Bacon, J. Harb, S. Mourad, M. G. Bellemare, and D. Precup. The Barbados 2018 list of open issues in continual learning. CoRR, abs/1811.07004, 2018.
 [37] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 [38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–503, 2016.
 [39] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 [40] D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series, 2013.
 [41] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In Advances in neural information processing systems, pages 3545–3553, 2014.
 [42] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [43] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [44] C.Y. Tsai, A. M. Saxe, and D. Cox. Tensor switching networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2038–2046. Curran Associates, Inc., 2016.
 [45] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, T. Ewalds, D. Horgan, M. Kroiss, I. Danihelka, J. Agapiou, J. Oh, V. Dalibard, D. Choi, L. Sifre, Y. Sulsky, S. Vezhnevets, J. Molloy, T. Cai, D. Budden, T. Paine, C. Gulcehre, Z. Wang, T. Pfaff, T. Pohlen, Y. Wu, D. Yogatama, J. Cohen, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, C. Apps, K. Kavukcuoglu, D. Hassabis, and D. Silver. AlphaStar: Mastering the RealTime Strategy Game StarCraft II. https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/, 2019.
 [46] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [47] D. Yin, A. Pananjady, M. Lam, D. S. Papailiopoulos, K. Ramchandran, and P. Bartlett. Gradient diversity empowers distributed learning. CoRR, abs/1706.05699, 2017.
Acknowledgements
We thank Hado van Hasselt, David Balduzzi, Andre Barreto, Claudia Clopath, Arthur Guez, Simon Osindero, Neil Rabinowitz, Junyoung Chung, David Silver, Remi Munos, Alhussein Fawzi, Jane Wang, Agnieszka GrabskaBarwinska, Dan Horgan, Matteo Hessel, Shakir Mohamed and the rest of the DeepMind team for helpful comments and suggestions.
Appendix A Additional results
We investigated numerous additional variants of the basic bandit setup. In each case, we summarize the results by the probabilities that a plateau of
or worse is encountered, as in Figure 5. We quantify this by computing the slowest progress along the learning curve (not near the start nor the optimum), normalized to factor out the stepsize. If not mentioned otherwise, we use the following settings across these experiments: , , low initial performance , stepsize , and batchsize (exactly 1 per context). Learning runs are stopped near the global optimum, when , or after samples.We can quantify the insights of Section 3.2 by measuring the flatness of the worst plateau in a learning curve that generally has more than one. Figure 7 gives results that validate the qualitative insights, when increasing and jointly. Note that only scaling up actually makes the problem easier (when controlling for initial performance), because the actions are disadvantageous in all contexts, so there is some positive transfer through their action biases.
We looked at the influence of some architectural choices, using deep neural networks to parameterize the policy. It turns out that deeper or wider MLPs do not qualitatively change the dynamics from the simple setup in Section 2. Figure 9 illustrates some of the effects of learning rates and optimizer choices.
Appendix B Detailed derivations (bandit)
b.1 Fixed point analysis
The Jacobian with respect to is
with determinant and trace . This lets us characterize the four fixed points:
fixed point  trace  determinant  type 

2  unstable  
0  saddle  
0  saddle  
2  stable 
b.2 Derivation of for RL
We characterize the acceleration of the learning dynamics, as:
This implies that along the diagonal where , and changes sign there. We have a plateau if the signchange is from negative to positive, in other words, wherever
see also Figure 2.
b.3 Lower bound on basin of attraction
We can construct an explicit lower bound on the size of the basing of attraction for a given plateau. The main argument is that once a trajectory is in a WTA region (say, the one with ), it can only leave it after the dominant component is nearly learned (), and the performance on the dominated component has regressed. , where Note that we abuse notation, , and technically describes the distribution of . Note that trajectories that leave WTA region by crossing by crossing the null cline or traverses diagonal at an plateau point. If we are to consider traditional initialization of the neural network we can look at the probability of the initialization to be underneath either null cline. Under assumption on the initialization (assuming uniform distribution across angular directions) we have over chance of initializing the model in a WTA region. For this we can compute the derivative of the null cline equation at the origin, and assuming a distribution that is uniform across angular directions we have approximately chance to start in a WTA region. Let us consider the dynamics around the top left saddle point. A trajectory that leaves the WTA region by crossing the null cline at will traverse the diagonal at an plateau point where and , because in that region. Furthermore, for any trajectory that starts at a point within the WTA region to the left of this null cline, if , then it will exit the WTA region at or above, and thus hit a plateau that is at least as flat. So the basin of attraction for an plateau includes the polygon defined by , but is in fact larger, as other trajectories can enter this region from elsewhere, see Figure 2. for a diagram with this geometric intuition, and Figure 8 for the empirical relation between and basin size. In these simulations, and elsewhere if not mentioned otherwise, w We use to produce trajectories, and exactly one sample per context for stochastic updates.
b.4 Probability WTA initialization near origin
When the distribution of starting points is close to the origin, then a quantity of interest is the probability of a starting point falling underneath either null cline (because from there on the WTA dynamics will pull it into a plateau). For this we can compute the derivatives of the null cline equation at the origin:
So for such a distribution that is uniform across angular directions we have . However, as Figure 5 shows empirically, the basins of attraction of flat plateaus are even larger, because starting in a WTA region is sufficient but not necessary.
b.5 Derivation of for supervised learning
We have:
Appendix C Derivations for factored objectives case
c.1 Derivation of
We consider objectives where each component is smooth, and can be written in the following form:
where is a scalar function mapping to positive numbers that doesn’t depend on current except via and . We further assume that each component is bounded: . When the optimum is reached, there is no further learning, so , but for intermediate points learning is always possible, i.e., .
If the above conditions hold, then these are sufficient conditions to show that the combined objective admits an plateau as defined in Definition 2. Moreover, will exhibit the saddle points at and .