Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

by   Tom Schaul, et al.

Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. In the presence of function approximation, this coupling can lead to a problematic type of 'ray interference', characterized by learning dynamics that sequentially traverse a number of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.


page 1

page 2

page 3

page 4


Ray RLLib: A Composable and Scalable Reinforcement Learning Library

Reinforcement learning (RL) algorithms involve the deep nesting of disti...

Learning sparse representations in reinforcement learning

Reinforcement learning (RL) algorithms allow artificial agents to improv...

On Catastrophic Interference in Atari 2600 Games

Model-free deep reinforcement learning algorithms are troubled with poor...

Distance-Sensitive Offline Reinforcement Learning

In offline reinforcement learning (RL), one detrimental issue to policy ...

GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning

Deep Q Network (DQN) firstly kicked the door of deep reinforcement learn...

Ray: A Distributed Framework for Emerging AI Applications

The next generation of AI applications will continuously interact with t...

Representation and Invariance in Reinforcement Learning

If we changed the rules, would the wise trade places with the fools? Dif...

1 Introduction

Deep reinforcement learning (RL) agents have achieved impressive results in recent years, tackling long-standing challenges in board games [38, 39], video games [23, 26, 45] and robotics [27]

. At the same time, their learning dynamics are notoriously complex. In contrast with supervised learning (SL), these algorithms operate on highly non-stationary data distributions that are

coupled with the agent’s performance: an incompetent agent will not generate much relevant training data. This paper identifies a problematic, hitherto unnamed, issue with the learning dynamics of RL systems under function approximation (FA).

We focus on the case where the learning objective can be decomposed into multiple components,

Although not always explicit, complex tasks commonly possess this property. For example, this property arises when learning about multiple tasks or contexts, when using multiple starting points, in the presence of multiple opponents, and in domains that contain decisions points with bifurcating dynamics. Sharing knowledge or representations between components can be beneficial in terms of skill reuse or generalization, and sharing seems essential to scale to complex domains. A common mechanism for sharing is a shared function approximator (e.g., a neural network). In general however, the different components do not coordinate and so may compete for resources, resulting in



Figure 1: Illustration of ray interference in two objective component dimensions . Top row: Arrows indicate the flow direction of the the learning trajectories. Each colored line is a (stochastic) sample trajectory, color-coded by performance. Bottom row: Matching learning curves for these same trajectories. Note how the trajectories that pass by the saddle points of the dynamics, at and , in warm colors, hit plateaus and learn much slower (note that the scale of the x-axis differs per plot). Each column has a different setup. Left: RL with FA exhibits ray interference as it has both coupling and interference. Middle: Tabular RL has few plateaus because there is no interference in the dynamics. Right: Supervised learning has no plateaus even with FA interference.

The core insight of this paper is that problematic learning dynamics arise when combining two algorithm properties that are relatively innocuous in isolation, the interference caused by the different components and the coupling of the learning signal to the future behaviour of the algorithm. Combining these properties leads to a phenomenon we call ‘ray interference’111So called, due to the tentative resemblance of Figure 1 (top, left) to a batoidea (ray fish).:

A learning system suffers from plateaus if it has (a) negative interference between the components of its objective, and (b) coupled performance and learning progress.

Intuitively, the reason for the problem is that negative interference creates winner-take-all (WTA) regions, where improvement in one component forces learning to stall or regress for the other components of the objective. Only after such a dominant component is learned (and its gradient vanishes), will the system start learning about the next component. This is where the coupling of learning and performance has its insidious effect: this new stage of learning is really slow, resulting in a long plateau for overall performance (see Figure 1).

We believe aspect (a) is common when using neural networks, which are known to exhibit negative interference when sharing representations between arbitrary tasks. On-policy RL exhibits coupling between performance and learning progress (b) because the improving behavior policy generates the future training data. Thus ray interference is likely to appear in the learning dynamics of on-policy deep RL agents.

The remainder of the paper is structured to introduce the concepts and intuitions on a simple example (Section 2), before extending it to more general cases in Section 3. Section 4 discusses prevalence, symptoms, interactions with different algorithmic components, as well as potential remedies.

2 Minimal, explicit setting

A minimal example that exhibits ray interference is a -bandit problem; a deterministic contextual bandit with contexts and discrete actions . This problem setting eliminates confounding RL elements of bootstrapping, temporal structure, and stochastic rewards. It also permits a purely analytic treatment of the learning dynamics.

The bandit reward function is one when the action matches the context and zero otherwise: where is the indicator function. The policy is a softmax , where

are logits. The mapping from context to logits is parameterised by the trainable weights

. The expected performance is the sum across contexts, .

A simple way to train is to follow the policy gradient in REINFORCE [46]. For a context-action-reward sample, this yields . When samples are generated on-policy according to , then the expected update in context is

Interference can arise from function approximation parameters that are shared across contexts. To see this, represent the logits as , where is a matrix, is an

-dimensional vector, and

is a one-hot vector representing the current context. Note that each context uses a different row of , hence no component of is shared among contexts. However is shared among contexts and this is sufficient for interference, defined as follows:

Definition 1 (Interference)

We quantify the degree of interference between two components of

as the cosine similarity between their gradients:

Qualitatively implies no interference, positive transfer, and negative transfer (interference). is bounded between and .

2.1 Explicit dynamics for a -bandit

For the two-dimensional case with contexts and arms, we can visualize and express the full learning dynamics exactly (for expected updates with an infinite batch-size), in the continuous time limit of small step-sizes. First, we clarify our use of two kinds of derivatives that simplify our presentation. The operator describes a partial derivative and is usually taken with respect to the parameters . We omit whenever it is unambiguous, e.g., . Next, the ‘overdot’ notation for a function denotes temporal derivatives when following the gradient with respect to , namely

Let and . For two actions, the softmax can be simplified to a simple sigmoid . By removing redundant parameters, we obtain:

where . This yields the following gradients:

From these we compute the degree of interference

which implies a substantial amount of negative interference at all points in parameter space.

2.2 Dynamical system

The learning dynamics follow the full gradient , and we examine them in the limit of small stepsizes , in the coordinate system given by and . The directional derivatives along for the two components and are:


This system of differential equations has fixed points at the four corners, where is unstable, is a stable attractor (the global optimum), and and are saddle points; see Section B.1 for derivations. The left of Figure 1 depicts exactly these dynamics.

2.3 Flat plateaus

By considering the inflection points where the learning dynamics re-accelerate after a slow-down, we can characterize its plateaus, formally:

Definition 2 (Plateaus)

We say that the learning dynamics of have an -plateau at a point if and only if

In other words, is an inflection point of where the learning curve along switches from concave to convex. At , the derivative characterizes the plateau’s flatness, characterizing how slow learning is at the slowest (nearby) point.

In our example, the acceleration is:


where is a polynomial of max degree 6 in and ; see Section B.2 for details. This implies that along the diagonal where , and changes sign there. We have a plateau if the sign-change is from negative to positive, see Figure 2. These points lie near the saddle points and are -plateaus, with their ‘flatness’ given by

which is vanishingly small near the corners. Under appropriate smoothness constraints on , the existence of an -plateau slows down learning by steps compared to a plateau-free baseline; see Figure 8 for empirical results.

2.4 Basins of attraction

Flat plateaus are only a serious concern if the learning trajectories are likely to pass through them. The exact basins of attraction are difficult to characterize, but some parts are simple, namely the regions where one component dominates.

Definition 3 (Winner-take-all)

We denote the learning dynamics at a point as winner-take-all (WTA) if and only if

that is, following the full gradient only increases the th component.

Figure 2: Bandit learning dynamics: Geometric intuitions to accompany the derivations. The green hyperbolae show the null clines that enclose the WTA regions. Inflection points are shown in blue, of which the solid lines are plateaus (), while the dashed lines are not. The orange path encloses the basin of attraction for a plateau of . The red polygon is its lower-bound approximation for which the vertices can be derived explicitly (Section B.3).

The core property of interest for a WTA region is that for every trajectory passing through it, when the trajectory leaves the region, all components of the objective except for the th will have decreased. In our example, the null clines describe the WTA regions. They follow the hyperbolae:

as shown in Figure 2. Armed with the knowledge of plateau locations and WTA regions, we can establish their basins of attraction, see Figure 2 and Section B.3, and thus the likelihood of hitting an -plateau under any distribution of starting points. For distributions near the origin and uniform across angular directions, it can be shown that the chance of initializing the model in a WTA region is over 50% (Section B.4). Figure 5 shows empirically that for initializations with low overall performance , a large fraction of learning trajectories hit (very) flat plateaus.

2.5 Contrast example: supervised learning

We can obtain the explicit dynamics for a variant of the setup where the policy is not trained by REINFORCE, but by supervised learning using a cross-entropy loss toward the ground-truth ideal arm, where crucially the performance-learning coupling of on-policy RL is absent: . In this case, interference is the same as before (), but there are no saddle points (the only fixed point is the global optimum at ), nor are there any inflection points that could indicate the presence of a plateau, because is concave everywhere (see Figure 1, right, and Section B.5 for the derivations):


Figure 3: Likelihood of encountering a flat plateau. This plot shows on the likelihood (vertical axis) that the slowest learning progress, , along a trajectory is below some value—when there is a plateau, this is its flatness (horizontal axis). For example, 20% of on-policy runs (red curve) traverse a very flat plateau with

. All these results are empirical quantiles, when starting at low initial performance,

, and ignoring slow progress near the start or the optimum. There are four settings: ray interference (red) is a consequence of two ingredients, interference and coupling. Multiple ablations eliminate it: interference can be removed by training separate networks or using a tabular representation (green); coupling can be removed by off-policy RL with uniform exploration (blue) or a supervised learning setup as in Section 2.5 (yellow). One key contributing factor that impacts whether a trajectory is ‘lucky’ is whether it is initialized near the diagonal () or not: the more imbalanced the initial performance, the more likely it is to encounter a slow plateau.

2.6 Summary: Conditions for ray interference

To summarize, the learning dynamics of a -bandit exhibit ray interference. For many initializations, the WTA dynamics pull the system into a flat plateau near a saddle point. Figures 3 and 1 show this hinges on two conditions. The negative interference (a) is due to having multiple contexts () and a shared function approximator; this creates the WTA regions that make it likely to hit flat plateaus (left subplots).

When using a tabular representation instead (i.e., without the action-bias), there are no WTA regions, so the basins of attraction for the plateaus are smaller and do not extend toward the origin, and ray interference vanishes, see Figure 1 (middle subplots). On the other hand, the learning dynamics couple performance and learning progress (b) because samples are generated by the current policy. Ray interference disappears when this coupling is broken, because without the coupling, the dynamics have no saddle points or plateaus. We see this when a uniform random behavior policy is used, or when the policy is trained directly by supervised learning (Section 2.5); see Figure 1 (right subplots).

3 Generalizations

In this section, we generalize the intuitions gained on the simple bandit example, and characterize a broader class of problems that exhibit ray interference.

3.1 Factored objectives

There is one class of learning problems that lends itself to such generalization, namely when the component updates are explicitly coupled to performance, and can be written in the following form:


where is a smooth scalar function mapping to positive numbers that does not depend on the current except via , and is a gradient vector field. Furthermore, suppose that each component is bounded: . When the optimum is reached, there is no further learning, so , but for intermediate points learning is always possible, i.e., .

A sufficient condition for a saddle point to exist is that for one of the components there is no learning at its performance minimum, i.e., . The reason is that , so then at any point where all other components are fully learned.

As a first step to assess whether there exist plateaus near the saddle points, we look at the two-dimensional case. Without loss of generality, we pick and . The saddle point of interest is , so we need to determine the sign of at the two nearby points and , with . Under reasonable smoothness assumptions, a sufficient condition for a plateau to exist between these points is that both and , because has to cross zero between them. Under certain assumptions, made explicit in our derivations in Section C.1, we have:

the sign of which only depends on . Furthermore, we know that for small because is smooth, and , and similarly because and . In other words, the same condition sufficient to induce a saddle point () is also sufficient to induce plateaus nearby. Note that the approximation here comes from assuming that is small near the saddle point.

At this point it is worth restating the shape of the in the bandit examples from the previous section: with the REINFORCE objective, we had and under supervised learning we had ; the extra factor of in the RL case is what introduced the saddle points, and its source was the (on-policy) data coupling; see Figure 6 for an illustration.

Ray interference requires a second ingredient besides the existence of plateaus, namely WTA regions that create the basins of attraction for these plateaus. For the saddle point at , the WTA region of interest is the one where dominates, i.e.,

Of course, this can only happen if there is negative interference (). If that is the case however, a WTA region necessarily exists in a strip around , because being smooth means that eventually becomes small enough for the negative term to dominate. In addition, as Section C.2 shows, the sign change in occurs in the region between the null clines and , which in turn means that for any plateau, there exist starting points inside the WTA region that lead to it.

3.2 More than two components

We have discussed conditions for saddle points to exist for any number of components . In fact, the number of saddle points grows exponentially with the number of components that satisfy . The previous section’s arguments that establish the existence of plateaus nearby can be extended to the case as well, but we omit the details here.

Characterizing the WTA regions in higher dimensions is less straight-forward. The simple case is the ‘fully-interfering’ one, where all components compete for the same resources, and they have negative pair-wise interference everywhere (): in this case, the previous section’s argument can be extended to show that WTA regions must exist near the boundaries.

However, WTA is an unnecessarily strong criterion for pulling trajectories toward plateaus for ray interference: we have seen in Figure 2 that the basins of attraction extend beyond the WTA region (compare green and orange), especially in the low-performance regime. For example consider three components A, B and C, where A learns first and suppresses B (as before). Now during this stage, C might behave in different ways. It could learn only partially, converge in parallel with A or be suppressed as B. When moving to stage two, once A is learned, if C has not fully converged, ray interference dynamics can appear between B and C. Note that a critical quantity is the performance of C after stage one; this is not a trivial one to make formal statements about, so we rely on an empirical study. Figure 4 shows that the number of plateaus grows with in fully-interfering scenarios. For , we observe that typically a first plateau is hit after a few components have been learned (), indicating that the initialization was not in a WTA region. But after that, most learning curves look like step functions that learn one of the remaining components at a time, with plateaus in-between these stages.

A more surprising secondary phenomenon is that the plateaus seem to get exponentially worse in each stage (note the log-scale). We propose two interpretations. First, consider the two last components to be learned. They have not dominated learning for stages, all the while being suppressed by interference, and thus their performance level is very low when the last stage starts, much lower than at initialization. And as Figure 5 shows, a low initial performance dramatically affects the chance that the dynamics go through a very flat plateau. Second, the length of the plateau may come from the interference of the th task with all previously learned tasks. Basically, when starting to learn the th task, a first step that improves it can negatively interfere with the first tasks. In a second step, these previous tasks may dominate the update and move the parameters such that they recover their performance. Thus the only changes preserved from these two tug-of-war steps are those in the null-space of the first tasks. Learning to use only that restricted capacity for task takes time, especially in the presence of noise, and could thus explain that the length the plateaus grows with .

Figure 4: Learning curves when scaling up the problem dimension (jointly and ). We observe that the runs go through more separate plateaus, and each plateau takes exponentially longer to overcome than the previous one (the horizontal axis is log-scale).

3.3 From bandits to RL

While the bandit setting has helped ease the exposition, our aim is to gain understanding in the more general (deep) RL case. In this section, we argue that there are some RL settings that are likely to be affected by ray interference. First, there are many cases where the single scalar reward objective can be seen as a composite of many : the simple analogue to the contextual bandit are domains with multiple rooms, levels or opponents (e.g., Atari games like Montezuma’s Revenge), and a competent policy needs to be good in/against many of these. The additive assumption may not be a perfect fit, but can be a good first-order approximation in many cases. More generally, when rewards are very sparse, decompositions that split trajectories near reward events is a common approximation in hierarchical RL [19, 24]. Other plausible decompositions exist in undiscounted domains where each reward can be ‘collected’ just once, in which case each such reward can be viewed as a distinct , and the overall performance depends on how many of them the policy can collect, ignoring the paths and ordering. It is important to note that a decomposition does not have to be explicit, semantic or clearly separable: in some domains, competent behavior may be well explained by a combination of implicit skills—and if the learning of such skills is subject to interference, then ray interference may be an issue.

Another way to explain RL dynamics from our bandit investigation is to consider that each arm is a temporally abstract option [43], with ‘contexts’ referring to parts of state space where different options are optimal. This connection makes an earlier assumption less artificial: we assumed low initial performance in Figure 3, which is artificial in a 2-armed bandit, but plausible if there is an arm for each possible option.

In RL, interference can also arise in other ways than the competition in policy space we observed for bandits. There can be competition around what to memorize, how to represent state, which options to refine, and where to improve value accuracy. These are commonly conflated in the dynamics of a shared function approximator, which is why we consider ray interference to apply to deep RL in particular. On the other hand, the potential for interference is also paired with the potential for positive transfer (), and a number of existing techniques try to exploit this, for example by learning about auxiliary tasks [17].

Coupling can also be stronger in the RL case than for the bandit: while we considered a uniform distribution over contexts, it is more realistic to assume that the RL agent will visit some parts of state space far more often than others. It is likely to favour those where it is learning, or seeing reward already (amplifying the rich-get-richer effect). To make this more concrete, assume that the performance

sufficiently determines the data distribution the agent encounters, such that its effect on learning about in on-policy RL can be summarized by the scalar function (see Section 3.1), at least in the low-performance regime. For the types of RL domains discussed above it is likely that , i.e., that very little can be learned from only the data produced by an incompetent policy—thereby inducing ray interference.

We have already alluded to the impact of on-policy versus off-policy learning: the latter is generally considered to lead to a number of difficulties [42]—however, it can also reduce the coupling in the learning dynamics; see for example Figure 6 for an illustration of how changes when mixing the soft-max policy with 10% of random actions: crucially it no longer satisfies . This perspective that off-policy learning can induce better learning dynamics in some settings, took some of the authors by surprise.

3.4 Beyond RL

A related phenomenon to the one described here was previously reported for supervised learning by Saxe et al. [34], for the particular case of deep linear models. This setting makes it easy to analytically express the learning dynamics of the system, unlike traditional neural networks. Assuming a single hidden layer deep linear model, and following the derivation in [34], under the full batch dynamics, the continuous time update rule is given by:

where and are the weights on the first and second layer respectively (the deep linear model does not have biases), is the correlation matrix between the input and target, and

is the input correlation matrix. Using the singular value decomposition of

and assuming , they study the dynamics for each mode, and find that the learning dynamics lead to multiple plateaus, where each plateau corresponds to learning of a different mode. Modes are learned sequentially, starting with the one corresponding to the largest singular value (see the original paper for full details). Learning each mode of the has its analogue in the different objective components in our notation. The learning curves showed in [34] resemble those observed in Figures 1 and 4, and their Equation (5) describing the per-mode dynamics has a similar structure to our Equations 1 and 2.

Our intuition on how this similarity comes about is speculative. One could view the hidden representation as the input of the top part of the model (

). Now from the perspective of that part, the input distribution is non-stationary because the hidden features (defined by ) change. Moreover, this non-stationarity is coupled to what has been learned so far, because the error is propagated through the hidden features into . If the system is initialized such that the hidden units are correlated, then learning the different modes leads to a competition over the same representation or resources. The gradient is initially dominated by the mode with the largest singular value and therefore the changes to the hidden representation correspond features needed to learn this mode only. Once the loss for the dominant mode converges, symmetry-breaking can happen, and some of the hidden features specialize to represent the second mode. This transition is visible as a plateau in the learning curve.

While this interpretation highlights the similarities with the RL case via coupling and competition of resources, we want to be careful to highlight that both of these aspects work differently here. The coupling does not have the property that low performance also slows down learning (Section 3.1). It is not clear whether the modes exhibit negative interference where learning about one mode leads to undoing progress on another one, it could be more akin to the magnitude of the noise of the larger mode obfuscating the signal on how to improve the smaller one.

Their proposed solution is an initialization scheme that ensures all variations of the data are preserved when going through the hierarchy, in line with previous initialization schemes [10, 20], which leads to symmetry breaking and reduces negative interference during learning. Unfortunately, as this solution requires access to the entire dataset, it does not have a direct analogue in RL where the relevant states and possible rewards are not available at initialization.

Multi-task versus continual learning

Our investigation has natural connections to the field of multi-task learning (be it for supervised or RL tasks [31, 25]), namely by considering that the multitask objective is additive over one per task. It is not uncommon to observe task dominance in this scenario (learning one task at a time [12]

), and our analysis suggests possible reasons why tasks are sometimes learned sequentially despite the setup presenting them to the learning system all at once. On the other hand, we know that deep learning struggles with fully sequential settings, as in

continual learning or life-long learning [30, 36, 40], one of the reasons being that the neural network’s capacity can be exhausted prematurely (saturated units), resulting in an agent that can never reach its full potential. So this raises the questions why current multi-task techniques appear to be so much more effective than continual learning, if they implicitly produce sequential learning? One hypothesis is that the potential tug-of-war dynamics that happen when moving from one component to another are akin to rehearsal methods for continual learning, help split the representation, and allowing room for learning the features required by the next component. Two other candidates could be that the implicit sequencing produces better task orderings and timings than external ones, or the notion that what is sequenced are not tasks themselves but skills that are useful across multiple tasks. But primarily, we profess our ignorance here, and hope that future work will elucidate this issue, and lead to significant improvements in continual learning along the way.

4 Discussion

How prevalent is it?

Ray interference does not require an explicit multi-task setting to appear. A given single objective might be internally composed of subtasks, some of which have negative interference. We hypothesize, for example, that performance plateaus observed in Atari [e.g. 23, 14] might be due to learning multiple interfering skills (such as picking up pellets, avoiding ghosts and eating ghosts in Ms PacMan). Conversely, some of the explicit multi-task RL setups appear not to suffer from visible plateaus [e.g. 6]. There is a long list of reasons for why this could be, from the task not having interfering subtasks, positive transfer outweighing negative interference, the particular architecture used, or population based training [16] hiding the plateaus through reliance on other members of the population. Note that the lack of plateaus does not exclude the sequential learning of the tasks. Finally, ray interference might not be restricted to RL settings. Similar behaviour has been observed for deep (linear) models [34], though we leave developing the relationship between these phenomena as future work.

How to detect it?

Ray interference is straight-forward to detect if the components are known (and appropriate), by simply monitoring whether progress stalls on some components while others learn, and then picks up later. It can be verified by training a separate network for each component from the same (fixed) data. In the more general case, where only the overall objective is known, a first symptom to watch out for is plateaus in the learning curve of individual runs, as plateaus tend to be averaged out when curves aggregated across many runs. Once plateaus have been observed, there are two types of control experiments: interference can be reduced by changing the function approximator (capacity or structure), or coupling can be reduced by fixing the data distribution or learning more off-policy. If the plateaus dissipate under these control experiments, they were likely due to ray interference.

What makes it worse?

For simplicity, we have examined only one type of coupling, via the data generated from the current policy, but there can be other sources. When contexts/tasks are not sampled uniformly but shaped into curricula based on recent learning progress [8, 11], this amplifies the winner-take-all dynamics. Also, using temporal-difference methods that bootstrap

from value estimates 

[42] may introduce a form of coupling where the values improve faster in regions of state space that already have accurate and consistent bootstrap targets. A form of coupling that operates on the population level is connected to selective pressure [16]: the population member that initially learns fastest can come to dominate the population and reduce diversity—favoring one-trick ponies in a multi-player setup, for example.

What makes it better?

There are essentially three approaches: reduce interference, reduce coupling, or tackle ray interfere head-on. Assuming knowledge of the components of the objective, the multi-task literature offers a plethora of approaches to avoid negative interference, from modular or multi-head architectures to gating or attention mechanisms [7, 32, 41, 3, 37, 44]. Additionally, there are methods that prevent the interference directly at the gradient level [5, 47], normalize the scales of the losses [15], or explicitly preserve capacity for late-learned subtasks [18]. It is plausible that following the natural gradient [1, 28] helps as well, see for example [4] (their figures 9b and 11b) for preliminary evidence. When the components are not explicit, a viable approach is to use population-based methods that encourage diversity, and exploit the fact that different members will learn about different implicit components; and that knowledge can be combined [e.g., using a distillation-based cross-over operator, 9]

. A possibly simpler approach is to rely on innovations in deep learning itself: it is plausible that deep networks with ReLU non-linearities and appropriate initialization schemes 

[13] implicitly allow units to specialize. Note also that the interference can be positive, learning one component helps on others (e.g., via refined features). Coupling can be reduced by introducing elements of off-policy learning that dilutes the coupled data distribution with exploratory experience (or experience from other agents), rebalancing the data distribution with a suitable form of prioritized replay [35] or fitness sharing [33], or by reward shaping that makes the learning signal less sparse [2]. A generic type of decoupling solution (when components are explicit) is to train separate networks per component, and distill them into a single one [21, 32]. Head-on approaches to alleviate ray interference could draw from the growing body of continual learning techniques [30, 36, 40, 29, 22].

5 Conclusion

This paper studied ray interference, an issue that can stall progress in reinforcement learning systems. It is a combination of harms that arise from (a) conflicting feedback to a shared representation from multiple objectives, and (b) changing the data distribution during policy improvement. These harms are much worse when combined, as they cause learning progress to stall, because the expected learning update drags the learning system towards plateaus in gradient space that require a long time to escape. As such, ray interference is not restricted to deep RL (a bias unit weight shared across different actions in a linear model suffices), but rather it shows how harmful forms of interference, similar to those studied in deep learning, can arise naturally within reinforcement learning. This initial investigation stops short of providing a full remedy, but it sheds light onto these dynamics, improves understanding, teases out some of the key factors, and hints at possible directions for solution methods.

Zooming out from the overall tone of the paper, we want to highlight that plateaus are not omnipresent in deep RL, even in complex domains. Their absence might be due to the many commonly used practical innovations that have been proposed for stability or performance reasons. As they affect the learning dynamics, they could indeed alleviate ray interference as a secondary effect. It may therefore be worth revisiting some of these methods, from a perspective that sheds light on their relation to phenomena like ray interference.



We thank Hado van Hasselt, David Balduzzi, Andre Barreto, Claudia Clopath, Arthur Guez, Simon Osindero, Neil Rabinowitz, Junyoung Chung, David Silver, Remi Munos, Alhussein Fawzi, Jane Wang, Agnieszka Grabska-Barwinska, Dan Horgan, Matteo Hessel, Shakir Mohamed and the rest of the DeepMind team for helpful comments and suggestions.

Appendix A Additional results

We investigated numerous additional variants of the basic bandit setup. In each case, we summarize the results by the probabilities that a plateau of

or worse is encountered, as in Figure 5. We quantify this by computing the slowest progress along the learning curve (not near the start nor the optimum), normalized to factor out the step-size. If not mentioned otherwise, we use the following settings across these experiments: , , low initial performance , step-size , and batch-size (exactly 1 per context). Learning runs are stopped near the global optimum, when , or after samples.

We can quantify the insights of Section 3.2 by measuring the flatness of the worst plateau in a learning curve that generally has more than one. Figure 7 gives results that validate the qualitative insights, when increasing and jointly. Note that only scaling up actually makes the problem easier (when controlling for initial performance), because the actions are disadvantageous in all contexts, so there is some positive transfer through their action biases.

We looked at the influence of some architectural choices, using deep neural networks to parameterize the policy. It turns out that deeper or wider MLPs do not qualitatively change the dynamics from the simple setup in Section 2. Figure 9 illustrates some of the effects of learning rates and optimizer choices.

Figure 5: Basins of attraction, for plateaus of different , and for different levels of initial performance , under deterministic dynamics. The dashed line indicates the typical for which learning is 10 times slower than necessary (see Figure 8), so for example half of the trajectories initialized at hit such a flat plateau.

Figure 6: Plot of the scalar coupling functions of Equation 5 (see Section 3.1). It highlights the U-shape for on-policy REINFORCE (red), in contrast to a supervised learning setup (green, see Section 2.5). In orange, it illustrates how the condition no longer holds when using off-policy data, in this case, mixing 90% of on-policy data with 10% of uniform random data.

Figure 7: Likelihood of encountering a plateau for different numbers of contexts and actions (same type of plot than Figure 3). Note how the positive transfer from spurious actions (when , warm colors) helps performance.

Figure 8: Relation between of the traversed plateau, and the number of steps along a trajectory from near to near . The dashed line (‘balanced’) corresponds to trajectories that follow the diagonal () and don’t encounter a plateau. Note that for , the deterministic learning trajectories are 10 times slower than a diagonal trajectory (warm colors in Figure 1).

Figure 9: Likelihood of encountering a bad plateau (same type of plot than Figure 3). Left: Comparison between step-sizes. Right: Comparison between optimizers.

Appendix B Detailed derivations (bandit)

b.1 Fixed point analysis

The Jacobian with respect to is

with determinant and trace . This lets us characterize the four fixed points:

fixed point trace determinant type
2 unstable
0 saddle
0 saddle
-2 stable

b.2 Derivation of for RL

We characterize the acceleration of the learning dynamics, as:

This implies that along the diagonal where , and changes sign there. We have a plateau if the sign-change is from negative to positive, in other words, wherever

see also Figure 2.

b.3 Lower bound on basin of attraction

We can construct an explicit lower bound on the size of the basing of attraction for a given plateau. The main argument is that once a trajectory is in a WTA region (say, the one with ), it can only leave it after the dominant component is nearly learned (), and the performance on the dominated component has regressed. , where Note that we abuse notation, , and technically describes the distribution of . Note that trajectories that leave WTA region by crossing by crossing the null cline or traverses diagonal at an -plateau point. If we are to consider traditional initialization of the neural network we can look at the probability of the initialization to be underneath either null cline. Under assumption on the initialization (assuming uniform distribution across angular directions) we have over chance of initializing the model in a WTA region. For this we can compute the derivative of the null cline equation at the origin, and assuming a distribution that is uniform across angular directions we have approximately chance to start in a WTA region. Let us consider the dynamics around the top left saddle point. A trajectory that leaves the WTA region by crossing the null cline at will traverse the diagonal at an -plateau point where and , because in that region. Furthermore, for any trajectory that starts at a point within the WTA region to the left of this null cline, if , then it will exit the WTA region at or above, and thus hit a plateau that is at least as flat. So the basin of attraction for an -plateau includes the polygon defined by , but is in fact larger, as other trajectories can enter this region from elsewhere, see Figure 2. for a diagram with this geometric intuition, and Figure 8 for the empirical relation between and basin size. In these simulations, and elsewhere if not mentioned otherwise, w We use to produce trajectories, and exactly one sample per context for stochastic updates.

b.4 Probability WTA initialization near origin

When the distribution of starting points is close to the origin, then a quantity of interest is the probability of a starting point falling underneath either null cline (because from there on the WTA dynamics will pull it into a plateau). For this we can compute the derivatives of the null cline equation at the origin:

So for such a distribution that is uniform across angular directions we have . However, as Figure 5 shows empirically, the basins of attraction of flat plateaus are even larger, because starting in a WTA region is sufficient but not necessary.

b.5 Derivation of for supervised learning

We have:

Appendix C Derivations for factored objectives case

c.1 Derivation of

We consider objectives where each component is smooth, and can be written in the following form:

where is a scalar function mapping to positive numbers that doesn’t depend on current except via and . We further assume that each component is bounded: . When the optimum is reached, there is no further learning, so , but for intermediate points learning is always possible, i.e., .

If the above conditions hold, then these are sufficient conditions to show that the combined objective admits an -plateau as defined in Definition 2. Moreover, will exhibit the saddle points at and .