I Introduction
In recent years, the successful application of deep neural networks (DNNs) in reinforcement learning (RL) [sutton2018reinforcement] has provided a new perspective to boost its performance on highdimensional continuous problems. With the powerful function approximation and representation learning capabilities of DNNs, deep RL is regarded as a milestone towards constructing autonomous systems with a higher level of understanding of the physical world. Currently, deep RL has demonstrated great potential on complex tasks, from learning to play video games directly from pixels [mnih2013playing, mnih2015human] to making immediate decisions on robot behavior from camera inputs [faust2018prm, chiang2019learning, francis2020long]. However, these successes are limited and prone to catastrophic interference due to the inherent issue of DNNs in face of the nonstationary data distributions, and they rely heavily on a combination of various subtle strategies, such as experience replay [mnih2013playing] and fixed target networks [mnih2015human], or distributed training architecture [mnih2016asynchronous, bellemare2017distributional, espeholt2018impala].
Catastrophic interference is the primary challenge for many neuralnetworkbased machine learning systems when learning over a nonstationary stream of data
[mccloskey1989catastrophic]. It is normally investigated in multitask continual learning (CL), mainly including supervised continual learning (SCL) for classification tasks [kirkpatrick2017overcoming, fernando2017pathnet, lopez2017gradient, rebuffi2017icarl, mallya2018packnet, delange2021continual] and continual reinforcement learning (CRL) [kirkpatrick2017overcoming, fernando2017pathnet, riemer2019learning, kessler2020unclear, khetarpal2020towards] for decision tasks. In the multitask CL, the agent continually faces new tasks and the neural network may quickly fit to the data distribution of the current task, while potentially overwriting the information related to learned tasks, leading to catastrophic forgetting of the solutions of old tasks. The underlying reason behind this phenomenon is the global generalization and overlapping representation of neural networks [ghiassian2020improving, bengio2020interference]. Neural networks training normally assumes that the inputs are identically and independently distributed (i.i.d.) from a fixed data distribution and the output targets are sampled from a fixed conditional distribution. Only when this assumption is satisfied, can positive generalization be ensured among different batches of stochastic gradient descent. However, when the data distribution is drifted during training, the information learned from old tasks may be negatively interfered or even overwritten by the newly updated weights, resulting in catastrophic interference.
Deep RL is essentially a CL problem due to its learning mode of exploring while learning [khetarpal2020towards], and it is particularly vulnerable to catastrophic interference, even within many singletask settings (such as Atari 2600 games, or even simpler classic OpenAI Gym environments) [schaul2019ray, fedus2020catastrophic, lo2019overcoming, liu2019utility]
. The nonstationarity of data distributions in the singletask RL is mainly attributed to the following properties of RL. Firstly, the inputs of RL are sequential observations received from the environment, which are temporally correlated. Secondly, in the progress of learning, the agent’s decision making policy changes gradually, which makes the observations nonstationary. Thirdly, RL methods rely heavily on bootstrapping, where the RL agent uses its own estimated value function as the target, making the target outputs also nonstationary. In addition, as noted in
[fedus2020catastrophic], replay buffers with prioritized experience replay [schaul2016prioritized] that preferentially sample experiences with higher temporaldifference (TD) errors will also exasperate the nonstationarity of training data. Once the distribution of training data encounters notable drift, catastrophic interference and a chain reaction are likely to occur, resulting in a sudden deterioration of the training performance, as shown in Fig. 1.Currently, there are two major strategies for dealing with catastrophic interference in the singletask RL training: experience replay [mnih2013playing, mnih2015human] and local optimization [liu2019utility, lo2019overcoming, ghiassian2020improving]. The former usually exhibits extreme sensitivity to key parameters (e.g., replay buffer size) and often requires maintaining a large experience storage memory. Furthermore, when faced with imbalanced state data, interference may still occur even if the memory is sufficiently large. The latter advocates local network updating for the data with a specific distribution instead of global generalization to reduce the representation overlap among different data distributions. The major issues are that some methods are limited in the capability of model transfer among differently distributed data [liu2019utility, lo2019overcoming], or can not scale to highdimensional complex tasks [ghiassian2020improving], or require pretraining and may not be suitable for the online and incremental setting [liu2019utility].
In this paper, we focus on the catastrophic interference problem caused by state distribution drift in the singletask RL. We propose a novel scheme with low buffersize sensitivity called Context Division and Knowledge Distillation
(CDaKD) that estimates the value function online and incrementally for each state distribution by minimizing the weighted sum of the original loss function of RL algorithms and the regularization term regarding the interference among different groups of states. The schematic architecture is shown in Fig.
3.In order to mitigate the interference among different state distributions during model training, we introduce the concept of “context” into the singletask RL, and propose a novel context division strategy based on online clustering. We show that it is essential to decouple the correlations among different state distributions with this strategy to divide the state space into a series of independent contexts (each context is a set of states distributed close to each other, conceptually similar to “task” in the multitask CRL). To achieve efficient and adaptive partition, we employ
Sequential KMeans Clustering
[dias2008skm] to process the states encountered during training in real time. Then, we parameterize the value function by a neural network with multiple output heads commonly used in multitask learning [zenke2017continual, golkar2019continual, kessler2020unclear]in which each output head specializes on a specific context, and the feature extractor is shared across all contexts. In addition, we apply knowledge distillation as a regularization term in the objective function for value function estimation, which can preserve the learned policies while the RL agent is trained on states guided by the current policy, to further avoid the interference caused by the shared lowlevel representation. Furthermore, to ease the curse of dimensionality in highdimensional state spaces, we employ a random encoder as its lowdimensional representation space can effectively capture the information about the similarity among states without any representation learning
[seo2021state]. Clustering is then performed in the lowdimensional representation space of the randomly initialized convolutional encoder.Finally, to validate the efficacy of CDaKD, we conduct extensive experiments on several OpenAI Gym standard benchmark environments containing 4 basic MDP control environments [brockman2016openai] and 6 highdimensional complex Arcade Learning Environments (ALE) [bellemare2013arcade]. Compared to existing experiencereplaybased and localoptimizationbased methods, CDaKD features stateoftheart performance on classic control tasks (CartPolev0, Pendulumv0, CartPolev1, Acrobotv1) and Atari tasks (Pong, Breakout, Carnival, Freeway, Tennis, FishingDerby). The main contributions of this paper are summarized as follows:

A novel context division strategy is proposed for the singletask RL. It is essential as the widely studied multitask CRL methods cannot be used directly to reduce interference in the singletask RL due to the lack of predefined task boundaries. This strategy can detect contexts adaptively online, so that each context can be regarded as a task in multitask settings. In this way, we bridge the gap between the multitask CRL and the singletask RL in terms of the catastrophic interference problem.

A novel RL training scheme called CDaKD based on multihead neural networks is proposed following the context division strategy. By incorporating the knowledge distillation loss into the objective function, our method can better alleviate the interference suffered in the singletask RL than existing methods in an online and incremental manner.

A highdimensional state representation method based on the random convolutional encoder is introduced, which further boosts the performance of CDaKD on highdimensional complex RL tasks.

The CDaKD framework is highly flexible and can be incorporated into various RL models. Experiments on classic control tasks and highdimensional complex tasks demonstrate the overall superiority of our method over baselines in terms of plasticity and stability, confirming the effectiveness of the context division strategy.
The rest of this paper is organized as follows. Section II reviews the relevant strategies for alleviating catastrophic interference as well as context detection and identification. Section III introduces the nature of RL in terms of continuous learning and gives a definition of catastrophic interference in the singletask RL. The details of CDaKD are shown in Section IV, and experimental results and analyses are presented in Section V. Finally, this paper is concluded in Section VI with some discussions and directions for future work.
Ii Related Work
Catastrophic interference within the singletask RL is a special case of CRL, which involves not only the strategies to mitigate interference but also the context detection and identification techniques.
Iia Multitask Continual Reinforcement Learning
Multitask CRL has been an active research area with the development of a variety of RL architectures [lesort2020continual]. In summary, existing methods mainly consist of three categories: experience replay methods, regularizationbased methods, and parameter isolation methods.
The core idea of experience replay is to store samples of previous tasks in raw format (e.g., Selective Experience Replay (SER) [isele2018selective], Meta Experience Replay (MER) [riemer2019learning], Continual Learning with Experience And Replay (CLEAR) [rolnick2019experience]) or generate pseudosamples from a generative model (e.g., Reinforcement Pseudo Rehearsal (RePR) [atkinson2021pseudo]) to maintain the knowledge about the past in the model. These previous task samples are replayed while learning a new task to alleviate interference, in the form of either being reused as model inputs for rehearsal [isele2018selective, atkinson2021pseudo] or constraining the optimization of the new task loss [rolnick2019experience, riemer2019learning]. As a result, experience replay has become a very successful approach to tackling interference in CRL. However, experience replay in raw format may result in significant storage requirements for more complex CRL settings. Although the generative model can be exempted from a replay buffer, it is still difficult to capture the overall distribution of previous tasks.
Regularizationbased methods avoid storing raw inputs and thus alleviate the memory requirements, by introducing an extra regularization term into the loss function to consolidate previous knowledge while learning on new tasks. The regularization term includes penalty computing and knowledge distillation. The former focuses on reducing the chance of weights being modified. For example, Elastic Weight Consolidation (EWC) [kirkpatrick2017overcoming] and UNcertainty guided Continual LEARning (UNCLEAR) [kessler2020unclear] use Fisher matrix to measure the importance of weights and protect important weights on new tasks. The latter is a form of knowledge transfer [hinton2015distilling], which expects that the model trained on a new task can still perform well on the old ones. It is often used for policy transfer from one model to another (e.g., Policy Distillation [rusu2016policy], Genetic Policy Optimization (GPO) [gangwani2018policy], Distillation for Continual Reinforcement learning (DisCoRL) [traore2019discorl]). This family of solutions is easy to implement and tends to perform well on a small number of tasks, but still faces challenges as the number of tasks increases.
Parameter isolation methods dedicate different model parameters to each task, to prevent any possible interference among tasks. Without the constraints on the size of neural networks, one can grow new branches for new tasks, while freezing previous task parameters (e.g., Progressive Natural Networks (PNN) [rusu2016progressive]). Alternatively, the architecture remains static, with fixed parts being allocated to each task. For instance, PathNet [fernando2017pathnet]
uses a genetic algorithm to find a path from input to output for each task in the neural network and isolates the used network parts in parameter level from the new task training. These methods typically require networks with enormous capacity, especially when the number of tasks is large, and there is often unnecessary redundancy in the network structure, bringing a great challenge to model storage and efficiency.
IiB Singletask Reinforcement Learning
Compared with the multitask CRL, catastrophic interference in the singletask RL remains an emerging research area, which has been relatively underexplored. There are two primary aspects of previous studies: one is finding supporting evidence to confirm that catastrophic interference is indeed prevalent within a specific RL task, and the other is proposing effective strategies for dealing with it.
Researchers in DeepMind studied the learning dynamics of the singletask RL and developed a hypothesis that the characteristic coupling between learning and data generation is the main cause of interference and performance plateaus in deep RL systems [schaul2019ray]. Recent studies further confirmed the above hypothesis and its universality in the singletask RL through largescale empirical studies (called Memento experiments) in Atari 2600 games [fedus2020catastrophic]. However, none of these studies has suggested any practical solution for tackling the interference.
In order to mitigate interference, many deep RL algorithms such as DQN [mnih2015human] and its variants (e.g., Double DQN [van2016deep], Rainbow [hessel2018rainbow]) employ experience replay and fixed target networks to produce approximately i.i.d. training data, which may quickly become intractable in terms of memory requirement as task complexity increases. Furthermore, even with sufficient memory, it is still possible to suffer from catastrophic interference due to the imbalanced distribution of experiences.
In recent studies [liu2019utility, lo2019overcoming, ghiassian2020improving], researchers proposed some methods based on local representation and optimization of neural networks, which showed that interference can be reduced by promoting the local updating of weights while avoiding global generalization. Sparse Representation Neural Network (SRNN) [liu2019utility]
induces sparse representations in neural networks by introducing a distributional regularizer, which requires a large batch of data generated by a fixed policy that covers the space for pretraining. Dynamic SelfOrganizing Map (DSOM)
[lo2019overcoming] with neural networks introduces a DSOM module to induce such locality updates. These methods can reduce interference to some extent, but they may inevitably suffer from the lack of positive transfer in the representation layer, which is not desirable in complex tasks. Recently, discretizing (DNN) and tile coding (TCNN) were used to remap the input observations to a highdimensional space to sparsify input features, reducing the activation overlap [ghiassian2020improving]. However, tile coding increases the dimension of inputs to a neural network, which can lead to scalability issues for spaces with high dimensionality.IiC Context Detection and Identification
Context detection and identification is a fundamental step for learning task relatedness in CL. Most multitask CL methods mentioned above rely on welldefined task boundaries, and are usually trained on a sequence of tasks with known labels or boundaries. Existing context detection approaches commonly leverage statistics or Bayesian inference to detect task boundaries.
On the one hand, some methods tend to be reactive to a changing distribution by finding change points in the pattern of statereward tuples (e.g., Context QL [padakandla2020reinforcement]), or tracking the difference between the shortterm and longterm moving average rewards (e.g., CRLUnsup [lomonaco2020continual]), or splitting a game into contexts using the undiscounted accumulated game score as a task contextualization [jain2020algorithmic]. These methods can be agile in responding to scenarios with abrupt changes among contexts or tasks, but are insensitive to smooth transitions from one context to another.
On the other hand, some more ambitious approaches try to learn a belief of the unobserved context state directly from the history of environment interactions, such as Forgetmenot Process (FMN) [milan2016forget] for piecewiserepeating data generating sources, and Continual Unsupervised Representation Learning (CURL) [rao2019continual] for task inference without any knowledge about task identity. However, they both need to be pretrained with the complete data when applied to CL problems, and CURL itself also requires additional techniques to deal with catastrophic interference.
Furthermore, Ghosh et al. [ghosh2018divide] proposed to partition the initial state space into a finite set of contexts by performing a KMeans clustering procedure, which can decompose more complex tasks, but cannot completely decouple the correlations among different state distributions from the perspective of interference prevention.
Iii Preliminaries and Problem Statement
To better characterize the problem studied in this paper, some key definitions and glossaries of CRL problems are introduced in this section.
Iiia Definitions and Glossaries
Some important definitions of RL relevant to this paper are presented as follows.
Definition 1 (RL Paradigm [sutton2018reinforcement]).
A RL problem is regarded as a Markov Decision Process (MDP), which is defined as a tuple
, where is the set of states; is the set of actions;is the environment transition probability function;
is the reward function, and is the discount factor.According to Definition 1, at each time step , the agent moves from to with probability after taking action , and receives reward . Based on this definition, the optimization objective of valuebased RL models is defined as follow:
Definition 2 (RL Optimization Objective [khetarpal2020towards]).
The optimization objective of the valuebased RL is to learn a policy with internal parameter that maximizes the expected longterm discounted returns for each in time, also known as the value function:
(1) 
Here, the expectation is over the process that generates a history using and decides actions from until the end of the agent’s lifetime.
The optimization objective in Definition 2 does not just concern itself with the current state, but also the full expected future distribution of states. As such, it is possible to overcome the catastrophic interference for RL over nonstationary data distributions. However, much of the recent work in RL has been in the so called episodic environments, which optimizes the episodic RL objective:
Definition 3 (Episodic RL Optimization Objective[khetarpal2020towards]).
Given some future horizon , find a policy , optimizing the expected discounted returns:
(2) 
Here, to ensure the feasibility and ease of implementation of optimization, the objective is only optimized over a future horizon until the current episode terminates.
It is clear that the episodic objective in Definition 3 is biased towards the current episode distribution while ignoring the possibly far more important future episode distributions over the agent’s lifetime. Plugging in such an objective directly into the nonstationary RL settings leads to biased optimization, which is likely to cause catastrophic interference effects.
For large scale domains, the value function is often approximated with a member of the parametric function class, such as a neural network with parameter , expressed as , which is fit online using experience samples of the form . This experience is typically collected into a buffer from which batches are later drawn at random to form a stochastic estimate of the loss:
(3) 
where is the agent’s loss function, and is the distribution that defines its sampling strategy. In general, the parameter used to compute the target is a prior copy of that used for action selection (as the settings of DQN [mnih2015human]).
In addition, it is necessary to clarify some important glossaries in relation to CL.
1) Nonstationary [hadsell2020embracing]
: a process whose state or probability distribution changes with time.
2) Interference [bengio2020interference]: a type of influence between two gradientbased processes with objectives , , sharing parameter . Interference is often characterized in the first order by the inner product of their gradients:
(4) 
and can be seen as being constructive (, transfer) or destructive (, interference), when applying a gradient update using on the value of .
3) Catastrophic Interference [hadsell2020embracing]: a phenomenon observed in neural networks where learning a new task significantly degrades the performance on previous tasks.
IiiB Problem Statement
The interference within the singletask RL can be approximately measured by the difference in TD errors before and after model update under the current policy, referred to as Approximate Expected Interference (AEI) [liu2020towards]:
(5) 
where is the distribution of under the current policy and is the TD error.
To illustrate the interaction between interference and the agent’s performance during the singletask RL training, we run an experiment on CartPole using the DQN implemented in OpenAI Baselines^{1}^{1}1OpenAI Baselines is a set of highquality implementations of RL algorithms implemented by OpenAI: https://github.com/openai/baselines., and set the replay buffer size to 100, a small capacity to trigger interference to highlight its effect. We trained the agent for 300K environment steps and approximated with a buffer containing recent transitions of capacity 10K to evaluate the AEI value according to Eq. (5) after each update. Fig. 2 shows two segments of the interference and performance curves during training from which we can see that the performance started to oscillate when AEI started to increase (e.g., , , and in Fig. 2(a), and in Fig. 2(b)). In general, the performance of the agent tends to drop significantly in the presence of increasing interference. This result provides direct evidence that interference is correlated closely with the stability and plasticity of the singletask RL model.
From the analysis above, we state the problem investigated in this paper as: proposing a novel and effective training scheme for the singletask RL, to alleviate catastrophic interference and reduce performance oscillation during training, improving stability and plasticity simultaneously.
Iv The Proposed Method
In this section, we give a detailed description of our CDaKD scheme whose architecture is shown in Fig. 3. CDaKD consists of three main components, which are jointly optimized to mitigate catastrophic interference in the singletask RL: context division, knowledge distillation, and the collaborative training of the multiheaded neural network. On the basis of CDaKD, we further propose CDaKDRE with a random encoder for the efficient contextualization of highdimensional state spaces.
As mentioned before, catastrophic interference is an undesirable byproduct of global updates to the neural network weights on data whose distribution changes over time. A rational solution to this issue is to estimate an individual value function for each distribution, instead of using a single value function for all distributions. When an agent updates its value estimation of a state, the update should only affect the states within the same distribution. With this intuition in mind, we adopt a multihead neural network with shared representation layers to parameterize the distribution specific value functions.
The CDaKD scheme proposed in this paper can be incorporated into any existing valuebased RL methods to train a piecewise function for the singletask RL. The neural network is parameterized by a shared feature extractor and a set of linear output heads, corresponding to each context. As shown in Fig. 3, the set of weights of the function is denoted by , where is a set of shared parameters while and are both context specific parameters: is for the context that corresponds to the current input state , and is for others. In this section, we take the combination of CDaKD and DQN as an illustrative example.
Iva Context Division
In MDPs, states (or “observations”) represent the most comprehensive information regarding the environment. To better understand the states of different distributions, we define a variable for a set of states that are close to each other in the state space, referred to as “context”. Formally,
(6) 
where is a finite set of contexts and is the number of contexts. For an arbitrary MDP, we partition its state space into contexts, and all states within each context follow approximately the same distribution, to decouple the correlations among states against distribution drift. More precisely, for a partition of in Eq. (6), we associate a context with each set , so that for , , where can be thought of as a function of state .
The inherent learningwhileexploring feature of RL agents leads to the fact that the agent generally does not experience all possible states of the environment while searching for the optimal policy. Thus, it is unnecessary to process the entire state space. Based on this fact, in CDaKD, we only perform context division on states experienced during training. In this paper, we employ Sequential KMeans Clustering [dias2008skm] to achieve context detection adaptively (See Appendix A for more details).
In Fig. 3, centroids are initialized at random in the entire state space. In each subsequent time step , we execute State Assignment and Centroid Update steps for each incoming state received from the environment, and store its corresponding transition into the replay buffer . Accordingly, in the training phase, we randomly sample a batch of transitions from and train the shared feature extractor and the specific output head corresponding to the input state simultaneously, while conducting finetuning of other output heads to avoid interference on learned policies. Since we store the context label of each state in the replay buffer, there are no additional state assignments required at every update step^{2}^{2}2In CDaKD, we only need to perform state assignment once for each state..
Note that it is also possible to conduct context division based on the initial state distribution [ghosh2018divide]. By contrast, we show that the partition of all states experienced during training can produce more accurate and effective context division results, as the trajectories starting from the initial states within different contexts have a high likelihood of overlapping in subsequent time steps (See Appendix B for more detailed experiments and analysis).
Interference Among Contexts: We investigate the interference among contexts obtained by our context division method in details. Specifically, we measure the Huber loss of TD errors in different contexts of the game as the agent learns in other contexts, and then record the relative changes in loss before and after the agent’s learning, as shown in Fig. 4. The results show that, longterm training on any context may lead to negative generalization on all other contexts, even in such simple RL tasks (i.e., CartPolev0 and Pendulumv0).
Computational Complexity: Assuming a dimensional environment of contexts, the time and space complexities of our proposed context division module to process environment steps are and , respectively.
IvB Knowledge Distillation
The shared lowlevel representation can cause the learning in new contexts to interfere with previous learning results, leading to catastrophic interference. A relevant technique to address this issue is knowledge distillation [hinton2015distilling], which works well for encouraging the outputs of one network to approximate the outputs of another. The concept of distillation was originally used to transfer knowledge from a complex ensemble of networks to a relatively simpler network to reduce model complexity and facilitate deployment. In CDaKD, we use it as a regularization term in value function estimation to preserve the previously learned information.
When training the model on a specific context, we need to consider two aspects of the loss function: the general loss of the current training context (denoted by ), and the distillation loss of other contexts (denoted by ). The former encourages the model to adapt to the current context to ensure plasticity, while the latter encourages the model to keep the memory of other contexts, preventing interference.
To incorporate CDaKD into the DQN framework, we rewrite the original loss function of DQN with the context variable as:
(7) 
where
is the estimated target value of and is the distribution of samples, i.e., , and is the agent’s loss function.
For each of the other contexts that the environment contains, we expect the output value for each pair of to be close to the recorded output from the original network. In knowledge distillation, we regard the learned function before the current update step as the teacher network, expressed as , and the current network to be trained as the student network, expressed as , where except the current context . Thus, the distillation loss is defined as:
(8) 
where
is the distillation loss function of the output head corresponding to context .
IvC Joint Optimization Procedure
To optimize a function that can guide the agent to make proper decisions on each context without being adversely affected by catastrophic interference, we combine Eqs. (7) and (8) to form a joint optimization framework. Namely, we solve the catastrophic interference problem by the following optimization objective:
(9) 
where is a coefficient to control the tradeoff between the plasticity and stability of the neural network.
The complete procedure is described in Algorithm 1. The proposed method learns the value function and context division policy in parallel. For network training, to reduce the correlations with the target and ensure the stability of model training, the target network parameter is only updated by the network parameter every steps and is held fixed between individual updates, as in DQN [mnih2015human]. Similarly, we also adopt fixed target context centroids () to avoid the instability of RL training caused by constantly updated context centroids (). To simplify the model implementation, we set the updating frequency of the target context centroids to be consistent with the target network.
IvD Random Encoders for Highdimensional State Space
For highdimensional state spaces, we propose to use random encoders for efficient context division, which can map highdimensional inputs into lowdimensional representation spaces, easing the curse of dimensionality. Although the original RL model already contains an encoder module, it is constantly updated and directly performing clustering in its representation space may introduce extra instability into context division. Therefore, on the basis of CDaKD, we exploit a dedicated random encoder module for dimension reduction. Fig. 5 gives an illustration of this updated framework called CDaKDRE in which the structure of the random encoder is consistent with the underlying RL encoder, but its parameter is randomly initialized and fixed throughout training. We provide the full procedure of CDaKDRE in Appendix C.
The main motivation of using random encoders arises from the observation that distances in the representation space of random encoder are adequate for finding similar states without any representation learning [seo2021state]. That is, the representation space of a random encoder can effectively capture the information about the similarity among states without any representation learning (See Appendix C).
V Experiments and Evaluations
In this section, we conduct comprehensive experiments on several standard benchmarks from OpenAI Gym^{3}^{3}3OpenAI Gym is a publicly available released implementation repository of RL environments: https://github.com/openai/gym. containing 4 classic control tasks and 6 highdimensional complex Atari games to demonstrate the effectiveness of our method^{4}^{4}4Experiment code: please refer to supplementary material.. Several stateoftheart methods such as experience replay based methods (e.g., DQN [mnih2015human], Rainbow [hessel2018rainbow]), local optimization based methods (e.g., SRNN [liu2019utility]), and other context division techniques (e.g., Game Score (GS) [jain2020algorithmic], Initial States (IS) [ghosh2018divide]) are employed as our baselines.
Va Datasets
Classic Control [brockman2016openai] contains 4 classic control tasks: CartPolev0, Pendulumv0, CartPolev1, Acrobotv1, where the dimensions of state spaces are in the range of 3 to 6. The maximum time steps are 200 for CartPolev0 and Pendulumv0, and 500 for CartPolev1 and Acrobotv1. Meanwhile, the reward thresholds used to determine the success of tasks are 195.0, 475.0 and 100.0 for CartPolev0, CartPolev1 and Acrobotv1, respectively, while the threshold for Pendulumv0 is not available. We choose these domains as they are wellunderstood and relatively simple, suitable for highlighting the mechanism and verifying the effectiveness of our method in a straightforward manner.
Atari Games [bellemare2013arcade] contain 6 imagelevel complex tasks: Pong, Breakout, Carnival, Freeway, Tennis, FishingDerby, where the observation is the screenshot represented by an RGB image of size . We choose these domains to further demonstrate the scalability of our method on highdimensional complex tasks.
VB Implementation
Network Structure
. For the 4 classic control tasks, we employ a fullyconnected layer as the feature extractor and a fullyconnected layer as the multihead action scorer, following the network configuration for this type of tasks in OpenAI Baseline. For the 6 Atari games, we employ the similar convolution neural network as
[hessel2018rainbow], [castro18dopamine]for feature extracting (
i.e., RL Encoder in Fig. 5) and two fullyconnected layers as the multihead action scorer. In addition, for random encoders, we use convolutional neural networks with the same structure as the underlying RL methods, but we do not update its randomly initialized parameters during the training process [seo2021state]. Details of the networks can be found in Appendix D.Parameter Setting. In CDaKD, there are two key parameters: and . To simplify parameter setting, we set in accordance with the exploration proportion of the agent in all experiments: , due to the inverse relationship between them in training. During the early training, is close to 1, and the model is normally inaccurate with little interference, and a small (close to 0) can promote plasticity construction of the neural network. Then, gradually approaches 0 during the subsequent training, and the model has accumulated more and more useful information, while interference is also likely to occur. Consequently, smoothly increasing is needed to ensure stability and avoid catastrophic interference. Meanwhile, we set to 3 for all classic control tasks, and 4 for all Atari games. It is suggested that larger values should be used for complex environments to achieve reasonable state decoupling effect. In CDaKDRE, there is an extra parameter: the output dimension of the random encoder. We set to 50 as in [seo2021state], which has been shown to be both efficient and effective. More details of the parameter setting can be found in Appendix D. For classic control tasks, we evaluate the training performance using the average episode returns every 10K time steps for CartPolev0, Pendulumv0, and 20K time steps for CartPolev1, Acrobotv1. For Atari games, the time step range for performance evaluation is 200K. All experiment results reported are the average episode returns over 5 independent runs.
VC Baselines
We compare our method with five stateoftheart methods, including DQN [mnih2015human] and Rainbow [hessel2018rainbow] based on experience replay, SRNN [liu2019utility] based on local optimization for alleviating catastrophic interference, and two context division techniques (i.e., Game Score (GS) [jain2020algorithmic], Initial States (IS) [ghosh2018divide]) for efficient contextualization. We briefly introduce these baselines in the following.

DQN [mnih2015human] is a representative algorithm of Deep RL, which reduces catastrophic interference using experience replay and fixed target networks. We use the DQN agent implemented in OpenAI Baselines.

Rainbow [hessel2018rainbow] is the upgraded version of DQN containing six extensions, including a prioritized replay buffer [schaul2016prioritized], nstep returns [sutton2018reinforcement] and distributional learning [bellemare2017distributional] for stable RL training. The Rainbow agent is implemented in Google’s Dopamine framework^{5}^{5}5Dopamine is a research framework developed by Google for fast prototyping of reinforcement learning algorithms: https://github.com/google/dopamine..

SRNN [liu2019utility] employs a distributional regularizer to induce sparse representations in neural networks to avert catastrophic interference in the singletask RL.

CDaKDGS is our extension to DQN using the CDaKD scheme where the context division is based on the undiscounted accumulated game score in [jain2020algorithmic] instead of all experienced states.

CDaKDIS is another extension to DQN using the CDaKD scheme where the context division is based on the initial states distribution as in [ghosh2018divide] instead of all experienced states.
VD Evaluation Metrics
Following the convention in previous studies [mnih2016asynchronous, bellemare2017distributional, espeholt2018impala, hessel2018rainbow, fedus2020catastrophic], we employ the average training episode returns to evaluate our method during training:
(10) 
where is the number of episodes experienced within each evaluation period; is the total time steps in episode ; is the reward received at time step in episode .
VE Results
Method  DQN  Rainbow  DQN+CDaKDRE  Rainbow+CDaKDRE  

N  
Pong  
Breakout  
Carnival  
Freeway  
Tennis  
FishingDerby 
(based on the performance of five runs in Fig. LABEL:fig:Atari_games).
Method  DQN  Rainbow  DQN+CDaKDRE  Rainbow+CDaKDRE  

N  
Pong  
Breakout  
Carnival  
Freeway  
Tennis  
FishingDerby  
Average 
(based on the average performance of five runs in Fig. LABEL:fig:Atari_games).
The results on 4 control tasks are presented in Fig. LABEL:fig:Classic_Control, showing the learning curves during training for each task with three levels of replay buffer capacity. Note that CDaKDES is our proposed scheme where the context division is based on all experienced states^{6}^{6}6In this paper, all appearances of CDaKD refer to CDaKDES unless otherwise stated.. In general, CDaKDES is clearly superior to all baselines in terms of plasticity and stability, especially when the replay buffer capacity is small (e.g., ) or even without experience replay (i.e., ). In most tasks, CDaKDES achieves near optimal performance as well as good stability even without any experience replay. For Pendulumv0 and Acrobotv1, a large replay buffer (e.g., ) can help DQN and SRNN escape from catastrophic interference. However, this is not the case for two CartPole tasks where the agents exhibit fast initial learning but then encounter collapse in performance.
The learning curves on 6 Atari games are shown in Fig. LABEL:fig:Atari_games. Moreover, the highest cumulative returns (indicator of plasticity) achieved during the training of each task are summarized in Table I, and the maximum deterioration ratios (indicator of stability) compared to the previous maximal episode returns are given in Table II. Overall, for highdimensional image inputs, the training performance of the original RL algorithms can be noticeably improved with our CDaKDRE scheme. In Fig. LABEL:fig:Atari_games, our method significantly outperforms DQN on 7 out of 12 tasks, being comparable with DQN on the rest 5 tasks. Similarly, our method outperforms Rainbow on 8 out of 12 tasks, being comparable with Rainbow on the rest 4 tasks. Furthermore, as shown in Table I and Table II, CDaKDRE achieves higher maximum cumulative scores and less performance degradation in most tasks compared to its counterparts. Among the 24 training settings, only 2 maximum cumulative scores achieved by CDaKDRE are slightly lower than those of baselines. In terms of the average performance degradation ratio, DQN and Rainbow incorporated with CDaKDRE surpass the original RL methods by and (), respectively. Note that, even with a large memory (), CDaKDRE still shows certain advantages over the baselines.
Moreover, we observe that: 1) DQN and SRNN agents exhibit high sensitivity to the replay buffer capacity. They generally perform well with a large buffer (except on CartPolev1), but their performance deteriorates significantly when the buffer capacity is reduced. DQN performs worst among all baselines when there is no experience replay, as both DQN and SRNN may face severe data drift when the buffer is small, and approximately i.i.d. training data are required to avoid possible interference in training. 2) CDaKDGS and CDaKDIS leverage the cumulative game score and initial state distribution to partition contexts, respectively. However, neither the game score nor the initial state distribution is a perfect determining factor for context boundaries, making it difficult to achieve the necessary decoupling of differently distributed states. Furthermore, they require prior knowledge about the scores for each level of the game or the initial state distribution of the environment.
In summary, the proposed techniques containing context division based on the clustering of all experienced states, and knowledge distillation in multihead neural networks can effectively eliminate catastrophic interference caused by data drift in the singletask RL. In addition, our method leverages a fixed randomly initialized encoder to characterize the similarity among states in the lowdimensional representation space, which can be used to partition contexts effectively for highdimensional environments.
VF Analysis
1) Ablation Study: Since our method can be regarded as an extension to existing RL methods (e.g., DQN [mnih2015human]), with three novel components (i.e., adaptive context division by online clustering, knowledge distillation, and the multihead neural network), the ablation experiments are designed as follows:

No clustering means using a random partition of the raw state space instead of adaptive context division by online clustering.

No distillation means removing the distillation loss function from the joint objective function in Eq. (9) (i.e., ).

No multihead means removing the context division module and optimize the neural network with a singlehead output (i.e., ). Here, the distillation term is represented as the distillation of the network before each update of the output head.
The results of ablation experiments are shown in Fig. LABEL:fig:ablation_study, using classic control tasks for the convenience of validation. From Fig. LABEL:fig:ablation_study, the following facts can be observed: 1) Across all settings, the overall performance of DQN is the worst, showing the effectiveness of the three components introduced for coping with catastrophic interference in the singletask RL, although the contribution of each component varies substantially per task; 2) Removing online clustering from the context division module is likely to damage the performance in most cases; 3) Without the multihead component, our model is equivalent to a DQN with an extra distillation loss, which can result in better performance in general than DQN; 4) Removing knowledge distillation makes the performance deteriorate on almost all tasks, indicating that knowledge distillation is a key element in our method.
2) Parameter Analysis: There are two critical parameters in CDaKD: and . By its nature, is related to the training process. Since we need to preserve the learned good policies during training, it is intuitive to gradually increase until its value reaches 1. The reason is that, in earlystage training, the model has not learned any sufficiently useful information, so the distillation constraint can be ignored. With the progress of training, the model starts to acquire more and more valuable information and needs to pay serious attention to interference to protect the learned good policies while learning further. In our experiments, we recommend to set to be inversely proportional to the exploration proportion , and the results in Figs. LABEL:fig:Classic_Control and LABEL:fig:Atari_games have demonstrated the simplicity and effectiveness of this setting.
To investigate the effect of , which is related to the complexity of learning tasks, we conduct experiments with different values () and the results are shown in Fig. LABEL:fig:parameter_study. In our experiments, is a reasonably good choice for CartPolev0 and CartPolev1, while is good for CDaKD on Acrobotv1. It is worth noting that, on Pendulumv0, our method achieves similar performance with set to 2, 3, 4, and 5, respectively, but without any satisfactory performance. A possible explanation is that the agents failed to learn any useful information due to the limited state space explored in the early training, leading to the failure of further learning.
In summary, we can make the following statements: 1) The performance of DQN combined with CDaKD is significantly better than the original DQN regardless of the specific value, confirming the effectiveness of our CDaKD scheme; 2) For , better performance of CDaKD can be expected. However, large values are not always desirable as larger values will result in more finegrained context divisions and more complex neural networks with a large amount of output heads, making the model unlikely to converge satisfactorily within a limited number of training steps. Thus, we recommend to set the value of by taking into consideration the statespace structure of specific tasks.
3) Convergence Analysis: To analyze convergence, we track the agent’s average loss and average predicted actionvalue during training progress. According to Fig. 10, we can conclude that: 1) Our method has better convergence and stability in face of interference compared with original RL algorithms (See Fig. 10(a) and Fig. 10(b)); 2) For a heldout set of states, the average maximum predicted actionvalue of each output head reflects the difference as expected (See Fig. 10(c)), and the final output of the CDaKD agent is synthesised based on all output heads.
4) Computational Efficiency: Our methods are computationally efficient in that: 1) In each time step, the extra context division module only needs to compute the distances between the current state and context centroids, which is computationally negligible w.r.t. the SGD complexity of RL itself; 2) Only extra output heads are added to the neural network, in which the increased computation is acceptable w.r.t. the representation complexity; 3) There are no gradient updates through the random encoder; 4) There is no unnecessary distance computation for finding the corresponding context at every update step as the context label for each state is stored in the replay buffer. Fig. 11 shows the training time of each agent on Pendulum and the floating point operations (FLOPs) executed by agents on Breakout, respectively.
Vi Conclusion and Future Work
In this paper, we propose a competent scheme CDaKD to tackle the inherent challenge of catastrophic interference in the singletask RL. The core idea is to partition all states experienced during training into a set of contexts using online clustering techniques and simultaneously estimate the contextspecific value function with a multihead neural network as well as a knowledge distillation loss to mitigate the interference across contexts. Furthermore, we introduced a random convolutional encoder to enhance the context division for highdimensional complex tasks. Our method can effectively decouple the correlations among differently distributed states and can be easily incorporated into various valuebased RL models. Experiments on several benchmark tasks show that our method can significantly outperform stateoftheart RL methods and dramatically reduce the memory requirement of existing RL methods.
In the future, we aim to incorporate our method into policybased RL models to reduce the interference during training by applying weight or functional regularization on policies. Furthermore, we will investigate a more challenging setting called continual RL in nonstationary environments [lomonaco2020continual]. This setting is a more realistic representation of the realworld scenarios and includes abrupt changes or smooth transitions on dynamics, or even the dynamics itself is shuffled.
Acknowledgment
The work presented in this paper was supported by the National Natural Science Foundation of China (U1713214).
References
Appendix A Sequential KMeans Clustering
The process of Sequential KMeans Clustering for the current state is shown in Algorithm 2. Each context centroid is the average of all of the states closest to . In order to get a better initialization of , we can perform offline KMeans clustering on all states experienced before training starts and set the results of centroids as the initial . Then, Sequential KMeans Clustering is performed in subsequent time steps.
Appendix B Initial States vs All Experienced States
We compare the effects of two kinds of context division techniques:

ISC (Initial States Clustering) means performing context division using KMeans on the samples sampled from the initial states distribution, referring to [ghosh2018divide];

ESC (Experienced States Clustering) is our proposed performing context division using Sequential KMeans Clustering on all states experienced during training process.
We run the DQN agent incorporated with CDaKD with the above two clustering techniques on CartPolev0, respectively, and visualize the twodimensional tSNE results of context division in Fig. 12. It can be clearly observed that the contexts divided by ESC are relatively independent and there is almost no overlapping area among contexts, achieving effective decoupling among states with different distributions. By contrast, there are obvious overlapping areas among the contexts divided by ISC, since the trajectories starting from the initial states within different contexts have a high likelihood of overlapping in subsequent time steps, which is not desirable for reducing interference on neural network training among differently distributed states.
Appendix C Random Encoder
We find the Knearest neighbors of some specific states by measuring the distances in the lowdimensional representation space produced by a randomly initialized encoder (Random Encoder) on Breakout. The results are shown in Fig. 13 from which we can observe that the raw images corresponding to the Knearest neighbors of the source image in the representation space of the random encoder demonstrate significant similarities. We provide the full procedure of CDaKD with the random encoder in Algorithm 3.
Tasks  Layer  Input  Filter size  Stride  Num filters  Activation  Output 
Classic Control Tasks  FC1  Dimension of state space      Tanh  
FC2      Number of actions  Linear  Number of actions  
Atari Games  Conv1  ReLU  
Conv2  ReLU  
Conv3  ReLU  
FC4      ReLU  
FC5      Linear 
Hyperparameter  Classic Control Tasks  Atari Games  

Training time step 

steps  
Training  
decay schedule 

steps  
Min. history to start learning  steps  steps  
Target network update frequency  steps  steps  
Batch size  
Learning rate 

Appendix D Implementation Details
To ensure the fairness of comparison, our results compare the agents based on the underlying RL model with the same hyperparameters and neural network architecture. We provide a full list of neural networks architecture of the underlying RL models in Table III and summarize our choices for common key hyperparameters in Table IV.
Appendix E Calculation of model complexity
Computational Complexity of Context Division. At each time step, SKM only needs to calculate the distances between the current state and context centroids. Given a dimensional state space and step environment interactions, the time complexity of context division is . At the same time, since only additional context centroids need to be stored for clustering, the space complexity is .
Calculation of Floating Point Operations. We obtain the number of operations per forward pass for all layers in the encoder (denoted by ) and the number of operations per forward pass for all MLP layers in each output head (denoted by ), as in https://openai.com/blog/aiandcompute/. Therefore, the number of FLOPs of CDaKDRE is:
where is the batch size; is the number of environment steps; is the number of training updates. The first two terms are for the forward and backward passes required in training updates, respectively. The latter two terms are for the forward passes required to compute the policy action and obtain the lowdimensional representation from the random encoder, respectively. In our experiments: , , , , MFLOPs, MFLOPs for Rainbow and MFLOPs for DQN.
Comments
There are no comments yet.