1 Motivation
Neural networks have recently achieved notable success in different areas including machine translation (bahdanau2014neural) and image classification (krizhevsky2012imagenet). Despite all the successes, there remain unaddressed challenges for neural networks in order to learn efficiently. One such challenge is that of catastrophic interference: neural networks forget what they learned in the past when they are presented with new data (french1999catastrophic).
In this work, we focus on a setting in which the neural network observes each sample only once, learns from the sample, updates the weights, and then disposes of the sample without using it again in the future. Apart from its close affinity with how humans learn, this setting is popular in the linear reinforcement learning (RL) framework (sutton2018reinforcement). In the rest of this paper, we refer to this setting as online and fully incremental, whereas we refer to settings that use experience replay (ER) buffers only as online learning frameworks. The fully incremental setting is of interest because interference tends to be most severe in such cases. Actually, the interference is so severe that the agent fails to learn how to solve a single small task successfully ((ghiassian2018two)).
Neural networks inherently suffer from catastrophic interference. The network has a single set of shared weights that leads to high representational overlap across data samples (french1999catastrophic). One update can easily overwrite what was learned previously by changing too many weights, generalizing too globally. Interference is especially pronounced when the data is received in a nonindependentandidenticallydistributed manner (noni.i.d.) (mccloskey1989catastrophic), a property of the online and fully incremental setting that we consider.
To mitigate catastrophic interference in RL, practitioners have resorted to approaches with ER buffers that collect recent data to perform minibatch updates (volodymyr2015human). Such minibatch updates approximate an i.i.d. data distribution which attenuates catastrophic interference. Still, as we tackle more complex problems, having an ER buffer might become infeasible in terms of memory. We cannot expect an intelligent agent to store all past experiences and to constantly relearn what it had learned in the past. Thus, it is essential to look for potential ways to achieve online RL in the absence of ER.
Many solutions have been proposed to tackle catastrophic interference. kirkpatrick2017overcoming introduced Elastic Weight Consolidation as a regularization strategy to encourage parameters to stay as close to the parameters of previous tasks as possible. Similar to our method, DBLP:journals/corr/abs190409330
proposed using selforganizing map (SOM) to mitigate interference. However, it is limited to supervised learning and the SOM used is unsuitable for cases when the data distribution is noni.i.d. In terms of online and fully incremental RL,
ghiassian2018two proposed using tile coding and geometric projection to sparsify input features, reducing activation overlap. But tile coding increases the number of input dimensions to a neural network, which can lead to scalability issues for benchmarks with high dimensionality. liu2018utility introduced regularization strategies like a distributional regularizer to induce sparse representations in neural networks. The representations induced can provide locality that averts interference. Yet, these methods require pretraining and have not been extended to the online and fully incremental setting.In this paper, we propose a method that helps an RL agent learn online and fully incrementally without the use of ER buffers and target networks. We show, on two RL tasks, that our method is on par with, and in some cases, better than baseline methods that use ER and target networks in both performance and speed. We use visualizations, and qualitative measures proposed in riemer2018learning to show that the proposed method indeed reduces interference.
2 Dynamic SelfOrganizing Maps for Online Reinforcement learning
As mentioned before, catastrophic interference is a byproduct of global updates to the neural network weights. The obvious solution to the problem seems to be replacing the global updates with more local ones. When an agent updates its value estimation of a state, the update should only affect value estimations of states that are similar to the current state. With this intuition in mind, we hypothesize the catastrophic interference caused by the global property of neural networks to be a major factor of neural networks’ inability to learn online and fully incrementally. If we can limit the scope of an update to be more local in the state space, we may be able to achieve generalization without the interference induced by global updates.
To localize the updates and alleviate interference, we propose the use of dynamic selforganizing maps (DSOM). The proposed method learns a neural network and a DSOM in parallel. It reduces interference by inducing stateconditioned updates, removing the need for ER and target networks. SelfOrganizing Maps (SOM) are a type of neural network learned without supervision (kohonen1990self)
. The goal of SOM is to transform an input space into a onedimensional or twodimensional discrete map in a topologically ordered manner, which means data samples that are close to each other in their input space would be close together on the map. Unlike standard neural networks, SOM learn with competitive learning instead of error backpropagation. Under competitive learning, other than updating all neurons, only the winner neuron gets updated. The winner is determined by a metric (e.g. euclidean norm for SOM). With SOM, both the winner neuron and its neighboring neurons on the map get updated. The neighborhood of neurons is defined by a neighborhood function. By iterating this process, the map of neurons would converge with each vector representing a cluster of the input space, conditioned on received data's distribution.
Typically, a SOM contains a set of dimensional weight vectors, with each denoted as , . Each is associated with a unique position on a one dimensional or two dimensional grid map to represent a node. At each iteration, an input vector will be matched with the closest node on the map by
(1) 
where is the euclidean norm. The closest unit is commonly known as the best matching unit (BMU) or the winner neuron. The weight vector associated with the BMU and weight vectors of the BMU's topologically neighboring nodes are updated to reduce the error between the input vector and those weights, pushing that neighborhood of weight vectors to be closer to the input vector. The neighborhood is determined by a neighborhood function . In this work, we use DSOM, a type of SOM proposed in rougier2011dynamic. At each time step , the weight vectors are shifted towards by
(2) 
with the neighbourhood function defined as
(3) 
, and refer to the normalized euclidean norm, DSOM learning rate and elasticity (plasticity) respectively. Unlike SOM, DSOM removes the condition on time which allows the use of SOM with noni.i.d. data without the need for an offline training period, making it a suitable candidate to work under the online and fully incremental setting. Note that elasticity modulates the coupling strength between DSOM weight vectors. If elasticity is high, weight vectors tend to be relatively close while a lower value allows looser coupling between weight vectors.
We illustrate our method with Sarsa (rummery1994line) and Qlearning (watkins1992q). We parameterize the stateaction value functions with neural networks for both algorithms (See Appendix 5.1).
We propose using DSOM as a resource allocation module to a neural network, modulating the extent of an update to each weight to reduce interference. At each time step, DSOM produces an output mask based on the euclidean norm between each of its weight vector and the input vector:
(4) 
where
is a tunable hyperparameter. Weight vectors that are closer to the input vector will be weighted higher and vice versa. The number of weight vectors in DSOM is set to be the number of hidden units in the hidden layer. An elementwise multiplication is performed with the hidden layer output and DSOM's output mask. By doing so, values of the hidden layer output would be modulated by the DSOM 's output mask based on the euclidean norm between the input vector and each DSOM weight vector. Stateaction values are then computed by the output layer using the modulated hidden layer output. The neural network and DSOM are learned in parallel and entirely online. See Appendix
5.2 for a sample architecture.To put it in the context of RL, DSOM's weight vectors learns to represent different parts of the state space. Each weight vector is associated with a hidden unit in the hidden layer. The output mask determines the degree of use of each hidden unit based on the state feature vector's euclidean distance to each DSOM weight vector. The weighted mask has a similar effect as activation sharpening proposed by french1992semi, except the selection of nodes to be sharpened or dampened is conditioned on the current state. With the state similarity information embedded in the output mask, an update to the stateaction value estimation of a state will only affect the value estimations of similar states. Weights that are used by dissimilar states more would be weighted less in the DSOM mask, diminishing changes to those weights during an update. In other words, the learning progress of various parts of the state space cannot interfere with each other as much, facilitating the possibility for the agent to learn online and fully incrementally by having more local updates. We hypothesize the output mask from DSOM alleviates catastrophic interference by masking out interfering updates while allowing generalization across similar states.
3 Experiments and Results
We evaluated our proposed method on two RL benchmarks, namely Mountain Car and Lunar Lander (See Appendix 5.5). The agent takes in a feature vector as input and produces stateaction values of all actions. All methods used one single hidden layer of 800 units unless specified otherwise. For our method, we ensure the total number of weights, including the number of DSOM vectors, to be the same or less than the baselines for fair comparison ^{1}^{1}1For instance, to compare with a neural network of a single hidden layer of 800 units (number of weights = number of features 800 + 800 number of actions), we used 400 hidden units and 400 DSOM weight vectors for our proposed approach (number of weights = (400 + number of DSOM weight vectors) number of features + 400 number of actions). Note that the number of weights used in the latter case is less than the baseline. The performance measure of Mountain Car is the number of steps taken in an episode with a cutoff of 1000 steps. The lower the number of steps means the less time an agent needs to solve the problem. For Lunar Lander, the episodic reward is used as the performance measure. We evaluated our method against several baselines, including baselines that use ER, target networks and adaptive learning rate (ALR) mechanisms and baselines that learn only with the RL algorithms (See Appendix 5.6)^{2}^{2}2
Methods that do not use adaptive learning rate optimizers use stochastic gradient descent (SGD) by default
. We performed parameters sweeps over each method and report the best performance averaged over 30 runs. We used Sarsa and Qlearning in Mountain Car and Lunar Lander respectively.We show learning curves in Figure 1. On Mountain Car, our method (DSOM_S), which learns online without ER, outperforms the baseline methods in terms of performance and speed. For baselines without ER and target networks (Sarsa and Sarsa_ALR), they either have trouble in solving the task or learn very slowly, likely due to catastrophic interference when updating the neural networks as pointed out in ghiassian2018two. When ER, target network and ALR optimizer are used, the baseline (Replay_ALR_S) is able to learn as ER helps to alleviate some amount of interference. Our method performs well by balancing between generalization and interference directly when updating the value function. This is done by modulating the updates based on the state similarity information embedded in the DSOM output mask, affecting value estimations of dissimilar states less. This means updates are more local, influencing only the surrounding area of the current state. This achieves online learning without any explicit storage of observed data.
On Lunar Lander, a more complex control problem, our method with DSOM performs as well as, and in the case of not using ALR, better than methods that use ER and target network. It also learns faster than the baselines. This further illustrates the potential benefits of online and fully incremental learning with neural networks, if we are able to do more local and differentiated (stateconditioned) updates. Note that the use of ALR optimizer appears to play a role in reducing interference. In Figure 1c and 1d, by removing the use of ALR optimizer, all methods perform worse. Especially in the case with ER and target network (Replay), learning instability is observed with a significant dip in performance while our method continues to improve and perform well consistently.
Online  Online_ALR  Replay  Replay_ALR  DSOM 
0.1739  0.2735  0.4164  0.1693  0.1079 
We visualize activation response functions of hidden units across the state space of Mountain Car as shown in Figure 2a and 2b. A set of states representative of the whole state space is used to measure the activation responses of each hidden unit (See Appendix 5.4). Based on the visualizations, hidden units of our method empirically demonstrates more local and specialized responses to specific areas of the state space. On the other hand, hidden units trained with ER, target network and ALR optimizer respond to a large area of the state space. This provides a potential explanation to the effectiveness of our method as each hidden unit only responds to a small subset of similar states. This aids the network to generalize across similar states while avoiding overwriting estimations of dissimilar states. Furthermore, we employ a quantitative measure mentioned in riemer2018learning to quantify interference (See Appendix 5.4). This measure would be zero when there is no activation overlapping across two samples. Thus, the lower the value, the less the amount of interference is. We used this measure for each method averaging over all unique pairs of states used in the visualizations. As shown in Table 1, our method has the least level of interference across the state space. With both qualitative and quantitative results, we provide evidence that indicates our method performs well in the online and fully incremental setting without ER by reducing interference by inducing more local updates.
Additionally, with our method inducing more local updates, we hypothesize that this may lead to more efficient use of resources (hidden units) as each unit is specialized in a small area of the state space. This means our method may need less hidden units to solve the same problem. To verify this hypothesis, we conducted experiments on Mountain Car, varying the number of hidden units used. We show the results in Figure 2c. For each method, we measure the average number of steps used over the number of training episodes. As we can see, our method continues to solve the tasks and outperform all the baselines used, even in cases with a very small number of hidden units ^{3}^{3}3On the xaxis of Figure 3, the number of units refers to the total number of hidden units in the hidden layer for the baselines. For our method, it refers to the total number of DSOM weight vectors + the total number of hidden units, as they are of the same feature length.. Our method is able to solve Mountain Car with as few as 36 hidden units. The results support the hypothesis that our method’s capability to reduce interference may give rise to more efficient use of resources in neural networks, reducing the number of hidden units needed.
4 Conclusion
We proposed a method that combines dynamic selforganizing maps with neural networks to solve reinforcement learning problems fully incrementally, a setting akin to how humans learn. The method achieves fully incremental learning by localizing the updates and preventing interference with what the network has learned in the past. It removes the need for experience replay buffers and target networks. It also provides a new perspective on how interference can be avoided in neural networks.
Acknowledgments
The authors thank Richard Sutton and Khurram Javed for discussions contributing to the results presented in this paper. The authors gratefully acknowledge funding from, Alberta Machine Intelligence Institute, JPMorgan Chase & Co, the Natural Sciences and Engineering Research Council of Canada, and Google DeepMind.
References
5 Appendix
5.1 RL Algorithms with neural networks
In reinforcement learning, an agent interacts with its environment by taking actions at discrete time steps
. The environment is commonly formulated as a Markov Decision Process (MDP) with states
, actions, transition probabilities
, rewards and discount function white2017unifying. At each time step , the agent is in a state , and takes an action . In response, the environment emits a reward and takes the agent to a state . The goal of the agent is to maximize the return, defined as the discounted sum of the cumulative rewards:(5) 
In this paper, we use Sarsa rummery1994line and Qlearning watkins1992q algorithms to test our method in control problems. In both algorithms, the agent learns to approximate the stateaction value function and acts neargreedily according to those stateaction values. The stateaction values for a policy are the expected return for that policy beginning from state and action :
(6) 
where denotes taking expectation under policy .
We parameterize the stateaction value functions with neural networks for Sarsa and Qlearning, denoted as , where refers to the neural network weights. The stateaction value function is commonly learned by bootstrapping stateaction value of the next state, minimizing the temporal difference error. The update rules for Sarsa and Qlearning to learn the stateaction value function are as follows:
(7) 
(8) 
where and
learning rate and discount factor respectively. Each neural network takes in a state feature vector as an input and produces stateaction values for each possible action. All neural networks used in this work use ReLU as activation function and a linear output layer.
5.2 Sample Architecture of using DSOM
5.3 Details on heatmap visualizations
we used a set of 121 states that covers the entire state space. These states were generated in the manner as follows: . For each algorithm, the normalized activation values of the hidden layer for states in were used to create the heatmaps
5.4 Quantitative measure of Interference
Introduced in riemer2018learning, The measure looks at the dot product of the gradient vector of two samples with respect to the parameters:
(9) 
where L corresponds to the loss function.
5.5 Environment Descriptions
Mountain Car: The Mountain Car environment has a 2dimensional state space: position and velocity. The value of position is between 1.2 and 0.6 and the value of velocity is between 0.07 and 0.07. The agent has three discrete actions, namely full throttle forward, fullthrottle backward and no throttle. The car is initialized around the hill bottom randomly with a reward of 1 for each time step before it gets to the top of the hill. The goal state (top of the hill) is defined to be when the position is greater than 0.5. The episode terminates if it takes more than 1000 steps.
Lunar Lander: The Lunar Lander environment has an 8dimensional state space: xcoordinate, ycoordinate, xvelocity, yvelocity, angle, angular velocity, and two binary features to indicate whether the left and right legs of the lander are in contact with the ground. The agent has four discrete actions, namely, fire left orientation engine, fire right orientation engine, do nothing and fire main engine. we used the Lunar Lander environment from Open AI Gym brockman2016openai
. The goal of an agent is to land the lander safely on a fixed launching pad without crashing. The reward for landing on the pad with a zero resultant speed is between 100 and 140 with a negative reward if it moves away from the pad. A reward of 10 is given for each leg’s ground contact and A punishment of 0.3 is given for firing the main engine at every frame. The episode terminates either when the lander crashes or it comes to rest.
5.6 Experimental Details
For Mountain Car, we used Sarsa as the RL algorithm. We evaluated 5 different methods:

Sarsa^{4}^{4}4Methods that do not use adaptive learning rate optimizers use stochastic gradient descent (SGD) by default,

Sarsa with ALR optimizer, abbreviated as Sarsa_ALR in the figures,

Sarsa with DSOM, abbreviated as DSOM_S.

Sarsa with experience replay buffer and target network, abbreviated as Replay_S and

Sarsa with experience replay buffer, target network and ALR optimizer, abbreviated as Replay_ALR_S.
For Lunar Lander, we used Qlearning as the RL algorithm. We evaluated 6 different methods:

Qlearning,

Qlearning with ALR optimizer, abbreviated as Q_Learning_ALR,

Qlearning with DSOM, abbreviated as DSOM_Q,

Qlearning with DSOM and ALR optimizer, abbreviated as DSOM_ALR_Q,

Qlearning with experience replay buffer and target network, abbreviated as Replay_Q and

Qlearning with experience replay buffer, target network and ALR optimizer, abbreviated as Replay_ALR_Q.
In terms of exploration policy, we used the same exploration policy for all methods in each environment. For both Mountain Car, we used an greedy policy to take a random action with a probability of 10%. For Lunar Lander, we used a decaying greedy policy. The Starting , ending and decay rate are 1.0, 0.1 and 0.995 respectively. Here we present the hyperparameter ranges that were swept over for the experimental runs in each environment.
Mountain Car  Lunar Lander  
Sarsa & Sarsa_ALR  Q_Learning & Q_Learning_ALR  
Learning Rate  
Number of Hidden Units  800  
Optimizer  RMSProp/SGD  Adam/SGD 
Replay_S & Replay_ALR_S  Replay_Q & Replay_ALR_Q  
Learning Rate  
Number of Hidden Units  800  
Optimizer  RMSProp/SGD  Adam/SGD 
Replay Size  20000  100000 
Batch Size  32  64 
Target Network Update Frequency  10  N/A 
Target Network Soft Update Frequency  N/A  4 
Target Network Soft Update Ratio  N/A  
DSOM_S  DSOM_Q & DSOM_ALR_Q  
Learning Rate  
DSOM Learning rate  
Optimizer  RMSProp/SGD  Adam/SGD 
Elasticity  
0.5  0.5  
Number of DSOM Weight Vectors  400 
Comments
There are no comments yet.