Learning by Active Forgetting for Neural Networks

by   Jian Peng, et al.

Remembering and forgetting mechanisms are two sides of the same coin in a human learning-memory system. Inspired by human brain memory mechanisms, modern machine learning systems have been working to endow machine with lifelong learning capability through better remembering while pushing the forgetting as the antagonist to overcome. Nevertheless, this idea might only see the half picture. Up until very recently, increasing researchers argue that a brain is born to forget, i.e., forgetting is a natural and active process for abstract, rich, and flexible representations. This paper presents a learning model by active forgetting mechanism with artificial neural networks. The active forgetting mechanism (AFM) is introduced to a neural network via a "plug-and-play" forgetting layer (P&PF), consisting of groups of inhibitory neurons with Internal Regulation Strategy (IRS) to adjust the extinction rate of themselves via lateral inhibition mechanism and External Regulation Strategy (ERS) to adjust the extinction rate of excitatory neurons via inhibition mechanism. Experimental studies have shown that the P&PF offers surprising benefits: self-adaptive structure, strong generalization, long-term learning and memory, and robustness to data and parameter perturbation. This work sheds light on the importance of forgetting in the learning process and offers new perspectives to understand the underlying mechanisms of neural networks.



There are no comments yet.


page 1

page 3

page 6

page 7


Realizing Continual Learning through Modeling a Learning System as a Fiber Bundle

A human brain is capable of continual learning by nature; however the cu...

Fortuitous Forgetting in Connectionist Networks

Forgetting is often seen as an unwanted characteristic in both human and...

Visualizing and Understanding Vision System

How the human vision system addresses the object identity-preserving rec...

Lifelong Neural Predictive Coding: Sparsity Yields Less Forgetting when Learning Cumulatively

In lifelong learning systems, especially those based on artificial neura...

Advanced Memory Buoyancy for Forgetful Information Systems

Knowledge workers face an ever increasing flood of information in their ...

Convolution Forgetting Curve Model for Repeated Learning

Most of mathematic forgetting curve models fit well with the forgetting ...

Learning Rapid-Temporal Adaptations

A hallmark of human intelligence and cognition is its flexibility. One o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Learning and memory capability are the essence of biological intelligence. Questions about how the brain processes information, encodes, and stores knowledge in learning and memory have increasingly attracted considerable research [1, 2] and achieved remarkable progress in recent decades [57], relighting the hope toward artificial general intelligence in a brain-inspired way. In particular, inspired by the memory mechanisms of the human brain, modern machine learning systems have been working to endow machines with lifelong learning capabilities by overcoming catastrophic forgetting in neural networks [41], which intuitively furthers the long-held view of forgetting as a negative consequence of memory and as a passive process [6, 7]. Until recent research challenge this opinion: a brain is born to forget, i.e., forgetting is a natural and active process for abstract, rich, and flexible representations [4]. Forgetting also plays a crucial role in regulating the learning-memory process for good generalisability in the real-world [5]. "If we remember everything, we should on most occasions be as ill off as if we remember nothing." This quote, cited by William James [3], concludes that forgetting is essential to humans to move forwards.

Deep neural networks (DNNs), a technique widely used for a range of machine intelligence-related tasks, are essentially a form of forgetting that compresses knowledge by filtering irrelevant information from massive signals [8, 9], and many of its techniques draw on the abstract concept of forgetting [10]. Popular implicit approaches, like drop out [11], L1 norm [12], and decorrelation [13], regularise neurons’ activation, promoting the model’s strong generalization, robustness, or low parameter complexity [14, 15]. Alternatively, explicit works specify individual modules, e.g., LSTM [14, 15] and its variants [14, 15], and gated control [19] networks to filter spatial-temporal contextual information. Furthermore, some works mimics mechanisms of natural forgetting, e.g., using the lateral inhibition mechanism of visual cells to suppress category-level irrelevant features in saliency detection [20] or improve extended sequence memory capacity in lifelong learning [21]. In general, these efforts suggest that forgetting potentially plays an essential role in learning and memory in artificial neural networks. Nevertheless, research on artificial neural network-based forgetting mechanisms and their explicit modelling is missing.

In contrast to artificial neural networks, forgetting mechanisms in biological neural networks have been extensively studied in recent years [22, 23, 24]. Growing evidence [25] suggests that the brain actively forgets through the selective extinction of neurons, which plays a fundamental role in the learning-memory process [23, 26]. Further studies [25] suggest that this regulation of forgetting is accomplished through a class of molecular switches, namely neurons or proteins, that selectively regulate the activation of excitatory neurons. Concretely, these switches are semi-open before learning and adjust to both directions, i.e., closed during learning, with a corresponding acceleration of forgetting, and open during memory consolidation, with a corresponding slowing of forgetting conversely. Furthermore, the rate of forgetting depends on the degree of switching, which is regulated according to a competitive mechanism that cooperates with inhibition and lateral inhibition to sparsely encode the gist, allowing the brain to learn and remember more events [27].

Here, we propose a memory model for artificial neural networks by active forgetting mechanism inspired by the human brain. We design a "plug-and-play" forgetting layer (P&PF) consisting of groups of inhibitory neurons with Internal Regulation Strategy (IRS) to adjust the extinction rate of themselves by mimicking biological lateral inhibition mechanism and External Regulation Strategy (ERS) to adjust the extinction rate of excitatory neurons by mimicking biological inhibition mechanism. We demonstrate that this neural switch-based biological forgetting mechanism is also applicable to artificial neural networks. It promotes strong generalization, long-term memory, and strong robustness for data/weight perturbations.

Figure 1: Schematics of the proposed forgetting layer in artificial neural networks. (a). The structure of the forgetting layer in artificial neural networks, which lies between two adjacent layers of excitatory neurons. The signalling of excitatory neurons contains two paths: one outputs the response of excitatory neurons to the next layer of excitatory neurons after the forgetting operation of inhibitory neurons; the other directly connects to the next layer through a short cut. (b). Gradient values of excitatory neurons in the intermediate layers during the initial and convergence phases of training, with or without the shortcut. Colder colours indicate lower gradient values and vice versa. (c). The lateral inhibition mechanism of inhibitory neurons in the forgetting layer, based on their learned importance. (d).

Activation of excitatory neurons for standard neural networks with and without forgetting layers. The horizontal axis represents neurons; the solid blue line and the solid black line are the mean activation value and standard deviation of the samples counted in the test dataset.


Comparison of the sparsity of activation values of excitatory neurons using different mechanisms on a four-layer multilayer perceptron, including plain artificial neural networks, ERS, and cooperative IRS and ERS.

(f). The process of selective apoptosis of excitatory neurons during active forgetting. Warmer colours indicate more intense closure of the corresponding inhibitory neurons. The excitatory neuron is primarily active in the initialization phase. Then, the majority of excitatory neurons are gradually deactivated during training.


Actively Forgetting via the Plug-and-Play Forgetting Layer.

Fig. 1a gives a schematic architecture of the active forgetting model on an artificial neural network, i.e., adding a plug-and-play forgetting layer (P&PF), consisting of a group of inhibitory neurons regulated by ERS and IRS. More specifically, ERS mimics the biological inhibition mechanism by compressing the size of inhibitory neurons through the softmax function with a soft threshold, which acts as a bottleneck that allows only strong inhibitory neurons to survive (see Methods for details). IRS mimics the biological lateral inhibition mechanism, which strengthens vital inhibitory neurons and weakens weak inhibitory neurons through mutual inhibition between inhibitory neurons (Fig. 1c), which is complementary to ERS.

We found that the forgetting layer is difficult to train and converge because of the oversaturation problem of the softmax function, which causes gradients to vanish, as shown in the top row of Fig. 1b. Drawing on identity [28], a shortcut is designed between the activation and forgetting layers to prevent gradient disappearance. The shortcut can effectively propagate the gradient on parameters compared with no shortcut, allowing more parameters to adjust in a broader range during the whole model training process.

We first compare the activation states of excitatory neurons on a multilayer perceptron (MLP, with 748-128-256-10 units) with and without the P&PF in Fig.

1d. The results demonstrate that using the forgetting layer can substantially decrease excitatory neurons’ activation and achieve a sparse representation. For quantification, referring to the metric of sparsity in biological neurons [27], we propose a metric to evaluate the sparsity in artificial neural networks. It is calculated by


where denotes the number of units in one layer, and denotes the activation frequency of the -th neuron on all test datasets.

We demonstrate that P&PF substantially enhances the sparsity of excitatory neurons. Fig. 1e compares the sparsity of the standard MLP and MLP with P&PF using the inhibition mechanism (ERS) and the lateral inhibition mechanism (IRS), showing that the sparsity of the model using the forgetting layer is much higher than the plain model. Furthermore, forgetting is selective in layers. The upper layer’s sparsity is considerably lower than the lower layer, and it does the same in standard models. It might be because the representation of higher layers is more complex than that of lower layers, which requires more neurons to encode. Introducing a lateral inhibition mechanism may reduce sparsity, but it vastly improves the generalized accuracy, implying a more efficient way of encoding representations. In addition, we visualize the intensity changes of neural switch-inhibitory neurons (Fig. 1f) to analyze the extinction process of excitatory neurons regulated by the P&PF. Prior to learning, the strength of inhibitory neurons is at an intermediate value, implying massive excitatory neurons join in encoding; then, inhibitory neurons increase rapidly during learning, which suppresses the expression of excitatory neurons insignificant to the task; finally, the majority of inhibitory neurons turn on at the convergence stage, which corresponds to the extinction of a vast number of activating neurons.

Figure 2: The forgetting layer learns the neuron-importance of excitatory neurons. (a). Results of ablating excitatory neurons. The horizontal axis indicates the proportion of neuron ablations, and the vertical axis indicates the model test accuracy. Three types of ablation methods are utilized: random (red line is the mean of test accuracy for ten operations, and shading is the result considering standard deviation), positive-order ablation according to importance, and inverse-order ablation according to importance. (b), Results of natural and rapid forgetting in different layers using the forgetting layer. (c). The model identifies the results on each class after removing neurons with top-1 and top-10 importance values. (d). The distribution of neuron-importance values of excitatory neurons in each layer of the network. Three methods are compared: forgetting layers with and without using a cooperative mechanism of inhibition and lateral inhibition, as well as forgetting layers using only an inhibition mechanism.
Figure 3: Facilitating efficient learning in artificial neural networks using forgetting layers. (a).

Structural adaptive information encoding. Separate models are trained on Fashion MNIST for the vanilla multilayer perceptron (

), the model with L1 regularization(), the model with L2 regularization (), and the model with the forgetting layer (). (b). Comparison of the actual parameter occupancy of the models. Three parameter pruning methods are applied: random pruning (), optimal brain damage (), and self-learning-based neuron importance pruning (). (c). Comparison of the convergence speed of various models. is the valilla model, is the forgetting layer without regularization using the inhibition and lateral inhibition mechanism, and the is the complete forgetting layer.

The forgetting layer selectively deactivates the insignificant excitatory neurons and consolidates the critical excitatory neurons. To verify its effectiveness in learning the neuron importance, we trained an MLP on Fashion MNIST (784-256-256-128-10 units). We observed the degradation of the model, which corresponds to the fact that deactivating crucial excitatory neurons results in a dramatic decline in the model’s test accuracy and vice versa. Specifically, we calculate the importance of excitatory neurons (see Methods for details) and rank the corresponding neurons, and then retain a fixed number of neurons according to a given approach, including: a positive-order approach, a negative-order approach, and a random approach. The random approach () is utilized as a control group to reflect the average performance independent of parameter importance. A remarkable difference in performance with implies that the importance of excitatory neurons is precise. Fig. 2 a compares the importance-based manners with the random manners. According to the positive-order approach, the method guarantees the model remains at a high-test accuracy (degradation below 10%), even removing almost 80% neurons. It means that the forgetting layer can carefully discard a vast majority of redundant excitatory neurons with minor impairment of learned task-relevant knowledge. In contrast, the model’s performance collapses after removing quite a few neurons (less than 5%) in the reverse order. It reveals that neural networks with forgetting layers encode information in a limited set of vital excitatory neurons and that destroying them can efficiently collapse the network.

We further analyze the two inactivation modes on various layers, corresponding to neuronal apoptosis in biologically natural amnesia and targeted apoptosis in memory wiping. Fig. 2b reveals that higher-layer neurons are more substantial than those of lower layers, and selective removal of higher-level neurons can be more effective at destroying task-related memories. Relatively, excitatory neurons in the shallow layers are insensitive, suggesting that fewer neurons at the lower layer and shallower networks may benefit the model’s robustness. In addition, the natural forgetting and target forgetting surround a larger area at each layer, further verifying that the forgetting layer can accurately learn the importance of excitatory neurons.

We investigate the distribution of the excitatory neurons’ importance values using the same model as mentioned in Fig. 1e. The plain neural network is taken as a control group to analyse the forgetting layer equipped with a mixture of inhibition and lateral inhibition, as well as the forgetting layer equipped with an inhibition mechanism. Fig. 2

d demonstrates that the forgetting layer substantially facilitates the distribution of excitatory neurons’ importance values close to the bimodal distribution. Moreover, the introduction of the lateral inhibition mechanism further promotes the distribution polarisation. Unlike the near-uniform distribution of the standard model, most values in the distribution with the forgetting layer are concentrated at both ends, i.e., near the zero and the extreme values.

Previous research [11, 30, 31]

holds the view that almost no base neurons dominate the model’s performance in standard convolutional neural networks (CNNs), and groups of convolutional kernels in CNNs perform different functions separately, such as specific neurons responding only to specific features


. Nevertheless, we demonstrate that CNNs with forgetting layers might possess such base neurons. Experiments on CIFAR-10

[32] with LeNet [33] suggest that CNNs with a forgetting layer facilitate the formation of base neurons, and removing these neurons leads to dramatic degradation of the model’s performance. Fig. 2c shows the test accuracy of the model after removing the excitatory neurons with -1 and -10 importance. After removing these two neurons, the model fails to recognize almost all categories compared to the original naive standard model.

Forgetting Layers Facilitates Efficient Learning.

Biologically active forgetting allows for efficient information processing, as exhibited by the adaptability of neural networks to the task and the sparsity of neuronal encoding, preventing information overload [34, 35]

. In particular, for network adaptation, the current widespread use of a uniform network structure ignores data variances, which leads to a mismatch between the network and the data. A straightforward way to address this issue is to manually adjust the number of neurons layer by layer. As a preliminary demonstration, we train a multilayer perceptron network (784-256-256-128-10 units) with configured forgetting layers on Fashion MNIST, whose network complexity is far beyond the data needed (Fig.

3a). The vanilla sgd is the standard model that serves as the control group. Compared with peer algorithms (L1 and L2 regularisation [36]), both of which can reduce the model complexity to fit the data, the use of the forgetting layer can better handle the mismatch between the network size and the data. Quantitatively, the forgetting layer can substantially increase the generalisability of the model, i.e., a nearly 13% improvement in test accuracy, compared to the vanilla sgd. By comparison, the L1 algorithm has a slight improvement and L2 has a moderate improvement, but they both have a considerable gap with the forgetting layer algorithm.

We next evaluate the ability of our algorithm to store information by probing the pruning parameter size. Intuitively, the fewer parameters the model uses while keeping the test accuracy constant, the more powerful the ability to store information. We trained the VGG network [37] with 9 layers on CIFAR-10. Three parameter pruning strategies are utilised: (i) the random, stochastic selection of parameters strategy, which serves as a lower bound on the performance of the algorithm; (ii) optimal brain surgery () [38], one of the mainstream parameter pruning methods, which iteratively selects the parameters with the minimal loss in a pruning-training-pruning manner, and serves as an upper bound on the algorithm; and (iii) forgetting layer-based parameter importance sampling (), which selects the parameter with the lowest importance value, is calculated by the neuron importance as


where and are neurons in adjacent layers. If neither neuron is important, the connection between them, corresponding to the parameter, is even less important, and vice versa. In Fig. 3b, we demonstrate the strong ability of the forgetting layer to store information with a small number of parameters. Specifically, it is capable of maintaining the model’s performance with only approximately 20% of the parameters, close to the level of OBS. It is important to emphasize that our algorithm is online and trained for only one session, while the latter is offline and requires multiple prolonged training-pruning sessions.

In addition, empirical results on CIFAR-10 with VGG suggest that the forgetting layer can accelerate the model convergence. Fig. 3c shows that the model(denoted as ) with forgetting layers substantially reduces the loss value, and its curve is smoother than the vanilla model’s (). We further analyse the critical component of the forgetting layer, i.e., the cooperative mechanism of inhibition and lateral inhibition. The results show that the model without this mechanism () has considerably higher loss values and more fluctuating curves than the vanilla model. We speculate that this is due to the introduction of additional parameters without an effective modulation strategy.

Figure 4: Analysis of model generalization and robustness to neuronal perturbations. a, Scatterplot of model generalization and sparsity. The horizontal axis represents the test accuracy, and the vertical axis represents the sparsity. These results are obtained based on MLP (784-64-64-10) trained ten times repeatedly on Fashion MNIST. b, The result of 5 repetitions of random perturbations performed on the neurons of each layer of the model trained on Fashion MNIST. The horizontal axis is the proportion of perturbed neurons, and the vertical axis represents the test accuracy. The solid line indicates the mean value of the test accuracy of the model with forgetting layers and the shaded line indicates the range of the curve’s fluctuation based on the standard deviation calculation. As a comparison, the dashed line represents the mean of the test accuracy of the vanilla neural network. c, Results of the clustering visualization with

-sne. It is based on the features extracted from the vanilla neural network with or without forgetting layers. As a baseline, the original MINST is the result of using images as features.

Forgetting Layers Allows for Well-balanced Sparsity, Generalization, and Robustness to the Perturbations.

Achieving a trade-off between sparsity and generalization is challenging. We show the capability of the forgetting layer in dealing with this problem by training fully connected neural networks on Fashion MNIST. We compared various algorithms, including (i) , one of the classical sparse algorithms; (ii) [13], a category of algorithms to reduce overfitting; (iii) the vanilla model (), a standard neural network; and (iv) two critical components in the forgetting layer: the inhibition mechanism () and the combination of inhibition and a shortcut strategy (), as control groups. In Fig. 4a, our algorithm obtains the highest score in sparsity while obtaining suboptimal generalization accuracy. As a comparison, and the achieve high sparsity, but both compromise the model’s generalization compared to standard neural networks. Moreover, we analysed the components in the forgetting layer. Although the neural network using only the inhibition mechanism improves the sparsity, it struggles with some loss in generalization accuracy. In contrast, the shortcut combined with inhibition substantially improves the generalization accuracy of the model, but is limited by the low sparsity. Taken together, the model using the complete active forgetting strategy () greatly enhances both generalization accuracy and sparsity.

We further validated the ability of the forgetting layer to improve generalization, especially in terms of features on MNIST to eliminate dataset bias. Specifically, we extracted features from the network’s last layer and clustered them using the -sne algorithm [39]. As baselines, the original input and the features of the vanilla neural network are used. Fig. 4c shows that the forgetting layer enhances the representativeness of the features. Compared with the baselines, the distribution of the features with the forgetting layer is more compact, and the distance between categories is farther. In particular, the distance of distribution between the red points and the brown points is considerably farther than the vanilla model. This indicates a more differentiable feature for forgetting layer learning.

Furthermore, given the above results, we initially investigate whether the forgetting layer can improve robustness, while balancing generalization and sparsity. To achieve this, we randomly perturbed the parameters using the model in Fig. 4a. As a comparison, we also perform a similar operation on the vanilla model. Combining the results of Fig. 4a and Fig. 4b, we show that the forgetting layer can substantially improve the robustness of the model. In particular, the result of separately perturbing each layer supports our conclusion that the sensitivity to perturbation is considerably reduced by utilising the forgetting layer while maintaining high generalization accuracy and sparsity. Moreover, we found that the sensitivity to perturbation varies widely across layers, with shallow layers having lower sensitivity and higher layers having higher sensitivity. The forgetting layer vastly improves the robustness at higher levels, i.e., the difference in generalization accuracy between the forgetting layer and the standard model gradually increases at higher levels.

Figure 5: Analysis of the forgetting layer for long sequence memory on supervised tasks. a, An example of sequential learning on the PermutedMNIST dataset. b, Results of learning ten tasks sequentially on Permuted MNIST. is EWC based on a vanilla multilayer perceptron model (784-512-256-10 units), is EWC based on the identical multilayer perceptron model equipped with forgetting layers. jointly training all tasks at once, free from catastrophic forgetting; sequentially learns tasks one by one in the vanilla model. All tasks share network parameters except for the output layer. c, Comparison of the distribution of parameter-importance values with various methods, on networks with and without forgetting layers. The yellow bars indicate the model using the forgetting layer, and the blue bars indicate the vanilla neural network model.

Forgetting Layers Support Long Sequences of Memory.

The brain learns and remembers continuously by actively forgetting to reduce the interference of future memory [34]. Analogously, we expect artificial neural networks to enable continuous learning and memory; however, these networks face a crucial obstacle, i.e., learning new knowledge and then losing old knowledge, which is known as the catastrophic forgetting issue [40]. Studies based on synaptic memory inspire some algorithms to deal with the issue, such as EWC [41] maintaining memory by consolidating parameters related to historical knowledge. Here, we show how forgetting layers can be utilised to improve the ability of such algorithms to overcome catastrophic forgetting and thus achieve long sequence memory. Specifically, we define a long sequence learning scenario in which we sequentially learn subtasks from a given set of tasks (Fig. 5a). The data for these subtasks are Permuted MINIST [42], whose samples are obtained by randomly permuting the pixels of the images in MNIST. We train a multilayer perceptron model sequentially on a task set with length , using different strategies, including (i) , which trains all subtasks at once and is an upper bound on model performance; (ii) , which trains subtasks incrementally using a vanilla model and is a lower bound on model performance; (iii) , a classical synaptic memory algorithm; and (iv) , the EWC algorithm that introduces the forgetting layer. Fig. 5 b left left shows that the forgetting layer can effectively improve the model’s performance in overcoming catastrophic forgetting. During sequential learning, the vanilla model forgets earlier task-related memories, resulting in a sharp degradation of test accuracy. In contrast, the model using and can still maintain the performance of the first task. In particular, the performance of is almost on par with the curve of the joint. It is important to note that the initial test accuracy of , , and on the first task is higher than that of because the former is intended for joint learning of multiple tasks, which is more challenging than learning the first task alone. In addition, we use the average test accuracy to evaluate the ability of the algorithm to memorise long sequences, yielding similar results (Fig. 5b right) as Fig. 5 b left. The results of the two metrics both show that the forgetting layer improves the model’s ability to memorise long sequence tasks. The EWC algorithm, along with the forgetting layer, substantially outperforms the other algorithms. Moreover, the superiority of () increases with the number of tasks in the sequence. The gap between it and the other algorithms gradually increases.

We further explore why the forgetting layer extends the memory duration. Since synaptic memory algorithms centre on measuring the parameter importance, we follow the above experiments by analysing the distribution of parameter importance for algorithms with and without the forgetting layer. In the middle of Fig. 5c middle, we first compare the EWC method. We demonstrate that the distribution of parameter importance for the model with the forgetting layer on EWC exhibits polarisation, i.e., the use of the forgetting layer allows the distribution to move towards the ends, with most of the points moving to the zero region, while a smaller fraction of the points is in the larger region. This implies that the forgetting layer allows less model capacity to preserve historical memory while freeing up more capacity for future memory. To exclude the preference of this phenomenon for specific algorithms, we next performed similar statistics on two other synaptic memory algorithms, MAS [43] and ANPyC [44] (Fig. 5 c left and right). We demonstrate that it is crucial that the models’ values of parameter importance using the forgetting layer are more concentrated in the high- and low-value regions. Additionally, there are considerably less points of parameter importance in the intermediate region than in the model without the forgetting layer, and conversely, points in the zero-value region substantially increase, which dramatically decreases the historical memory-occupied model capacity.


We first reveal the critical role of active forgetting in learning and memory in this paper. Then, we propose a model simulating the active forgetting mechanism of the brain based on an artificial neural network called P&PF. It introduces a novel neuron, called the inhibitory neuron, which acts as a neural switch that selectively regulates the activation state of excitatory neurons. The inhibitory neuron self-regulates the internal state through the IRS, the lateral inhibitory mechanism, and modulates the excitatory neuron through the ERS, the inhibitory mechanism, facilitating the sparsity and representativeness of neural coding. We demonstrate the potential of active forgetting mechanisms on a series of supervised task scenarios, particularly improvements in sparsity, generalization, the robustness of learning, and extended sequence memory.

The forgetting layer introduces selectivity and uncertainty into artificial neural network learning and memorisation. It strengthens highly representational neurons while eliminating redundant neurons, which improve the model’s generalization. Simultaneously, the release of task-irrelevant neurons increases uncertainty, making the learning system more robust to unknown structural and data changes in the future. Moreover, it explicitly expands network capacity, which is the crux of long persistent memory in sequence of tasks.

Several previous works utilised the concept of forgetting in memory. The work [45]

designed a forgetting module, which implicitly weakened synapses to forget and focuses on the decay of memory with time to model the temporal dynamics, but did not extend to artificial neural networks. And the work

[46] filtered information by discarding specific neurons based on activation, but was limited by an effective data selection strategy. In contrast, our algorithm self-learned how to select and discard features by drawing on inhibition and side inhibition mechanisms. This biological mechanism inspired some works. For instance, introducing this mechanism to improve feature representativeness in visual attention and saliency detection [20] or introducing a lateral inhibition mechanism to improve the memory capacity of the model [21], but its parameter importance metric was designed for manually learning, rather than automatically learning like ours.

Although in this paper we focus on an algorithm inspired by biological-based active forgetting mechanisms, we believe that it may help us pursue some important neuroscience issues. In particular, we find that artificial neural networks based on the forgetting layer may exist as a category of neurons that aggregate and encode multilayer features to represent concepts, which are functionally consistent with recently discovered visual nerve cells [47]. This finding disputes the customary claim that independent neurons encode features in a distributed manner [11]. Furthermore, we demonstrate that the forgetting layer performs long-term potentiation (LTP) and long-term depression (LTD)-like functions in brain memory, which are crucial for efficient learning and memory [48].

Some problems remain to be discussed and improved in the future. Specifically, the global receptive field is utilised in lateral inhibition. However, a local receptive field combined with a specific prior [49] or an adaptive approach utilising affine transformation [50]

may be beneficial. Moreover, dynamic memory retention based on the forgetting mechanism in long-term memory needs further exploration. To achieve this, it is probably necessary to consider the cycled functional structure of learning-forgetting. Some directions are deserving of investigation:

(i) combining attention mechanisms [51] to achieve iterative forgetting and recall memory to facilitate learning; (ii) introducing active forgetting mechanisms into explicit external memory models [52] to promote long-term memory consolidation; and (iii) drawing on biological neural loop generation mechanisms [53, 54] to achieve structural scalability after forgetting.


Inspired by neuroscientific studies, we argue that the crux of biological neural networks to achieve active forgetting lies in (i) specific mediators that achieve an indirect regulation of the activation state of excitatory neurons, which is not available in artificial neural networks, and (ii) a competition mechanism with a clear division of labor, which acts directly on the mediators to promote a balance between the sparsity of the mediators themselves and the expression of excitatory neurons. Based on these natures, we model the active forgetting mechanism, which introduces inhibitory neurons that act as neural switches to regulate the activation state of excitatory neurons and maintain their sparseness via the IRS while maintaining the representativeness of excitatory neurons via the ERS.

Selectivity of forgetting and inhibitory neurons in P&PF. In the "plug-and-play" forgetting layer, we configure specific neural switches called inhibitory neurons to regulate the activation of the corresponding excitatory neurons. It selectively preserves and deactivates activating neurons by a group of weights corresponding to excitatory neurons, which filters out redundant information and preserves critical information. A low weight value means that excitatory neurons are turned off and vice versa. It allows the learning system to encode the task using as few excitatory neurons as possible in artificial neural networks. It allows highly relevant task signals to flow into the next layer, facilitating the learning of representative features while uses a minimum number of excitatory neurons to prevent taking up much of the network capacity.

As shown in Fig. 1 a, based on a vanilla artificial neural network, the forgetting layer receives signals from excitatory neurons in the previous layer and modulates the activation of excitatory neurons through inhibitory neurons. It is computed by


where is the elementwise dot product operation, denotes the outputs of the previous layer, and is the output of the forgetting layer. The denotes the forgetting layer operation. It regulates the excitatory neurons’ activation flow into the next layer by adjusting inhibitory neurons’ wights. We define the forgetting layer as


We design the forgetting function

that controls the degree of forgetting. To ensure that the algorithm is differentiable, which means that the gradient calculation is available, we introduce a variant of the sigmoid function as the forgetting function, considering that it has the properties of:

(i) an exponential function whose inverse function is consistent with the Ebbinghaus forgetting curve [56], which is consistent with the biological forgetting law; (ii) smooth, which guarantees that the algorithm completes end-to-end training by back-propagation; and (iii) its range is in , which can prepare to describe the state of the neural switch.

Specifically, on the right side of the equation, is the weight of the inhibitory neuron, and

is a hyperparameter with a value greater than zero, which controls the degree of polarisation of the sigmoid function. The larger its value is, the steeper the

function curve is. Intuitively, acts as a cooling factor, partially affecting the rate of forgetting. For example, increasing the value of will cause more inhibitory neurons to turn on, which means that excitatory neurons turn off and only the most representative excitatory neurons are preserved. On the left side of the equation, represents the state of the inhibitory neuron, the neural switch, which determines the magnitude of signal flow to the next layer. In the initial state, most neural switches are in a half-on, half-off indeterminate state, and then during the training phase, most task-irrelevant excitatory neurons are turned off through a neural switch-based inhibition mechanism, while only the most critical excitatory neurons survive. Thus, the model learns a value that reflects the importance of excitatory neurons.

Since the sigmoid function tends to oversaturate with large activation values, this will cause the model to be prone to gradient disappearance on the deep network, making it difficult to converge (Fig. 1 b). To alleviate this issue, we introduce the shortcut (Fig. 1 a) in the form as


The external regulation strategy via inhibition mechanism.

To ensure that inhibitory neuron weights are learnable, we design an adversarial mechanism. In addition to making the output of the forgetting layer consistent with the task-relevant truth values, which correspond to the standard loss function, we also keep the inhibitory neurons in the forgetting layer turned on as much as possible, and this corresponds to the loss inspired by the inhibitory mechanism. The inhibition of neurons realises the neuronal inhibition mechanism. In particular, inhibition neurons inhibit the potential transmission of another group of excitatory neurons through the inhibition mechanism in transmitting signals between neurons. It ensures that minimising inhibitory neurons while maximising task-related information results in a sparse response (Fig.

1 d). We give the objective function that inhibits neuron activation. The formula is defined as


where denotes the -th layer of networks and represents the location of a neuron in the -th layer.

The internal regulation strategy via lateral inhibition mechanism. Biological excitatory neurons reduce the activity of neighbouring neurons by a lateral inhibition mechanism. It suppresses the lateral propagation of action potentials from one neuron to its neighbouring neurons. Thus, it drives the contrast of the stimulus signal and then reinforces the encoding of signal. Based on the above mechanism, we built an analogous mechanism for inhibition neurons in the forgetting layer (Fig. 1 b) by adding a regularisation in the formula as


where denotes the -th layer of neural network, and denote neurons’ positions in the -th layer.

Furthermore, we consider that in artificial neural networks, the interactions between neurons are related to the neurons’ properties, except to the size of their activation responses. There may be some neurons that are indispensable despite their response values. Therefore, for the lateral inhibition mechanism, we believe it is necessary to consider the importance of neurons and give two hypotheses: (i) if one neuron is essential, it should be inhibited less; and (ii) if one neuron is essential, it inhibits others more. Based on these two assumptions, we modify the above equation 7 as


where and are the importance of the two neurons in the layer. We emphasize that it is learned automatically through the excitation-inhibition antagonistic mechanism in the forgetting layer.

Forgetting rate and the cooperative constraints of ERS and IRS. Combined with the lateral inhibition mechanism, the pair works in concert to actively select and extinguish the insignificant excitatory neurons. The total objective function for configuring the forgetting layer on the standard model is defined as


where is the loss function of the task-related objective, and hyperparameters and control the weights of inhibition and lateral inhibition, as well as the rate of forgetting, respectively.


The implementation of the training and testing models equipped with active forgetting is based on the framework, TensorFlow



  • [1] Craik, F. I. & Jennings, J. M. The handbook of aging and cognition (Lawrence Erlbaum Associates, Inc, 1992).
  • [2] Piaget, J. & Inhelder, B. Memory and Intelligence (Psychology Revivals) (Psychology Press), 1st edn.
  • [3] James, William & James, William The principles of psychology (Cosimo, Inc., 2007).
  • [4] Gravitz, L. The forgotten part of memory. Nature 571, S12–S12 (2019).
  • [5] Bekinschtein, P., Weisstaub, N. V., Gallo, F., Renner, M. & Anderson, M. C. A retrieval-specific mechanism of adaptive forgetting in the mammalian brain. Nature communications 9, 4660 (2018).
  • [6] Loftus, G. Assortative pairing and life history strategy – a cross-cultural study. Journal of Experimental Psychology: Learning, Memory, and Cognition 11, 397–406, DOI: https://doi.org/10.1037/0278-7393.11.2.397 (1985).
  • [7] Carpenter, S., Pashler, H. & Wixted, J. The effects of tests on learning and forgetting. Memory |& Cognition 36, 438–448, DOI: https://doi.org/10.3758/MC.36.2.438 (2008).
  • [8] Zoph, B. & Le, Q. V. Neural architecture search with reinforcement learning. Preprint at https://arxiv.org/abs/1611.01578 (2014).
  • [9] Tishby, N. & Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), 1–5, DOI: 10.1109/ITW.2015.7133169 (IEEE, 2015).
  • [10] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436–444 (2015).
  • [11] Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105 (2012).
  • [12] Zhang, Y., Lee, J. D. & Jordan, M. I. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, 993–1001 (PMLR, 2016).
  • [13] Cogswell, M., Ahmed, F., Girshick, R., Zitnick, L. & Batra, D. Reducing overfitting in deep networks by decorrelating representations. Preprint at https://arxiv.org/abs/1511.06068 (2014).
  • [14] Yang, M., Zhang, L., Yang, J. & Zhang, D.

    Robust sparse coding for face recognition.

    In CVPR 2011, 625–632 (IEEE, 2011).
  • [15] Yang, Y., Ruozzi, N. & Gogate, V. Scalable neural network compression and pruning using hard clustering and l1 regularization. Preprint at https://arxiv.org/abs/1806.05355 (2018).
  • [16] Gers, F., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with lstm. Neural Computation 12, 2451–2471, DOI: 10.1162/089976600300015015. (2000-10-01).
  • [17] Dey, R. & Salem, F. M.

    Gate-variants of gated recurrent unit (gru) neural networks.

    In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), 1597–1600 (IEEE, 2017).
  • [18] Yu, Y., Si, X., Hu, C. & Zhang, J.

    A review of recurrent neural networks: Lstm cells and network architectures.

    Neural computation 31, 1235–1270 (2019).
  • [19] Liu, J., Shahroudy, A., Xu, D. & Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In

    European conference on computer vision

    , 816–833 (Springer, 2016).
  • [20] Cao, C. et al. Lateral inhibition-inspired convolutional neural network for visual attention and saliency detection. In

    Thirty-second AAAI conference on artificial intelligence

  • [21] Aljundi, R., Rohrbach, M. & Tuytelaars, T. Selfless sequential learning. In 7th International Conference on Learning Representations (2019).
  • [22] Hardt, O., Nader, K. & Nadel, L. Decay happens: the role of active forgetting in memory. Trends in cognitive sciences 17, 111–120 (2013).
  • [23] Davis, R. L. & Zhong, Y. The biology of forgetting—a perspective. Neuron 95, 490–503 (2017).
  • [24] Tononi, G. & Cirelli, C. Sleep and the price of plasticity: from synaptic and cellular homeostasis to memory consolidation and integration. Neuron 81, 12–34 (2014).
  • [25] Shuai, Y. et al. Forgetting is regulated through rac activity in drosophila. Cell 140, 579–589 (2010).
  • [26] Izawa, S. et al. Rem sleep–active mch neurons are involved in forgetting hippocampus-dependent memories. Science 365, 1308–1313 (2019).
  • [27] Yu, Y., Migliore, M., Hines, M. L. & Shepherd, G. M. Sparse coding and lateral inhibition arising from balanced and unbalanced dendrodendritic excitation and inhibition. Journal of Neuroscience 34, 13701–13713 (2014).
  • [28] He, Kaiming., Zhang, Xiangyu., Ren, Shaoqing., Lai, H. & Sun, Jian. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  • [29] Xiao, Han. & Rasul, Kashif. & Vollgraf, Roland. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Preprint at https://arxiv.org/abs/1708.07747 (2017).
  • [30] Guidotti, R. et al. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 1–42 (2018).
  • [31] Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning, 3145–3153 (PMLR, 2017).
  • [32] Krizhevsky, A., Hinton, G. et al. Learning multiple layers of features from tiny images.

    Tech. Rep., Citeseer (2009).

  • [33] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
  • [34] Wimber, M., Alink, A., Charest, I., Kriegeskorte, N. & Anderson, M. C. Retrieval induces adaptive forgetting of competing memories via cortical pattern suppression. Nature neuroscience 18, 582–589 (2015).
  • [35] Richards, B. A. & Frankland, P. W. The persistence and transience of memory. Neuron 94, 1071–1084 (2017).
  • [36] Phaisangittisagul, E. An analysis of the regularization between l2 and dropout in single hidden layer neural network. In 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS), 174–179 (IEEE, 2016).
  • [37] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).
  • [38] Hassibi, B. & Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5 (NIPS 1992, 164–171 (Morgan Kaufmann, San Mateo, CA, 1993).
  • [39] Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9 (2008).
  • [40] McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, vol. 24, 109–165 (Elsevier, 1989).
  • [41] Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 3521–3526 (2017).
  • [42] Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F. J. & Schmidhuber, J. Compete to compute. In NIPS, 2310–2318 (Citeseer, 2013).
  • [43] Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M. & Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), 139–154 (2018).
  • [44] Peng, J. et al. Overcoming long-term catastrophic forgetting through adversarial neural pruning and synaptic consolidation. IEEE Transactions on Neural Networks and Learning Systems (2021).
  • [45] Fusi, S., Drew, P. J. & Abbott, L. F. Cascade models of synaptically stored memories. Neuron 45, 599–611 (2005).
  • [46] Gomez, A. N. et al. Learning sparse networks using targeted dropout. Preprint at https://arxiv.org/abs/1905.13678 (2019).
  • [47] Han, Y. et al. The logic of single-cell projections from visual cortex. Nature 556, 51–56 (2018).
  • [48] Norimoto, H. et al. Hippocampal ripples down-regulate synapses. Science 359, 1524–1527 (2018).
  • [49] Turcsany, D., Bargiela, A. & Maul, T. Local receptive field constrained deep networks. Information Sciences 349, 229–247 (2016).
  • [50] Wei, Z., Sun, Y., Wang, J., Lai, H. & Liu, S. Learning adaptive receptive fields for deep image parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
  • [51] Kim, S., Hori, T. & Watanabe, S. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4835–4839 (IEEE, 2017).
  • [52] Graves, A. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016).
  • [53] Ostapenko, O., Puscas, M., Klein, T., Jahnichen, P. & Nabi, M. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
  • [54] Mocanu, D. C. et al. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications 9, 1–12 (2018).
  • [55] Abadi, M. et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), 265–283 (2016).
  • [56] Ebbinghaus, H. Memory: A contribution to experimental psychology. Annals of neurosciences 20, 155 (2013).
  • [57] Hassabis, D. Kumaran, D., Summerfield, C., Botvinick, M. Neuroscience-Inspired Artificial Intelligence . Neuron 95, 245-258 (2017).


This work was supported in part by the National Natural Science Foundation of China under Grant 41871364, Grant 41571397, Grant 42071427, and Grant 41771458 and by using Computing Resources at the High Performance Computing Platform of Central South University.

Author contributions statement

Ethics declarations

Competing interests:

The authors declare no competing interests.

Additional information

The source code and data will be available at https://github.com/GeoX-Lab/P-PF.