Triple Memory Networks: a Brain-Inspired Method for Continual Learning

03/06/2020 ∙ by Liyuan Wang, et al. ∙ Tsinghua University 0

Continual acquisition of novel experience without interfering previously learned knowledge, i.e. continual learning, is critical for artificial neural networks, but limited by catastrophic forgetting. A neural network adjusts its parameters when learning a new task, but then fails to conduct the old tasks well. By contrast, the brain has a powerful ability to continually learn new experience without catastrophic interference. The underlying neural mechanisms possibly attribute to the interplay of hippocampus-dependent memory system and neocortex-dependent memory system, mediated by prefrontal cortex. Specifically, the two memory systems develop specialized mechanisms to consolidate information as more specific forms and more generalized forms, respectively, and complement the two forms of information in the interplay. Inspired by such brain strategy, we propose a novel approach named triple memory networks (TMNs) for continual learning. TMNs model the interplay of hippocampus, prefrontal cortex and sensory cortex (a neocortex region) as a triple-network architecture of generative adversarial networks (GAN). The input information is encoded as specific representation of the data distributions in a generator, or generalized knowledge of solving tasks in a discriminator and a classifier, with implementing appropriate brain-inspired algorithms to alleviate catastrophic forgetting in each module. Particularly, the generator replays generated data of the learned tasks to the discriminator and the classifier, both of which are implemented with a weight consolidation regularizer to complement the lost information in generation process. TMNs achieve new state-of-the-art performance on a variety of class-incremental learning benchmarks on MNIST, SVHN, CIFAR-10 and ImageNet-50, comparing with strong baseline methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to continually learn new information without interfering previously learned knowledge, i.e. continual learning, is one of the basic challenges for deep neural networks (DNN), because the continual acquisition of information from dynamic data distributions generally results in catastrophic forgetting McCloskey and Cohen (1989). When accommodating for new experience, a normally trained DNN tends to adjust the learned parameters and thus forgets the old knowledge.

Numerous efforts has been devoted to mitigating catastrophic forgetting, e.g. regularization methods and memory replay Parisi et al. (2019). Regularization methods protect important parameters for solving the learned tasks, e.g. EWC Kirkpatrick et al. (2017) and SI Zenke et al. (2017), but hard to allocate additional parameters for new outputs without access to old data distributions. Memory replay methods replay a small amount of training data or use deep generative models to replay generated data Shin et al. (2017), which usually cannot precisely maintain the distributions of the learned training data. As shown in Kemker et al. (2018), the strategy that achieves the optimal performance of continual learning often heavily depends on the learning paradigms and the datasets being used, but none of the existing methods solves catastrophic forgetting. The empirical results suggest a general strategy to further alleviate catastrophic forgetting: both the two forms of information, i.e. the learned knowledge for solving old tasks and the learned distribution of old training data, should be maintained to complement the lost information with each other during continual learning.

Figure 1:

Diagrams of the brain memory system. Two forms of memory are encoded in three brain regions and protected by two consolidation mechanisms: 1. Consolidation of Synapses; 2. Neurogenesis & Neural Inhibition. (b) is modified from

Frankland and Bontempi (2005).

Compared with DNNs, the brain is able to continually learn new experience without catastrophic forgetting Deng et al. (2010); Wiskott et al. (2006); McClelland et al. (1995). This ability is possibly achieved by the organization principles of the brain memory system, i.e. cooperation of the three memory networks as a currently well-accepted model Frankland and Bontempi (2005). In each module of the triple-network architecture, information is encoded as more specific forms, e.g. an object you have ever seen, or more generalized forms, e.g. the features of the object that associate to a concept. The encoded information is stabilized individually by different neural mechanisms in these modules (Fig. 1), the process of which is called

of memory. Specifically, hippocampus develops the mechanisms of neurogenesis and neural inhibition to encode more specific information. Neurogenesis generates additional neurons to create space for incoming experience

Deng et al. (2010). The inhibitory neurons are developed together to inhibit irrelevant parts of the network and prevent interference Gonçalves et al. (2016). While, neocortex encodes more generalized information, which is consolidated by strengthening synaptic connectivity Frankland and Bontempi (2005). In neocortex, prefrontal cortex (PFC) mediates the interplay of the two memory systems. PFC develops a discrimination mechanism to regulate the encoding of specific information in hippocampus and integrates (sensory) cortical modules (SC) in neocortex to encode generalized information Frankland and Bontempi (2005). The interplay of hippocampus, PFC and SC complements the individually consolidated information to avoid catastrophic forgetting. This strategy inspires us to interdependently protect both the structured knowledge to solve tasks and the learned distribution of training data in a continual learning system.

In this work, we aim to provide a new approach for continual learning by systematically drawing inspirations from the triple-memory architecture of the brain functions. Specifically, the framework applies the idea of generative adversarial networks (GAN) Goodfellow et al. (2014) to model the functions of hippocampus, PFC and SC as cooperation of a generator, a discriminator and a classifier, respectively. A conditional generator learns the distribution of training data, similar to the specific experience encoded in hippocampus. The generator models the role of neurogenesis and neural inhibition as an extendable architecture to learn the incoming domains and a domain-specific attention mask to prevent interference of the encoded experience. The generated data is replayed to a discriminator to model PFC, which detects familiarity of the input experience, and then to a classifier to model SC. During memory replay, the classifier learns new tasks under supervision of the discriminator, as PFC integrates SC to encode generalized information. To regularize the difference between the generated data and the old training data, both the classifier and the discriminator are implemented with weight consolidation algorithms, inspired by the strengthened synaptic connectivity in neocortex. The model is named as “triple memory networks (TMNs)” (Fig. 2), since the three networks continually learn the knowledge for generation, identification and classification and maintain the learned knowledge by corresponding brain-inspired algorithms.

Our contributions include: (1) We propose that the brain strategy, processing both specific and generalized information by interplay of three memory networks, can be applied in continual learning; (2) We present the triple memory networks (TMNs), an “artificial memory system” for generative replay and class-incremental learning, which leverages the organization principles of brain memory system to mitigate catastrophic forgetting; (3) Our method achieves state-of-the-art (SOTA) performance on a variety of image classification benchmarks, including MNIST, SVHN, CIFAR-10 and ImageNet-50; and (4) To the best of our knowledge, TMNs is the first attempt to model the triple-network architecture of brain memory system for continual learning, which bridges the two fields of artificial neural networks and biological neural networks.

2 Related Work

Regularization methods use an additional regularization term in the loss function to penalize changes of important parameters for the old tasks and thus stabilize these weights (synapses).

Kirkpatrick et al. (2017) proposed the elastic weight consolidation (EWC) method, which adds a quadratic penalty on the difference between the parameters for the old and the new tasks. In EWC, the strengths of penalty (the “importance”) on each parameter are calculated by the diagonal of the Fisher Information Matrix (FIM). Zenke et al. (2017) proposed the synaptic intelligence (SI) approach to calculate synaptic relevance in an online fashion. Memory aware synapses (MAS) Aljundi et al. (2018) applied gradient of the learned function with respect to the parameter as the importance of the synapse. Interestingly, Aljundi et al. (2018) showed that the importance of a synapse is equal to the co-activation frequency of the two neurons connected by the synapse in MAS. This principle is analogous to the Hebbian learning theory of synaptic plasticity: “Cells that fire together, wire together” Hebb (1962). Serrà et al. (2018) proposed a hard attention to the task (HAT) algorithm to allocate dedicated parameter subspace for individual tasks by a hard attention mask, similar to the function of biological neural inhibition.

Memory replay strategies mitigate catastrophic forgetting by replaying a small amount of training data or replaying generated data (generative memory replay) through deep generative model. Rebuffi et al. (2017) proposed the iCarl approach to store example data points of old tasks to train the feature extractor together with new data. EEIL used a distillation measure to retain the exemplar set of old classes Castro et al. (2018). However, in the strict continual learning setups, to store real data is not allowed. Generative memory replay avoids the limitation of replaying raw data by reconstruction of training examples from a generative model. FearNet Kemker and Kanan (2017) applied a generative auto-encoder to train a classifier together with training data. Deep Generative Replay Shin et al. (2017), Memory Replay GAN (MeRGAN) Wu et al. (2018) and Dynamic Generative Memory (DGMw) Ostapenko et al. (2019) used GAN as the generative model.

To adapt the setup of incremental tasks, some continual learning frameworks make network extendable. Rusu et al. (2016) proposed a progressive network to freeze the network trained on previous tasks and allocate new sub-networks to train incoming tasks. Several generative replay methods make the generative network extendable to learn new domains, such as FearNet, MeRGAN and DGMw.

Performance of continual learning is mainly evaluated on task-incremental or class-incremental scenarios van de Ven and Tolias (2019); Ostapenko et al. (2019). In task-incremental scenarios, the task label is given at test time, i.e. multiple-head evaluation. The setups of class-incremental learning, which predominately applies single-head evaluation Chaudhry et al. (2018), is more general and realistic, that the task label is not provided during testing. However, single-head evaluation usually requires the model to allocate dedicated parameters for new outputs of the incremental classes to evaluate them together. To allocate additional parameters under iid data distributions is challenging for weight regularization methods without access to training data. Sub-network methods with attention gate decorrelate the learned tasks and thus fails to evaluate them together. Memory replay show advantages in such incremental setting Kemker et al. (2018) but limited by unavailability to the real data. Generative memory replay is a promising strategy which applies generated data to replace the unavailable old training data, but only perform well on relatively simple datasets, e.g. MNIST and SVHN Parisi et al. (2019).

Our method aims to improve continual learning on class-incremental setups and single-head evaluation, which can be easily extended to other scenarios. Thus, we analyze two main issues of existing generative replay methods and solve these issues through brain-inspired strategies, as shown below.

3 Preliminary Knowledge

We start by introducing the notations and the setup of continual learning and discuss on the limitations of existing methods.

3.1 Setups and Notations

We consider the continual learning with tasks. The entire dataset is a union of task-specific datasets , where . is an input data with the true label . In the incremental setups, the dataset is only available during training task . contains the data from multiple classes or only a single class. The testing part follows the single-head evaluation, i.e. task labels are not provided and all the learned classes are evaluated at test time Chaudhry et al. (2018).

In this setting, suppose a DNN first learns the parameters from the dataset of the current task and can perform well on it. For the new task , if the DNN is directly trained on without the availability of , the parameters will adapt to achieve good performance on task , while tend to forget the learned knowledge of task . This is known as catastrophic forgetting McCloskey and Cohen (1989).

3.2 Two Issues of Existing Generative Replay Methods

To avoid forgetting information of old tasks in continual learning, a promising strategy is generative replay, which first learns a generative model to describe the old training data and then replays the model to generate data of all the previous tasks. When training task , we use the extendable dataset , which contains both the training data of the current task and , thereby preventing the forgetting of previous knowledge. In the sequel, we use to denote the label of a generated data, distinguishing from the training data label . Because consists of both training data and generated data , we use to denote the label of in .

However, current state-of-the-art (SOTA) generative replay methods, such as MeRGAN Wu et al. (2018) and DGMw Ostapenko et al. (2019), only perform well on relatively simple datasets (e.g., MNIST and SVHN) while much poorer in complex datasets (e.g., ImageNet). Here we show two issues of current generative replay methods which decrease performance of continual learning:

Issue (a): Difference between the generated data and training data.

Generative replay does not directly solve catastrophic forgetting, but transfers the stress from the task-solver network to the generative network. Catastrophic forgetting in generative replay methods is mainly caused by the deviance between generated data and training data, because if the generative module precisely learns the distribution of old training data, the performance of generative replay should be the same as joint training of all training data. To show the difference of generated data from training data on the task-solver network during generative replay, we train a classifier first with training data to simulate the training stage, and then with generated data of the same task to simulate the replay stage in a 10-class ImageNet task. Then we measure the empirical Fisher information matrix (FIM), i.e., squared gradients of the parameters, on training data and generated data, and quantify the cosine similarity in Table 3 (

group). Cosine similarity measures the cosine of angle between the two matrices through the normalized inner product. The divergent directions of the two empirical FIMs indicate that the parameters of the classifier are optimized to different directions on the two types of data (i.e., generated data and training data).

Issue (b): Interference of discrimination and classification in a joint discriminator network. Many SOTA generative replay methods are implemented in a two-module architecture of GANs, e.g., AC-GAN Odena et al. (2017), where the discriminator network is responsible for both discriminating fake/real examples and predicting class labels. However, in continual learning, the discriminator is optimized for discrimination of the current task on dataset and classification of all the learned tasks on dataset . Thus, discrimination and classification in a joint discriminator network might not achieve optimum simultaneously. To verify this interference, we measure empirical FIM of the auxiliary classifier and the discriminator after training an AC-GAN model continually with 10 classes of MNIST or SVHN. The two FIMs show poor overlap in the shared part of the discrimination network (Table 4). Particularly, the weight of deeper convolution layers shows much poorer similarity. The averaged cosine similarity of the conv1.weight, conv2.weight and conv3.weight are 0.66, 0.40, 0.16 in MNIST and 0.70, 0.20, 0.08 in SVHN. Thus, we hypothesize that discrimination and classification interferes with each other in the continual learning setups.

4 Our Proposal

We now present our approach to addressing the above issues of existing memory replay methods. We build on the success of brain memory system in continual learning and develop three “artificial memory networks” with implementing consolidation approaches in appropriate modules.

Figure 2: Architecture and Training of TMNs. G, D, D’, C represent generator, discriminator, auxiliary classifier and classifier, respectively. and is training and generated dataset of task .

4.1 Triple Memory Networks

Fig. 2 illustrates the overall architecture of our Triple Memory Networks (TMNs), which consists of a generator (), a discriminator () and a classifier (). Before diving into the implementation details, we provide rationale on the design of each module.

Because of the similar function of the generative module and hippocampus to reconstruct specific information of the learned experience, we start with modeling the mechanisms of hippocampus in to mitigate catastrophic forgetting in generation process. Hippocampus accommodates for new information without interfering previous experience by neurogenesis and neural inhibition Deng et al. (2010), and it continually generates new-born neurons to encode new experience and develop inhibitory neurons to inhibit irrelevant parts of the network. These neurons quickly become mature and decrease plasticity Deng et al. (2010). Thus, even when the incoming pattern is similar to a previous one, they will not interfere with each other Rolls (2013). The hippocampal neurogenesis can be modeled as an extendable generative network and the neural inhibition is close to a binary attention mechanism Serrà et al. (2018). Since PFC inhibits hippocampus to prevent encoding of redundant information when the incoming experience matches a previously learned cortical memory Frankland and Bontempi (2005), we model the function of PFC as a discriminator and the hippocampus-PFC interaction as adversarial training of GAN. Such a combination has been proved effective to generate images in incremental simple domains Ostapenko et al. (2019), although performs much poorer in complex datasets (e.g., ImageNet).

To address the above issue (b), we use a relatively independent classifier , which is optimized only for classification rather than two objectives, inspired by organization principles of neocortex. Generalized information (e.g., structured knowledge for discrimination and classification) is encoded in neocortex but supported by different regions. In particular, sensory cortex (SC) is the dedicated region to extract features of sensory input, e.g., visual information Frankland and Bontempi (2005) and maintain generalized sensory information, analogous to the function of a classifier.

To address the above issue (a), we use a weight consolidation regularizer based on empirical FIM on classification process to regularize the divergence of training data and generated data during generative replay. This strategy is inspired by the strengthened synaptic connections in neocortex in biological memory replay. During biological memthe replay, specific information encoded in hippocampus is transferred into PFC and SC as generalized knowledge, corresponding to the knowledge of discrimination and classification in our framework. To consolidate the generalized knowledge, neocortical regions (e.g., PFC and SC) incrementally strengthen synaptic connections to stabilize the synapses connecting simultaneously-activated neurons. Particularly, our weight regularization process keeps biologically plausible because empirical FIM is also a measurement of synaptic connectivity of a network Achille et al. (2017).

Note that the closest GAN architecture to TMNs is Triple-GAN Li et al. (2017)

, which also applies an additional classifier. But Triple-GAN is proposed to improve classification and class-conditional image generation in semi-supervised learning. So the classifier is trained to label the unlabeled data. In contrast, TMNs is designed to alleviate catastrophic forgetting in continual learning, thus the three networks are implemented with consolidation algorithms inspired by the brain memory system, as detailed below.

4.2 Implementation Details

We now present the implementation details of TMNs with the interaction of with and . In biological memory replay, SC and other cortical modules learn generalized information under the integration of PFC. During training task , randomly generates data with label

using a random noise vector

of the current task and all the learned tasks : The sampling distribution of is uniform and is Gaussian . After training each task , generates the dataset to update . To model the interaction of PFC and SC, although both and should receive the replay dataset , supervises to learn the generalized knowledge. Since learns to classify the data, should learn not only to identify fake data but also .

A straightforward design is to use to model and an auxiliary classifier to learn . So the optimization problem becomes to optimize four groups of parameters in three networks: {}. To stabilize the training process, the loss function of the discriminator follows WGAN-GP Gulrajani et al. (2017). The loss function of the auxiliary classifier consists of a cross entropy term and an elastic weight consolidation (EWC) regularizer on empirical FIM .

(1)

where the cross entropy is calculated from the classification results of the auxiliary classifier and the true labels on training data of current task and generated data of previous tasks:

(2)

For notation clarity, we have used to denote without explicitly writing out its parameters, likewise for , and .

Because of the deviance between the generated data and the training data of the previous tasks, only to minimize cannot optimize the same as the true distribution . In theory, the gap can be filled by the regularization term since the is directly calculated from training data. In class-incremental setups, the output layer expands once a dimension for each incremental class and changes the shape of its weight. The regularization term only calculates and protects the parameters of other layers except for the output layer, i.e. the shared network of and . The loss function is:

(3)

The loss function of the classifier in (4) includes a cross entropy term and an regularization term. The cross entropy term minimizes the difference of and on the replay dataset to transfer knowledge from to , since the regularization term in (2) has penalized the gap between and . Similar to , the and on previous tasks are calculated from and the gap of imperfect generation can be filled by the weight consolidation regularizer. The Fisher Information of is not directly calculated from its loss function but , including an additional cross-entropy of the classification results and the ground truth labels of training data to minimize the gap in the two distributions.

(4)
(5)
(6)

The conditional generator uses an extendable network and the hard attention masks to model hippocampus. We apply a similar conditional generator architecture as DGMw Ostapenko et al. (2019): is the attention weight of the layer at task with the initialization of , where is a positive scaling factor, is a mask embedding matrix and

is the sigmoid function. To prevent interference of the learned generation tasks, the gradients

of each layer is multiplied by the reverse of cumulated attention mask of the learned tasks :

(7)

The loss function of the generator also follows the requirements of WGAN-GP and is the sparsity regularizer of the attention masks, where is the number of parameters of layer :

(8)
(9)

Because both and can make a prediction, a decision-making process is required to decide which classifier to use. Here, we adopt a simple yet effective decision-making rule. Specifically, and

first estimate the probability

and that an input feature vector belongs to class out of all classes, where . Then the final prediction is made by the network with the highest confidence:

(10)

5 Experiment

5.1 Experiment Setup

Our framework is evaluated following the class-incremental setups on four benchmark datasets: MNIST LeCun (1998), SVHN Netzer et al. (2011), CIFAR-10 Krizhevsky et al. (2009) and ImageNet Russakovsky et al. (2015). The evaluation measure is the average accuracy () on the test set of the class trained so far.

Datasets: MNIST includes 50,000 training samples, 10,000 validation samples and 10,000 testing samples of black and white handwritten digits of size 28 28. SVHN includes 73,257 training samples and 26,032 testing samples and each is a colored digit in various environments of size 32 32. CIFAR-10 contains 50,000 training samples and 10,000 testing samples of 10-class colored images of size 32 32. iILSVRC-2012 dataset contains 1000 classes of images and 1300 samples per class. We randomly choose 50 classes of iILSVRC-2012 as a subset ImageNet-50 and resize all images to 32 32 before experiment. The 50 classes of images in ImageNet-50 are trained with incremental bach of 10. We use top-1 and top-5 accuracy as the evaluation measure of ImageNet-50 on the val part of iILSVRC-2012. The data shown in Table 1, 2, 5 use top-1 accuracy. All the experimental results are averaged by 10 runs.

Architecture: We apply a 3-layer DCGAN architecture Radford et al. (2015) for the MNIST, SVHN and CIFAR-10 experiments and a ResNet-18 architecture for the ImageNet-50 experiment. The discriminator and the classifier use similar architecture except for the output layer. The generator applies an extendable network with hard attention masks similar to Ostapenko et al. (2019).

Baselines: We primarily compare with the continual learning methods following the strict setups, i.e. without storing training samples. In particular, because our method is basically a generative memory replay approach, we compare with other methods dependent on the similar idea. To best of our knowledge, DGMw Ostapenko et al. (2019) achieves state-of-the-art (SOTA) performance of class-incremental learning in most benchmark datasets, followed by MeRGAN Wu et al. (2018), DGR Shin et al. (2017) and EWC-M Seff et al. (2017)

. We compare our results directly with DGMw, under the same architecture and hyperparameters for the fair comparison. In ImageNet-50 experiment, we also compare with iCarl

Rebuffi et al. (2017) and EEIL Castro et al. (2018), the SOTA methods to incrementally learn complex domains but have to store training samples. Since the relaxed DGMw (DGMw-R) outperforms iCarl on when accessible to training samples, we also compare our method with DGMw-R. iCarl, EEIL and DGMw-R are allowed to keep (total number of classes in the dataset) (20 training samples per class). All the experiments apply joint training as the upper bound performance.

To examine the idea of weight consolidation, we apply SVHN benchmark to evaluate our system implemented with EWC or SI Zenke et al. (2017), another method to incrementally stabilize important parameters. SI uses an additional quadratic surrogate loss to replace the EWC term in loss functions. To calculate the synaptic relevance in SI, we also use the same loss functions of the classifier and the discriminator as EWC.

5.2 Comparison with SOTA Methods

The quantification of the comparison experiment with other methods is summarized in Table 1. We apply joint training as the upper bound performance. We compare the averaged accuracy of 5-class () and 10-class () in MNIST, SVHN and CIFAR-10, 30-class () and 50-class () in ImageNet-50. Our method outperforms the SOTA methods in SVHN and achieves comparable results in MNIST. Our method also achieves SOTA performance on CIFAR-10. Particularly, our experiment results in Table 2 show that our method more significantly outperforms the SOTA method on both and . The difference is possibly caused by the different network architectures from Ostapenko et al. (2019). Our approach significantly outperforms DGMw on ImageNet-50 (Fig. 3). When accessible to the real training samples, iCarl, EEIL and DGMw-R are the SOTA methods on ImageNet. TMNs outperforms the three methods or achieves comparable results on the ImageNet-50 benckmark, although stores no training samples.

MNIST SVHN CIFAR-10 ImageNet-50
Methods
Joint Training 99.87 99.24 92.99 88.72 83.40 77.82 57.35 49.88
+ Training
Data
EEIL (Castro et al. (2018)) - - - - - - 27.87 11.80
iCarl (Rebuffi et al. (2017)) 84.61 55.8 - - 57.30 43.69 29.38 28.98
DGMw-R (Ostapenko et al. (2019)) - - - - - - 36.87 18.84
- Training
Data
EWC-M (Seff et al. (2017)) 70.62 77.03 39.84 33.02 - - - -
DGR (Shin et al. (2017)) 90.39 85.40 61.29 47.28 - - - -
MeRGAN (Wu et al. (2018)) 98.19 97.00 80.90 66.78 - - - -
DGMw (Ostapenko et al. (2019)) 98.75 96.46 83.93 74.38 72.45 56.21 32.14 17.82
TMNs (ours) 98.80 96.72 87.12 77.08 72.72 61.24 38.23 28.08
Table 1: Averaged accuracy (%) of class-incremental learning on image. The results of baselines are cited from Wu et al. (2018); Chaudhry et al. (2018); Ostapenko et al. (2019).
SVHN CIFAR-10 ImageNet-50
Methods
DGMw* (Ostapenko et al. (2019)) 84.82(0.30) 73.27(0.35) 68.85(0.25) 54.40(0.65) 30.34(0.63) 17.84(0.44)
TMNs (w/o EWC) 84.99(0.36) 73.81(0.36) 69.18(0.49) 55.70(0.48) 32.05( 0.63) 18.18(0.42)
TMNs (D’+EWC) 86.56(0.35) 75.28(0.42) 71.16(0.36) 59.80(0.21) 36.05( 0.89) 25.36(0.59)
TMNs (C+EWC) 86.34(0.31) 75.45(0.26) 70.33(0.22) 57.32(0.44) 34.21(0.60) 19.55(0.49)
TMNs (C, D’+EWC) 87.12(0.22) 77.08(0.26) 72.72(0.36) 61.24(0.14) 38.23(0.75) 28.08(0.33)
TMNs (C+SI) 86.29(0.35) 75.40(0.34) - - - -
TMNs (C, D’+SI) 86.41(0.27) 75.58(0.20) - - - -
Table 2: Averaged accuracy (%) (SEM) on SVHN, CIFAR-10 and ImageNet-50, averaged by ten runs. *The performance of DGMw is our results.

5.3 Effectiveness of Weight Consolidation

Next, three evidences support the effectiveness of weight consolidation algorithms in the framework. The first evidence is the parallel experiment of SI and EWC, two comparable methods to approximate synaptic relevance Parisi et al. (2019). We implement SI into our system in the same way as EWC and compare the performance of the two models in SVHN dataset (Table 2). Implementation of EWC or SI into only the classifier or both the classifier and the discriminator results in a similar improvement of the averaged accuracy.

The second evidence is to compare the model implemented with EWC on both and , or only one network, or without EWC in Table 2. In all experiments, EWC implemented in both and outperforms EWC in a single network and TMNs w/o EWC. TMNs implemented with EWC in a single network also outperforms TMNs w/o EWC. Particularly, TMNs with EWC on the two classifiers significantly outperform TMNs w/o EWC on ImageNet-50 (Fig. 3).

Figure 3: Averaged top-1 and top-5 accuracy of class-incremental learning on ImageNet.
block 3 block 4
Parameter
0.layers.0.weight 0.8294 0.8089 0.9214 0.9495 0.4571 0.8673 0.9209 0.8704
0.layers.1.weight 0.8350 0.9371 0.9653 0.9943 0.6411 0.8460 0.9120 0.9899
0.layers.3.weight 0.6767 0.8110 0.9184 0.9519 0.6364 0.7749 0.8928 0.8309
0.layers.4.weight 0.6039 0.8101 0.9541 0.9501 0.6324 0.8549 0.8909 0.9059
0.shortcut.0.weight 0.7766 0.8127 0.9199 0.8144 0.6451 0.7602 0.6837 0.8510
0.shortcut.1.weight 0.8230 0.8953 0.9110 0.9626 0.5553 0.8444 0.8581 0.9302
1.layers.0.weight 0.7934 0.8500 0.9478 0.9334 0.4826 0.8008 0.8754 0.9241
1.layers.1.weight 0.8150 0.9840 0.8992 0.9988 0.5160 0.9152 0.9172 0.9693
1.layers.3.weight 0.7361 0.9333 0.9083 0.9608 0.2159 0.8247 0.8331 0.8629
1.layers.4.weight 0.6755 0.8904 0.9270 0.9828 0.2774 0.8747 0.9408 0.9776
Table 3: Averaged cosine similarity of empirical FIM of ResNet18 last two blocks in the 10-class ImageNet task. We train the classifier with training data first and then generated data of the same task under different strength of weight consolidation ().

Thirdly, we measure empirical FIM, i.e. squared gradient of the parameters, of a classifier trained on training data and then trained on generated data of the same task. Higher strength (larger ) of weight consolidation increases similarity of FIM of parameters on training data and generated data during generative replay (Table 3). The directions of FIM on the generated data and the training data are much closer under higher strength of weight consolidation, which regularize optimization of the parameters on generated data to a closer direction as the training data.

MNIST SVHN
Parameter Cosine Correlation Cosine Correlation
conv1.weight 0.6586 0.2285 0.7000 -0.0196
conv2.weight 0.3996 0.3566 0.2031 0.0806
conv3.weight 0.1625 0.1486 0.0783 -0.0061
BatchNorm2.weight 0.5078 0.0250 0.4378 -0.0869
BatchNorm3.weight 0.5576 -0.0251 0.4788 -0.0747
Table 4: Averaged similarity of the empirical FIM in AC-GAN. We calculate the cosine similarity (Cosine) and correlation coefficient (Correlation) after 10-class incremental learning on MNIST or SVHN dataset.

5.4 Effectiveness of Triple-Network Architecture

A key difference of TMNs from many single-head generative replay methods is the relatively independent classifier. Since can also make the prediction, we use a simple decision-making equation for the final prediction. Now we examine the necessity of the additional classifier and the decision-making process. One evidence has been mentioned that our preliminary experiment (Table 4) shows divergent directions of FIMs of the discriminator and the auxiliary classifier in the shared network on AC-GAN architecture, which interfere with each other.

We also quantify the classification results of individual classifiers and the final prediction after decision-making in Table 5. In all the experiments above, the averaged accuracy of the final prediction is always higher than individual D’ and C. Moreover, TMNs w/o EWC in Table 2 and Fig. 3 outperforms DGMw, which uses the same form of conditional generator but on an AC-GAN architecture. Notably, the first data point in Fig. 3 is the averaged accuracy of the first incremental batch in ImageNet-50 experiment. The first incremental batch only uses training data rather than generated data and there is no weight regularization. Both top-1 and top-5 of TMNs (69.53, 96.32) significantly outperform DGMw (63.34, 93.75). Thus, the triple-network architecture further alleviates catastrophc forgetting.

D’ C Output
SVHN 86.34 (0.19) 86.35 (0.29) 87.12 (0.22)
SVHN 74.81(0.32) 75.12(0.17) 77.08(0.26)
CIFAR-10 70.40(0.29) 69.36(0.45) 72.72(0.36)
CIFAR-10 59.96(0.23) 53.11(0.47) 61.24(0.14)
ImageNet-50 36.46(0.84) 36.65(0.81) 38.23(0.75)
ImageNet-50 26.90(0.50) 24.33(0.48) 28.08(0.33)
Table 5: Averaged accuracy (%) (SEM) of individual network and final output of TMNs.

6 Conclusions

In this work, we analyze how brain memory system encodes, consolidates and complements specific and generalized information to successfully overcome catastrophic forgetting. Inspired by the organization principles of brain memory system, we apply a triple network architecture of GAN to model the interplay of hippocampus, prefrontal cortex and sensory cortex. Inspired by the neural mechanisms to consolidate specific or generalized information, we implement the three modules with appropriate brain-inspired algorithms to develop the “artificial memory networks”. The triple-network architecture consists of a classifier network to show its advantages in classification, which can be replaced by a corresponding task-solver network and weight regularization methods to cope with other tasks. Further work will focus on more accurately modeling synaptic plasticity and extending the framework to other continual learning scenarios.

References

  • A. Achille, M. Rovere, and S. Soatto (2017) Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856. Cited by: §4.1.
  • R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars (2018) Memory aware synapses: learning what (not) to forget. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 139–154. Cited by: §2.
  • F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–248. Cited by: §2, §5.1, Table 1.
  • A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §2, §3.1, Table 1.
  • W. Deng, J. B. Aimone, and F. H. Gage (2010) New neurons and new memories: how does adult hippocampal neurogenesis affect learning and memory?. Nature reviews neuroscience 11 (5), pp. 339. Cited by: §1, §4.1.
  • P. W. Frankland and B. Bontempi (2005) The organization of recent and remote memories. Nature Reviews Neuroscience 6 (2), pp. 119. Cited by: Figure 1, §1, §4.1, §4.1.
  • J. T. Gonçalves, S. T. Schafer, and F. H. Gage (2016) Adult neurogenesis in the hippocampus: from stem cells to behavior. Cell 167 (4), pp. 897–914. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §4.2.
  • D. O. Hebb (1962) The organization of behavior: a neuropsychological theory. Science Editions. Cited by: §2.
  • R. Kemker and C. Kanan (2017) Fearnet: brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563. Cited by: §2.
  • R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan (2018) Measuring catastrophic forgetting in neural networks. In

    Thirty-second AAAI conference on artificial intelligence

    ,
    Cited by: §1, §2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1.
  • Y. LeCun (1998)

    The mnist database of handwritten digits

    .
    http://yann. lecun. com/exdb/mnist/. Cited by: §5.1.
  • C. Li, T. Xu, J. Zhu, and B. Zhang (2017) Triple generative adversarial nets. In Advances in neural information processing systems, pp. 4088–4098. Cited by: §4.1.
  • J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review 102 (3), pp. 419. Cited by: §1.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1, §3.1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §5.1.
  • A. Odena, C. Olah, and J. Shlens (2017) Conditional image synthesis with auxiliary classifier gans. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 2642–2651. Cited by: §3.2.
  • O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi (2019) Learning to remember: a synaptic plasticity driven framework for continual learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 11321–11329. Cited by: §2, §2, §3.2, §4.1, §4.2, §5.1, §5.1, §5.2, Table 1, Table 2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1, §2, §5.3.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §5.1.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2, §5.1, Table 1.
  • E. Rolls (2013) The mechanisms for pattern completion and pattern separation in the hippocampus. Frontiers in systems neuroscience 7, pp. 74. Cited by: §4.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §5.1.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §2.
  • A. Seff, A. Beatson, D. Suo, and H. Liu (2017) Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395. Cited by: §5.1, Table 1.
  • J. Serrà, D. Surís, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423. Cited by: §2, §4.1.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §1, §2, §5.1, Table 1.
  • G. M. van de Ven and A. S. Tolias (2019) Three scenarios for continual learning. arXiv preprint arXiv:1904.07734. Cited by: §2.
  • L. Wiskott, M. J. Rasch, and G. Kempermann (2006) A functional hypothesis for adult hippocampal neurogenesis: avoidance of catastrophic interference in the dentate gyrus. Hippocampus 16 (3), pp. 329–343. Cited by: §1.
  • C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. (2018) Memory replay gans: learning to generate new categories without forgetting. In Advances In Neural Information Processing Systems, pp. 5962–5972. Cited by: §2, §3.2, §5.1, Table 1.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §1, §2, §5.1.