Dissecting Catastrophic Forgetting in Continual Learning by Deep Visualization

01/06/2020 ∙ by Giang Nguyen, et al. ∙ 15

Interpreting the behaviors of Deep Neural Networks (usually considered as a black box) is critical especially when they are now being widely adopted over diverse aspects of human life. Taking the advancements from Explainable Artificial Intelligent, this paper proposes a novel technique called Auto DeepVis to dissect catastrophic forgetting in continual learning. A new method to deal with catastrophic forgetting named critical freezing is also introduced upon investigating the dilemma by Auto DeepVis. Experiments on a captioning model meticulously present how catastrophic forgetting happens, particularly showing which components are forgetting or changing. The effectiveness of our technique is then assessed; and more precisely, critical freezing claims the best performance on both previous and coming tasks over baselines, proving the capability of the investigation. Our techniques could not only be supplementary to existing solutions for completely eradicating catastrophic forgetting for life-long learning but also explainable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

Code Repositories

dissect_catastrophic_forgetting

Dissecting Catastrophic Forgetting in Continual Learning by Deep Visualization


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regarding human evolution, life-long learning has been considered as one of the most crucial abilities, helping us develop more complicated skills throughout our lifetime. The idea of this learning strategy is hence deployed extensively in deep learning community. Life-long learning (or continual learning) enables machine learning models to perceive new knowledge while simultaneously exposing backward-forward transfer, non-forgetting, or few-show learning

(Ling and Bohn, 2019). While the aforementioned properties are the ultimate goals for life-long learning systems, catastrophic forgetting or semantic drift naturally occurs in deep neural networks (DNNs) because they are mostly updated upon gradient descent algorithm (Goodfellow et al., 2013).

Many attempts have been succeeded to address the forgetting problem in generative models (Zhai et al., 2019), object detection (Shmelkov et al., 2017), semantic segmentation (Tasar et al., 2019), or captioning (Nguyen et al., 2019b). However, algorithms tend to rely on external factors (e.g., input data, network structure) while ignoring why catastrophic forgetting happens internally. If we have the picture of the forgetting process and understand how this problem affects models, it would be one step towards learning without forgetting.

Contemporary interpretability methods give us advantages to understand the decision-making process of deep neural networks, ranging from visualizing saliency maps (Simonyan et al., 2013; Dabkowski and Gal, 2017) to transforming models into human-friendly structures (Che et al., 2016)

. Interpreting the activation of a neuron or a layer in networks helps us categorize the specific role of each block, layer, or even a node. It has been proven that the earlier layers extract the basic features, such as edges or colors; while deeper layers are responsible for detecting distinctive characteristics. Prediction Difference Analysis (PDA)

(Zintgraf et al., 2017), even more specifically, highlights pixels that support or counteract a certain class, indicating which features are positive or negative to a prediction.

Although catastrophic forgetting is tough and undesirable, research to understand this problem is rare amongst AI community. The interest in understanding or measuring catastrophic forgetting does not correlate with the number of research to deal with this problem. (Kemker et al., 2018) develop new metrics to help compare continual learning techniques fairly and directly. (Nguyen et al., 2019a)

study which properties cause the hardness for the learning process. By modeling the chosen properties using task space, they can estimate how much a model forgets in a sequential learning scenario, shedding light on factors affecting the error rate on a task sequence. However, they can not explain what is being forgotten or which components are forgetting inside the model, but showing what properties of tasks trigger catastrophic forgetting. In comparison, our work focuses on study which components of a network are most likely to change corresponding to a given sequence of tasks.

This research introduces a novel approach to dissect catastrophic forgetting by visualizing hidden layers in class-incremental learning (considered as the hardest learning scenario in continual learning). In this learning paradigm, rehearsal strategies using previous data are prohibited and samples of the incoming tasks are unseen so far. We propose a tool named Auto DeepVis which leverages the prediction difference analysis from (Zintgraf et al., 2017). This tool automates the dissection of catastrophic forgetting, exactly pointing out which components in a model are causing the forgetting. By using Intersection over Union (IoU), the degree of forgetting is measured after each class is added, thus giving us an intuition how forgetting happens on a given part of the network.

In the first step, our tool observes the evidence and against for a prediction. If the evidence for a given image changes, we argue that the model is forgetting what it needs to look. IoUs between the previous and current evidence are taken into account to determine the degree of forgetting. We automate this procedure on representative samples of trained classes, followed by a generalization step to figure out the main culprits of forgetting. The algorithm is outlined in Algorithm 1.

In our thorough analysis of the model, it is necessary to pick the appropriate components in a block to visualize the changes. For instance, in ResNet-50 (He et al., 2016)

, there are 5 convolutional blocks, thus interpreting the activation of all filters in the whole 5 blocks will take a large amount of time. We propose to choose the filter having the highest IoU value with the ground truth segmentation. This choice is sufficient because although some filters may be blocked by activation function, ranking the importance of the remaining filters over a convolutional block is impossible. As a result, the biggest-IoU filter can be the representative for a given block.

A work from (Kemker et al., 2018) conducts experiments on state-of-the-art continual learning techniques that address catastrophic forgetting. It is demonstrated that although the algorithms work, but only on weak constrains and unfair baselines, thus the forgetting problem is not yet fully solved. They insist on the infeasibility of using toy datasets, such as MNIST (Kirkpatrick et al., 2017) or CIFAR (Zenke et al., 2017) in continual learning. Consequently, we choose Split MS-COCO (Nguyen et al., 2019b)

to measure the forgetting on deep neural networks. From the results of the dissection, we simply freeze the most plastic components in the network to protect the accumulated information. In addition to visualization on Convolutional Neural Network (CNN), we also briefly study how the decoder which consists of Recurrent Neural Network (RNN) changes in continual learning.

We summarize the contributions of this work as follows: First, we propose a novel and pioneering method to analyze catastrophic forgetting in continual learning. Auto DeepVis automatically points out the forgetting components in a network while learning a task sequence. Second, we introduce a new approach to mitigate catastrophic forgetting based on findings. Our techniques could play a complementary role in the step towards eradicating catastrophic forgetting.

2 Related Work

Feature Visualization

To understand how a model forgets the features learned before, we need to visualize how the model processes the input image. To do this, the layer-wise visualization of the model is performed. The old model (or original model) is the model we have obtained so far, denoting acquired knowledge on past tasks. On the other hand, the new model (or current model) is the network facing the new incoming tasks and should avoid catastrophic forgetting.

There are several works trying to visualize convolutional neural networks by various kinds of means. (Zeiler and Fergus, 2014) make use of multi-layered deconvolutional networks (deconvnet), using switches to record the local max value of the original input images. Deconvnet allows us to recognize which features are expected by a specific part of a network or what properties of image excite a chosen neuron the most. Instead of feeding an input image to diagnose, (Yosinski et al., 2015) attempt to generate a synthetic image which maximizes the activation of a given neuron by gradient descent algorithm.

Another broadly used approach called saliency maps by (Simonyan et al., 2013), presenting a gradient-based technique to measure how the classification score is sensitively changed due to the small changes from pixels. This work shows a generalized version of deconvolution layers. A similar approach has been proposed by (Robnik-Šikonja and Kononenko, 2008), which completely removes input pixels instead of calculating the gradient of those pixels. A variant of the above-mentioned work called PDA is proposed by (Zintgraf et al., 2017). They utilize conditional sampling that sweeps away patches of connected pixels instead of removing one single pixel at a time, which eventually shows better visualization results. However, this tool only generates the computer vision, leaving the conclusion for users. This manual process can not ensure the quality of the observation when we can have hundreds or even thousands of feature maps in a convolutional block. In this work, PDA is adopted to build an automatic tool for visualizing catastrophic forgetting.

Catastrophic Forgetting

Catastrophic forgetting has been introduced for the first time by (McCloskey and Cohen, 1989) in connectionist networks. Very recently, (Toneva et al., 2018) reframe the definition of forgetting as when the prediction of a model on a sample is shifted during the learning process, from correctly to incorrectly. Going deeply into the model, we can argue that the values of neurons have changed drastically, thus possibly resulting in a different answer for the same input image.

Elastic Weight Consolidation - EWC (Kirkpatrick et al., 2017)

uses Fisher information matrix to help model mimic the synaptic consolidation mechanism of the human brain. Important parameters for the performance of the old task are protected while others are updated to minimize the loss on the new dataset. By comparison, the significance of each synapse (or weight) in the neural network is computed locally in

(Zenke et al., 2017). When the distribution interference appears, synaptic states keep and estimate the importance of synapses by an online estimation and crucial synapses are prevented from changing. Learning without Forgetting - LwF (Li and Hoiem, 2017) generates pseudo labels on incoming data to help capture the previous distribution. In training, knowledge distillation (Hinton et al., 2015) turns the pseudo labels into soft targets and a warm-up step is applied. Based on Bayesian neural networks, (Lee et al., 2017)

match the moment of posteriors on both two tasks to guide the new network to a common low-error region. Knowledge distillation from the old model and an expert feature extractor is conducted in

(Hou et al., 2018), complemented by a retrospection on a trivial fraction of old data. (Rusu et al., 2016) dynamically expand the network by specialized sub-networks to absorb new knowledge while the old modules are frozen.

Catastrophic forgetting can be generally addressed by regularization, data replay or altering network architecture. In this research, we show that only by simply freezing the fragile blocks of a network, we can significantly improve the generalization performance.

Figure 1: (a) IoU value between the segmentation of a train and the positive features. (b) The IoU of two representative maps. Red is evidence and blue is against.

3 Methodology

Although (Nguyen et al., 2019a) show an interest in understanding catastrophic forgetting, they focus on how task properties influence the hardness of sequential learning. Hence, they are explaining based on the input data. In comparison, our work attempts to explain how the forgetting happens over time based on the computer vision of models. (Zintgraf et al., 2017) present response from a network to a given image in which we can clarify which features support or counteract the prediction. We leverage this tool for visualizing the computer vision, but extend to an automatic version - Auto DeepVis to efficiently dissect the forgetting dilemma. The ultimate goal of this tool is to figure out the most plastic layers or blocks in a network. Plasticity means low degree of stiffness or easy to change. Moreover, continual learning techniques have been proposed to alleviate catastrophic forgetting, but none of them are devised based on the findings from Explainable AI. Critical freezing is built on the top of Auto DeepVis’ investigation to provide an interpretable yet effective approach to acquire the life-long learning ability for deep learning models. In the learning process, the stable state of the old model is employed to initialize the new network. This way really mimics the working mechanism of the human brain.

Figure 2: Overall scheme of Auto DeepVis

3.1 Auto DeepVis

To dissect the model, we visualize the hidden layers to understand the forgetting effect inside a model after being trained on different tasks. (Zintgraf et al., 2017) claim that different features of the objects being dissected could be captured and visualized by particular feature maps in different layers. By looking into every response maps in one convolutional block, we realize diverse features, such as eyes, face shape, car wheels, or background are isolatedly recognized by different channels. Unfortunately, finding each feature map manually by human eyes might be inefficient. To solve this issue, we only compare the computer vision with the ground truth segmentation rather than small details. In general, we do not look for the answer that what features are being forgotten, but which layers are now forgetting.

Assume the semantic segmentation label of MS-COCO dataset (Lin et al., 2014) is what human sees, we compare this segmentation with the computer vision of the model, particularly concentrating on positive evidence for a prediction. The IoU value between the segmentation and evidence is calculated as shown in Fig. 1 (a). The red dots in the map describe the positive evidence while the blue points represent the against. The scheme of Auto DeepVis is illustrated in Fig. 2.

Having the m-th feature map (FM) in the l-th layer of a model M and the ground truth segmentation GT, the IoU is computed as:

(1)

To select the feature map having the largest overlap with the ground truth in each convolutional block, the representative feature map (RM) with the best IoU is denoted as :

(2)

To understand how the computer vision change over the training process, we compared the RM in the new model with the RM of original network Fig. 1 (b). The forgetting effect of each trained model is measured by the IoU between the original model and new model :

(3)

Similar to the method of finding out the best map fitting with ground truth, the feature map representing the best memory of the original feature map is denoted as in Equation 4. In the same block of both the old and new model, the role of a filter can be adjusted. For instance, the filter in the block of the old model detects the eyes, but the same filter in the same block of the new model will consider the face. The way we propose to pick

sounds heuristically sufficient.

(4)
  Input: Sample set , segmentation ground truth , old model , new model , number of blocks
  Output: Forgetting layers
  
  
  repeat
     
     
     
     
     for  to  do
        
        
        Append()
        Append()
     end for
     
     Append(Ł, )
     
  until S
   Most frequent block in Ł
Algorithm 1 Auto DeepVis

The pipeline of our method is depicted in Algorithm 1. The sample set is particularized in Section 4, and K is the number of the convolutional blocks in the network. By inputting image by image from , we get the visualization of filters over the whole network by PDA. Next, we iterate through the blocks to obtain the fragile block against a given image, then the indexOf gets the most forgetting block. Finally, we generalize on to return the most forgetting component .

Figure 3: Two-head network for dissecting catastrophic forgetting.

3.2 Critical Freezing

Using the investigation from Auto DeepVis, we freeze or apply a tiny learning rate on the most plastic layer in a deep learning model. Take Resnet-50 including 5 convolutional blocks as an example, if we find the

convolutional layer fragile, this layers should be slightly updated or completely frozen in the training process of the next task. The objective function is the standard cross-entropy loss for image captioning, in which

is the size of the vocabulary.

(5)

The proposed technique can accompany various existing solutions. For instance, when using knowledge distillation for continual semantic segmentation in (Michieli and Zanuttigh, 2019), progressively adding classes alters the evolution of the model, misleading the network to a local optimum. By knowing which regions are needed to be intact, the performance on the old task could be largely improved. Our tool could also benefits other continual learning approaches, ranging from regularization to dynamic architecture.

width=1.0center= BLEU1 BLEU4 METEOR ROUGE_L CIDEr BLEU1 BLEU4 METEOR ROUGE_L CIDEr 68.1 24.9 23.4 50.8 77.8 - - - - - 46.0 6.5 11.7 34.5 11.3 58.1 15.5 17.7 44.2 35.0 51.3 11.1 15.3 38.8 27.1 60.5 17.2 19.3 45.4 43.7 53.3 13.6 15.6 40.3 35.3 60.2 17.3 17.8 44.7 36.5 54.3 12.5 16.1 40.4 33.0 61.1 17.8 19.6 46.3 45.6

Table 1: Performance when 5 classes arrive sequentially on past tasks and newly added tasks.

4 Experiments

When using PDA (Zintgraf et al., 2017), the optimal window size for the best visualization is . We use a tweaked dataset called Split MS-COCO from (Nguyen et al., 2019b) to reproduce catastrophic forgetting. The dataset contains over 47k images for training and over 23k images for validation and testing. For incremental learning setup, at one time step, a new class will arrive. Notably, data balancing is not applied yet; as a result, this technique could be left for future work to increase the overall performance.

Experiments are performed on a multi-modal task (captioning) combining both CNN and RNN in the architecture, obeying the sequential scenario from (Nguyen et al., 2019b)

. We initially train 19 classes to acquire the original model, then adding 5 classes incrementally. Obviously, if the investigation from our tool makes the chosen task work, the findings could be also deployed on object detection, classification or segmentation. The captioning model is divided into an encoder followed by a decoder. Therefore, the original structure should be transformed into a two-head network so that we can get the prediction of the CNN and the sentence generated from the decoder simultaneously. We add one more output layer as a classifier after the encoder, and our proposed architecture is presented in Fig.

3. The encoder is the ResNet-50, and the decoder includes an embedding layer, a single-layer LSTM, and a fully-connected layer producing a word at a time step.

Figure 4: Feature maps from convolutional blocks of ResNet-50.

As our tool works on a single image, running multiple times on different and diverse input images is needed, helping us generalize the forgetting. We choose a sample set (bicycle, car, motorcycle, airplane, bus, train, bird, cat, dog, horse, sheep, and cow) from 19 trained classes.

To evaluate critical freezing, there are baselines, such as fine-tuning or heuristically freezing. In fine-tuning, The old model initializes the new model and training is done to minimize the loss on the new task. As the network contains two parts, encoder and decoder, we freeze them separately. The traditional scores for image captioning are considered in evaluating the superiority of critical freezing over the baselines.

4.1 Auto DeepVis to Dissect CNNs

To dissect which parts in the encoder are forgetting the most, we first apply Auto DeepVis to elaborate the IoU of each layer comparing with the ground truth and the original computer vision seen by layers of the original model (trained on 19 classes). After adding a new class, we obtain , and is the model when classes are witnessed. In Fig. 4, the visualized results show that the first and second block of ResNet-50 can overall capture the outline of objects. Computer vision turns to represent more detailed features from the objects and other background features to determine the class of the input image.

Figure 5: of , and comparing with GT.

Subsequently, the IoUs of different blocks in each model, comparing with ground truth, are calculated by equation 1 and 2. The results reveal that although different models show different levels of performance of classification, the () values are roughly similar at all the layers, which implies that no matter the how good the performance is, the level of matching between feature maps of each model and the human vision is preserved (Fig. 5).

To quantitatively measure how the forgetting occurs in the encoder, we compute IoUs by equation 3 and 4. As shown in Fig 6, it is clear that the IoU between and is much higher than the figure for and in every block. For , the is always 1.00 at the first block of the network, showing that a very trivial forgetting happens here. The starts to drop along the blocks because the later feature maps are constructed by the previous maps. The forgetting effect persists and does not show which block is forgetting the most.

Figure 6: of model and comparing with .
Table 2: Qualitative analysis on the encoder. The prediction is preserved via critical freezing.

For , the first block of the model still gets a high IoU comparing with and the values decrease from the second block. Unlike the stably decreasing trend seen in , the decreasing rate of fluctuates through the blocks and a severe drop at block 3 is observed in all the testing input, suggesting that the forgetting effect might happen the most in this block. After the block, almost instances show a continuous IoU decline at block 4. Iterating this procedure on all images of the set reinforces that the most dramatically forgetting is happening at block 3 and block 4 in ResNet-50.

Freezing the mentioned blocks means the feature extraction stays unchanged. We conduct a qualitative analysis to observe the response of the encoder while critical freezing is adopted in Table.

2. The prediction keeps virtually the same with the output of the original model. The only misclassifcation in the Table. 2 is from American eagle and Crane; however, they are still from the bird family. While two naive approaches of freezing in (Nguyen et al., 2019b) are also implemented (partially freezing), we devise critical freezing based on findings, which only freezes the most plastic layers. As shown in Table. 1, precisely freezing helps learning on both the new and old tasks more effectively.

4.2 Decoder Dissection

In addition to dissecting CNN, we also want to inspect and visualize the changes of the decoder. However, to the best of our knowledge, there is still no work done in revealing how RNNs visually see or sense the input. Therefore, in this work, we simply freeze each component of the decoder in learning the new task and observe the effect of freezing. We also keep the encoder frozen from to see how the synthetic caption depends on an individual layer of the decoder. Table 3 suggests the significance of the LSTM network in the sentence generation process while the linear layer can be trainable over a task series.

BLEU1 BLEU4 METEOR ROUGE_L CIDEr
LSTM 46.1 7.0 11.1 34.1 12.8
Embedding 45.0 6.7 11.1 34.0 12.8
Linear 39.3 2.1 8.3 31.3 4.2
Table 3: Performance of on the past tasks when freezing decoder’s components.

5 Conclusion

As the presence of catastrophic forgetting hinders the life-long learning, understanding how this phenomenon happens in computer vision is extremely significant. We introduce Auto DeepVis to grasp catastrophic forgetting. From knowing where the forgetting issue is coming from, a technique has been proposed focusing on plastic components of a model to moderate the information loss. The results indicate the superiority of critical freezing over the baselines. We also try knowledge distillation on plastic layers but it does not help much because the discrepancy of teacher and student is accumulated over time, leading to a difficulty for teaching. To the best of our knowledge, no work has been done for mitigating catastrophic forgetting based on Explainable AI. This work shows a satisfying result from the investigation. Auto DeepVis gives good observation on the forgetting layer, and freezing critical layers helps the model mitigate catastrophic forgetting and could play as a supplement to other techniques to completely address catastrophic forgetting.

There are future works following our paper. First and foremost, a deeper, cleaner, and effort-free version of our tool should be taken into consideration to give a better insight into catastrophic forgetting. At this time, we only consider the filter having the highest IoU. Although this assumption is appropriate, it is still crude, thus looking into other filters in one convolutional block might be necessary as well. Scaling this work for other tasks can better validate the feasibility of the proposed continual learning algorithm. Last but not least, RNN is now being overlooked and not understood fully via interpretability methods. RNN dissection requires efforts but would open the door to understanding the nature of this network.

References

  • Z. Che, S. Purushotham, R. Khemani, and Y. Liu (2016) Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, Vol. 2016, pp. 371. Cited by: §1.
  • P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §1.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • S. Hou, X. Pan, C. Change Loy, Z. Wang, and D. Lin (2018) Lifelong learning via progressive distillation and retrospection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 437–452. Cited by: §2.
  • R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan (2018) Measuring catastrophic forgetting in neural networks. In Thirty-second AAAI conference on artificial intelligence, Cited by: §1, §1.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.
  • S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017) Overcoming catastrophic forgetting by incremental moment matching. In Advances in neural information processing systems, pp. 4652–4662. Cited by: §2.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.1.
  • C. X. Ling and T. Bohn (2019) A unified framework for lifelong learning in deep neural networks. arXiv preprint arXiv:1911.09704. Cited by: §1.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §2.
  • U. Michieli and P. Zanuttigh (2019) Incremental learning techniques for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §3.2.
  • C. V. Nguyen, A. Achille, M. Lam, T. Hassner, V. Mahadevan, and S. Soatto (2019a) Toward understanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091. Cited by: §1, §3.
  • G. Nguyen, T. J. Jun, T. Tran, and D. Kim (2019b) ContCap: a comprehensive framework for continual image captioning. arXiv preprint arXiv:1909.08745. Cited by: §1, §1, §4.1, §4, §4.
  • M. Robnik-Šikonja and I. Kononenko (2008) Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering 20 (5), pp. 589–600. Cited by: §2.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §2.
  • K. Shmelkov, C. Schmid, and K. Alahari (2017) Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3400–3409. Cited by: §1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §2.
  • O. Tasar, Y. Tarabalka, and P. Alliez (2019) Incremental learning for semantic segmentation of large-scale remote sensing data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (9), pp. 3524–3537. Cited by: §1.
  • M. Toneva, A. Sordoni, R. T. d. Combes, A. Trischler, Y. Bengio, and G. J. Gordon (2018) An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159. Cited by: §2.
  • J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson (2015) Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §2.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §1, §2.
  • M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori (2019) Lifelong gan: continual learning for conditional image generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2759–2768. Cited by: §1.
  • L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §1, §1, §2, §3.1, §3, §4.