Reducing catastrophic forgetting with learning on synthetic data

04/29/2020 ∙ by Wojciech Masarczyk, et al. ∙ Tooploox 4

Catastrophic forgetting is a problem caused by neural networks' inability to learn data in sequence. After learning two tasks in sequence, performance on the first one drops significantly. This is a serious disadvantage that prevents many deep learning applications to real-life problems where not all object classes are known beforehand; or change in data requires adjustments to the model. To reduce this problem we investigate the use of synthetic data, namely we answer a question: Is it possible to generate such data synthetically which learned in sequence does not result in catastrophic forgetting? We propose a method to generate such data in two-step optimisation process via meta-gradients. Our experimental results on Split-MNIST dataset show that training a model on such synthetic data in sequence does not result in catastrophic forgetting. We also show that our method of generating data is robust to different learning scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Synthetic data created from generator is divided into five tasks according to classes and learner (green) learns tasks sequentially. The same procedure is applied to learner with real data (red). The right plot shows that accuracy at the end of each task does not decrease on learned data in contrast to real data where it deteriorates sharply.

Deep learning methods have succeeded in many different domains such as: scene understanding, image generation, natural language processing

[2, 27, 25, 19]. While deep learning methods differ in architecture choice, objective function or optimization strategy, they all assume that the training data is independent and identically distributed (i.i.d). Methods built on this assumption are effective for fixed environments with stationary data distributions – where tasks to be solved do not change over time or classes present in the dataset are known from the beginning. However, in most real-life scenarios this assumption is violated and there is a need for methods that are able to handle such cases. Among many examples of such scenarios, a few can be highlighted: new object class is introduced, however the dataset used to train the baseline model is no longer available; the data characteristics seem to change seasonally and model needs to change its predictions accordingly to these trends. Continual learning [21] is a paradigm where data is presented sequentially to the algorithm without the ability to manipulate this sequence. Additionally, there is no assumption about the structure of the sequence. A successful continual learning algorithm needs to be able to learn a growing number of tasks, be resistant to catastrophic forgetting [17] and be able to adapt do distribution shifts. The memory and computational requirements of such algorithm should scale reasonably with the incoming data.

Although the problem of continual learning is known for many years [21, 17], only recently has the field gained significant traction and many interesting ideas have been proposed. Most of continual learning contributions can be divided into three categories [12, 20]: optimization, architecture and rehersal. Methods based on optimization modifications usually add additional regularization terms to objective function to dampen catastrophic forgetting [9, 11]. Second category gathers methods that propose various architectural modifications e.g. Progressive Net [22] where increasing capacity is obtained by initialising new network for each task. The last category – rehersal based methods – consists of methods that assume life-long presence of a subset of historical data that can be re-used to retain knowledge about past tasks [13, 5].

This work proposes a new data-driven path that is orthogonal to existing approaches. Specifically, we would like to explore the possibility of creating input data artificially in a coordinated manner in such a way that it reduces the catastrophic forgetting phenomena. We achieve this by combining two separate neural networks connected by two-step optimisation. We use generative model to create synthetic dataset and form a sequence of tasks to evaluate learner model in continual learning scenario. The sequence of synthetic tasks is used to train the learner network. Then, the learner network is evaluated on real data. The loss obtained on real data is used to tune the parameters of the generative network. In the following step, the learning network is replaced with a new one.

Differently from existing approaches, our method is independent of training method and task and it can be easily incorporated to above-mentioned strategies providing additional gains.

2 Related Work

One line of research for continual learning focuses on optimization process. It draws inspiration from the biological phenomena known as synaptic plasticity [1]. It assumes that weights (connections) that are important for particular task become less plastic in order to retain the desired performance on previous tasks. An example of such approach is Elastic Weight Consolidation (EWC) [9], where regularisation term based on Fisher Information matrix is used to slow down the change of important weights. However accumulation of these constrains prevents network from learning longer sequences of tasks. Another optimization based method is Learning without Forgetting (LwF) [11]. It tries to retain the knowledge of previous tasks by optimizing linear combination of current task loss and knowledge distillation loss. LwF is conceptually simple method that benefits from knowledge distillation phenomenon [6]. The downside of such approach is that applying LwF requires additional memory and computation resources for each optimization step.

Methods based on architectural modifications allow to dynamically expand/shrink networks, select sub-networks, freeze weights or create additional networks to preserve knowledge. Authors of [22]

propose algorithm that for each new task creates a separate network (a column) that is trained to solve particular task. Additionally, connections between previous columns and the current column are learned to enable forward transfer of knowledge. This algorithm avoids catastrophic forgetting completely and enables effective transfer learning. However the computational cost of this approach is prohibitive for longer sequences of tasks. Other methods

[31, 29]

address the problem of computational cost by expanding single layers/neurons instead of whole networks, however these methods has less capacity to solve upcoming tasks. Different approaches that modify architectures are based on selecting sub-networks used for solving current task in such a way that only a fraction of network’s parameters relevant to current task is changed

[15, 16, 3]. The challenge here is to balance the number of frozen and active weights in such way that network is still able to learn new tasks and preserve current knowledge.

Rehearsal methods are based on the concept of memory replay. It is assumed that subset of previously processed data is stored in memory bank and interleaved with upcoming data in such a way that neural network learns to solve current task in addition to preserving current knowledge [13, 28, 5]. A naive rehearsal method would be to save random data samples that were present during training. However such approach is inefficient, since samples are not equally informative, hence the challenge of rehearsal methods is to choose the most representative samples for a given dataset, such that minimum storage is occupied. In [28], authors apply method of dataset distillation based on meta-gradient optimization to reduce the size of memory bank. It is possible to represent whole class of examples just by storing one carefully-optimized example. Unfortunately, applying this meta-optimization method is computationally exhaustive. The biggest downside of using rehearsal based methods is the need to store the actual data which in some cases can violate data privacy rules or can be computationally prohibitive. To mitigate this issue solution based on Generative Networks was proposed [30, 23]. Namely, they use dual model architecture composed of learner network and generative network. Role of the generative network is to model data previously experienced by the learner network. Data sampled from the generator network is used as a rehersal data for learner network to reduce the effect of catastrophic forgetting.

Our method is also dual architecture model based on generative network, however the aim of generative network is radically different. In contrast to authors [30, 23] we do not aim to capture the statistics of real data, instead we try to generate entirely synthetic data such that when learner does learn on a sequence of such data it does not suffer from catastrophic forgetting.

3 Method

The main idea of our approach is to generate data samples such that network trained on them in sequence would not suffer from catastrophic forgetting. One of many ways to generate artificial data is to use meta-optimization strategy introduced in [14]

. It is shown that by applying meta-learning it is possible to use gradient optimization both to hyperparameters and to input data. However, this approach is limited to small problems, since each data point must be optimised separately. To overcome this bottleneck, authors of Generative Teaching Networks (GTNs)

[26] use generative network to create artificial data samples instead of directly optimizing the data input. We adopt similar approach in our method, namely, we use generative network – green rectangle ”Generator” in Fig. 2

– to produce synthetic data from noise vectors sampled from a random distribution. Next, we split the data into separate tasks according to classes and form a continual learning task for the learner network – blue rectangle in Fig.

2

. Learner network after completing whole sequence of tasks in evaluated on real training data. The loss from real data classification after learning all tasks in sequence is then backpropagated to generator network to tune the parameters as shown in Fig.

2.

Our approach is similar to one proposed in work [7]. Using two step meta-learning optimization they try to learn best representation of input data such that the model learned in with standard optimization does not suffer from catastrophic forgetting.

Differently from [26], we do not use curriculum based learning as our goal is to have a realistic continual learning scenario where the order of data sequence is not known beforehand. To ensure that the Generator network does not generate data suitable for particular sequence of tasks at each meta-optimization we shuffle order of tasks. Precisely, at each step we generate samples for each class and then randomly create a sequence of binary classification tasks with particular data.

Figure 2: Synthetic data from generator is passed to learner where the inner optimization is performed and meta-loss is backpropagated to .

Precisely, let be a generative neural network, a standard convolutional network for classification, a sequence of tasks, where each tasks is binary classification task and classes in each task form mutually disjoint sets.

The inner training loop consists of sequence of tasks, where generated samples from previous tasks are not replayed once the task is finished. To achieve this, the sequence of tasks must be defined a priori and samples generated by network are conditioned on the information of particular task. For each task the network generates two batches of samples for , where

is a batch of noise vectors sampled from Normal distribution and

is a class indicator for task . Note that generator networks has access to class indicators since we aim to learn in continual learning scenario only the learner network.

Neural network learns sequentially on following tasks using standard SGD optimizer with learning rate and momentum optimized through meta-gradients. At the end of the sequence network is evaluated on real dataset () obtaining meta-loss as shown in Fig. 2. This meta-loss is backpropagated through all training inner-loops of model to optimize network . Parameters of network are updated according to the equation:

(1)

where are parameters of the network after optimization steps, is fixed learning rate,

is a cross entropy loss function,

are real data samples and labels respectively.

4 Experiments

To test our hypothesis we use popular continual learning benchmark Split-MNIST [10, 24]. In first experiment, we use 5-fold split with two classes for each task to create a moderately difficult sequence of tasks. Network

generates 250 samples per each class. During inner optimisation learner network is optimized on batch size formed with 40 generated images (20 samples per class drawn randomly from the pool of 250 samples per class). We train the learner network on each task for 5 inner steps with batch size 40. Once the task is over, samples from this task are not shown to the network to the end of training. At test time, after learning on each task the network is evaluated on part of a test set composed of classes seen in previous taks. Both networks are simple convolutional neural networks with two convolutional layers with addition of one and two fully connected layers for classification and generative network respectively. Each layer is followed by a batch normalisation layer.

As a baseline to compare with, we use simple fully connected network proposed in [8] (’MLP’ – red – in Fig. 4). To further investigate the impact of generated data we use the same network architectures and optimizer settings with learning rate and momentum optimized with by a meta learning process as described in Section 3 but for optimizing the learner network we use real data (’Real Data’ – yellow – in Fig. 4). We also compare our results with GAN-based data samples. In this scenario we follow the setting of ’Real Data’ scenario except for the source of data. We use Conditional-GAN [18] to model the original data distribution and then sample 250 samples per each class (’GAN based’ – blue – in Fig. 4).

We implement experiments in PyTorch library, which is well suited for computing higher-order gradients

[4].

Figure 3: Samples generated by network at the end of meta-optimisation. Starting from zero (leftmost), each sample to the right represents the following class.
Figure 4: Overall accuracy measured on test data subset. After learning each task, test data subset is made of samples only from classes seen during recent and previous tasks.
Figure 5: Overall accuracy measured on test set after learning network with synthetic data for inner steps on each task.

Results – obtained results support our hypothesis, that it is possible to generate synthetic data such that, even if networks learns this data in sequence (one time per sample), the learning process does not result in castastrophic forgetting.

Figure 4 shows how learning on synthetic data in sequence results in less catastrophic forgetting compared to learning on a sequence of real data samples. Note that additional performance could be gained with careful hyperparameter tuning, however we did not want to compete for best performance and rather show the potential of this approach. Higher accuracy of ’Real data’ scenario over ’MLP’ can be attributed to the effectiveness of optimised learning rate and momentum parameters, however the main advantage comes from using meta learned data samples. Results obtained with data generated with GAN are almost identical to ones obtained with real data. This result is expected as the data modeled by a GAN resembles original data closely.

An example batch of generated samples is shown in Figure 3. The samples are ordered according to classes (starting from 0). In contrast to [26] the data samples are abstract blobs, rather than interpretable images. We verify experimentally that the reason for the lack of structure in generated samples is the lack of curriculum learning in our scenario. We skip it intentionally to provide more realistic continual learning scenario for the learner network.

Fig. 5 shows the impact of change of learning scenario of network after network is trained. In this experiment data generated by a network in first experiment is used. Here, we investigate how the final accuracy after learning five consecutive tasks changes with the number of inner optimization steps. Note that was optimised to create samples that are robust to catastrophic forgetting with inner optimization loop of 5 steps. As we can see, in case of longer learning horizon, network learned on synthetic (green plot Fig. 5) data suffers significantly less than the same network learned on real data (yellow plot Fig. 5). Even though accuracy of the networks drops with increasing number of inner steps, the drop is smoother in case of synthetic data.

5 Conclusions

The aim of this work was to answer a question, whether it is possible to create data that would dampen the effect of catastrophic forgetting. Experiments show that this hypothesis is true – it is possible to generate such samples, however usually they do not visually resemble real data. Surprisingly, even applying the method alone can result in high performing network. Additional interesting advantage of this synthetic data is the robustness to changes of inner optimisation parameters – increasing 15-fold size of a batch and length on training still results in compelling performance. We believe that our experiments open a new and exciting path in continual learning research. As a future work we plan to adjust current method to datasets of higher complexity and test its effectiveness in online learning scenario.

6 Acknowledgements

Authors would like to thank Petr Hlubuček and GoodAI for publishing the code at https://github.com/GoodAI/GTN.

References

  • [1] J. Cichon and W. Gan (2015-03) Branch-specific dendritic ca2+ spikes cause persistent synaptic plasticity. Nature 520, pp. . External Links: Document Cited by: §2.
  • [2] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1.
  • [3] S. Golkar, M. Kagan, and K. Cho (2019) Continual learning via neural pruning. CoRR abs/1903.04476. External Links: Link, 1903.04476 Cited by: §2.
  • [4] E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, F. Meier, D. Kiela, K. Cho, and S. Chintala (2019) Generalized inner loop meta-learning. arXiv preprint arXiv:1910.01727. Cited by: §4.
  • [5] T. L. Hayes, N. D. Cahill, and C. Kanan (2019) Memory efficient experience replay for streaming learning. In International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20-24, 2019, pp. 9769–9776. External Links: Link, Document Cited by: §1, §2.
  • [6] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, External Links: Link Cited by: §2.
  • [7] K. Javed and M. White (2019) Meta-learning representations for continual learning. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1820–1830. External Links: Link Cited by: §3.
  • [8] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan (2018) Measuring catastrophic forgetting in neural networks. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    , S. A. McIlraith and K. Q. Weinberger (Eds.),
    pp. 3390–3398. External Links: Link Cited by: §4.
  • [9] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/114/13/3521.full.pdf Cited by: §1, §2.
  • [10] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    .
    In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4652–4662. External Links: Link Cited by: §4.
  • [11] Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §1, §2.
  • [12] V. Lomonaco (2018) Continual learning with deep architectures. Ph.D. Thesis. Cited by: §1.
  • [13] D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6467–6476. External Links: Link Cited by: §1, §2.
  • [14] D. Maclaurin, D. Duvenaud, and R. P. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In

    Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

    ,
    ICML’15, pp. 2113–2122. Cited by: §3.
  • [15] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11208, pp. 72–88. External Links: Link, Document Cited by: §2.
  • [16] A. Mallya and S. Lazebnik (2018) PackNet: adding multiple tasks to a single network by iterative pruning. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    ,
    pp. 7765–7773. External Links: Link, Document Cited by: §2.
  • [17] M. Mccloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation 24, pp. 104–169. Cited by: §1, §1.
  • [18] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. External Links: Link, 1411.1784 Cited by: §4.
  • [19] W. Oleszkiewicz, P. Kairouz, K. Piczak, R. Rajagopal, and T. Trzciński (2019) Siamese generative adversarial privatizer for biometric data. In Computer Vision – ACCV 2018, C.V. Jawahar, H. Li, G. Mori, and K. Schindler (Eds.), Cham, pp. 482–497. External Links: ISBN 978-3-030-20873-8 Cited by: §1.
  • [20] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54 – 71. External Links: ISSN 0893-6080, Document, Link Cited by: §1.
  • [21] M. B. Ring (1994) Continual learning in reinforcement environments. Ph.D. Thesis, University of Texas at Austin, USA. Cited by: §1, §1.
  • [22] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. External Links: 1606.04671 Cited by: §1, §2.
  • [23] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 2990–2999. External Links: Link Cited by: §2, §2.
  • [24] R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber (2013) Compete to compute. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2310–2318. External Links: Link Cited by: §4.
  • [25] W. Stokowiec, T. Trzcinski, K. Wołk, K. Marasek, and P. Rokita (2017-07) Shallow reading with deep learning: predicting popularity of online content using only its title. pp. 136–145. External Links: ISBN 978-3-319-60437-4, Document Cited by: §1.
  • [26] F. P. Such, A. Rawal, J. Lehman, K. Stanley, and J. Clune (2020) Generative teaching networks: accelerating neural architecture search by learning to generate synthetic training data. External Links: Link Cited by: §3, §3, §4.
  • [27] I. Tautkute, T. Trzciński, A. P. Skorupa, Ł. Brocki, and K. Marasek (2018) DeepStyle: multimodal search engine for fashion and interior design. IEEE Access 7, pp. 84613–84628. Cited by: §1.
  • [28] T. Wang, J. Zhu, A. Torralba, and A. A. Efros (2018) Dataset distillation. CoRR abs/1811.10959. External Links: Link, 1811.10959 Cited by: §2.
  • [29] Y. Wang, D. Ramanan, and M. Hebert (2019) Growing a brain: fine-tuning by increasing model capacity. CoRR abs/1907.07844. External Links: Link, 1907.07844 Cited by: §2.
  • [30] C. Wu, L. Herranz, X. Liu, y. wang, J. van de Weijer, and B. Raducanu (2018) Memory replay gans: learning to generate new categories without forgetting. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5962–5972. External Links: Link Cited by: §2, §2.
  • [31] J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2018) Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.