1 Introduction
Neural networks are achieving humanlevel performance on many cognitive tasks including image classification krizhevsky2012imagenet and speech recognition hinton2006fast . However, as opposed to humans, their acquired knowledge is comparably volatile and can be easily dismissed. Especially, the catastrophic forgetting phenomenon refers to the case when a neural network forgets the past tasks if it is not allowed to retrain or reiterate on them again goodfellow2013empirical ; mccloskey1989catastrophic .
Continual learning is a research direction that aims to solve the catastrophic forgetting problem. Recent works tried to tackle this issue from a variety of perspectives. Regularization methods (e.g., kirkpatrick2017overcoming ; zenke2017continual ) aim to consolidate the weights that are important to previous tasks while expansion based methods (e.g., rusu2016progressive ; yoon2018lifelong ) typically increase the model capacity to cope with the new tasks. Repetition based methods (e.g., lopez2017gradient ; chaudhry2018efficient ) usually do not require additional and complex modules, however, they have to maintain a small memory of previous data and use them to preserve knowledge. Unfortunately, the performance boost of repetition based methods comes at the cost of storing previous data which may be undesirable whenever privacy is important. To address this issue, authors in farajtabar2019orthogonal proposed a method to work with the gradients of the previous data to constrain the weight updates; however, this may still be subject to privacy issues as the gradient associated with each individual data point may disclose information about the raw data.
In this paper, we study the continual learning problem from the perspective of loss landscapes. We explicitly target minimizing an average over all tasks’ loss functions. The proposed method stores neither the data samples nor the individual gradients on the previous tasks. Instead, we propose to construct an approximation to the loss surface of previous tasks. More specifically, we approximate the loss function by estimating its secondorder Taylor expansion. The approximation is used as a surrogate added to the loss function of the current task. Our method only stores information based on the statistics of the entire training dataset, such as full gradient and full Hessian matrix (or its low rank approximation), and thus better protects privacy. In addition, since we do not expand the model capacity, the neural network structure is less complex than that of expansion based methods.
We study our algorithm from an optimization perspective, and make the following theoretical contributions:

We prove a sufficient and worstcase necessary condition under which by conducting gradient descent on the approximate loss function, we can still minimize the actual loss function.

We further provide convergence analysis of our algorithm for both nonconvex and convex loss functions. Our results imply that early stopping can be helpful in continual learning.

We make connections between our method and elastic weight consolidation (EWC) kirkpatrick2017overcoming .
In addition, we make the following experimental contributions:

We conduct a comprehensive comparison among our algorithm and several baseline algorithms kirkpatrick2017overcoming ; chaudhry2018efficient ; farajtabar2019orthogonal on a variety of combinations of datasets and models. We observe that in many scenarios, especially when the learner is not allowed to store the raw data samples, our proposed algorithm outperforms them. We also discuss the conditions under which the proposed method or any of the alternatives are effective.

We provide experimental evidence validating the importance of accurate approximation of the Hessian matrix and discuss scenarios in which early stopping is helpful for our algorithm.
2 Related work
Avoiding catastrophic forgetting in continual learning Parisi2018ContinualLL ; beaulieu2020learning is an important milestone towards achieving artificial general intelligence (AGI) which entails developing measurements toneva2018empirical ; kemker2018measuring , evaluation protocols farquhar2018towards ; de2019continual , and theoretical understating nguyen2019toward ; farquhar2019unifying of the phenomenon. Generally speaking, three classes of algorithms exist to overcome catastrophic forgetting farajtabar2019orthogonal .
The expansion
based methods allocate new neurons or layers or modules to accommodate new tasks while utilizing the shared representation learned from previous ones
rusu2016progressive ; xiao2014error ; yoon2018lifelong ; li2019learn ; Jerfel2018ReconcilingMA . Although being a very natural approach the mechanism of dynamic expansion can be quite complex and can add considerable overhead to the training process.The repetition and memory based methods store previous data or, alternatively, train a generative model of them and replay samples from them interleaved with samples drawn from the current task shin2017continual ; kamra2017deep ; zhang2019prototype ; rios2018closed ; luders2016continual ; lopez2017gradient ; farajtabar2019orthogonal . They achieve promising performance however at the cost of higher risk of users’ privacy by storing or learning a generative model of their data.
The regularization based approaches impose limiting constraints on the weight updates of the neural network according to some relevance score for previous knowledge kirkpatrick2017overcoming ; nguyen2017variational ; titsias2019functional ; ritter2018online ; mirzadeh2020dropout ; zenke2017continual ; park2019continual . These methods provide a better privacy guarantee as they do not explicitly store the data samples. In general, SOLA also belongs to this category as we use the secondorder Taylor expansion as the regularization term in new tasks. Many of the regularization methods are derived from a Bayesian perspective of estimating the posterior distribution of the model parameters given the data from a sequence of tasks kirkpatrick2017overcoming ; nguyen2017variational ; titsias2019functional ; ritter2018online
; some of these methods use other heuristics to either estimate the importance of the weights of the neural network
zenke2017continual ; park2019continual or implicitly limit the capacity of the network mirzadeh2020dropout . Similar to our approach, several regularization based methods use quadratic functions as the regularization term, and many of them use the diagonal form of quadratic functions kirkpatrick2017overcoming ; zenke2017continual ; park2019continual . In Section 5.3, we demonstrate that in some cases, the EWC algorithm kirkpatrick2017overcoming can be considered as the diagonal approximation of our approach. Here, we note that the diagonal form of quadratic regularization has the drawback that it does not take the interaction between the weights into account.Among the regularization based methods, the online Laplace approximation algorithm ritter2018online
is the most similar one to our proposed method. Despite the similarity in the implementations, the two algorithms are derived from very different perspectives: the online Laplace approximation algorithm uses a Bayesian approach that approximates the posterior distribution of the weights with a Gaussian distribution, whereas our algorithm is derived from an optimization viewpoint using Taylor approximation of loss functions. More importantly, the Gaussian approximation in
ritter2018online is proposed as a heuristic; whereas in this paper, we provide rigorous theoretical analysis on how the approximation error affects the optimization procedure. We believe that our analysis provides deeper insights to the loss landscape of continual learning problems, and explains some important implementation details such as early stopping.We also note that continual learning is broader than just solving the catastrophic forgetting and is connected to many other areas such as meta learning riemer2018learning , fewshot Learning wen2018few ; gidaris2018dynamic , learning without explicit task identifiers rao2019continual ; aljundi2019online , to name a few.
3 Problem formulation
We consider a sequence of supervised learning tasks , .^{1}^{1}1For any positive integer , we define . For task , there is an unknown distribution over the space of featurelabel pairs . Let be a model parameter space,^{2}^{2}2In most cases, we consider . and for the th task, let be the loss function of associated with data point . The population loss function of task is defined as
. Our general objective is to learn a parametric model with minimized population loss over all the
tasks. More specifically, in continual learning, the learner follows the following protocol: When learning on the th task, the learner obtains access to data points , sampled i.i.d. according to and we define as the empirical loss function; the learner then updates the model parameter using these training data, and after the training procedure is finished, the learner loses access to the training data, but can store some side information about the task. Our goal is to avoid forgetting previous tasks when trained on new tasks by utilizing the side information. We provide details of our algorithm design in the next section.4 Our approach
To measure the effectiveness of a continual learning algorithm, we use a simple criterion that after each task, we hope the average population loss over all the tasks that have been trained on to be small, i.e., for every , after training on , we hope to solve . Since minimizing the loss function is the key to training a good model, we propose a straightforward method for continual learning: storing the secondorder Taylor expansion of the empirical loss function, and using it as a surrogate of the loss function for an old task when training on new tasks. We start with a simple setting. Suppose that there are two tasks, and at the end of , the we obtain a model . Then we compute the gradient and Hessian matrix of at , and construct the secondorder Taylor expansion of at :
When training on , we try to minimize . The basic idea of this design is that, we hope in a neighborhood around , the quadratic function stays as a good approximation of , and thus approximately we still minimize the average of the empirical loss functions , which in the limit generalizes to the population loss function .
We rely on the assumption that the secondorder Taylor approximation of loss function can capture their local geometry well. For a general nonlinear function and arbitrary displacement, this approximation can be oversimplistic, however, we refer to the abundance of observations for modern neural networks that are seen to be wellbehaved with flat and wide minima choromanska2015loss ; goodfellow2014qualitatively . Moreover, the assumption of wellbehaved loss around tasks’ local minima also forms the basis of a few other continual learning algorithms such as EWC kirkpatrick2017overcoming and OGD farajtabar2019orthogonal .
Formally, let be the model that we obtain at the end of the th task. We define the approximation of the sum of the first empirical loss functions as
(1)  
where denotes the Hessian matrix or its low rank approximation, , , and is a constant that does not depend on . We construct at the end of task , and when training on task , we minimize . In the following, we name our algorithm SOLA, an acronym for SecondOrder Loss Approximation.
As we can see, in the SOLA algorithm, after each task, if we choose to use the exact Hessian matrix, i.e., , , it suffices to update and in memory, and thus the memory cost of the algorithm is , which does not grow with the number of tasks. However, in practice, especially for overparameterized neural networks, the dimension of the model is usually large, and thus the storage cost of memorizing the Hessian matrix can be high. Recent studies have shown that the Hessian matrices of the loss functions of a deep neural networks are usually approximately low rank ghorbani2019investigation . If we choose as a rank approximation of , we need to keep accumulating the low rank approximations of the Hessian matrices in order to construct , and at the end of task , the memory cost is , which in practice can be much smaller than that of using the exact Hessian matrices. We formally demonstrate our approach in Algorithm 1, and the methods that use exact Hessian matrices and low rank approximation of them are presented as options I and II, respectively. Moreover, we can use a recursive implementation for the low rank approximation, and the memory cost can be further reduced to , which does not grow with . We present the details of the recursive implementation in Section 6.
5 Theoretical analysis
In this section, we provide theoretical analysis of our algorithm. As we can see, the key idea in our algorithm is to approximate the loss functions of previous tasks using quadratic functions. This leads to the following theoretical question: By running gradient descent algorithm on an approximate loss function, can we still minimize the actual loss that we are interested in?
For the purpose of theoretical analysis, we make a few simplifications to our setup. Without loss of generality, we study the training process of the last task , and still use to denote the model parameters that we obtain at the end of the th task. We use the loss function approximation in (1), but for simplicity we ignore the finitesample effect and replace the empirical loss function with the population loss function, i.e., we define
(2) 
where represents or its low rank approximation. The reason for this simplification is that our focus is the optimization aspect of the problem, while the generalization aspect can be tackled by tools such as uniform convergence mohri2018foundations . As discussed, during the training of the last task, we have access to the approximate loss function , whereas the actual loss function that we care about is . We also focus on gradient descent instead of its stochastic counterpart. In particular, let be the initial model parameter for the last task. We run the following update for :
(3) 
We use the following standard notions for differentiable function .
Definition 1.
is smooth if , .
Definition 2.
is Hessian Lipschitz if , .
We make the assumptions that the loss functions are smooth and Hessian Lipschitz. We note that the Hessian Lipschitz assumption is standard in analysis of nonconvex optimization nesterov2006cubic ; jin2017escape .
Assumption 1.
We assume that is smooth and Hessian Lipschitz .
We also assume that the error between the matrices and is bounded.
Assumption 2.
We assume that for every , , where is defined in Assumption 1, and that for some .
5.1 Sufficient and worstcase necessary condition for onestep descent
We begin with analyzing a single step during training. Our goal is to understand by running a single step of gradient descent on , whether we can minimize the actual loss function . More specifically, we have the following result.
Theorem 1.
We prove Theorem 1 in Appendix A. Here, we emphasize that this result does not assume any convexity of the loss functions. The theorem provides a sufficient condition (4), under which by running gradient descent on , we can still minimize the true loss function . Intuitively, this condition requires the gradient of to be large enough, such that the magnitude of the gradient is larger than the error caused by the inexactness of the loss function. In Proposition 1 below, we will see that this condition is also necessary in the worstcase scenario, at least for the case where . More specifically, we can construct cases in which (4) is violated and the gradients of and have opposite directions.
Proposition 1.
Suppose that , , . Then, there exists , , , and such that if , then .
We prove Proposition 1 in Appendix B. In addition, we note that Theorem 1 also implies that as training going on and decreasing, it is beneficial to decrease the learning rate , since when decreases, the upper bound on that guarantees the decay of (i.e., ) also decreases. We notice that the importance of learning rate decay for continual learning has been observed in some empirical study recently mirzadeh2020dropout .
5.2 Convergence analysis
Although the condition in (4) provides us with insights on the dynamics of the training algorithm, it is usually hard to check this condition in every step, since we may not have good estimates of and . A practical implementation is to choose a constant learning rate along with an appropriate number of training steps. In this section, we provide bounds on the convergence behavior of our algorithm with a constant learning rate and iterations, both for nonconvex and convex loss functions. These results imply that early stopping can be helpful, and provide a theoretical treatment of the very intuitive fact that the more iterations one optimizes for the current task the more forgetting can happen for the previous ones. We begin with a convergence analysis for nonconvex loss functions in Theorem 2, in which we use the common choice of learning rate for gradient descent on smooth functions bubeck2014convex .
Theorem 2 (nonconvex).
We prove Theorem 2 in Appendix C. Unlike standard optimization analysis, the average norm of the gradients does not always decrease as increases, when or . Intuitively, as we move far from the points where we conduct Taylor expansion, the gradient of becomes more and more inaccurate, and thus we need to stop early. In Section 7, we provide experimental evidence.
When the loss functions are convex, we can prove a better guarantee which does not have the and terms as in Theorem 2. More specifically, we have the following assumption and theorem.
Assumption 3.
is convex and , .
Theorem 3 (convex).
We prove Theorem 3 in Appendix D. As we can see, if or , we still cannot guarantee the convergence to the true minimum of , due to the inexactness of . On the other hand, if the loss functions are quadratic and we save the full Hessian matrices, i.e., , as we have full information about previous loss functions, we can recover the standard convergence rate for gradient descent on convex and smooth functions.
5.3 Connection to EWC
The elastic weight consolidation (EWC) algorithm kirkpatrick2017overcoming for continual learning is proposed based on the Bayesian idea of estimating the posterior distribution of the model parameters. Interestingly, we notice that our algorithm has a connection with EWC, although their basic ideas are quite different. More specifically, we show that in some cases, the regularization technique that the EWC algorithm uses can be considered as a diagonal approximation of the Hessian matrix of the loss function. Suppose that in the th task, the data points are samples from a probabilistic model with the likelihood function being , and we use negative loglikelihood as the loss function, i.e., . Suppose that at the end of this task, we obtain the ground truth model parameter . Then we know that , and that the Fisher information of the th coordinate of is . The EWC algorithm constructs a regularization term as a proxy of the loss function of the th task, and uses it in the following tasks. As we can see, in this case, the quadratic regularization in EWC is a diagonal approximation of the quadratic term in our loss function approximation approach.
6 A recursive implementation
As we have seen, one drawback of the SOLA algorithm with low rank approximation in Section 4 is that the memory cost grows with the number of tasks. In this section, we present a more practical and memory efficient implementation of SOLA with low rank approximation. Recall that is the empirical loss function for the th task, . We then define the loss function approximation in a recursive way. We begin with , and for every , we define
(5)  
(6) 
where is a rank approximation of the Hessian matrix . This means that at the end of task , we compute the secondorder Taylor expansion of the approximate loss function at , with the Hessian matrix being replaced by the low rank approximation . Thus, we only need to store and , and the memory cost is , which does not grow with . We formally present this approach in Algorithm 2. In our experiments in Section 7, we use the recursive implementation for SOLA with low rank approximation.
7 Experiments
We implement the experiments with TensorFlow
abadi2016tensorflow. When computing the exact or the low rank approximation of the Hessian matrix, we treat each tensor in the model independently; in other words, we compute the block diagonal approximation of the Hessian matrix. This technique has the benefit that the Hessian computation is independent of the model architecture and has been used in recent studies on secondorder optimization
gupta2018shampoo . We use the recursive implementation for SOLA with low rank approximation. In the following, we denote SOLA with exact Hessian matrix and low rank approximation by SOLAexact and SOLAprox, respectively. As for the calculation of the low rank matrix, we make use of Hessianvector product and provide details in Appendix
E.Datasets. We use multiple standard continual learning benchmarks created based on MNIST lecun1998gradient and CIFAR10 krizhevsky2009learning datasets, i.e., Permuted MNIST goodfellow2013empirical , Rotated MNIST lopez2017gradient , Split MNIST zenke2017continual , and Split CIFAR (similar to a dataset in chaudhry2018efficient ). In Permuted MNIST, for each task, we choose a random permutation of the pixels of MNIST images, and reorder all the images according to the permutation. We use task Permuted MNIST in the experiments. In Rotated MNIST, for each task, we rotate the MNIST images by a particular angle. In our experiments, we choose a task Rotated MNIST, with the rotation angles being , , , , and degrees. For Split MNIST, we Split the labels of the MNIST dataset to disjoint subsets, and for each task, we use the MNIST data whose labels belong to a particular subset. In this paper, we use a task Split MNIST, and the subsets of labels are , , , , and . Split CIFAR is defined similar to Split MNIST, and we use a task Split CIFAR with the label subsets being and .
Architecture.
We use both multilayer perceptron (MLP) and convolutional neural network (CNN). In most cases, we use MLP with two hidden layers, sometimes denoted by MLP
, with and being the number of hidden units. We may use CNN to denote a CNN model with convolutional layers, and provide details of the model in Appendix F. For Split MNIST and Split CIFAR, we use MLP and CNN models with a multihead structure similar to what has been used in chaudhry2018efficient ; farajtabar2019orthogonal . In the multihead model, instead of having logits in the output layer, we use separate heads for different tasks, and each head corresponds to the classes of the associated task. During training, for each task, we only optimize the crossentropy loss over the logits and labels of the corresponding output head.Baselines. We compare SOLA algorithm with several baselines: the vanilla algorithm which runs SGD over all the tasks without storing any side information; the multitask algorithm which assumes access to all the training data of previous tasks; the repetition based AGEM algorithm chaudhry2018efficient , which stores a subset of data samples from the previous tasks and forms constrained optimization algorithms when training on new tasks; the regularization based EWC algorithm kirkpatrick2017overcoming discussed in Section 5.3; and the orthogonal gradient descent (OGD) algorithm farajtabar2019orthogonal that stores the gradients in previous tasks and forms a constrained optimization algorithm.^{3}^{3}3Comparison between OGD and SOLAprox: Since OGD has a memory cost that grows linearly with the number of tasks but SOLAprox does not, we keep the average memory cost of them the same. Among them, our algorithm, along with the vanilla, EWC, and OGD algorithms do not explicitly store the raw data samples. Following prior works chaudhry2018efficient ; farajtabar2019orthogonal ; kirkpatrick2017overcoming , we choose a learning rate of and a batch size of . For all the results that we report, we present the average result over
independent runs, as well as the standard deviation (as the shaded areas in the figures).
Dataset  PMNIST  PMNIST  RMNIST  RMNIST  RMNIST  SMNIST  SMNIST 
Model type  MLP  MLP  MLP  MLP  CNN  MLP  MLP 
Model size  [10, 10]  [100, 100]  [10, 10]  [100, 100]  4conv  [10, 10]  [100, 100] 
Multitask  91.8 0.4  97.0 0.1  91.4 0.4  97.5 0.1  98.8 0.1  98.9 0.3  99.3 0.1 
AGEM  84.1 1.1  93.2 0.4  83.6 1.0  92.6 0.4  95.3 0.3  91.2 4.9  97.8 0.4 
Vanilla  69.2 3.1  81.1 1.6  76.8 0.9  86.0 0.5  89.5 0.6  86.4 6.6  97.2 0.9 
EWC  69.1 3.7  80.2 1.4  76.9 1.0  86.1 0.6  89.4 0.7  87.7 9.2  97.7 0.8 
OGD  68.9 3.3  81.5 1.7  81.1 1.3  88.0 0.7  89.5 0.7  97.1 1.8  98.8 0.1 
SOLAexact  90.0 0.9  88.6 0.9  96.3 3.0  
SOLAprox  86.2 1.5  87.8 0.6  86.5 0.9  90.4 0.5  92.2 1.5  96.1 2.5  99.0 0.2 
Model  Multitask  AGEM  Vanilla  EWC  OGD  SOLAexact  SOLAprox 

CNN2  75.9 0.9  65.8 2.1  57.2 4.2  55.6 4.6  56.5 4.2  62.0 5.4  59.4 3.8 
CNN6  78.6 1.4  68.1 2.3  57.5 4.6  57.7 3.8  58.3 4.8  58.6 5.2  
MLP  69.2 0.5  66.1 0.7  63.5 1.6  63.8 2.1  65.8 1.2  55.7 3.2 
Results. We provide a comprehensive comparison among SOLA and the baseline algorithms with a variety of combinations of datasets and models. Tables 1 and 2 present the results for MNISTbased datasets and Split CIFAR, respectively. We make a few notes before discussing the results. First, the multitask algorithm uses all the data of previous tasks, which serves as an upper bound for the performance of continual learning algorithms. Second, since the AGEM algorithm stores a subset of data samples from previous tasks, it is not completely fair to compare AGEM with algorithms that do not store raw data. However, here we still report the results for AGEM for reference, and in AGEM we store
data points for each task. Third, since the performance of the algorithms depends on the number of epochs that we train for each task, we treat this quantity as a tuning parameter, and for each algorithm, we report the result corresponding to the best epoch choice for its performance. In particular, for MNISTbased datasets, we choose epoch from
, and for Split CIFAR, we choose from . Due to memory constraints, we only implement SOLAexact on small models such as MLP and CNN2. We conclude from the results as follows:
If it is allowed to store raw data, repetition based algorithm such as AGEM should be the choice. This remarks the importance of the information contained in the raw data samples. In some cases we observe that SOLA outperforms AGEM, e.g., on MLP. However, we expect that the performance of AGEM can be improved if more data are stored in memory.

If it is not allowed to store raw data due to privacy concerns, then in many scenarios, SOLA outperforms the baseline algorithms. In particular, on MNISTbased datasets, SOLAexact or SOLAprox achieves the best performance in out of settings.

On Split CIFAR, we observe mixed results. When the model is relatively small (CNN2) and we can store the exact Hessian matrix, SOLAexact achieves the best performance. On a relatively large CNN model (CNN6), we observe that none of the continual learning algorithms (EWC, OGD, SOLA) significantly outperforms the vanilla algorithm. On a large MLP, we observe that OGD performs the best and the result for SOLAprox becomes worse. We believe the reason is that since in this experiment we only use eigenvectors to approximate a Hessian matrix with very high dimensions, the approximation error is so large that SOLAprox cannot find a descent direction that is close to the true gradient. This remarks the importance of future study of SOLA on models with more complicated structure or higher dimensions.
Performance vs approximation. We study how the approximation of Hessian matrices affects the performance of SOLAprox. In particular, we choose different values of the rank in SOLAprox and investigate its correlation with the final average test accuracy. Our theory implies that when the approximation of Hessian matrices is better, i.e., smaller , the final performance is better. Our experiments validate this point. Figure 0(a) and Figure 0(b) show that, as we increase , i.e., using more eigenvectors to approximate the Hessian matrix, the average test accuracy over all tasks improves.
Early stopping. Our theoretical analysis in Section 5 implies that early stopping can be helpful for SOLA. Here, we discuss empirical evidence. As we can see from Figure 0(c), on Permuted MNIST with MLP, the average test accuracy of SOLAprox becomes worse if we train more than epochs per task; similarly, from Figure 0(a), we can also see that training each task for more epochs can hurt the performance of MLP on Permuted MNIST. However, this phenomenon is less severe on Rotated MNIST. In Figure 0(d), for SOLAprox with , we observe one case where the average test accuracy gradually decreases as we increase the number of epochs per task. Moreover, we notice that we did not observe this phenomenon for SOLAexact. Hence, we draw the conclusion that the importance of early stopping for SOLA depends on how different the tasks are and how well we approximate the Hessian matrix. In Permuted MNIST, the pixels are randomly shuffled when switching to new tasks, whereas in Rotated MNIST we only rotate the images by degrees; thus early stopping is more important for Permuted MNIST. On the other hand, if we store the exact value of the Hessian matrix ( in Theorem 1), the approximation error of the gradients can be small, and thus we can train more epochs on new tasks. In addition, we note that it has been observed that early stopping is typically helpful for other continual learning algorithms farajtabar2019orthogonal .
8 Conclusions
We propose the SOLA algorithm based on the idea of loss function approximation. We establish theoretical guarantees, make connections to the EWC algorithm, and present experimental results showing that in many scenarios, our algorithm outperforms several baseline algorithms, especially among the ones that do not explicitly store the raw data samples. Future directions include studying SOLA on broader classes of neural network architectures and parameter spaces with higher dimensions.
Acknowledgements
We would like to thank Dilan Gorur, Alex Mott, Clara Huiyi Hu, Nevena Lazic, Nir Levine, and Michalis Titsias for helpful discussions.
References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: A system for largescale machine learning.
In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.  [2] R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio. Online continual learning with no task boundaries. arXiv preprint arXiv:1903.08671, 2019.
 [3] S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney. Learning to continually learn. arXiv preprint arXiv:2002.09571, 2020.
 [4] S. Bubeck. Convex optimization: Algorithms and complexity. arXiv preprint arXiv:1405.4980, 2014.
 [5] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with AGEM. arXiv preprint arXiv:1812.00420, 2018.
 [6] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
 [7] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383, 2019.
 [8] M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, 2020.
 [9] S. Farquhar and Y. Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.
 [10] S. Farquhar and Y. Gal. A unifying Bayesian view of continual learning. arXiv preprint arXiv:1902.06494, 2019.
 [11] B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via Hessian eigenvalue density. arXiv preprint arXiv:1901.10159, 2019.

[12]
S. Gidaris and N. Komodakis.
Dynamic fewshot visual learning without forgetting.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4367–4375, 2018.  [13] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211, 2013.
 [14] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
 [15] V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. arXiv preprint arXiv:1802.09568, 2018.
 [16] G. E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 [17] G. Jerfel, E. Grant, T. L. Griffiths, and K. A. Heller. Reconciling metalearning and continual learning with online mixtures of tasks. In Advancs in Neural Information Processing Systems, 2019.
 [18] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1724–1732. JMLR. org, 2017.
 [19] N. Kamra, U. Gupta, and Y. Liu. Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368, 2017.
 [20] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan. Measuring catastrophic forgetting in neural networks. In Thirtysecond AAAI Conference on Artificial Intelligence, 2018.
 [21] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
 [22] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [25] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. arXiv preprint arXiv:1904.00310, 2019.
 [26] D. LopezPaz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.

[27]
B. Lüders, M. Schläger, and S. Risi.
Continual learning through evolvable neural turing machines.
In NIPS 2016 Workshop on Continual Learning and Deep Networks (CLDL 2016), 2016.  [28] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
 [29] S.I. Mirzadeh, M. Farajtabar, and H. Ghasemzadeh. Dropout as an implicit gating mechanism for continual learning. arXiv preprint arXiv:2004.11545, 2020.
 [30] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
 [31] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 [32] C. V. Nguyen, A. Achille, M. Lam, T. Hassner, V. Mahadevan, and S. Soatto. Toward understanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091, 2019.
 [33] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
 [34] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review. 2018.
 [35] D. Park, S. Hong, B. Han, and K. M. Lee. Continual learning by asymmetric loss approximation with singleside overestimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3335–3344, 2019.
 [36] D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pages 7645–7655, 2019.
 [37] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
 [38] A. Rios and L. Itti. Closedloop GAN for continual learning. arXiv preprint arXiv:1811.01146, 2018.
 [39] H. Ritter, A. Botev, and D. Barber. Online structured Laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, pages 3738–3748, 2018.
 [40] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 [41] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
 [42] M. K. Titsias, J. Schwarz, A. G. d. G. Matthews, R. Pascanu, and Y. W. Teh. Functional regularisation for continual learning using Gaussian Processes. arXiv preprint arXiv:1901.11356, 2019.
 [43] M. Toneva, A. Sordoni, R. T. d. Combes, A. Trischler, Y. Bengio, and G. J. Gordon. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
 [44] J. Wen, Y. Cao, and R. Huang. Fewshot self reminder to overcome catastrophic forgetting. arXiv preprint arXiv:1812.00543, 2018.
 [45] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang. Errordriven incremental learning in deep convolutional neural network for largescale image classification. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 177–186. ACM, 2014.
 [46] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations. ICLR, 2018.
 [47] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3987–3995. JMLR, 2017.
 [48] M. Zhang, T. Wang, J. H. Lim, and J. Feng. Prototype reminding for continual learning. arXiv preprint arXiv:1905.09447, 2019.
Appendix
Appendix A Proof of Theorem 1
We first provide a bound for the difference between the gradients of and .
Lemma 1.
Let . Then we have
We prove Lemma 1 in Appendix A.1. Since the loss functions for all the tasks are smooth, we know that is also smooth. Then we have
Therefore, as long as for some , we have
(7) 
Then we can complete the proof by combining (7) with Lemma 1.
a.1 Proof of Lemma 1
Appendix B Proof of Proposition 1
We first note that it suffices to construct and , as one can always choose and then the construction of and is equivalent to that of and . Let ,
One can easily check that , , and . In addition, since the second derivative of is always bounded in , we know that is smooth. Since , we know that is Hessian Lipschitz. Therefore, and satisfy all of our assumptions.
Since , we know that , , and then
is equivalent to , which implies that .
Appendix C Proof of Theorem 2
Similar to Appendix A, we define . According to Assumptions 1 and 2, we know that both and are smooth. By the smoothness of and using the fact that , we get
which implies
(9) 
By averaging (9) over , we get
By taking square root on both sizes, and using CauchySchwarz inequality as well as the fact that , we get
(10) 
We then proceed to bound . According to Lemma 1, we have
where
Comments
There are no comments yet.