1 Introduction
The ability to acquire new knowledge over time while retaining previously learned experiences, referred to as continual learning (CL), brings machine learning closer to human learning
[29, 32, 3]. More specifically, given a stream of tasks, CL focuses on training a machine learning (ML) model to quickly learn a new task by leveraging the acquired knowledge after learning previous tasks under a limited amount of computation and memory resources [19, 31]. As a result, the main challenge of existing CL algorithms is that they can quickly be suffered by catastrophic forgetting.Also, memorizing previous tasks while learning new tasks further exposes CL models to adversarial attacks, especially model and data inference [34, 12, 36]. CL models can disclose private and susceptible information in the training set, such as healthcare data [44, 4, 16], financial records [38, 43], and biomedical images [27, 15]. Continuously accessing the data from the previously learned tasks, either stored in episodic memories [7, 30, 2, 35, 28] or produced from generative memories [33, 37, 21], incurs additional privacy risk compared to a single ML model trained on a single task. However, there is still a lack of scientific study to protect private training data in CL algorithms.
Motivated by this, we propose to preserve differential privacy (DP) [8], offering rigorous privacy protection as probabilistic terms for the training data in CL. Merely employing existing DPpreserving mechanisms can either cause a significantly large privacy loss or quickly exhaust the limited computation and memory resources in learning new tasks while memorizing previous tasks through either episodic or generative memories. Thus, effectively and efficiently preserving DP in CL remains a mostly open problem.
Key contributions. To effectively bound the DP privacy loss in CL, we first define continual adjacent databases (Def. 2) to capture the impact of the current task’s data and the episodic memory on the privacy loss and model utility. Based upon that, we incorporate a moments accountant [1] into the Averaged Gradient Episodic Memory (AGEM) algorithm [7] in a new DPCL algorithm to preserve DP in CL.
Our idea is to configure the episodic memory in AGEM as independent minimemory blocks. We store a subset of training data of the current task in a minimemory block with an associated task index in the episodic memory for each task. At each training step, we compute reference gradients on the minimemory blocks independently. The reference gradients will be used to optimize the process of memorizing previously learned tasks as in AGEM. More importantly, by keep tracking of the task and minimemory block index, we can leverage a moments accountant to estimate the privacy cost spent on each minimemory block. Based upon this, we derive a new strategy (
Lemma 2) to bound DP loss in the whole CL process while maintaining the computation efficiency of the AGEM algorithm.To our knowledge, our proposed mechanism establishes the first formal connection between DP and CL. Experiments conducted on the permuted MNIST dataset
[13]and the Split CIFAR
[41] show promising results in preserving DP in CL, compared with baseline approaches.2 Background
In this section, we revisit continual learning, differential privacy, and introduce our problem statement. The goal of CL is to learn a model through a sequence of tasks such that the learning of each new task will not cause forgetting of the previously learned tasks. Let be the dataset at task consisting of samples, each of which is a sample associated with a label . Each
is a onehot vector of
categories:. A classifier outputs class scores
mapping an input to a vector of scores s.t. and . The class with the highest score is selected as the predicted label for the sample. The classifieris trained by minimizing a loss function
that penalizes mismatching between the prediction and the original value .Averaged Gradient Episodic Memory (AGEM) [7]. There is a sequence of tasks that have been learnt, where . The goal is to train the model at the current task so that it minimizes the loss on the task and does not forget previous learned tasks . The key feature of AGEM is to store a subset of data from task , denoted as , in an episodic memory . Then the algorithm ensures that the loss on an average episodic memory across all the previously learned tasks, i.e., , does not increase at every training step. In AGEM, the objective function of learning the current task is:
(1) 
where is the values of model parameters learned after training the task , and .
The constrained optimization problem of Eq.1 can be approximated quickly and the updated gradient is as follows:
(2) 
where is the proposed gradient update on and is the reference gradient computed from the episodic memory from previous tasks.
Differential Privacy [9, 10, 20, 39]. To avoid the training data leakage, DP guarantees to restrict what the adversaries can learn from the training data given the model parameters by ensuring similar model outcomes with and without any single data sample in the dataset. The definition of DP is as follows:
Definition 1
DP [8]. A randomized algorithm fulfills DP, if for any two adjacent databases and differ at most one sample, and for all outcomes , we have:
(3) 
where is the privacy budget and and
is the broken probability.
The privacy budget controls the amount by which the distributions induced by and may differ. A smaller enforces a stronger privacy guarantee. The broken probability represents the improbable “bad” events in which an adversary can infer whether a particular data sample belongs to the training data, which is possible when the probability .
DP in Continual Learning. There are several works of DP in CL [11, 23]. In [11], the authors train a DPGAN [42] to approximate the distribution of the past datasets. They leverage a small portion of public data (i.e., the data that does not need to keep private) to initialize and train the GAN in the first few iterations of each task, then continue training the GAN model under DP constraint. The trained generator produces adversarial examples imitating real examples of past tasks. Then, the adversarial examples are employed to supplement the actual data of the current training task. DPL2M [23] perturbs the objective functions using a DPAL mechanism [22, 24] and applies AGEM to optimize the perturbed objective function. However, there is a lack of a concrete definition of adjacent databases with unclear or not welljustified DP protection in [11, 23]. Different from existing works, we provide a formal DP protection for CL models.
3 Continual Learning with DP
This section establishes a connection between differential privacy and continual learning. We first propose a definition of continual adjacent databases in CL, as follows: Two databases and are continual adjacent if they differ in a single sample of the training data and differ in a single sample of the episodic memory across all the tasks. The definition is presented as follows:
Definition 2
Continual Adjacent Databases. Two databases and , where , , , and , are called continual adjacent if: and .
A Naive Algorithm. Based upon Definition 2, a straightforward approach, called DPAGEM, is to simply apply a moments accountant [1] into AGEM [7], to preserve DP in CL. At each task , we divide the dataset into and such that and are disjoint: . By using the training data with a sampling rate , DPAGEM computes a proposed gradient , which is bounded by a predefined norm clipping bound . It is beneficial in realworld to keep track of the privacy budget spent on each task independently, and the total privacy budget used in the entire training process. To achieve this, in computing the reference gradients , the algorithm first randomly samples data from all the data samples in the episodic memory with a sampling probability . Given a particular () in the episodic memory, the sampled data is used to compute a reference gradient , which is clipped with the norm bound . Then Gaussian mechanism is employed to inject random Gaussian noise with a predefined hyperparameter into both and . The reference gradient is the average of all the reference gradients computed on each , as follows: . Finally, the updated gradient computed using Eq. 2 with and can be used to update the model parameters. After training the task , is added into the episodic memory . The training process will continue until the model is trained on all the tasks.
Since the norms of the gradients and are bounded, we can leverage a moments accountant to bound the privacy loss for a single task as well as the privacy loss accumulated across all tasks. Let be the privacy budget used to compute on , and is the privacy budget spent on computing the reference gradient at each training task. The privacy budget used for a specific task , denoted as and the total privacy budgets of DPAGEM accumulated until the task can be computed in the following lemma.
Lemma 1
Until the task , 1) the privacy budget used for a specific and previously learned task is: , and 2) the total privacy budget of DPAGEM is: .
Proof
We use induction to prove Lemma 1.
When , there is an empty episodic memory; therefore, and .
Hence, Lemma 1 is true for . Assuming that Lemma 1 is true for , so we have and . We need to show that Lemma 1 is true for .
We have: , and . Thus, Lemma 1 holds.
Two Levels of DP Protection. In Lemma 1, based on our definition of continual adjacent databases (Def. 2), it is essential that there are two levels of DP protection provided to an arbitrary data sample, as follows. Until the task : (1) Given the DP budget for a specific task , the participation information of an arbitrary data sample in the task is protected under a DP given the released parameters . This can be presented as: , for any adjacent databases and ; and (2) The participation information of an arbitrary data sample in the whole training data is protected under a DP given the released parameters . This can be presented as: , for any continual adjacent databases and . This is fundamentally different from existing works [11, 23], which do not provide any formal DP protection in CL.
Although DPAGEM can preserve DP in CL, it suffers from a large privacy budget accumulation across tasks with an for . This is impractical in the real world with a loose DP protection. To address this, we present an algorithm to tighten the DP loss.
DPCL Algorithm. Our DPCL (Alg. 1 and Figure 1) takes a sequence of tasks and dataset as inputs. All samples in are used to compute the proposed gradient update on task with a sampling rate (Line 6). We clip so that its
norm is bounded by a predefined gradient clipping bound
. Then we add a random Gaussian noise into with a predefined noise scale (Line 9). Note that after training the task , samples in are added to the episodic memory as a minimemory block (Lines 17, 2426). To reduce the privacy budget accumulated over the number of tasks, we limit the access to seen data of previous tasks by using a randomly selected minimemory block () from to compute (Lines 2023). We clip by the gradient clipping bound and then add a random Gaussian noise to (Line 14). The updated gradient is computed by Eq. 2 (Line 15). Then is used to update the model parameters (Line 16). The privacy budgets in our DPCL algorithm can be bounded in the following lemma.Lemma 2
Until the task , 1) the privacy budget used for a specific and previously learned task is: , where is the privacy budget used for a randomly chosen minimemory block from to compute at task , and 2) the total privacy budget of DPCL is: .
Proof
It is obvious that our DPCL algorithm significantly reduces the privacy consumption to , which is linear to the number of training tasks. In addition, our sampling approach to compute is unbiased, since the expectation for any data sample selected to compute is the same: . In our experiment, we will show that DPCL outperforms the baseline approach DPAGEM.
4 Experimental Results
AGEM  Forgetting (F)  Worstcase F  LCA  

= 0.85  0.0070  
= 0.9  
= 0.95  
= 1.0  
= 1.15  
= 1.30  

AGEM  Forgetting (F)  Worstcase F  LCA  

= 0.95  
= 0.96  
= 0.97  
= 0.98  
= 0.99  
= 1.0  

We have conducted experiments on the permuted MNIST dataset [13] and the Split CIFAR dataset [41]. The permuted MNIST dataset is a variant of the MNIST dataset [18] of handwritten digits. The permuted MNIST dataset has images involving tendigit classification, where each task consists of different random permutations of the input pixels in the images. The Split CIFAR [41] is a split version of the original CIFAR100 dataset [17]. There are disjoint subsets, where each subset is constructed by randomly sampling classes without replacement from a total of classes. Our validation focuses on shedding light on the interplay between model utility and privacy loss of preserving DP in CL. Our code and datasets are available on Github^{1}^{1}1https://github.com/PhungLai728/DPCL.
Baseline Approaches. We evaluate our DPCL algorithm and compare it with AGEM [7], one of the stateoftheart CL algorithms. Note that AGEM does not preserve DP; therefore, we only use AGEM to show the upperbound model performance. We apply four wellknown metrics, including the average accuracy, the average forgetting measure [6], the worstcase forgetting measure [7], and the learning curve area (LCA) [7], to evaluate our mechanism.
Model Configuration. In the permuted MNIST dataset, we use a fully connected network with two hidden layers of
hidden neurons. Given a stream of
tasks, the model is optimized via stochastic gradient descent with a learning rate
. In computing , the batchsize is set to for each training task and for the minimemory block. The number of runs for each experiment is . The noise scale and the gradient clipping bound . In the Split CIFAR dataset, a reduced ResNet18 [19, 14] with 3 times less feature maps across all the layers. The network has a final linear classifier for prediction in the Split CIFAR dataset. The batchsize is set toin each training task. Other hyperparameters, e.g., learning rate, noise scale, gradient clipping bound, etc., are the same as in the permuted MNIST dataset experiment. The number of runs for each experiment is
.Comparing Privacy Accumulation. Since the number of data samples and the sampling rate remain the same for every task, the privacy budgets and can be the same for every task. Therefore, for the sake of clarity without loss of generality, in this privacy accumulation comparison between DPAGEM and our DPCL algorithm, we draw different random Gaussian values with (mean , std ) and assign the generated values as the privacy budget and for tasks.
Figure (a)a illustrates how privacy loss accumulates over tasks in DPAGEM and our DPCL algorithm. Our algorithm achieves a notably tighter privacy budget compared with DPAGEM, which accesses data samples from the whole episodic memory to compute . When the number of tasks increases, DPAGEM’s privacy budget exponentially increases. In contrast, our approach’s privacy budget slightly increases and is linear to the number of tasks or training steps.
Privacy Loss and Model Utility. From our theoretical analysis, DPAGEM suffers from a huge privacy budget accumulation over tasks. Therefore, we only compare our DPCL algorithm and the noiseless AGEM model for the sake of simplicity.
As shown in Figure (b)b and (c)c, our proposed method achieves a comparable average accuracy with the noiseless AGEM model at the first task. In the permuted MNIST dataset, when the number of tasks increases, the average accuracy of our DPCL drops faster than the average accuracy of the AGEM model. For example, at task th, AGEM’s average accuracy drops to , while DPCL’s average accuracy drops to with a tight privacy budget . When the privacy budget increases, the average accuracy gap between our model and the noiseless AGEM is larger, indicating that preserving DP in CL may increase the catastrophic forgetting. This phenomenon is further clarified by the measures of forgetting, worstcase forgetting, and LCA (Table 1). At , forgetting, worstcase forgetting, and LCA are , , and respectively in DPCL. After that, the forgetting and worstcase forgetting significantly increase, and LCA moderately decreases in DPCL.
In the Split CIFAR dataset, when the number of tasks increases, the average accuracy of DPCL drops quickly while the average accuracy of the AGEM model fluctuates. For instances, AGEM’s average accuracy is at the first task, drops to at the second task, and is at the last task. Meanwhile, DPCL’s average accuracy is at the first task, and gradually drops to at the last task with a tight privacy budget . The fluctuation phenomenon in the AGEM model is probably due to the curse of dimension in which there are training examples, which is much smaller than the number of trainable parameters in the ResNet18, i.e., million. Different from the permuted MNIST dataset, in the Split CIFAR dataset, when the privacy budget increases, the average accuracy gap between DPCL and the noiseless AGEM is smaller, especially at the first task. For instance, at the first task, the gap are , , , , , and when the values of are , , , , , and , respectively. This shows the tradeoff between privacy budget and model utility in which when we spend more privacy budget, the model accuracy improves. The gap between DPCL’s average accuracy and AGEM’s average accuracy are significantly bigger when the number of tasks increases, but the difference among different privacy budgets decreases. For instance, at the last task, the gap are , , , , , and when the values of are , , , , , and , respectively. As shown in Table 2, when the privacy budget increases, the forgetting and worstcase forgetting significantly increase, while the LCA slightly fluctuates around . This further confirms our observations in the MNIST dataset in which preserving DP in CL may increase the catastrophic forgetting.
Key observations. From our preliminary experiments, we obtain the following observations. (1) Merely incorporating the moments accountant into AGEM causes a large privacy budget accumulation. (2) Although our DPCL algorithm can preserve DP in CL, optimizing the tradeoff between model utility and privacy loss in CL is an open problem since the privacy noise can worsen the catastrophic forgetting.
5 Conclusion and Future Work
In this paper, we established the first formal connection between DP and CL. We combine the moments accountant and AGEM in a holistic approach to preserve DP in CL in a tightly accumulated privacy budget. Our model shows promising results under strong DP guarantees in CL and opens a new research line to optimize the model utility and privacy loss tradeoff. One of the immediate questions is how to align the privacy noise with the catastrophic forgetting under the same privacy protection. We also plan to examine our approach to a broader range of models and datasets, especially under adversarial attacks [34, 5], and heterogeneous and adaptive privacypreserving mechanisms [25, 26, 40]. Our work further highlights an open direction of quantifying the privacy risk given a diverse correlation among tasks. Learning a highly related task can further disclose the private information in another task, and viceversa.
Acknowledgments
The authors gratefully acknowledge the support from the National Science Foundation (NSF) grants NSF CNS1935928/1935923, CNS1850094, IIS2041096/2041065.
References
 [1] (2016) Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1, §3.

[2]
(2020)
Conditional channel gated networks for taskaware continual learning.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 3931–3940. Cited by: §1.  [3] (2019) Taskfree continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11254–11263. Cited by: §1.
 [4] (2017) An adaptive differential privacy algorithm for range queries over healthcare data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pp. 397–402. Cited by: §1.

[5]
(2021)
Extracting training data from large language models
. USENIX Security Symposium. Cited by: §5.  [6] (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In ECCV, pp. 532–547. Cited by: §4.
 [7] (2019) Efficient lifelong learning with aGEM. In International Conference on Learning Representations, Cited by: §1, §1, §2, §3, §4.
 [8] (2006) Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Cited by: §1, Definition 1.
 [9] (2014) The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (34), pp. 211–407. Cited by: §2.
 [10] (2008) Differential privacy: a survey of results. In International conference on theory and applications of models of computation, pp. 1–19. Cited by: §2.
 [11] (2018) Differentially private continual learning. Privacy in Machine Learning and AI workshop at ICML. Cited by: §2, §3.
 [12] (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333. Cited by: §1.

[13]
(2014)
An empirical investigation of catastrophic forgetting in gradientbased neural networks
. ICLR. Cited by: §1, §4.  [14] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.
 [15] (2013) Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature 500 (7461), pp. 168–174. Cited by: §1.
 [16] (2019) Differential privacy for the vast majority. ACM Transactions on Management Information Systems (TMIS) 10 (2), pp. 1–15. Cited by: §1.
 [17] (2009) Learning multiple layers of features from tiny images. Cited by: §4.
 [18] (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.
 [19] (2017) Gradient episodic memory for continual learning. Neural Information Processing Systems (NeurIPS). Cited by: §1, §4.
 [20] (2007) Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. Cited by: §2.
 [21] (2019) Learning to remember: a synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11321–11329. Cited by: §1.
 [22] (2020) Scalable differential privacy with certified robustness in adversarial learning. In International Conference on Machine Learning, pp. 7683–7694. Cited by: §2.
 [23] Differentially private lifelong learning. In Privacy in Machine Learning (NeurIPS 2019 Workshop), Cited by: §2, §3.
 [24] (2019) Preserving differential privacy in adversarial learning with provable robustness. CoRR abs/1903.09822. External Links: Link, 1903.09822 Cited by: §2.

[25]
(2019)
Heterogeneous gaussian mechanism: preserving differential privacy in deep learning with provable robustness.
In
Proceedings of the 28th International Joint Conference on Artificial Intelligence
, IJCAI’19, pp. 4753–4759. External Links: ISBN 9780999241141 Cited by: §5.  [26] (2017) Adaptive laplace mechanism: differential privacy preservation in deep learning. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 385–394. Cited by: §5.
 [27] (2014) Deep learning for neuroimaging: a validation study. Frontiers in neuroscience 8, pp. 229. Cited by: §1.
 [28] (2020) Itaml: an incremental taskagnostic metalearning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13588–13597. Cited by: §1.
 [29] (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §1.
 [30] (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. ICLR. Cited by: §1.
 [31] (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §1.
 [32] (2018) Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. Cited by: §1.
 [33] (2017) Continual learning with deep generative replay. Neural Information Processing Systems (NeurIPS). Cited by: §1.
 [34] (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1, §5.
 [35] (2020) Fewshot classincremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12183–12192. Cited by: §1.
 [36] (2015) Regression model fitting under differential privacy and model inversion attack.. In IJCAI, pp. 1003–1009. Cited by: §1.
 [37] (2018) Memory replay gans: learning to generate new categories without forgetting. In Neural Information Processing Systems (NeurIPS), Cited by: §1.
 [38] (2020) The value of collaboration in convex machine learning with differential privacy. IEEE Symposium on Security and Privacy. Cited by: §1.
 [39] (2019) GANobfuscator: mitigating information leakage under GAN via differential privacy. IEEE Transactions on Information Forensics and Security 14 (9), pp. 2358–2371. Cited by: §2.
 [40] (2021) Removing disparate impact on model accuracy in differentially private stochastic gradient descent. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, pp. 1924–1932. External Links: ISBN 9781450383325 Cited by: §5.
 [41] (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §1, §4.
 [42] (2018) Differentially private releasing via deep generative model (technical report). arXiv preprint arXiv:1801.01594. Cited by: §2.
 [43] (2017) Differential privacy and applications. Cited by: §1.
 [44] (2020) Application of differential privacy approach in healthcare data–a case study. In 2020 14th International Conference on Innovations in Information Technology (IIT), pp. 35–39. Cited by: §1.
Comments
There are no comments yet.