1 Introduction
When building a classification system, one rarely has all the data to be used for training available at the outset. More often, one starts by pretraining a model with some “core” dataset (e.g. ImageNet, or datasets close to the target task) and then incorporates various cohorts of taskspecific data as they become available from diverse sources. In some cases, the wrong data may be incorporated inadvertently, or the owners may change their mind and demand that their data be removed. One can, of course, restart the training from scratch every time such a demand is made, but at a significant cost of time and disruption. What if one could remove the effect of cohort(s) of data
alacarte, without retraining, in a way that the resulting model is functionally indistinguishable from one that has never seen the cohort(s) in question, and in addition has no residual information about it buried in the weights of the model? Of course, forgetting can always be trivially achieved by zeroing the weights or replacing them with random noise, but this comes at the expense of the accuracy of the model. Can we forget the cohort of interest without interfering with information about the other data and preserving, to the extent possible, the accuracy of the trained model? Recently, the problem of forgetting has received considerable attention [15, 16, 13, 19, 5, 24, 35, 41, 6, 39, 12, 7, 36], but solutions have focused on simpler machine learning problems such as linear logistic regression. Removing information from the weights of a standard convolutional network still remains an open problem, with some initial results working only on small scale problems
[15, 16]. This is mainly due to the highly nonconvex losslandscape of CNNs, which makes the influence of a particular sample on the optimization trajectory and the final weights highly nontrivial to model.In this paper we introduce MixedLinear Forgetting (MLForgetting), a method to train large scale computer vision models in such a way that information about a subset of the data can be removed on request – with strong bounds on the amount of remaining information – while at the same time retaining close to the state of the art accuracy on the tasks. To the best of our knowledge, this is the first algorithm to achieve forgetting for deep networks trained on largescale computer vision problems without compromising the accuracy. To further improve the performance in realistic usecases, we introduce the notion of forgetting in a
mixedprivacy setting, that is, when we know that a subset of the training dataset, which we call core data, will not need to be forgotten. For example, the core data may be a large dataset of generic data used for pretraining (e.g., ImageNet) or a large freely available collection of taskspecific data (e.g., a selfdriving dataset) which is not likely subject to changes. We show that MLForgetting can naturally take advantage of this setting, to improve both accuracy and bounds on the amount of remaining information after forgetting.
One of the main challenges of forgetting in deep networks is how to estimate the effects of a given training sample on the parameters of the model, which has lead the research to focus on simpler convex learning problem such as linear or logistic regression, for which a theoretical analysis is feasible. To address this problem, MixedLinear Forgetting uses a firstorder Taylorseries inspired decomposition of the network to learn two sets of weights: a core set
which is trained only with the core data using a standard (nonconvex) algorithm, and a set of linearof user weights, which is trained to minimize a quadratic loss function on the changeable user data
. The core weights are learned through standard training (since forgetting is not required on core data), while the user weights are obtained as the solution to a strongly convex quadratic optimization problem. This allows us to remove influence of a subset of the data with strong guarantees. Moreover, by construction, simply setting to zero the user weights removes influence of all changeable data with the lowest possible drop in performance, thus easily allowing the user to remove all of their data at the same time.To summarize, our key contributions are:

[labelwidth=!, labelindent=0pt]

We introduce the problem of forgetting (unlearning or data deletion or scrubbing) in a mixedprivacy setting which, compared to previous formalizations, is better taylored to standard practice, and allows for better privacy guarantees.

In this setting we propose MLForgetting. MLForgetting trains a set of nonlinear core weights and a set of linear user weights, which allow it to achieve both good accuracy, thanks to the flexibility of the nonlinear weights, and strong privacy guarantees thanks to the linear weights.

As a side effect, all the user data may be forgotten completely with the lowest possible drop in performance by simply erasing the user weights.

We show that MLForgetting can be applied to largescale vision datasets, and enjoys both strong forgetting guarantees and test time accuracy comparable to standard training of a Deep Neural Network (DNN). To the best of our knowledge, this is the first forgetting algorithm to do so.

Furthermore, we show that MLForgetting can handle multiple sequential forgetting requests without degrading its performance, which is important for real world applications.
2 Related Work
Forgetting. The problem of machine unlearning is introduced in [8] as an efficient forgetting algorithm for statistical query learning. [32, 13]
gives method for forgetting for particular class of learning algorithms, such as kmeans clustering. Other methods involve splitting the data into multiple subsets and train models separately on combinations of them
[6, 41]. This allows perfect forgetting, but incurs in heavy storage costs as multiple models/gradients need to be stored. In the context of model interpretability and crossvalidation, [27, 14] provided a hessian based method for estimating the influence of a training point on the model predictions. [5]proposed a method for hide information about an entire class from the output logits, but does not remove information from the model weights.
[19] proposed to remove information from the weights on convex problems using Newton’s method, and uses differential privacy [1, 11, 10, 9] to certify data removal. [24] provides a projective residual update method using synthetic data points to delete data points from linear/logistic regression based models. [36] proposed an unlearning mechanism for logistic regression and gaussian processes in a Bayesian setting using variational inference. Recently, [35] proposed a gradient descent based method for data deletion in convex settings, with theoretical guarantees for multiple forgetting requests. They also introduce the notion of statistical indistinguishably of the entire state or just the outputs similar to the information theoretic framework of [16]. We use some of their proof techniques for our theoretical results.Deep Networks provide additional challenges to forgetting due to their highly nonconvex loss functions. [15]
proposed an information theoretic procedure to scrub the information from intermediate layers of DNN trained with stochastic gradient descent (SGD), exploiting the stability of SGD
[21]. They also bound the amount of remaining information in the weights [3] after scrubbing. [16] extend the framework of [15] to activations. They also show that an approximation of the training process based on a firstorder Taylor expansion of the network (NTK theory) can be used to the estimate the weights after forgetting. This approximation works well on small scale vision datasets. However, the approximation accuracy and the computational cost degrade for larger datasets (in particular the cost is quadratic in the number of samples). We also use linearizarion but, crucially, instead of linearly approximating the training dynamics of a nonlinear network, we show that we can directly train a linearized network for forgetting. This ensures that our forgetting procedure is correct, and it allows us to scale easily to standard realworld vision datasets.Linearization. Using a firstorder Taylor expansion (linearization) of the network to study its behavior has gained interest recently in the NTK theory [25, 30] as a tool to study the dynamics of DNNs in the limit of infinite filters. [33]
shows that aside from a theoretical tool, it is possible to directly train a (finite) linearized network using an efficient algorithm for the JacobianVector product computation.
[2] show that with some changes to the architecture and training process, linearized models can match the performance of nonlinear models on many vision tasks, while still maintaining a convex loss function.3 Preliminaries and Notations
We use the empirical risk minimization (ERM) framework throughout this paper for training. Let be a dataset where denotes the input datum (for example, images) and the corresponding output (for example, one hot vector in classification). Given an input image , let (for instance, a DNN) be a function parameterized by used to model the relation . Given a inputtarget pair , we denote empirical risk or the training loss for by . We will sometimes abuse notation and use by dropping . For a training dataset , we denote the empirical risk/total training loss on by . We will interchangebly use with . Let , denote the weights obtained after steps of a training algorithm using as the initialization (for examples, SGD in our case). We denote with the norm of a vector and with
the largest eigenvalue of a matrix Q. To keep the notation uncluttered, we also use the shorthand
.4 The Forgetting Problem
The weights of a trained deep network are a (possibly stochastic) function of the training data . As such, they may retain information about the training samples which an attacker can extract. A forgetting procedure is a function ^{1}^{1}1We will abuse the notation and write when its arguments are clear from the context. (also called scrubbing function) which, given a set of weights trained on and a subset of images to forget, outputs a new set of weights which are indistinguishable from weights obtained by training without .
Readout functions. The success of the forgetting procedure, can be measured by looking at whether a discriminator function
that can guess – at better than chance probability – whether a set of weights
was trained with or without or whether it was trained with and then scrubbed. Following [15, 16] we call such functions readout functions. A popular example of readout function is the confidence of the network (that is, the entropy of the output softmax vector) on the samples in : Since networks tend to be overconfident on their training data [20, 28], a higher than expected confidence may indicate that the network was indeed trained on . We discuss more readout functions in section 8.1. Alternatively, we can measure the success of the forgetting procedure by measuring the amount of remaining mutual information ^{2}^{2}2 is the mutual information between and , whereis the joint distribution and
are the marginal distributions. between the scrubbed weights and the data to be forgotten. While this is more difficult to estimate, it can be shown that upperbounds the amount of information that any readout function can extract [15, 16]. Said otherwise, it is an upperbound on the amount of information that an attacker can extract about using the scrubbed weights .Quadratic forgetting.
An important example is forgetting in a linear regression problem, which has a quadratic loss function
. Given the weights obtained after training on using algorithm , the optimal forgetting function is given by:(1) 
where is the hessian and gradient of the loss function computed on the remaining data respectively. When , we replace with in which case it is interpreted as a reverse Newtonstep that unlearns the data [19, 15]. Since the “user weights” of MLForgetting minimize a similar quadratic loss function as we will discuss in Section 6, eq. 1 also describes the optimal forgetting procedure for our model. The main challenge for us will be how to accurately compute the forgetting step since the Hessian matrix can’t be computed or stored in memory due to the highnumber of parameters of a deep network (section 6).
Convex forgetting. Unfortunately, for more general machine learning models we do not have a close form expression for the optimal forgetting step. However, it can be shown [27] that eq. 1 is always a firstorder approximation of the optimal forgetting. [19] shows that for strongly convex Lipschitz loss functions, the discrepancy between eq. 1 and the optimal forgetting is bounded. Since this discrepancy – even if bounded – can leak information, a possible solution is to add a small amount of noise after forgetting:
(2) 
where
is a vector of random Gaussian noise, which aims to destroy any information that may leak due to small discrepancies. Increasing the variance
of the noise destroys more information, thus making forgetting more secure, but also reduces the accuracy of the model since the weights are increasingly random. The curve of possible Paretooptimal tradeoffs between accuracy and forgetting can be formalized with the Forgetting Lagrangian [15].Alternatively, to forget data in a strongly convex problem, one can finetune the weights on the remaining data using perturbed projectedGD [35]. Since projectedGD converges to the unique minimum of a strongly convex function regardless of the initial condition (contrary to SGD, which may not converge unless proper learning rate scheduling is used), this is guaranteed to remove all influence of the initial data [35]
. The downside is that gradient descent (GD) is impractical for largescale deep learning applications compared to SGD, projection based algorithms are not popular in practice, and the commonly used loss functions are not generally Lipschitz.
Nonconvex forgetting. Due to their highly nonconvex losslandscape, small changes of the training data can cause large changes in the final weights of a deep network. This makes application of eq. 2 challenging. [15] shows that pretraining helps increasing the stability of SGD and derives a similar expression to eq. 2 for DNNs, and also provides a way to upperbound the amount of remaining information in a DNN. [15] builds on recent results in linear approximation of DNNs, and approximate the training path of a DNN with that of its linear approximation. While this improves the forgetting results, the approximation is still not good enough to remove all the information. Moreover, computing the forgetting step scales quadratically with the number of training samples and classes, which restricts the applicability of the algorithm to smaller datasets.
5 MixedLinear Forgetting
Let be the output of a deep network model with weights computed on an input image . For ease of notation, assume that the core dataset and the user dataset share the same output space (for example, the same set of classes, for a classification problem). After training a set of weights on a coredataset we would like to further perturb those weights to finetune the network on user data . We can think of this as solving the two minimization problems:
(3)  
(4) 
where we can think of the user weights a perturbation to the core weights that adapts them to the user task. However, since the deep network is not a linear function in of the weights , the loss function can be highly nonconvex. As discussed in the previous section, this makes forgetting difficult. However, if the perturbation is small, we can hope for a linear approximation of the DNN around to have a similar performance to finetuning the whole network [33], while at the same time granting us easiness of forgetting.
Motivated, by this, we introduce the following model, which we call MixedLinear Forgetting model (MLmodel):
(5) 
The model can be seen as firstorder Taylor approximation of the effect of finetuning the original deep network . It has two sets of weights, a set of nonlinear core weights , which enters the model through the nonlinear network , and a set linear userweights which enters the model linearly. Even though the model is linear in , it is still a highly nonlinear function of due to the nonlinear activations in .
We train the model solving two separate minimization problems:
(6)  
(7) 
Eq. (6) is akin to pretraining the weights on the core dataset , while eq. 7 finetunes the linear weights on all the data . This ensures the weights will only contain information about the core dataset , while all information about the user data is contained in . Also note that we introduce two separate loss functions for the core and user data. To train the user weights we use a mean square error (i.e., ) loss [23, 34, 17]:
(8) 
where
is a onehot encoding of the class label. This loss has the advantage that the weights
are the solution to a quadratic problem, in which case the optimal forgetting step can be written in closed form (see eq. 1). On the other hand, since we do not need to remove any information from the the weights , we can train them using any loss in eq. 3. We pick the standard crossentropy loss, although this choice is not fundamental for our method.5.1 Optimizing the MixedLinear model
Ideally, we want the MLmodel to have a similar accuracy on the user data to a standard nonlinear network. At the same time, we want the MLmodel to perform significantly better than simply training a linear classifier on top of the last layer features of
, which is the trivial baseline method to train a linear model for an object classification task. In fig. 1 (see section 8 for details) we see that this is indeed the case: while linear, the MLmodel is still flexible enough to fit the data with a comparable accuracy to the fully nonlinear model (DNN). However, some considerations are in order regarding how to train our MLmodel.Training the core model. Eq. (3) reduces to the standard training of a DNN on the dataset using crossentropy loss. We train using SGD with annealing learnig rate. In case is composed of multiple datasets, for example ImageNet and a second dataset closer to the user task, we first pretrain on ImageNet, then finetune on the other dataset.
Training the MixedLinear model. Training the linear weights of the MixedLinear model in eq. 4 is slighly more involved, since we need to compute the JacobianVector product (JVP) of . While a naïve implementation would require a separate backward pass for each sample, [33, 37] show that the JVP of a batch of samples can be computed easily for deep networks using a slightly modified forward pass. The modified forward pass has only double the computational cost of a standard forward pass, and can be further reduced by linearizing only the final layers of the network. Using the algorithm of [33] to compute the model output, eq. 4 reduces to a standard optimization, which we perform again with SGD with annealing learning rate. Note that, since the problem is quadratic, we could use more powerful quasiNetwon methods to optimize, however we avoid that to keep the analysis simpler, since optimization speed is not the focus of this paper.
Architecture changes. We observe that a straightforward application of [33] to a standard pretrained ResNet50 tend to underperform in our setting (finetuning on large scale vision tasks). In particular, it achieves only slightly better performance than training a linear classifier on top of the last layer features. Following the suggetsion of [2], we replace the ReLUs with Leaky ReLUs, since it boosts the accuracy of linearized models.
6 Forgetting procedure
The user weights are obtained by minimizing the quadratic loss function in section 5 on the user data . Let denote a subset of samples we want to forget (by hypothesis , i.e., the core data is not going to change) and let denote the remaining data. As discussed in Section 4, in case of quadratic training loss the optimal forgetting step to delete is given by:
(9) 
where we define and we can explicitly write the Hessian of the loss section 5 as:
(10) 
where
is the identity matrix of size
. Thus, forgetting amounts to computing the update step eq. 9. Unfortunately, even if we can easily write the hessian in closed form, we cannot store it in memory and much less invert it. Instead, we now discuss how to find an approximation of the forgetting step by solving an optimization problem which does not require constructing or inverting the hessian.Since is positive definite, we can define the auxiliary loss function
(11) 
It is easy to show that the forgetting update is the unique minimizer of , so we can recast computing the forgetting update as simply minimizing the loss using SGD. In general, the product of eq. 11 can be computed efficiently without constructing the Hessian using the HessianVector product algorithm [27]. However, in our case we have a better alternative due to the fact that we use MSE loss and that MLmodel is linear in weightspace: Using eq. 10, we can easily show that
(12) 
where is a JacobianVector product which can be computed efficiently (see Section 5.1). Using this result, we compute the (approximate) minimizer of eq. 11 using SGD. When optimizing eq. 11, we compute exactly on and approximate eq. 10 by MonteCarlo sampling. In fig. 4, we show this method outperforms full stochastic minimization of eq. 11.
MixedLinear Forgetting. Let be the approximate minimizer of eq. 11 obtained by training with SGD for iterations. Our forgetting procedure for the MLmodel, which we call MixedLinear (ML) Forgetting, is:
(13) 
where is a random noise vector [15, 16]. As mentioned in Section 4, we need to add noise to the weights since is only an approximation of the optimal forgetting step, and the small difference may still contain information about the original data. By adding noise, we destroy the remaining information. Larger values of ensure better forgetting, but can reduce the performance of the model. In the next sections, we analyze theoretically and practically the role of .
Sequential forgetting. In practical applications, we may receive several separate requests to forget the data in a sequential fashion. In such cases, we simply apply the forgetting procedure in eq. 13 on the weights obtained at the end of the previous step. A key component is to ensure that the performance of the system does not deteriorate too much after many sequential requests, which we do next.
7 Bounds on Remaining Information
We now derive bounds on the amount of information that an attacker can extract from the weights of the model after applying the scrubbing procedure eq. 13. This will also guide us in selecting the optimal and the number of iterations to approximate the forgetting step that are necessary to reach a given privacy level (see fig. 2). Let denote some attribute of interest regarding an attacker might want to access, then from Proposition 1 in [15] we have:
where is the scrubbing/forgetting method which given weights trained on removes information about (which in our case is given by eq. 13). Hence, bounding the amount of information about that remains in the weights after forgetting uniformly bounds all the information that an attacker can extract.
We now upperbound the remaining information after applying the forgetting procedure in eq. 13 to our MLmodel, over multiple forgetting requests. Let be the total data asked to be forgotten at the end of forgetting requests and let be the weights obtained using the forgetting procedure in eq. 13 sequentially. Then we seek to provide a bound on the mutual information between the two, i.e., . We prove the following theorem.
Theorem 1 (Informal).
Let be the approximate update step obtained minimizing (eq. 13) using steps of SGD with minibatch size . Let , where is the smoothness constant of the loss in eq. 7. Consider a sequence of equally sized forgetting requests and let be the weights obtained after the requests using eq. 13. Then we have the following bound on the amount of information remaining in the weights about
(14) 
where , and , and .
. We aim to forget 10% of the training data through 10 forgetting requests on the Caltech256 (left) and Aircrafts datasets (right). Note that the remaining information in the weights decreases with an increase in the forgetting noise or the number of epochs during forgetting as predicted by the bound in
theorem 1. Increasing the forgetting noise increases the test error after forgetting (top). In terms of the computational efficiency/speed, doing 23 passes over the data (i.e. 23 epochs) is sufficient for forgetting (in terms of the test error and the remaining information) rather than retraining from scratch for 50 epochs (bottom) for each forgetting request . Thus providing a 1625 speedup per forgetting request. We finetune the MLForgetting model for 50 epochs while training the user weights. Values for and can be chosen using these tradeoff curves given a desired privacy level.[35] provides a similar probabilistic bound on the distance of the scrubbed weights from the optimal weights for strongly convex Lipschitz loss functions trained using projected GD. We prove our bound for the more general case of a convex loss function with regularization trained using SGD (instead of GD) and also bound the remaining information in the weights.
Role of . We make some observations regarding eq. 14. First, increasing the variance of the noise added to the weights after the forgetting step further reduces the possible leakage of information from an imperfect approximation. Of course, the downside is that increasing the noise may reduce the performance of the model (see fig. 2 (top) for the tradeoff between the two).
Forgetting with more iterations. Running the algorithm for an increasing number of steps improves the accuracy of the forgetting step, and hence reduces the amount of remaining information. We confirm this empirically in fig. 2 (bottom). Note however that there is a diminishing return. This is due to the variance of the stochastic optimization overshadowing gains in accuracy from longer optimization (see the additive term depending on the batch size). Increasing the batchsize, in eq. 13 reduces the variance of the estimation and leads to better convergence.
Fraction of data to forget. Finally, forgetting a smaller fraction of the data is easier. On the other hand, increasing the number of parameters of the model may make the forgetting more difficult.
8 Experiments
We use a ResNet50[22] as the model in MLForgetting. Unless specified otherwise, we forget around of randomly chosen training data in all the experiments through 10 sequential forgetting requests each of size . In the appendix, we also provide results for forgetting an entire class and show that our method is invariant to the choice of the subset to be forgotten. More experimental details can be found in the appendix.
Datasets used. We test our method on the following image classification tasks: Caltech256 [18], MIT67 [38], Stanford Dogs [26], CUB200 [40], FGVC Aircrafts [31]
, CIFAR10
[29]. Readout function and forgettingaccuracy tradeoff plots for MIT67,StanfordDogs,CUB200 and CIFAR10 can be found in the appendix.8.1 Readout functions
The forgetting procedure should be such that an attacker with access to the scrubbed weights should not be able to construct some function , which will leak information about the set to forget . More precisely the scrubbing procedure should be such that for all :
(15) 
where is some baseline function that does not depend on (it only depends on the subset to retain = ). Here , corresponds to the distribution of weights (due to the stochastic training algorithm) obtained after minimizing the empirical risk on , respectively. corresponds to the scrubbing update defined in eq. 13. We choose , where
. For an ideal forgetting procedure, the value of the readout functions (or evaluation metrics) should be same for a model obtained after forgetting
and retrained from scratch without using . Some common choice of readout functions include (see Figure 3):
[wide, labelindent=0pt]

Error on , , : The scrubbed and the retrained model (from scratch on ) should have similar accuracy on all the three subsets of the data

Relearn Time: We finetune the scrubbed (model after forgetting) and retrained model for a few iterations on a subset of the training data (which includes ) and compute the number of iterations it takes for the models to relearn . An ideal forgetting procedure should be such that the relearn time should be comparable to the retrained model (we plot the relative retrain time in Figure 3). Relearn time serves a proxy for the amount of information remaining in the weights about (see fig. 3).

Activation Distance: We compute the distance between the final activations of the scrubbed weights and the retrained model () on different subsets of data. More precisely we compute the following: , where . We compare different corresponding to the original weights without any forgetting, weights after adding Fisher noise and MLforgetting (see fig. 3). This serves as a proxy for the amount of information remaining in the activations about .

Membership Attack: We construct a simple yet effective membership attack similar to [16], where we compute the entropy of the scrubbed model output on (Retain set) and (Test set) and label it class 0 and 1 respectively (i.e. we create a binary classification problem with entropy of output on retain and test set as two different classes). We then fit a weighted support vector classifier (SVC) (weighted because the size of the retain and test set may be unequal) to this data and compute the attack success on the data to forget, . Now if SVC is able to identify the samples in as class 0 during the attack, then the attack is successful, because the adversary has identified from the scrubbed model what data it was initially trained on. Ideally, a forgetting procedure should have the same attack success as a retrained model (see fig. 3).
8.2 Complete vs Stochastic residual gradient
In eq. 13 we compute the residual gradient completely once over the remaining data instead of estimating that term stochastically using . In Figure 4, we compare both the methods of computing the residual gradient. We show that in the ideal region of noise (i.e. ), both the remaining information and test error after forgetting (10% of the data through 10 requests) is lower when computing the residual gradient completely.
8.3 Effect of choosing different core datasets
For finegrained datasets like FGVCAircrafts and CUB200, we show that if the core data has some information about the user task, then it improves forgetting significantly both in terms of the remaining information and the test accuracy. In fig. 5, we show that using ImageNet + 30% of the Aircrafts (we assume that we are not asked to forget this 30% of the data) as core data and 100% of the Aircrafts as the user data, performs much better than simply using ImageNet as core. In fig. 5(right), we also show that increasing the percentage of user distribution in the core data improves the test accuracy of the MixedLinear model.
9 Conclusion
We provide a practical forgetting procedure to remove the influence of a subset of the data from a trained image classification model. We achieve this by linearizing the model using a mixedprivacy setting which enables us to split the weights into a set of core and forgettable user weights. When asked to delete all the user data, we can simply discard the user weights. The quadratic nature of the training loss enables us to efficiently forget a subset of the user data without compromising the accuracy of the model. In terms of the timecomplexity, we only need 23 passes over the dataset per forgetting query for removing information from the weights rather than the 50 retraining epochs, thus, providing a 16 or more speedup per request (see fig. 2). We test the forgetting procedure against various readout functions, and show that it performs comparably to a model retrained from scratch (the ideal paragon). Finally, we also provide theoretical guarantees on the amount of remaining information in the weights and verify the behavior of the information bounds empirically through extensive evaluation in fig. 2.
Even though we provide a forgetting procedure for deep networks by linearizing them without compromising their accuracy, directly removing information from highly nonconvex deep networks efficiently still remains an unsolved problem at large.
References
 [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
 [2] Alessandro Achille, Aditya Golatkar, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Lqf: Linear quadratic finetuning, 2020.
 [3] Alessandro Achille and Stefano Soatto. Where is the Information in a Deep Neural Network? arXiv eprints, page arXiv:1905.12213, May 2019.
 [4] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of sgd in nonconvex overparametrized learning. arXiv preprint arXiv:1811.02564, 2018.
 [5] Thomas Baumhauer, Pascal Schöttle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logitbased classifiers. arXiv preprint arXiv:2002.02730, 2020.
 [6] Lucas Bourtoule, Varun Chandrasekaran, Christopher ChoquetteChoo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. arXiv preprint arXiv:1912.03817, 2019.
 [7] Jonathan Brophy and Daniel Lowd. Dart: Data addition and removal trees. arXiv preprint arXiv:2009.05567, 2020.
 [8] Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, pages 463–480. IEEE, 2015.
 [9] Kamalika Chaudhuri and Claire Monteleoni. Privacypreserving logistic regression. Advances in neural information processing systems, 21:289–296, 2008.
 [10] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
 [11] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
 [12] Sanjam Garg, Shafi Goldwasser, and Prashant Nalini Vasudevan. Formalizing data deletion in the context of the right to be forgotten. arXiv preprint arXiv:2002.10635, 2020.
 [13] Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems, pages 3513–3526, 2019.

[14]
Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara
Broderick.
A swiss army infinitesimal jackknife.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 1139–1147, 2019. 
[15]
Aditya Golatkar, Alessandro Achille, and Stefano Soatto.
Eternal sunshine of the spotless net: Selective forgetting in deep
networks.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 9304–9312, 2020.  [16] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Forgetting outside the box: Scrubbing deep networks of information accessible from inputoutput observations. arXiv preprint arXiv:2003.02960, 2020.
 [17] Pavel Golik, Patrick Doetsch, and Hermann Ney. Crossentropy vs. squared error training: a theoretical and experimental comparison. 2013.
 [18] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech256 object category dataset. 2007.
 [19] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens van der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
 [20] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017.
 [21] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [23] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs crossentropy in classification tasks. arXiv preprint arXiv:2006.07322, 2020.
 [24] Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models: Algorithms and evaluations. arXiv preprint arXiv:2002.10077, 2020.
 [25] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
 [26] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and FeiFei Li. Novel dataset for finegrained image categorization: Stanford dogs.
 [27] Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
 [28] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in relu networks. arXiv preprint arXiv:2002.10118, 2020.

[29]
Alex Krizhevsky et al.
Learning multiple layers of features from tiny images.
Technical report, Citeseer, 2009.
 [30] Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019.
 [31] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Finegrained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
 [32] Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Deletionrobust submodular maximization: Data summarization with “the right to be forgotten”. In International Conference on Machine Learning, pages 2449–2458, 2017.
 [33] Fangzhou Mu, Yingyu Liang, and Yin Li. Gradients as features for deep representation learning. In International Conference on Learning Representations, 2020.
 [34] Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter? arXiv preprint arXiv:2005.08054, 2020.
 [35] Seth Neel, Aaron Roth, and Saeed SharifiMalvajerdi. Descenttodelete: Gradientbased methods for machine unlearning, 2020.
 [36] Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational bayesian unlearning. Advances in Neural Information Processing Systems, 33, 2020.
 [37] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
 [38] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features offtheshelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813, 2014.
 [39] David Marco Sommer, Liwei Song, Sameer Wagh, and Prateek Mittal. Towards probabilistic verification of machine unlearning. arXiv preprint arXiv:2003.04247, 2020.
 [40] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltechucsd birds2002011 dataset. 2011.
 [41] Yinjun Wu, Edgar Dobriban, and Susan B Davidson. Deltagrad: Rapid retraining of machine learning models. arXiv preprint arXiv:2006.14755, 2020.
Appendix
In this appendix we provide additional experiments (Appendix A), experimental details (Appendix B) and theoretical results (Appendix C).
Appendix A Additional Experiments
a.1 Forgetting an entire class
In the main paper we considered forgetting a random subset of 10% of the training data. Here we consider instead the problem of completely forgetting all samples of a given class in a single forgetting request. In figs. 7 and 6, we observe that also in this setting our proposed method outperforms other methods and is robust to different readout functions. Note that for the case of removing an entire class the target forget error (i.e. the error on the class to forget) is 100%.
a.2 Role of Regularization
We plot the amount of remaining information and the test error as a function of the regularization coefficient. Note that instead of incorporating weight decay directly in the optimization step, as it is often done, we explicitly add the regularization to the loss function. As expected theoretically (theorem 3), increasing the regularization coefficient makes the training optimization problem more strongly convex, which in turn makes forgetting easy. However, increasing weight decay too much also hurts the accuracy of the model. Hence there is a tradeoff between the amount of remaining information and the amount of regularization with respect to the regularization. We plot the tradeoff in fig. 8.
a.3 More experiments using SGD for forgetting
We repeat the same experiments as in fig. 3 on the following datasets: Stanford Dogs, MIT67, CIFAR10, CUB200, FGVC Aircrafts. Overall, we observe consistent results over all datasets.
a.4 Information vs Noise/Epochs
Appendix B Experimental Details
We use a ResNet50 pretrained on ImageNet. For the plots in fig. 1, we train MLForgetting model using SGD for 50 epochs with batch size 64, learning rate lr=0.05, momentum=0.9, weight decay=0.00001 where the learning rate is annealed by 0.1 at 25 and 40 epochs. We explicitly add the regularization to the loss function instead of incorporating it in the SGD update equation. We only linearize the final layers of ResNet50 and scale the onehot vectors by 5 while using the MSE loss. For finegrained datasets, FGVCAircrafts and CUB200, in addition to the ImageNet pretraining, we also pretrain them using randomly sampled 30% of the training data (which we assume is part of the core set).
For the training the MLForgetting model in the readout functions and information plots using SGD, we use the same experimental setting as above with a increased weight decay=0.0005 for Caltech256,StanfordDogs and CIFAR10 and 0.001 for MIT67,CUB200 and FGVCAircrafts. We use a higher value of weight decay to increase the strong convexity constant of the training loss function, which facilitates forgetting (see lemma 4).
For forgetting using MLForgetting model in the readout function/information plots using SGD (MLForgetting to minimize eq. 11), we use momentum=0.999 and decrease the learning rate by 0.5 per epoch. We run SGD for 3 epochs with an initial lr=0.01 for Caltech256, StanfordDogs and CIFAR10 and run it for 4 epochs with initial lr=0.025 for MIT67, CUB200 and FGVCAircrafts.
Appendix C Theoretical Results
Lemma 1.
Let be two random vectors such that . Then we have the following, for any :
Proof.
(16) 
for any , where (a) follows from the CauchySchwarz inequality and (b) follows from the AMGM inequality. ∎
Lemma 2.
Let and be a strongly convex function with , and such that . Then , we have that . When is also quadratic with , the maximum eigen value of the Hessian, we have that .
Proof.
Let , where then from Mean Value Theorem (MVT) we have that for some in between 0 and 1. This implies that , and . Thus from MVT we get:
(17) 
where follows from the CauchySchwarz inequality and follows from the fact that and , .
When is quadratic, then we can always write , where is a constant symmetric matrix and . From our definition of we can write:
where (a) follows from the definition of and (b) follows from the triangle inequality. Substituting this result in Equation 17 we get:
∎
Lemma 3.
Consider a function , where , and is a dataset of size . Let , then .
Proof.
Thus, , where follows from the assumption that is nonnegative, follows from the fact that is the minimizer of and follows from the assumption that . Note that the result is independent of , thus, the empirical risk minimizers of the datasets obtained by removing a subset of samples will also lie within a dimensional sphere of radius .
∎
Lemma 4.
Consider a function , where , is strongly convex function with . Let (retain set) be the dataset remaining after removing samples (forget set) from i.e. . Let and . Let and . Then we have that:
When is also quadratic with , the smoothness constant of , we have that .
Proof.
We use the same technique as proposed in [35].
where, first inequality follows from the fact that is the minimizer of .
From Lemma 3 we know that . Also from the definition of we have that . Then applying Lemma 2 with for we get that
(19) 
From the the definition of we know that it is a strongly convex function. So we have the following property:
(20) 
Substituting Equation 19 and Equation 20 in LABEL:equation:lemma41 we get:
When is also quadratic with , the smoothness constant, then from lemma 2 we have .
∎
Lemma 5.
Let be a convex and smooth with minimizer, . Then we have that:
Proof.
From the definition of smoothness we have that:
(21) 
Setting , and using the fact that , we get that:
For (2) we minimize Equation 21 with respect to :
(22) 
where follows from the result that . Setting , in LABEL:equation:smoothnesgeneral and rearranging the terms we get (2). ∎
Theorem 2.
(SGD) Consider