Mixed-Privacy Forgetting in Deep Networks

by   Aditya Golatkar, et al.

We show that the influence of a subset of the training samples can be removed – or "forgotten" – from the weights of a network trained on large-scale image classification tasks, and we provide strong computable bounds on the amount of remaining information after forgetting. Inspired by real-world applications of forgetting techniques, we introduce a novel notion of forgetting in mixed-privacy setting, where we know that a "core" subset of the training samples does not need to be forgotten. While this variation of the problem is conceptually simple, we show that working in this setting significantly improves the accuracy and guarantees of forgetting methods applied to vision classification tasks. Moreover, our method allows efficient removal of all information contained in non-core data by simply setting to zero a subset of the weights with minimal loss in performance. We achieve these results by replacing a standard deep network with a suitable linear approximation. With opportune changes to the network architecture and training procedure, we show that such linear approximation achieves comparable performance to the original network and that the forgetting problem becomes quadratic and can be solved efficiently even for large models. Unlike previous forgetting methods on deep networks, ours can achieve close to the state-of-the-art accuracy on large scale vision tasks. In particular, we show that our method allows forgetting without having to trade off the model accuracy.


page 11

page 12


Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks

We explore the problem of selectively forgetting a particular set of dat...

Selective Forgetting of Deep Networks at a Finer Level than Samples

Selective forgetting or removing information from deep neural networks (...

Forgetting Outside the Box: Scrubbing Deep Networks of Information Accessible from Input-Output Observations

We describe a procedure for removing dependency on a cohort of training ...

Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Neural Networks

We explore the problem of selectively forgetting a particular set of dat...

A learning without forgetting approach to incorporate artifact knowledge in polyp localization tasks

Colorectal polyps are abnormalities in the colon tissue that can develop...

PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning

This paper presents a method for adding multiple tasks to a single deep ...

De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

Training a robust Speech to Text (STT) system requires tens of thousands...

1 Introduction

When building a classification system, one rarely has all the data to be used for training available at the outset. More often, one starts by pre-training a model with some “core” dataset (e.g. ImageNet, or datasets close to the target task) and then incorporates various cohorts of task-specific data as they become available from diverse sources. In some cases, the wrong data may be incorporated inadvertently, or the owners may change their mind and demand that their data be removed. One can, of course, restart the training from scratch every time such a demand is made, but at a significant cost of time and disruption. What if one could remove the effect of cohort(s) of data

a-la-carte, without re-training, in a way that the resulting model is functionally indistinguishable from one that has never seen the cohort(s) in question, and in addition has no residual information about it buried in the weights of the model? Of course, forgetting can always be trivially achieved by zeroing the weights or replacing them with random noise, but this comes at the expense of the accuracy of the model. Can we forget the cohort of interest without interfering with information about the other data and preserving, to the extent possible, the accuracy of the trained model? Recently, the problem of forgetting has received considerable attention [15, 16, 13, 19, 5, 24, 35, 41, 6, 39, 12, 7, 36]

, but solutions have focused on simpler machine learning problems such as linear logistic regression. Removing information from the weights of a standard convolutional network still remains an open problem, with some initial results working only on small scale problems

[15, 16]. This is mainly due to the highly non-convex loss-landscape of CNNs, which makes the influence of a particular sample on the optimization trajectory and the final weights highly non-trivial to model.

In this paper we introduce Mixed-Linear Forgetting (ML-Forgetting), a method to train large scale computer vision models in such a way that information about a subset of the data can be removed on request – with strong bounds on the amount of remaining information – while at the same time retaining close to the state of the art accuracy on the tasks. To the best of our knowledge, this is the first algorithm to achieve forgetting for deep networks trained on large-scale computer vision problems without compromising the accuracy. To further improve the performance in realistic use-cases, we introduce the notion of forgetting in a

mixed-privacy setting, that is, when we know that a subset of the training dataset, which we call core data

, will not need to be forgotten. For example, the core data may be a large dataset of generic data used for pre-training (e.g., ImageNet) or a large freely available collection of task-specific data (e.g., a self-driving dataset) which is not likely subject to changes. We show that ML-Forgetting can naturally take advantage of this setting, to improve both accuracy and bounds on the amount of remaining information after forgetting.

One of the main challenges of forgetting in deep networks is how to estimate the effects of a given training sample on the parameters of the model, which has lead the research to focus on simpler convex learning problem such as linear or logistic regression, for which a theoretical analysis is feasible. To address this problem, Mixed-Linear Forgetting uses a first-order Taylor-series inspired decomposition of the network to learn two sets of weights: a core set

which is trained only with the core data using a standard (non-convex) algorithm, and a set of linear

of user weights, which is trained to minimize a quadratic loss function on the changeable user data

. The core weights are learned through standard training (since forgetting is not required on core data), while the user weights are obtained as the solution to a strongly convex quadratic optimization problem. This allows us to remove influence of a subset of the data with strong guarantees. Moreover, by construction, simply setting to zero the user weights removes influence of all changeable data with the lowest possible drop in performance, thus easily allowing the user to remove all of their data at the same time.

To summarize, our key contributions are:

  1. [labelwidth=!, labelindent=0pt]

  2. We introduce the problem of forgetting (unlearning or data deletion or scrubbing) in a mixed-privacy setting which, compared to previous formalizations, is better taylored to standard practice, and allows for better privacy guarantees.

  3. In this setting we propose ML-Forgetting. ML-Forgetting trains a set of non-linear core weights and a set of linear user weights, which allow it to achieve both good accuracy, thanks to the flexibility of the non-linear weights, and strong privacy guarantees thanks to the linear weights.

  4. As a side effect, all the user data may be forgotten completely with the lowest possible drop in performance by simply erasing the user weights.

  5. We show that ML-Forgetting can be applied to large-scale vision datasets, and enjoys both strong forgetting guarantees and test time accuracy comparable to standard training of a Deep Neural Network (DNN). To the best of our knowledge, this is the first forgetting algorithm to do so.

  6. Furthermore, we show that ML-Forgetting can handle multiple sequential forgetting requests without degrading its performance, which is important for real world applications.

2 Related Work

Forgetting. The problem of machine unlearning is introduced in [8] as an efficient forgetting algorithm for statistical query learning. [32, 13]

gives method for forgetting for particular class of learning algorithms, such as k-means clustering. Other methods involve splitting the data into multiple subsets and train models separately on combinations of them

[6, 41]. This allows perfect forgetting, but incurs in heavy storage costs as multiple models/gradients need to be stored. In the context of model interpretability and cross-validation, [27, 14] provided a hessian based method for estimating the influence of a training point on the model predictions. [5]

proposed a method for hide information about an entire class from the output logits, but does not remove information from the model weights.

[19] proposed to remove information from the weights on convex problems using Newton’s method, and uses differential privacy [1, 11, 10, 9] to certify data removal. [24] provides a projective residual update method using synthetic data points to delete data points from linear/logistic regression based models. [36] proposed an unlearning mechanism for logistic regression and gaussian processes in a Bayesian setting using variational inference. Recently, [35] proposed a gradient descent based method for data deletion in convex settings, with theoretical guarantees for multiple forgetting requests. They also introduce the notion of statistical indistinguishably of the entire state or just the outputs similar to the information theoretic framework of [16]. We use some of their proof techniques for our theoretical results.

Deep Networks provide additional challenges to forgetting due to their highly non-convex loss functions. [15]

proposed an information theoretic procedure to scrub the information from intermediate layers of DNN trained with stochastic gradient descent (SGD), exploiting the stability of SGD

[21]. They also bound the amount of remaining information in the weights [3] after scrubbing. [16] extend the framework of [15] to activations. They also show that an approximation of the training process based on a first-order Taylor expansion of the network (NTK theory) can be used to the estimate the weights after forgetting. This approximation works well on small scale vision datasets. However, the approximation accuracy and the computational cost degrade for larger datasets (in particular the cost is quadratic in the number of samples). We also use linearizarion but, crucially, instead of linearly approximating the training dynamics of a non-linear network, we show that we can directly train a linearized network for forgetting. This ensures that our forgetting procedure is correct, and it allows us to scale easily to standard real-world vision datasets.

Linearization. Using a first-order Taylor expansion (linearization) of the network to study its behavior has gained interest recently in the NTK theory [25, 30] as a tool to study the dynamics of DNNs in the limit of infinite filters. [33]

shows that aside from a theoretical tool, it is possible to directly train a (finite) linearized network using an efficient algorithm for the Jacobian-Vector product computation.

[2] show that with some changes to the architecture and training process, linearized models can match the performance of non-linear models on many vision tasks, while still maintaining a convex loss function.

3 Preliminaries and Notations

We use the empirical risk minimization (ERM) framework throughout this paper for training. Let be a dataset where denotes the input datum (for example, images) and the corresponding output (for example, one hot vector in classification). Given an input image , let (for instance, a DNN) be a function parameterized by used to model the relation . Given a input-target pair , we denote empirical risk or the training loss for by . We will sometimes abuse notation and use by dropping . For a training dataset , we denote the empirical risk/total training loss on by . We will interchangebly use with . Let , denote the weights obtained after steps of a training algorithm using as the initialization (for examples, SGD in our case). We denote with the norm of a vector and with

the largest eigenvalue of a matrix Q. To keep the notation uncluttered, we also use the shorthand


4 The Forgetting Problem

The weights of a trained deep network are a (possibly stochastic) function of the training data . As such, they may retain information about the training samples which an attacker can extract. A forgetting procedure is a function 111We will abuse the notation and write when its arguments are clear from the context. (also called scrubbing function) which, given a set of weights trained on and a subset of images to forget, outputs a new set of weights which are indistinguishable from weights obtained by training without .

Readout functions. The success of the forgetting procedure, can be measured by looking at whether a discriminator function

that can guess – at better than chance probability – whether a set of weights

was trained with or without or whether it was trained with and then scrubbed. Following [15, 16] we call such functions readout functions. A popular example of readout function is the confidence of the network (that is, the entropy of the output softmax vector) on the samples in : Since networks tend to be overconfident on their training data [20, 28], a higher than expected confidence may indicate that the network was indeed trained on . We discuss more read-out functions in section 8.1. Alternatively, we can measure the success of the forgetting procedure by measuring the amount of remaining mutual information 222 is the mutual information between and , where

is the joint distribution and

are the marginal distributions.
between the scrubbed weights and the data to be forgotten. While this is more difficult to estimate, it can be shown that upper-bounds the amount of information that any read-out function can extract [15, 16]. Said otherwise, it is an upper-bound on the amount of information that an attacker can extract about using the scrubbed weights .

Quadratic forgetting.

An important example is forgetting in a linear regression problem, which has a quadratic loss function

. Given the weights obtained after training on using algorithm , the optimal forgetting function is given by:


where is the hessian and gradient of the loss function computed on the remaining data respectively. When , we replace with in which case it is interpreted as a reverse Newton-step that unlearns the data [19, 15]. Since the “user weights” of ML-Forgetting minimize a similar quadratic loss function as we will discuss in Section 6, eq. 1 also describes the optimal forgetting procedure for our model. The main challenge for us will be how to accurately compute the forgetting step since the Hessian matrix can’t be computed or stored in memory due to the high-number of parameters of a deep network (section 6).

Convex forgetting. Unfortunately, for more general machine learning models we do not have a close form expression for the optimal forgetting step. However, it can be shown [27] that eq. 1 is always a first-order approximation of the optimal forgetting. [19] shows that for strongly convex Lipschitz loss functions, the discrepancy between eq. 1 and the optimal forgetting is bounded. Since this discrepancy – even if bounded – can leak information, a possible solution is to add a small amount of noise after forgetting:



is a vector of random Gaussian noise, which aims to destroy any information that may leak due to small discrepancies. Increasing the variance

of the noise destroys more information, thus making forgetting more secure, but also reduces the accuracy of the model since the weights are increasingly random. The curve of possible Pareto-optimal trade-offs between accuracy and forgetting can be formalized with the Forgetting Lagrangian [15].

Alternatively, to forget data in a strongly convex problem, one can fine-tune the weights on the remaining data using perturbed projected-GD [35]. Since projected-GD converges to the unique minimum of a strongly convex function regardless of the initial condition (contrary to SGD, which may not converge unless proper learning rate scheduling is used), this is guaranteed to remove all influence of the initial data [35]

. The downside is that gradient descent (GD) is impractical for large-scale deep learning applications compared to SGD, projection based algorithms are not popular in practice, and the commonly used loss functions are not generally Lipschitz.

Non-convex forgetting. Due to their highly non-convex loss-landscape, small changes of the training data can cause large changes in the final weights of a deep network. This makes application of eq. 2 challenging. [15] shows that pre-training helps increasing the stability of SGD and derives a similar expression to eq. 2 for DNNs, and also provides a way to upper-bound the amount of remaining information in a DNN. [15] builds on recent results in linear approximation of DNNs, and approximate the training path of a DNN with that of its linear approximation. While this improves the forgetting results, the approximation is still not good enough to remove all the information. Moreover, computing the forgetting step scales quadratically with the number of training samples and classes, which restricts the applicability of the algorithm to smaller datasets.

5 Mixed-Linear Forgetting

Let be the output of a deep network model with weights computed on an input image . For ease of notation, assume that the core dataset and the user dataset share the same output space (for example, the same set of classes, for a classification problem). After training a set of weights on a core-dataset we would like to further perturb those weights to fine-tune the network on user data . We can think of this as solving the two minimization problems:


where we can think of the user weights a perturbation to the core weights that adapts them to the user task. However, since the deep network is not a linear function in of the weights , the loss function can be highly non-convex. As discussed in the previous section, this makes forgetting difficult. However, if the perturbation is small, we can hope for a linear approximation of the DNN around to have a similar performance to fine-tuning the whole network [33], while at the same time granting us easiness of forgetting.

Motivated, by this, we introduce the following model, which we call Mixed-Linear Forgetting model (ML-model):


The model can be seen as first-order Taylor approximation of the effect of fine-tuning the original deep network . It has two sets of weights, a set of non-linear core weights , which enters the model through the non-linear network , and a set linear user-weights which enters the model linearly. Even though the model is linear in , it is still a highly non-linear function of due to the non-linear activations in .

We train the model solving two separate minimization problems:


Eq. (6) is akin to pretraining the weights on the core dataset , while eq. 7 fine-tunes the linear weights on all the data . This ensures the weights will only contain information about the core dataset , while all information about the user data is contained in . Also note that we introduce two separate loss functions for the core and user data. To train the user weights we use a mean square error (i.e., ) loss [23, 34, 17]:



is a one-hot encoding of the class label. This loss has the advantage that the weights

are the solution to a quadratic problem, in which case the optimal forgetting step can be written in closed form (see eq. 1). On the other hand, since we do not need to remove any information from the the weights , we can train them using any loss in eq. 3. We pick the standard cross-entropy loss, although this choice is not fundamental for our method.

5.1 Optimizing the Mixed-Linear model

Ideally, we want the ML-model to have a similar accuracy on the user data to a standard non-linear network. At the same time, we want the ML-model to perform significantly better than simply training a linear classifier on top of the last layer features of

, which is the trivial baseline method to train a linear model for an object classification task. In fig. 1 (see section 8 for details) we see that this is indeed the case: while linear, the ML-model is still flexible enough to fit the data with a comparable accuracy to the fully non-linear model (DNN). However, some considerations are in order regarding how to train our ML-model.

Training the core model. Eq. (3) reduces to the standard training of a DNN on the dataset using cross-entropy loss. We train using SGD with annealing learnig rate. In case is composed of multiple datasets, for example ImageNet and a second dataset closer to the user task, we first pretrain on ImageNet, then fine-tune on the other dataset.

Training the Mixed-Linear model. Training the linear weights of the Mixed-Linear model in eq. 4 is slighly more involved, since we need to compute the Jacobian-Vector product (JVP) of . While a naïve implementation would require a separate backward pass for each sample, [33, 37] show that the JVP of a batch of samples can be computed easily for deep networks using a slightly modified forward pass. The modified forward pass has only double the computational cost of a standard forward pass, and can be further reduced by linearizing only the final layers of the network. Using the algorithm of [33] to compute the model output, eq. 4 reduces to a standard optimization, which we perform again with SGD with annealing learning rate. Note that, since the problem is quadratic, we could use more powerful quasi-Netwon methods to optimize, however we avoid that to keep the analysis simpler, since optimization speed is not the focus of this paper.

Architecture changes. We observe that a straightforward application of [33] to a standard pre-trained ResNet-50 tend to under-perform in our setting (fine-tuning on large scale vision tasks). In particular, it achieves only slightly better performance than training a linear classifier on top of the last layer features. Following the suggetsion of [2], we replace the ReLUs with Leaky ReLUs, since it boosts the accuracy of linearized models.

Figure 1: Mixed-Linear model has comparable accuracy to standard DNN. Plot of the test errors for different datasets using different models. We take a ResNet-50 pretrained on ImageNet and fine-tune it using different procedures, (DNN) We fine-tune the whole network on various datasets, (Mixed-Linear) We fine-tune the linearized ResNet-50 (eq. 5) in a mixed private setting, Last-Layer Features: We simply fine-tune the final fully connected (FC) layer of the ResNet-50. We show that fine-tuning a linearized DNN using the mixed-privacy framework performs comparable to fine-tuning a DNN and outperforms simply fine-tuning the last FC layer.

6 Forgetting procedure

The user weights are obtained by minimizing the quadratic loss function in section 5 on the user data . Let denote a subset of samples we want to forget (by hypothesis , i.e., the core data is not going to change) and let denote the remaining data. As discussed in Section 4, in case of quadratic training loss the optimal forgetting step to delete is given by:


where we define and we can explicitly write the Hessian of the loss section 5 as:



is the identity matrix of size

. Thus, forgetting amounts to computing the update step eq. 9. Unfortunately, even if we can easily write the hessian in closed form, we cannot store it in memory and much less invert it. Instead, we now discuss how to find an approximation of the forgetting step by solving an optimization problem which does not require constructing or inverting the hessian.

Since is positive definite, we can define the auxiliary loss function


It is easy to show that the forgetting update is the unique minimizer of , so we can recast computing the forgetting update as simply minimizing the loss using SGD. In general, the product of eq. 11 can be computed efficiently without constructing the Hessian using the Hessian-Vector product algorithm [27]. However, in our case we have a better alternative due to the fact that we use MSE loss and that ML-model is linear in weight-space: Using eq. 10, we can easily show that


where is a Jacobian-Vector product which can be computed efficiently (see Section 5.1). Using this result, we compute the (approximate) minimizer of eq. 11 using SGD. When optimizing eq. 11, we compute exactly on and approximate eq. 10 by Monte-Carlo sampling. In fig. 4, we show this method outperforms full stochastic minimization of eq. 11.

Mixed-Linear Forgetting. Let be the approximate minimizer of eq. 11 obtained by training with SGD for iterations. Our forgetting procedure for the ML-model, which we call Mixed-Linear (ML) Forgetting, is:


where is a random noise vector [15, 16]. As mentioned in Section 4, we need to add noise to the weights since is only an approximation of the optimal forgetting step, and the small difference may still contain information about the original data. By adding noise, we destroy the remaining information. Larger values of ensure better forgetting, but can reduce the performance of the model. In the next sections, we analyze theoretically and practically the role of .

Sequential forgetting. In practical applications, we may receive several separate requests to forget the data in a sequential fashion. In such cases, we simply apply the forgetting procedure in eq. 13 on the weights obtained at the end of the previous step. A key component is to ensure that the performance of the system does not deteriorate too much after many sequential requests, which we do next.

7 Bounds on Remaining Information

We now derive bounds on the amount of information that an attacker can extract from the weights of the model after applying the scrubbing procedure eq. 13. This will also guide us in selecting the optimal and the number of iterations to approximate the forgetting step that are necessary to reach a given privacy level (see fig. 2). Let denote some attribute of interest regarding an attacker might want to access, then from Proposition 1 in [15] we have:

where is the scrubbing/forgetting method which given weights trained on removes information about (which in our case is given by eq. 13). Hence, bounding the amount of information about that remains in the weights after forgetting uniformly bounds all the information that an attacker can extract.

We now upper-bound the remaining information after applying the forgetting procedure in eq. 13 to our ML-model, over multiple forgetting requests. Let be the total data asked to be forgotten at the end of forgetting requests and let be the weights obtained using the forgetting procedure in eq. 13 sequentially. Then we seek to provide a bound on the mutual information between the two, i.e., . We prove the following theorem.

Theorem 1 (Informal).

Let be the approximate update step obtained minimizing (eq. 13) using steps of SGD with mini-batch size . Let , where is the smoothness constant of the loss in eq. 7. Consider a sequence of equally sized forgetting requests and let be the weights obtained after the requests using eq. 13. Then we have the following bound on the amount of information remaining in the weights about


where , and , and .

Figure 2: Forgetting-Accuracy Trade-off Plots of the amount of remaining information in the weights about the data to forget (red, left axis) and test error (blue, right axis) as a function of the (top) scrubbing noise and (bottom) number of optimization iterations used to compute the scrubbed weights in eq. 13

. We aim to forget 10% of the training data through 10 forgetting requests on the Caltech-256 (left) and Aircrafts datasets (right). Note that the remaining information in the weights decreases with an increase in the forgetting noise or the number of epochs during forgetting as predicted by the bound in

theorem 1. Increasing the forgetting noise increases the test error after forgetting (top). In terms of the computational efficiency/speed, doing 2-3 passes over the data (i.e. 2-3 epochs) is sufficient for forgetting (in terms of the test error and the remaining information) rather than re-training from scratch for 50 epochs (bottom) for each forgetting request . Thus providing a 16-25 speed-up per forgetting request. We fine-tune the ML-Forgetting model for 50 epochs while training the user weights. Values for and can be chosen using these trade-off curves given a desired privacy level.

[35] provides a similar probabilistic bound on the distance of the scrubbed weights from the optimal weights for strongly convex Lipschitz loss functions trained using projected GD. We prove our bound for the more general case of a convex loss function with regularization trained using SGD (instead of GD) and also bound the remaining information in the weights.

Role of . We make some observations regarding eq. 14. First, increasing the variance of the noise added to the weights after the forgetting step further reduces the possible leakage of information from an imperfect approximation. Of course, the downside is that increasing the noise may reduce the performance of the model (see fig. 2 (top) for the trade-off between the two).

Forgetting with more iterations. Running the algorithm for an increasing number of steps improves the accuracy of the forgetting step, and hence reduces the amount of remaining information. We confirm this empirically in fig. 2 (bottom). Note however that there is a diminishing return. This is due to the variance of the stochastic optimization overshadowing gains in accuracy from longer optimization (see the additive term depending on the batch size). Increasing the batch-size, in eq. 13 reduces the variance of the estimation and leads to better convergence.

Fraction of data to forget. Finally, forgetting a smaller fraction of the data is easier. On the other hand, increasing the number of parameters of the model may make the forgetting more difficult.

Figure 3: Read-out functions for different forgetting methods. We forget a subset of 10% of the training data through 10 equally-size sequential deletion requests using different forgetting methods, and show the value of the several readout functions for the resulting scrubbed models. Ideally, the value of the readout function should be the same as the value (denoted with the green area) obtained on a model retrained from scratch without those samples. Closer to the green area is better. (Original) denotes the trivial baseline where we do apply any forgetting procedure. (Fisher) Adds Fisher noise as as described in ([15]), (ML-Forgetting) The model obtained with our method after forgetting. In all cases, we observe that ML-Forgetting obtains a model that is indistinguishable from one trained from scratch without the data, whereas the other methods fail to do so. This is particularly the case for the Retrain Time readout functions, which exploits full knowledge of the weights and it is therefore more difficult to defend against.

8 Experiments

We use a ResNet-50[22] as the model in ML-Forgetting. Unless specified otherwise, we forget around of randomly chosen training data in all the experiments through 10 sequential forgetting requests each of size . In the appendix, we also provide results for forgetting an entire class and show that our method is invariant to the choice of the subset to be forgotten. More experimental details can be found in the appendix.

Datasets used. We test our method on the following image classification tasks: Caltech-256 [18], MIT-67 [38], Stanford Dogs [26], CUB-200 [40], FGVC Aircrafts [31]

, CIFAR-10 

[29]. Readout function and forgetting-accuracy trade-off plots for MIT-67,StanfordDogs,CUB-200 and CIFAR-10 can be found in the appendix.

8.1 Readout functions

The forgetting procedure should be such that an attacker with access to the scrubbed weights should not be able to construct some function , which will leak information about the set to forget . More precisely the scrubbing procedure should be such that for all :


where is some baseline function that does not depend on (it only depends on the subset to retain = ). Here , corresponds to the distribution of weights (due to the stochastic training algorithm) obtained after minimizing the empirical risk on , respectively. corresponds to the scrubbing update defined in eq. 13. We choose , where

. For an ideal forgetting procedure, the value of the readout functions (or evaluation metrics) should be same for a model obtained after forgetting

and re-trained from scratch without using . Some common choice of readout functions include (see Figure 3):

  1. [wide, labelindent=0pt]

  2. Error on , , : The scrubbed and the re-trained model (from scratch on ) should have similar accuracy on all the three subsets of the data

  3. Re-learn Time: We fine-tune the scrubbed (model after forgetting) and re-trained model for a few iterations on a subset of the training data (which includes ) and compute the number of iterations it takes for the models to re-learn . An ideal forgetting procedure should be such that the re-learn time should be comparable to the re-trained model (we plot the relative re-train time in Figure 3). Re-learn time serves a proxy for the amount of information remaining in the weights about (see fig. 3).

  4. Activation Distance: We compute the distance between the final activations of the scrubbed weights and the re-trained model () on different subsets of data. More precisely we compute the following: , where . We compare different corresponding to the original weights without any forgetting, weights after adding Fisher noise and ML-forgetting (see fig. 3). This serves as a proxy for the amount of information remaining in the activations about .

  5. Membership Attack: We construct a simple yet effective membership attack similar to [16], where we compute the entropy of the scrubbed model output on (Retain set) and (Test set) and label it class 0 and 1 respectively (i.e. we create a binary classification problem with entropy of output on retain and test set as two different classes). We then fit a weighted support vector classifier (SVC) (weighted because the size of the retain and test set may be unequal) to this data and compute the attack success on the data to forget, . Now if SVC is able to identify the samples in as class 0 during the attack, then the attack is successful, because the adversary has identified from the scrubbed model what data it was initially trained on. Ideally, a forgetting procedure should have the same attack success as a re-trained model (see fig. 3).

8.2 Complete vs Stochastic residual gradient

Figure 4: Comparison of complete and stochastic residual gradient estimation for forgetting. ML-Forgetting uses the complete residual gradient for the forgetting step, exploiting the fact that the loss function is quadratic. However, one can also estimate it stochastically, which is equivalent to fine-tuning on the remaining data to forget. Here we show that indeed both methods work but – when using the same number of steps – complete estimate gives a better solution due to smaller variance and faster convergence (lower test error and information leakage).

In eq. 13 we compute the residual gradient completely once over the remaining data instead of estimating that term stochastically using . In Figure 4, we compare both the methods of computing the residual gradient. We show that in the ideal region of noise (i.e. ), both the remaining information and test error after forgetting (10% of the data through 10 requests) is lower when computing the residual gradient completely.

8.3 Effect of choosing different core datasets

Figure 5: Effect of using a core data close to the task. Plot of the remaining information and test error on Aircrafts using (a) generic ImageNet core data, and (b) ImageNet pre-training + 30% of the Aircrafts. When the core data contain information close to the user task that the network does not need to forget, ML-Forgetting can exploit this to create better core-weights and a correspondingly better linearized model. This improves both the accuracy of the model and makes forgetting easier, as seen from the accuracy-forgetting curves in the plot.

For fine-grained datasets like FGVC-Aircrafts and CUB-200, we show that if the core data has some information about the user task, then it improves forgetting significantly both in terms of the remaining information and the test accuracy. In fig. 5, we show that using ImageNet + 30% of the Aircrafts (we assume that we are not asked to forget this 30% of the data) as core data and 100% of the Aircrafts as the user data, performs much better than simply using ImageNet as core. In fig. 5(right), we also show that increasing the percentage of user distribution in the core data improves the test accuracy of the Mixed-Linear model.

9 Conclusion

We provide a practical forgetting procedure to remove the influence of a subset of the data from a trained image classification model. We achieve this by linearizing the model using a mixed-privacy setting which enables us to split the weights into a set of core and forgettable user weights. When asked to delete all the user data, we can simply discard the user weights. The quadratic nature of the training loss enables us to efficiently forget a subset of the user data without compromising the accuracy of the model. In terms of the time-complexity, we only need 2-3 passes over the dataset per forgetting query for removing information from the weights rather than the 50 re-training epochs, thus, providing a 16 or more speed-up per request (see fig. 2). We test the forgetting procedure against various read-out functions, and show that it performs comparably to a model re-trained from scratch (the ideal paragon). Finally, we also provide theoretical guarantees on the amount of remaining information in the weights and verify the behavior of the information bounds empirically through extensive evaluation in fig. 2.

Even though we provide a forgetting procedure for deep networks by linearizing them without compromising their accuracy, directly removing information from highly non-convex deep networks efficiently still remains an unsolved problem at large.



In this appendix we provide additional experiments (Appendix A), experimental details (Appendix B) and theoretical results (Appendix C).

Appendix A Additional Experiments

a.1 Forgetting an entire class

In the main paper we considered forgetting a random subset of 10% of the training data. Here we consider instead the problem of completely forgetting all samples of a given class in a single forgetting request. In figs. 7 and 6, we observe that also in this setting our proposed method outperforms other methods and is robust to different readout functions. Note that for the case of removing an entire class the target forget error (i.e. the error on the class to forget) is 100%.

Figure 6: Readout function plot similar to fig. 3 for Caltech-256 dataset, where we forget an entire class rather a sequence of randomly sampled data subsets.
Figure 7: Readout function plot similar to fig. 3 for FGVC-Aicrafts dataset, where we forget an entire class rather a sequence of randomly sampled data subsets.

a.2 Role of -Regularization

We plot the amount of remaining information and the test error as a function of the regularization coefficient. Note that instead of incorporating weight decay directly in the optimization step, as it is often done, we explicitly add the regularization to the loss function. As expected theoretically (theorem 3), increasing the regularization coefficient makes the training optimization problem more strongly convex, which in turn makes forgetting easy. However, increasing weight decay too much also hurts the accuracy of the model. Hence there is a trade-off between the amount of remaining information and the amount of regularization with respect to the regularization. We plot the trade-off in fig. 8.

Figure 8: Plot of the amount of remaining information and test error vs the regularization coefficient. We forget 10% of the training data sequentiall through 10 forgetting request.

a.3 More experiments using SGD for forgetting

We repeat the same experiments as in fig. 3 on the following datasets: Stanford Dogs, MIT-67, CIFAR-10, CUB-200, FGVC Aircrafts. Overall, we observe consistent results over all datasets.

Figure 9: Same experiments as fig. 3 for StanfordDogs.
Figure 10: Same experiments as fig. 3 for MIT-67.
Figure 11: Same experiments as fig. 3 for CIFAR-10.
Figure 12: Same experiments as fig. 3 for CUB-200.
Figure 13: Same experiments as fig. 3 for FGVC-Aircrafts.

a.4 Information vs Noise/Epochs

Figure 14: Same experiment as Figure 2 for Stanforddogs and CUB-200 datasets.
Figure 15: Same experiment as Figure 2 for MIT67.

Appendix B Experimental Details

We use a ResNet-50 pre-trained on ImageNet. For the plots in fig. 1, we train ML-Forgetting model using SGD for 50 epochs with batch size 64, learning rate lr=0.05, momentum=0.9, weight decay=0.00001 where the learning rate is annealed by 0.1 at 25 and 40 epochs. We explicitly add the regularization to the loss function instead of incorporating it in the SGD update equation. We only linearize the final layers of ResNet-50 and scale the one-hot vectors by 5 while using the MSE loss. For fine-grained datasets, FGVC-Aircrafts and CUB-200, in addition to the ImageNet pre-training, we also pre-train them using randomly sampled 30% of the training data (which we assume is part of the core set).

For the training the ML-Forgetting model in the readout functions and information plots using SGD, we use the same experimental setting as above with a increased weight decay=0.0005 for Caltech-256,StanfordDogs and CIFAR-10 and 0.001 for MIT-67,CUB-200 and FGVC-Aircrafts. We use a higher value of weight decay to increase the strong convexity constant of the training loss function, which facilitates forgetting (see lemma 4).

For forgetting using ML-Forgetting model in the readout function/information plots using SGD (ML-Forgetting to minimize eq. 11), we use momentum=0.999 and decrease the learning rate by 0.5 per epoch. We run SGD for 3 epochs with an initial lr=0.01 for Caltech-256, StanfordDogs and CIFAR-10 and run it for 4 epochs with initial lr=0.025 for MIT-67, CUB-200 and FGVC-Aircrafts.

Appendix C Theoretical Results

Lemma 1.

Let be two random vectors such that . Then we have the following, for any :


for any , where (a) follows from the Cauchy-Schwarz inequality and (b) follows from the AM-GM inequality. ∎

Lemma 2.

Let and be a strongly convex function with , and such that . Then , we have that . When is also quadratic with , the maximum eigen value of the Hessian, we have that .


Let , where then from Mean Value Theorem (MVT) we have that for some in between 0 and 1. This implies that , and . Thus from MVT we get:


where follows from the Cauchy-Schwarz inequality and follows from the fact that and , .

When is quadratic, then we can always write , where is a constant symmetric matrix and . From our definition of we can write:

where (a) follows from the definition of and (b) follows from the triangle inequality. Substituting this result in Equation 17 we get:

Lemma 3.

Consider a function , where , and is a dataset of size . Let , then .


Thus, , where follows from the assumption that is non-negative, follows from the fact that is the minimizer of and follows from the assumption that . Note that the result is independent of , thus, the empirical risk minimizers of the datasets obtained by removing a subset of samples will also lie within a dimensional sphere of radius .

Lemma 4.

Consider a function , where , is strongly convex function with . Let (retain set) be the dataset remaining after removing samples (forget set) from i.e. . Let and . Let and . Then we have that:

When is also quadratic with , the smoothness constant of , we have that .


We use the same technique as proposed in [35].

where, first inequality follows from the fact that is the minimizer of .

From Lemma 3 we know that . Also from the definition of we have that . Then applying Lemma 2 with for we get that


From the the definition of we know that it is a strongly convex function. So we have the following property:


Substituting Equation 19 and Equation 20 in LABEL:equation:lemma4-1 we get:

When is also quadratic with , the smoothness constant, then from lemma 2 we have .

Lemma 5.

Let be a convex and smooth with minimizer, . Then we have that:


From the definition of smoothness we have that:


Setting , and using the fact that , we get that:

For (2) we minimize Equation 21 with respect to :


where follows from the result that . Setting , in LABEL:equation:smoothnes-general and re-arranging the terms we get (2). ∎

Theorem 2.

(SGD) Consider