Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Neural Networks

11/12/2019 ∙ by Aditya Golatkar, et al. ∙ Amazon 20

We explore the problem of selectively forgetting a particular set of data used for training a deep neural network. While the effects of the data to be forgotten can be hidden from the output of the network, insights may still be gleaned by probing deep into its weights. We propose a method for “scrubbing” the weights clean of information about a particular set of training data. The method does not require retraining from scratch, nor access to the data originally used for training. Instead, the weights are modified so that any probing function of the weights, computed with no knowledge of the random seed used for training, is indistinguishable from the same function applied to the weights of a network trained without the data to be forgotten. This condition is weaker than Differential Privacy, which seeks protection against adversaries that have access to the entire training process, and is more appropriate for deep learning, where a potential adversary might have access to the trained network, but generally, have no knowledge of how it was trained.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Say you are the number ‘6’ in the MNIST handwritten digit database. You are proud of having nurtured the development of convolutional neural networks and their many beneficial uses. But you are beginning to feel uncomfortable with the attention surrounding the new “AI Revolution,” and long to not be recognized everywhere you appear. You wish a service existed, like that offered by the firm Lacuna INC in the screenplay

The Eternal Sunshine of the Spotless Mind

, whereby you could submit your images to have your identity scrubbed clean from handwritten digit recognition systems. Before you, the number ‘9’ already demanded that digit recognition systems returned, instead of a ten-dimensional “pre-softmax” vector (meant to approximate the log-likelihood of an image containing a number from 0 to 9) a

nine-dimensional vector that excluded the number ‘9’. So now, every image showing ‘9’ yields an outcome at random between 0 and 8. Is this enough? It could be that the system still contains information about the number ’9,’ and just suppresses it in the output. How do you know that the system has truly forgotten about you, even inside the black box? Is it possible to scrub the system so clean that it behaves as if it had never seen an image of you? Is it possible to do so without sabotaging information about other digits, who wish to continue enjoying their celebrity status? In the next section we formalize these questions to address the problem of selective forgetting in deep neural networks (DNNs). Before doing so, we present a summary of our contributions in the context of related work.

1.1 Related Work

Tampering with a learned model to achieve, or avoid, forgetting pertains to the general field of life-long learning. Specifically for the case of deep learning and representation learning, this topic has algorithmic, architectural and modeling ramifications, which we address in order.

Differential privacy [5] focuses on guaranteeing that the parameters of a trained model do not contain information about any particular individual. While this may be relevant in some applications, the condition is often too difficult to enforce in deep learning (although see [1]), and not always necessary. More importantly, differential privacy aims at making it impossible to extract information given knowledge of both the dataset and the training procedure. The requirement is that the possible distribution of weights, given the dataset, , changes by a small limited amount when replacing a sample. Our definition of selective forgetting can be seen as a generalization of differential privacy. In particular, we do not require that information about any sample in the dataset is minimized, but rather about a particular subset selected by the user. Moreover, we allow to apply a “scrubbing” function that can perturb the weights in order to remove information, so that , rather than , needs to remain unchanged. This less restrictive setting allows us to train standard deep neural networks using SGD, while still being able to guarantee forgetting.

Deep Neural Networks can memorize details about particular instances, rather than only shared characteristics [24, 3]. This makes forgetting critical, as attackers can try to extract information from the weights of the model. Membership attacks [23, 10, 19, 9, 21] attempt to determine whether a particular cohort of data was used for training, without any constructive indication on how to actively forget it. They relate to the ability of recovering data from the model [7] which exploits the increased confidence of the model on the training data to reconstruct images used for training; [18] proposes a method for performing zero-shot knowledge distillation by generating data impressions from the parameters of the teacher model to train the student model. Most techniques attempt to extract information from the confidence scores of samples, exploiting the fact that such confidence is higher for training images than for unseen samples. [20]

proposes a definition of forgetting based on changes of the value of the loss function. We show that this is not meaningful forgetting, and in some cases it may lead to the (opposite) “Streisand effect,” where the sample to be forgotten is actually made more noticeable.

Stability of SGD. In [8], a bound is derived on the divergence of training path of models trained with the same random seed (i.e., same intialization and sampling order) on datasets that differ by one sample (the “stability” of the training path). This can be thought of as a form of cross-validation and used to bound the generalization error of the network. It can also be thought of as a bound on how much of a sample the network has memorized. While these bounds are often loose, we introduce a novel bound on the information that remains about a set of samples to be forgotten, which exploits ideas from both the stability bounds and the PAC-Bayes bounds [17], which have been successful even for DNN [6].

The term “forgetting” is also used frequently in life-long learning, but often with different connotations that in our work: Catastrophic forgetting, where a network trained on a task rapidly loses accuracy on that task when fine-tuned for another. But while the network can forget a task, the information on the data it used may still be accessible from the weights. Hence, even catastrophic forgetting does not satisfy our stronger definition (we show this in Table 1). Interestingly, however, our proposed solution for forgetting relates to techniques used to avoid forgetting: [12] suggests adding an regularizer using the Fisher Information Matrix of the task. We use the Fisher Information Matrix, restricted to the samples we wish to retain, to compute optimal noise to destroy information, so that a cohort can be forgotten while maintaining good accuracy for the remaining samples. Part of our forgetting algorithm can be interpreted as performing “optimal brain damage” [14] in order to remove information from the weights if it is useful only or mainly to the class to be forgotten.

In this paper we talk about the weights of a network as containing “information,”, even though we have one

set of weights and information which is commonly defined only for random variables. While this has caused some confusion in the literature, the issue has been recently formalized by

[2]. Thus, we will use the term “information” liberally even when talking about a particular set of weights and dataset.

In defining forgetting, we wish to be resistant to both “black-box” attacks, which only have access to the model through some function (API) that returns the model output given the inputs, and “white-box” attacks, where the attacker can download and run an exact copy of the model. Since at this point it is unclear how much information about a model can be recovered by looking only at its inputs and outputs, it is also unclear how much less powerful black-box attacks are. To avoid unforeseen weaknesses, we give our definition of forgetting for the stronger case of white-box attacks, and derive bounds and defense mechanism for this situation.

1.2 Contributions

In summary, our contributions are, first, to propose a definition of selective forgetting for trained neural network models. It is not as simple as obfuscating the activations, and not as restrictive as Differential Privacy. Second, we propose a scrubbing procedure

removes information from the trained weights, without the need to access the original training data, nor to re-train the entire network. We compare the scrubbed network to the gold-standard model(s) trained from scratch without any knowledge of the data to be forgotten. We also prove the optimality of this procedure in the quadratic case. The approach is applicable to both the case where an entire class needs to be forgotten (e.g. the number ’6’) or multiple classes (e.g., all odd numbers), or a particular set of data within a class, while still maintaining output knowledge of that class (e.g., samples of class ’6’ hand-written by a particular individual, or all odd number samples furnished by a particular individual). Our approach is applicable to networks pre-trained using standard loss functions, such as cross-entropy, unlike Differential Privacy methods that require the training to be conducted in a special manner. Third, we introduce an

computable upper bound to the amount of the retained information, which can be efficiently computed even for deep neural networks. We further characterize the optimal tradeoff with preserving complementary information. We illustrate the criteria using the MNIST and CIFAR10 datasets, in addition to a new dataset called “Lacuna.”

1.3 Preliminaries and notation

Let be a dataset of images , each with an associated label representing a class (or label, or identity). We assume that are drawn from an unknown distribution .

Let be a subset of the data (cohort), whose information we want to remove (scrub) from a trained model, and let its complement be the data that we want to retain. The data to forget can be any subset of , but we are especially interested in the case where consists of all the data with a given label (that is, we want to completely forget about a class), or a subset of a class.

Let be a parametric function (model), for instance a deep neural network, with parameters (weights) trained using so that the -th component of the vector in response to an image approximates the optimal discriminant (log-posterior), , up to a normalizing constant.

WE use some notions form Information Theory. Particularly, the Kullback-Liebler divergence (or KL-divergence) between two distributions , which is always positive and zero if and only if . While it is not a distance (in particular it is not symmetric), it can be considered a measure of the divergence of the two distributions. The mutual information between two random variables and can be defined as .

1.4 Training algorithm and distribution of weights

Given a dataset , we can train a model — or equivalently a set of weights — using some training algorithm , that is . The training algorithm may have some stochasticity (for example arising from SGD or the random initial condition), in which case is a stochastic function. In any case, we may talk about the distribution of the possible outcomes of training. If this is deterministic, will be a degenerate (Dirac delta) distribution. We will find very useful for , the scrubbing function, to also be a stochastic function. For instance, it may add noise to the weights to destroy information. Putting all together, is the distribution of possible outcomes after training on the dataset and hence forgetting.

2 Definition and testing of forgetting

Let be a model trained on a dataset , where denotes some data that has to be forgotten. Then, a forgetting (or “scrubbing’) procedure consists in applying a function to the weights, with the goal of forgetting is to ensure that an “attacker” (algorithm) in possession of the model cannot compute some “readout function” , to reconstruct information about .

It should be noted that one can always infer some properties of , even without having ever seen it. For example, if is a consist of images of faces, we can infer that images in are likely to display two eyes, even without looking at the model . What matters for forgetting is the amount of additional information can extract from a cohort by exploiting the weights , and could not have been inferred simply by its complement . This can be formalized as follow:

Definition 1.

Given a readout function , an optimal scrubbing function for is a function — or simply omitting the argument — such that there is a function that does not depend on for which:


This can be interpreted as measuring how much the distribution of possible readout values applied to a model trained on and then scrubbed of , differs from the readout values of a model trained without having ever seen . The formal connection between the KL-divergence above and the amount of Shannon Information that can be extracted is given by the following:

Proposition 1.

Let be an attribute of interest that depends on , considered as a random variable (for instance, by re-sampling the subset from the same class distribution). Then,

Note that, since the mutual information is defined only for random variables, the above proposition requires considering as a random variable. This is done only to relate the notion of forgetting to a well-known information measure, but we do not follow this path in the rest of the paper.

Another interpretation of (1) arises from noticing that, if that quantity is zero then, given the output of the readout function, we cannot predict with better-than-chance accuracy whether the model was trained with or without the data. In other words, after forgetting any membership attack would fail.

In general, we may not know what readout function a potential attacker will use, and hence we want to be robust to every . Consider the following:

Lemma 1.

For any function have

Given this result, we can then focus on minimizing


which guarantees robustness to any readout function.

For the sake of concreteness, we give a first simple example of a possible scrubbing procedure.

Example 1 (Forgetting by adding noise).

Assume the weights of the model are bounded. Let , where

, be a scrubbing procedure that adds a sample from a Gaussian random variable. Then, as the variance increases,

While adding noise with a large variance does indeed help forgetting, it throws away the baby along with the bath water, rendering the model useless. Instead, we want to forget as much as possible about a cohort while retaining information about its complement. This can be formalized by minimizing the Forgetting Lagrangian:


Optimizing first term, say the cross-entropy loss, on the data to be remembered, is relatively easy. The problem in doing so while also minimizing the second (forgetting) term: For a DNN, the distribution

of possible outcomes of the training process is high-dimensional, complex and multi-modal, and we are not likely to be able to model it with enough precision to estimate the KL divergence above. Nevertheless, the Forgetting Lagrangian, if optimized, captures the notion of

selective forgetting at the core of our paper.

2.1 Stability and local forgetting bound

Notice that for many algorithms we may expect that removing some data points does not significantly change the outcome of the training, that is, the training algorithm is “stable’. Formally, let be a (possibly stochastic) training algorithm. The stocasticity of the algorithm can be made explicit by writing for some deterministic function and a “random seed” . We say that is stable if is close to whenever and differ only by a few samples. In this case, we may expect the two distributions and to also be close. Usin g this intuition, we now show that we can indeed exploit the stability of the learning algorithm to bound the Forgetting Lagrangian.

Proposition 2 (Local Forgetting Bound).

Let be a training algorithm with random seed . Notice that in this case . We then have the bound:

We refer to this as the “local forgetting bound” because, instead of worrying about the global distribution of possible outcomes as the random seed varies, it allows us to average the local results of forgetting using a particular random seed. To see the value of this bound, consider the following example.

Example 2 (Gaussian noise forgetting).

Consider the case where , with is Gaussian noise. Since for a fixed random seed the weights are a deterministic function of the data, we simply have and similarly . Using the previous bound and the closed form expression for the KL-divergence of Gaussians, we then have:


where and .

That is, we managed to upper-bound the complex term with a much simpler term which can easily be estimated by averaging the results of training with a few different random seeds.

Moreover, this suggests three simple but general procedures to forget. Assuming the training algorithm is stable, training with or without the sensitive data will give different but close results and . In order to forget we can either (i) apply a function that maps and closer together (i.e., minimize in eq. 4), or (ii) add noise whose covariance is high in the direction , or (iii) both. Indeed, this will be the basis of our forgetting algorithm, which we describe next.

3 The optimal quadratic scrubbing algorithm

In this section, we derive an optimal scrubbing algorithm under a local quadratic approximation. We then validate the method empirically in complex real world problems where the assumptions are challenged.

We start with very strong assumptions: that the loss is everywhere quadratic, and that the dynamics are given by a continuous gradient flow (that is, gradient descent in the limit of a very small learning rate), rather than from discrete stochastic gradient descent steps. We will then weaken these assumptions.

Proposition 3 (Optimal quadratic scrubbing algorithm).

Let the loss be , and assume both and are quadratic. Assume that the optimization algorithm at time is given by the gradient flow of the loss with a random initialization. Consider the function

where , , and , and denotes the matrix exponential. Then, is such that for all random initializations and all times . In particular, this means that scrubbing function perfectly scrubs all the information:

Notice that when , this reduces to the Newton update:

Notice that we do not need to assume that the algorithm is close to convergence.

3.1 Robust scrubbing

In general we cannot expect the procedure of creftypecap 3

to work perfectly, either because the optimization procedure is not an exact gradient flow or because the loss is not exactly quadratic, or a combination of the two. Hence, we introduce the following robust scrubbing procedure, which exploits the remaining degree of freedom we have from

Equation 4: adding noise.

Proposition 4 (Robust scrubbing procedure).

Assume that is close to

up to some normally distributed error

, and assume that is (locally) quadratic around . Then the optimal scrubbing procedure in the form

that minimizes the forgetting Lagrangian

is obtained when , where . In particular, if the error is isotropic, that is is a multiple of the identity, we have .

Putting this together with the result in creftypecap 3 gives us the following robust scrubbing procedure:


where and . In Figure 1

we show the effect of the scrubbing procedure on a simple logistic regression problem (which is not quadratic) trained with SGD (which does not satisfy the assumptions regarding the dynamic). Nonetheless, the scrubbing procedure manages to bring the value of the KL divergence almost to zero.

Figure 1: (Top) Distributions of weights and before and after the scrubbing procedure is applied to forget the samples . The scrubbing procedure makes the two distribution indistinguishable, thus preventing an attacker from extracting any information about . The KL divergence measures the maximum amount of information that an attacker can extract. After forgetting less than 1 nat of information about the cohort is accessible. (Bottom) The effect of the scrubbing procedure on the distribution of possible classification boundaries obtained after training. After forgetting the patient of the top left blue cluster, the classification boundaries adjust as if they never existed, and the distribution mimics the one that would have been obtained by training from scratch without that data (bottom row).

When , this simplifies to the noisy Newton update which can be more readily applied:



is an hyperparameter that controls the trade-off between the increase in loss due to the scrubbing procedure and the amount of removed information, and

is an hyperparameter that controls how much we trust our approximation to correctly model the optimization dynamics.

It should also be noted that the term in eq. 5 may diverge for if the actual dynamics of the optimization algorithm do not exactly satisfy the hypotheses. In practice, we found useful to replace it with , where

clamps the eigenvalues of

so that they are smaller or equal to , so that the expression does not diverge.111If is an eigenvalue decomposition of the symmetric matrix , we define . Notice also that in general , unless and commute, but we found this to be a good approximation in practice.

3.2 Forgetting using a subset of the data

Once a model is trained, a request to forget may be initiated by providing that cohort, as in the fictional service of Lacuna INC, but in general one may not longer have the remainder of the dataset used for training, , available for re-training. However, assuming we are in a minimum of , we have . Hence, we can rewrite and . Using this identities, instead of recomputing the gradients and Hessian on the whole dataset, we can simply use those computed on the cohort to be forgotten, provided we cached the Hessian we obtained at the end of the training on the original dataset .

3.3 Hessian approximation and Fisher Information

In practice, the Hessian is too expensive to compute for a DNN. In general, we cannot even ensure it is positive definite. To address both issues, we use the Levenberg-Marquardt semi-positive-definite approximation:


Note that for some loss functions, including most used in machine learning, this approximation of the Hessian coincides with the Fisher Information Matrix


, which opens the door to information-theoretic interpretations of the scrubbing procedure. We also notice that the approximation of the Hessian with the Fisher Information matrix is exact for some problems, such as linear regression and linear logistic regression.

4 Deep Network Scrubbing

Deep network challenge many of the assumptions we used to derive the forgetting procedure eq. 6. In this section we show how to adapt these methods. We present two variants, one uses the Fisher Information Matrix (FIM) of the network. However, since this depends on the network gradients, it may not be robust when the loss landscape of the network is rough. To solve this, we present a more robust method that attempts to minimize the directly the Forgetting Lagrangian eq. 3 through a variational optimization procedure.

4.1 Fisher forgetting

As mentioned above, we approximate the Hessian using the Fisher Information Matrix eq. 7. Since the whole matrix would be too large to store in memory, we only compute its diagonal. We note that a better approximation could be obtained using the Kronecker-factorized approximation of the FIM [16]. In our experiment, we found that for deep networks the diagonal of the FIM is not a good enough approximation to take a full Netwon step like eq. 6 suggest. Rather, we limit ourselves to just adding noise, hence relying on the stability of SGD for the two points to be close. This results in the simplified scrubbing procedure:

where is the FIM computed at the point . Here is an hyper-parameter that mediates the trade-off between the amount forgetting and increase in error of the scrubbing procedure, as shown in Figure 2.

Figure 2: Trade-off between information remaining about the class to forget and test error, mediated by the parameter in the Lagrangian: We can always forget more, but this comes with decreasing accuracy of the model.

Notice that this procedure may be interpreted as adding noise to destroy the weights that may have been informative to classify

, but are not useful to classify . We illustrate this in Figure 3.

Figure 3: Filters of a two-layered fully connected neural network, before and after the scrubbing procedure is applied to forget the digit 5. (Top) Filters specific to 5 are flooded with noise after scrubbing. (Bottom) While filters not specific to 5 almost remain unchanged.

4.2 Variational forgetting

Rather than using the Fisher, we may try to optimize for the optimal noise to add, by minimizing a proxy for the forgetting lagrangian eq. 3: Not knowing the optimal direction in which to add noise (see eq. 4), we may try to add the maximum amount of noise in all directions, under the constraint that it does not increase the loss function too much. Formally, we minimize the proxy Lagrangian:

The optimal may indeed be seen as the Fisher Information Matrix computed over a smoothed landscape. We note that since the noise is Gaussian, the loss can be optimized using the local reparametrization trick [11]

4.3 Pre-training improves the forgetting bound

Since our local forgetting bound depends on the stability of the algorithm, it may be ineffective when training a large deep network from scratch for a long time. A partial remedy to this is to pretrain the network. This guarantees all paths will stay closer together, since they all start from a common good configuration of weights, and decreases the train time, and hence the opportunitis for the path to diverge.

We show in the experiments that this greatly improves the bound. The drawback is that the local forgetting bound now cannot guarantee forgetting of any information contained in the pre-training dataset (that is, needs to be disjoint from ).

Metrics Original model Retrain (target) Fine-tune Negative gradient Random Labels Hiding Fisher (ours) Variational (ours)
Lacuna-10 Error on 9.2% 10.2% 9.0% 9.0% 12.3% 17.2% 22.60% 19%
Scrub 100 images Error on 0.0% 17.0% 0.0% 0.0% 10.0% 100.0% 31% 14%
All-CNN Error on 0.0% 0.0% 0.0% 0.0% 0.1% 7.0% 14.62% 9.53%
Fine-tune time 0 14 0 0 1 4 2 3
Memb. attack 55.35% 42.32% 52.23% 50% 24.80% 31.90% 49.07% 49.74%
Info-bound 1953 nats 1512 nats
Lacuna-10 Error on 9.2% 17.8% 10.0% 17.2% 17.5% 16.9% 26.7% 29.6%
Forget class Error on 0.0% 100.0% 2.0% 100.0% 88.9% 98.5% 100.0% 100.0%
All-CNN Error on 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 10.9% 12.2%
Fine-tune time 0 29 0 1 3 2 29 24
Memb. attack 53.82% 33.77% 37.94% 28.97% 5.16% 23.16% 44.69% 44.96%
Info-bound 6310 nats 3095 nats
CIFAR-10 Error on 13.0% 11.9% 11.9% 11.8% 12.1% 19.5% 23.8% 24.63%
Scrub 100 images Error on 0.0% 14.0% 0.0% 0.0% 12.0% 100.0% 22% 41%
All-CNN Error on 0.2% 0.0% 0.0% 0.0% 0.0% 9.9% 16.10% 18.19%
Fine-tune time 0 29 0 0 3 8 4 2
Memb. attack 52.58% 45.24% 50% 50% 49.73% 43.80% 49.58% 49.48%
Info-bound 21533 nats 17818 nats
CIFAR-10 Error on 13.0% 20.1% 13.1% 18.9% 19.4% 19.5% 26.65% 26.76%
Forget class Error on 0.0% 100.0% 21.0% 100.0% 96.6% 100.0% 99.86% 100%
All-CNN Error on 0.2% 0.3% 0.0% 0.0% 0.0% 0.1% 10.80% 10.07%
Fine-tune time 0 29 1 1 3 13 29 21
Memb. attack 52.17% 39.32% 50% 36.07% 8.49% 41.28% 48.74% 49.61%
Info-bound 218530 nats 171906 nats
Table 1: Original model is the model trained on all data . The forgetting algorithm should scrub information from the weights of this model. Retrain denotes the model obtained by retraining from scratch on , and hence has no information about . This is the optimal forgetting procedure we compare to. We consider the following forgetting procedures: Fine-tune denotes fine-tuning the model on . Negative gradient denotes fine-tuning on by moving in the direction of increasing loss. Random label denotes replacing the labels of the class with radom labels and then finetuning on all . Hiding denotes simply removing the class from the final classification layer. Fisher is our proposed method, which uses the Fisher Information matrix at the point to add noise and destroy information about . Variational is our second proposed method, which directly minimizes the forgetting Lagrangian eq. 3 using a variational approach. We benchmark these methods using several readout functions: errors on and after scrubbing, time to retrain on the forgotten samples after scrubbing, accuracy of a membership attack on run against the network. In all cases, the read-out of the scrubbed model should be closer to the target retrained model than to the original. Note that our methods also provide an upper-bound to the amount of information remaining.

5 Experiments

5.1 Datasets

We report experiments on MNIST, CIFAR10 [13], and on two new dataset we introduce specifically for selective forgetting, called Lacuna-10 and Lacuna-100.

Lacuna-10 consists of face images of 10 celebrities from VGGFaces2 [4]. We create Lacuna-10 by randomly sampling 10 celebrities with at least 500 images each. We split this data into a test set of 100 samples for each class and the remaining samples form the training set. Similarly, we create Lacuna-100 by randomly sampling 100 celebrities with at least 500 images each. We resize the images in both datasets to 32x32. There is no overlap between the celebrities in the two Lacuna datasets.

We use Lacuna-100 to pre-train the network (and hence assume that we won’t have to forget that data), and then fine-tune on Lacuna-10. The scrubbing procedure is required to forget some or all images for one identity in Lacuna-10.

On both CIFAR-10 and Lacuna-10 we choose to forget either the entire class 5, which is chosen at random, or hundred images of the class.

5.2 Models and training

For experiments on images (Lacuna-10 and CIFAR1-100), we use the All-CNN architecture [22]

, to which we add a batch normalization layer before each non-linearity.

We first pretrain on Lacuna-100 for 15 epochs, using SGD with fixed learning rate of 0.1, momentum 0.9 and weight decay 0.0005. We then fine-tune on Lacuna-10 with learning rate 0.01. To simplify the theoretical analysis, during fine-tuning we don’t update the running mean and variance of batch normalization, and rather reuse the pre-trained ones.

5.3 Linear logistic regression

To confirm our theoretical derivation, we test the scrubbing procedure in eq. 5 on a toy linear logistic regression problem, where the task is to forget the data points belonging to one of two clusters comprising the class Figure 1. We train using a uniform random initialization for the weights and SGD with batch size 10, with early stopping after 10 epochs. Since the problem is low-dimensional, we can easily approximate the distribution and by running 100 trainings with different random seeds.

As can be seen in Figure 1, the scrubbing procedure is able to almost perfectly overlap the two distributions, therefore preventing an attacker form extracting any information about the forgotten cluster. Notice also that since we use early stopping, the algorithm did not yet converge, and exploiting the time dependency in eq. 5 rather than using the simpler eq. 6 was essential.

5.4 Baselines

We consider four baselines to compare our methods. (i) Fine-tune: We fine-tune the model on the remaining data , (ii) Negative gradient: we fine-tune the model on by moving in the direction of increasing loss for samples in which is equivalent to using a negative gradient for the samples to forget, (iii) Random labels: We fine-tune the model on by randomly shuffling labels corresponding to images belonging to , (iv) Hiding: simply involves removing the row corresponding to the class to forget from the final classification layer of the DNN. We use these four methods as baselines against our proposed methods. It is unclear how much information is removed by these methods, since unlike our proposed methods they do not come with any upper-bound. For this reason, we introduce some read-out functions, which may be used to gauge how much information they destroy.

5.5 Readout functions

We use the following readout functions to evaluate our methods: (i) Error on the test set , (ii) Error on the cohort to be forgotten , (iii) Error on the data to retain after scrubbing, (iv) Fine-tune time: Time (in epochs) when the loss of the a potentially scrubbed network falls below a certain threshold when fine-tuned using samples from all the classes. This effectively measures how fast can a scrubbed network regain information about the cohort it was scrubbed off, (v) Membership-attack

: We use a simple membership-attack using the model entropy to identify if a cohort was used to train the model. We construct a membership-attack using logistic regression. For this membership-attack we construct a training set by using entropy of the model as the input with label 1 for training images (without the cohort to be forgotten) and label 0 for test images. The test set for this attack consists of the cohort to be forgotten. Next, we train a logistic classifier using the train set and compute the probability of prediction on the test set. Ideally, if the model is scrubbed of the cohort then the prediction probability on the test set should have a small value. (vi)

Information bound: For our methods, we upper-bound the amount of information the model contains about the cohort to be forgotten using our local forgetting bound creftypecap 2.

6 Discussion

We have introduced a notion of selective forgetting for deep neural networks, and derived a practical way to compute an upper-bound to the amount of information that the network retains. This hinges on connection between differential privacy (to which our framework may be seen as a generalization) and stability of SGD. The use of noise to bound the amount of information contained in the weights is also common to some other frameworks, such as PAC-Bayes [2]. As we discussed, forgetting may also be seen as minimizing an upper-bound on the amount of information a membership attack may extract about a particular cohort .

Finally, we notice that the notion of forgetting is entangled with the definition of information. Thus far, we have used Shannon Mutual Information, as it is best known and intuitive, but a few caveats are in order. When we say we “forget” in the sense of Shannon, we imply that there is no function that, in expectation, can extract information relative to some attribute. The expectation is with respect to the stochasticity both in the algorithms (if present) and, less trivially, the sampling of the data used for scrubbing, since that is a random variable. Hence, even when we talk about forgetting a particular sample, in practice we talk about a random variable. The fact that we do not restrict the class of functions is also another important difference. In principle, the readout function may be arbitrarily complex, including it having perfect knowledge of all remaining data and the training procedure. On the other hand, most read-out functions used in practice are generally agnostic to the details of the training process, and even knowing that perfectly, it is doubtful that an attacker could model for a deep network well enough to extract information out of it. This suggests that limiting the allowed readout functions may be a promising area of research.


Appendix A Proofs

Lemma 1

For any function have


We will consider the random variables to be discrete. Let us consider an equivalent simplified version of the above inequality :

where and For bijective functions , the above result holds with equality. For surjective functions, let . is a subset of the domain of with and is some constant. Then and similarly .

Lets write the LHS using this notation:


Similarly, lets write the RHS:


From the log-sum inequality, we know that for each in Eqn. (8) and Eqn. (9):

Thus a summation over on each side of the inequality concludes the proof. ∎

Proposition 2

Let be a (possibly stochastic) training algorithm, the outcome of which we indicate as for some deterministic function and , for instance a random seed. Then, we have . We have the bound:


Lets consider an equivalent version of the above inequality with simplified notations:

where and LHS can be equivalently written as :

Proposition 3

Let the loss be , and assume both and are quadratic. Assume that the optimization algorithm at time is given by the gradient flow of the loss with a random initialization. Consider the function

where , , and , and denotes the matrix exponential. Then, is such that for all random initializations and all times .


Since and are quadratic, assume without loss of generality that:

Since the training dynamic is given by a gradient flow, the training path is the solution to the differential equation:

which is given respectively by:

We can compute from the first expression:

where we defined . We now replace this expression of in the second expression to obtain:

where . ∎