Selective Forgetting of Deep Networks at a Finer Level than Samples

by   Tomohiro Hayase, et al.

Selective forgetting or removing information from deep neural networks (DNNs) is essential for continual learning and is challenging in controlling the DNNs. Such forgetting is crucial also in a practical sense since the deployed DNNs may be trained on the data with outliers, poisoned by attackers, or with leaked/sensitive information. In this paper, we formulate selective forgetting for classification tasks at a finer level than the samples' level. We specify the finer level based on four datasets distinguished by two conditions: whether they contain information to be forgotten and whether they are available for the forgetting procedure. Additionally, we reveal the need for such formulation with the datasets by showing concrete and practical situations. Moreover, we introduce the forgetting procedure as an optimization problem on three criteria; the forgetting, the correction, and the remembering term. Experimental results show that the proposed methods can make the model forget to use specific information for classification. Notably, in specific cases, our methods improved the model's accuracy on the datasets, which contains information to be forgotten but is unavailable in the forgetting procedure. Such data are unexpectedly found and misclassified in actual situations.



There are no comments yet.


page 2

page 6


Condensed Composite Memory Continual Learning

Deep Neural Networks (DNNs) suffer from a rapid decrease in performance ...

Forgetting Outside the Box: Scrubbing Deep Networks of Information Accessible from Input-Output Observations

We describe a procedure for removing dependency on a cohort of training ...

Intentional Forgetting

Many damaging cybersecurity attacks are enabled when an attacker can acc...

Mixed-Privacy Forgetting in Deep Networks

We show that the influence of a subset of the training samples can be re...

Evaluating Inexact Unlearning Requires Revisiting Forgetting

Existing works in inexact machine unlearning focus on achieving indistin...

FFNB: Forgetting-Free Neural Blocks for Deep Continual Visual Learning

Deep neural networks (DNNs) have recently achieved a great success in co...

Convolution Forgetting Curve Model for Repeated Learning

Most of mathematic forgetting curve models fit well with the forgetting ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In practical applications of machine learning, models must deal with continually arriving input data. Lifelong machine learning 

(Chen et al., 2018; Parisi et al., 2019)

is a framework addressing this problem. It consists of various techniques, such as continual learning, transfer learning, meta learning, and multi-task learning. Continuous learning is the most straightforward idea of lifelong machine learning. It aims to accumulate knowledge to a model from many tasks and data which come intermittently. Eventually, we expect the model to solve different types of tasks on a wide range of data. However, indeed, there are many difficulties in achieving such kind of learning.

In terms of deep neural networks (DNNs), the most typical problem on continual learning is catastrophic forgetting (Kirkpatrick et al., 2017; Li and Hoiem, 2018). If we train an already trained DNN on a new task, the parameters of the DNN will be overwritten, and will completely forget the previous task. This behavior is called catastrophic forgetting. Previously proposed techniques can alleviate this problem and have shown the possibility to add information to DNNs (Kirkpatrick et al., 2017; Kemker et al., 2018) continually.

Selective forgetting, which is a subtraction of information, for DNNs is also crucial for continuous learning. In practical situations, training data often contain useless or undesired data, and we may want to remove such information from the model afterward. Especially in industrial scenes, various problems can appear in long-term operation even if developers thought everything was fine at the first stage of deployment.

Typically, a trained DNN is required to forget specific samples in the training dataset. One reason for the request is the poor performance of inference caused by outliers. If the dataset has noisy outliers, the model’s generalization performance gets low. Privacy, which is related to GDPR (General Data Protection Regulation) in Europe and the right to be forgotten, is also a popular reason. For example, when you construct an image dataset using images on the Internet and train a DNN with it, some right holders of the images may demand removing their information from both the dataset and the trained model. From a privacy-protection perspective, selective forgetting is quite difficult because the training data could be estimated from the model 

(Fredrikson et al., 2015). Continuously learning DNNs such as a chatbot need selective forgetting, of course. Chatbots often learn from users’ posts, and its corpus is frequently polluted. Rolling back is a possible solution, but it removes both recent useful and useless corpora. It can be a better solution to forget only the useless corpus selectively and reserve the useful corpus.

Targets for selective forgetting can be a finer level than samples in many situations. Poisoning (Muñoz-González et al., 2017), an attack that pollutes training data and makes models malfunction, can make one of the situations. Chen et al. (2017) have illustrated that attackers can set up backdoors by injecting specific image patches to some training samples. More concretely, by adding face images with specific glasses to the training data, the attackers can make a face recognizing DNN classify face images with the glasses as a class. In such cases, it is desirable to make the DNN forget to use feature corresponding to the glasses rather than forget whole poisoned samples. Leakage (Kaufman et al., 2012) or shortcut learning (Geirhos et al., 2020) can also be situations that need forgetting. If some of the training data contain something like data ID, the DNN exploits it, and the generalization performance would be ruined. Hence, as in the case of poisoning, it is important to forget the leakage’s effect. In a context of fairness (Binns, 2018), some of the explanatory variables can be sensitive (e.g. sexuality, address, and racial information), and the model would be required to forget them.

In this paper, we formulate selective forgetting using four datasets , and that we get the idea from considering practical situations (see Section Information to be Forgotten). The four datasets are distinguished by two conditions: whether they contain information to be forgotten and whether they are available for the forgetting procedure. Additionally, we derive a novel forgetting procedure and show that it successfully make the DNN forget selectively.

We formulate three patterns of selective forgetting in classification tasks. In the first pattern, we make the DNN forget samples that contain information to be forgotten. In this pattern, we split a dataset into two datasets: a dataset that consists of data with the information to be forgotten and a dataset consists of the rest. In the second and third patterns, we adjust the targets to be forgotten in a finer level than the samples. In the patterns, features in input data can be seen as backdoor or leakage To evaluate the forgetting of such information, we use additional two datasets; and . The dataset is drawn from the similar distribution as , but each sample in is processed not to contain the information to be forgotten. Conversely, the dataset is drawn from the similar distribution as , but each sample in is modified to have the information. The second and the third pattern differ in which dataset the DNN should perform well (see Table 1). We describe three patterns and their importance with concrete and practical situations that need forgetting (see Section Situations).

We propose a forgetting procedure as training with a combination of a forgetting term, a correction term, and a remembering term. The forgetting term is based on a random distillation. The correction term and a remembering term are based on elastic weight consolidation (EWC) (Kirkpatrick et al., 2017); the first term is a classification loss on the additional data, and the second term is a regularization restricting the parameters’ movement by employing Fisher information. Regarding that the original training data is large and hardly accessible, we only use the dataset to be forgotten and its variant in the forgetting procedure. Experimental results show that the proposed method can make the DNN forget the target information in certain situations. Notably, we have found that our methods improve the performance on the data not shown in both the pretraining and the forgetting procedure.

Related Work

Catastrophic Forgetting

To alleviate the catastrophic forgetting, Kirkpatrick et al. (2017) proposed EWC. EWC estimates which parameter is important for previously learned tasks by calculating diagonal Fisher information matrix (FIM) on the previous task. Many other techniques have been proposed to prevent catastrophic forgetting (Kemker et al., 2018).

Formulations of Selective Forgetting

An important point for selective forgetting is how to define DNNs’ forgetting state. Guo et al. (2019); Golatkar et al. (2020a) defined the states based on differential privacy (Dwork, 2008). Differential privacy is an idea that two models trained by the same algorithm should have (almost) the same parameters when one of the models is trained on some dataset and the other is trained on the dataset without a sample in the dataset. In short, differential privacy guarantees that the removal of a sample in a dataset does not (or hardly) affect the resulting model. Data deletion (Ginart et al., 2019) has a similar concept as differential privacy; forgetting (or deletion) must result in the model that trained on the dataset without the data to be forgotten. Bourtoule et al. (2019) also employed a similar definition of forgetting and named it machine unlearning. Certified data removal (Guo et al., 2019) relaxes differential privacy by comparing the two models; the model trained without the sample to be forgotten and the model trained with it and made forget it. Golatkar et al. (2020a) aimed for more practical definition especially on DNNs; target to be forgotten is relaxed to a certain subset of the dataset instead of a sample in a dataset. Moreover, it allows the model parameters to perturb in order to remove the information of the subset to be forgotten. These formulations are mainly on forgetting one or more samples. We formulate selective forgetting in finer information than samples. In our formulation, a DNN’s forgetting state means that the behavior of the model does not change depending on whether a dataset contains information to be forgotten.

Features Finer than Samples

As a finer level feature than samples, backdoor is well-known (Li et al., 2020). Backdoor is hidden features that the attacker injects into the training data. It leads the model to predict as the attackers want. Defense methods against the backdoor attacks have been proposed (Li et al., 2020), but most of them must be applied before the training, in contrast to the forgetting which is an operation after the training.

Methods for Selective Forgetting

For certified data removal (Guo et al., 2019)

, a forgetting method for linear classifiers is proposed. The method is basically based on an additive noise to the loss function in the training time and Newton’s method on the dataset without data to be forgotten. It is applicable when the last layer of the DNN is a linear layer.

Bourtoule et al. (2019) proposed SISA training that trains several models with disjoint subsets of the original training dataset. The models trained with SISA training can efficiently forget certain samples under the condition that the whole the dataset is stored and available in the unlearning algorithm. In contrast, our method targets the case that the access to the dataset is restricted. Scrubbing (Golatkar et al., 2020a) is a perturbation of the parameters; it randomly moves the parameters in a direction that scrubs the information of the data to be forgotten and does not affect the rest. The direction is derived from FIM or using neural tangent kernel (Golatkar et al., 2020b), for example. Our method also modify parameters using randomness in a more naive way than the scrubbing. Additionally, Ginart et al. (2019) treated a forgetting methods for -means not for DNNs. They proposed a quantized variant of -means as a clustering method that is robust to removing data.



Let be a DNN, where , , and are the parameters of the DNN, the input space, and the label space, respectively. We assume that the DNN is trained on a classification task using a dataset and a loss function . We call the pretraining dataset hereinafter to distinguish between the dataset for the classification task and the dataset for the training procedure of the forgetting. As a result of the pretraining, we have a parameter that makes sufficiently small.

Let be a set of data that contain information to be forgotten. Write , which is a dataset to be remembered. A trivial solution for selective forgetting is retraining using . However, this strategy is not practical because is often huge and the training takes a long time. Besides, we sometimes do not have access to . Thus, we assume that and is basically not accessible in the forgetting procedure.

Information to be Forgotten

A pattern of selective forgetting is to make the DNN forget the information that contains but does not. In other words, the DNN is required to forget samples in and to remember those in . We evaluate the forgetting by the accuracy on the datasets: we say that the DNN forgets if it keeps high accuracy for and achieves low accuracy for . Golatkar et al. (2020a, b) utilize the DNN trained only on as the forgotten state. For classification problems, the DNN in such a state should pass our forgetting criterion. By evaluating the forgetting using accuracy on the datasets, we can easily apply the method to a wide range of models.

Further, we introduce patterns of selective forgetting of subtle information than samples. Here, the subtle information to be forgotten is determined by and an additionally given dataset . The additional dataset has data similar to these in but do not contain the information to be forgotten. For convenience, we introduce a map to describe and add the information to be forgotten. The map is assumed to satisfy the following two conditions. Firstly, since describes the forgotten version of , we assume that and . Secondly, the distance between and is assumed to be sufficiently small. In this situation, we evaluate the forgetting by the accuracy on four datasets: , , , and an extra dataset . has similar data to , but samples in have the information to be forgotten. may not exist in some cases, but it is necessary for evaluating the performance of the forgetting.

For which dataset the DNN should achieve high (or low) accuracy depends on the information to be forgotten as shown in Table 1. Concrete applications for each pattern in Table 1 are described in a later section.

Testing data Additional data
Forgetting pattern Examples to be forgotten
Pretrained state N/A N/A
Pattern A N/A N/A Samples
Pattern B Backdoor
Pattern C Leakage
Table 1: Relationship between accuracy for the datasets and targets for forgetting. and respectively denotes high/low accuracy on corresponding pattern and dataset. Blank cell means we do not care the corresponding accuracy.

Above, we described that is obtained afterward. However, it is often found at first; misclassification of samples that are similar to these in reveals the need of the forgetting and the information to be forgotten, for example. In such case, we choose so that is satisfied. Then, we obtain the dataset for the forgetting procedure by .

Loss Function for Forgetting

We formulate a loss function for selective forgetting as a combination of forgetting term , the correction term , and the remembering term . Here, the remembering loss approximates the KL divergence against the old model , where

is the parameter of the network before applying selective forgetting. The loss function for the selective forgetting is a linear combination of the three terms with non-negative weights. The weights are hyperparameters and decided by cross validation. Recall that we do not use

or in the minimizing the selective forgetting loss. However, we require validation sets and in deciding the hyperparameters. We describe the specific form of the losses in a later section.


We list several situations that need selective forgetting. They correspond to each pattern in Table 1 and Figure 1. We also clarify concrete content of the datasets (e.g. , , and ) in the examples.

Patten A: Forget Samples

Generally, DNNs are good at dealing with large datasets which are costly. To save the cost, we can use automatically collected datasets such as WebVision database (Li et al., 2017). Selective forgetting plays an important role in the DNNs’ learning noisy and huge datasets like WebVision. Suppose you have a DNN trained on WebVision and you found some outliers that affect the performance of the DNN. The most naive way to remove the effect of the outliers is retraining without them. However, the dataset is huge and it may take a couple of weeks to complete the training. In this case, it is useful to forget the outliers in a short time without accessing whole the dataset.

In the context of continual learning, the need for selective forgetting is clearer. Consider a chatbot that learns continually; the bot learns from users’ reactions. Even if the bot is once successfully trained on useful information, malicious users may teach irrelevant expressions. Since the bot is in continual learning, it will soon make such expressions like Microsoft Tay which ended up repeating racist remarks because of the poisoned corpus caused by malicious users (Neff and Nagy, 2016; Wolf et al., 2017). Rolling back the bot to the state before the attack is a trivial solution in such a case. However, the bot may learn proper expressions by normal users during the attack. If we can make the bot forget bad corpus and preserve the others, the bot will be under control and can continue to work after the attack.

Such cases require the DNN of forgetting specific samples. They correspond to Pattern A in Table 1: the DNN must achieve low accuracy on while keeping the accuracy on . Outliers and polluted data are , and the rest of the training data are . is hardly accessible in both cases; it is too large to iterate in the case of WebVision and it may be deleted in a streaming fashion in the case of the chatbot.

Pattern B: Forget Backdoor

Say you are developing a face authentication system using a DNN. Attackers may put malicious data into the training data to set up a backdoor so that they can pass the system by wearing specific glasses as Chen et al. (2017) describe. After deploying the system, you noticed the attack by seeing some unauthorized people with the glasses passing the system. You are required to make the model promptly forget the poisoned data.

The poisoned images contain the glasses which are the key to the backdoor. We want the DNN to forget using the feature that comes from the glasses. In this situation, since you noticed the attack by seeing the testing samples with the backdoor, we have at first. contains face images just like in but they are with the glasses. Once we notice the backdoor, we can collect which has images with the glasses in the pretraining data. For the forgetting procedure, we assume we can construct . It is just like but each image in it does not have the glasses. In order to say the DNN has forgotten the backdoor, the DNN must achieve the below;

  • High accuracy on to correct poisoned knowledge on .

  • High accuracy on to ensure the robustness on additive backdoor to the clean data.

Thus, Pattern B in Table 1 corresponds to this case. Note that we do not care about the accuracy on because the accuracy should be low and the learned information about will be overwritten in the forgetting procedure.

Pattern C: Forget Leakage

We can use DNNs to decide marketing strategies; oil companies may be interested in the models of cars that come to a certain gas station and want to classify car images from monitoring cameras in the station, for example. In this situation, the first thing to do is training a DNN with a dataset that has images of various types of cars. Assume that the DNN learned the emblems of the cars to distinguish the models. The model will confuse the emblems on actual cars and those on posters and advertisements at the gas station in the operational phase. For instance, the DNN may classify a car of company A as company B because the background of the input image has an advertisement for a car of company B. The emblems are a kind of leaked information in this case. A straightforward workaround is masking emblems in the dataset and retraining with it. However, masking every single emblem is not very practical because it is expensive in terms of both human resources and time. Forgetting leaked parts of the input/feature will help the DNN to classify the cars by their shape rather than the emblems appearing in the input images.

Here, the leaked information to forget is the emblems. As the same as the case of the backdoor, we are likely to find , data misclassified due to the emblems, at first. Then we can construct which has the problematic emblems (i.e. the emblems of company B in the context of the example above). Note that only contains the images of company B’s car because only they have the emblem of company B. We can also obtain by masking the emblems. Assuming the pretraining data does not contain the emblems in the background, we make the DNN forget the leaked information by achieving the below;

  • High accuracy on which has the right combination of the emblems and the car type.

  • High accuracy on to ensure the robustness on additive leakage to the clean data.

This situation corresponds to Pattern C in Table 1. Regarding the cars of company B always have the emblems, we do not care for .


We construct the selective forgetting in the classification problem as minimizing the combination of loss for forgetting and that for defense against catastrophic forgetting.

The forgetting term

We make DNNs forget by training to random outputs which means unlearned state. We introduce two forgetting terms and .

Random Network Distillation

For to be forgotten, we consider the following loss as random network distillation (RND):


where is the randomly initialized parameter of the DNN.

Random Label Distillation

Let us denote by the softmax cross entropy loss defined as follows:


where , and is the number of the classes. Then we consider the random label distillation (RLD) as follows: for to be forgotten,



is uniformly distributed on a subset of



Additionally, we introduce a truncation of output, which can be used in a class-wise forgetting, only for the comparison with RLD and RND. The truncation remove a specified index

from the output vector of the model, that is, we estimate the class label by ignoring

as follows: for each data ,


where is the -th element of the -dimensional vector . In this method, we do not train the parameters of the model. However, we emphasize that the truncation does not deal with forgetting more subtle information than class.

The correction term

In the classification problem, we use the cross entropy as the correction term.


We compute the correction term for .

The remembering term

In order to prevent catastrophic forgetting of what needs to be remembered, we keep the following diagonal regularization term small:


where is the parameter of the pretrained model before applying forgetting methods, and the diagonal Fisher Information is given by


and for for , where is the number of the parameters and . The regularization term , which is used in the elastic weight consideration (EWC) introduced by (Kirkpatrick et al., 2017), is a variant of the KL-divergence of the current model against the old model.

Setting of Experiments

Pattern A

We construct a forgetting method of a specified class. In the case of RND and RLD, we combine forgetting and defensive losses as follows:


where is the input of the data to be forgotten, is one of and , and is fixed through experiments.

Pattern B and C

Consider the classification of images. Firstly, we adopt one of the following transformations to images contained in a specified class.


We set the brightness of a area in the middle of the left side of each image to 255.


We replace the values of areas with 255 so that the areas are scattered throughout the image.


For each pixel, we set the B-value to the average of RGB-values and RG-values to zero.

Examples of these transformations are shown in Figure 2.

Figure 2: Visualization of backdoors (or leakage), created by adding the line-type backdoor (second from left), the tile-type one (third from left), and the color-type one (rightmost) to a picture (leftmost) contained in the dataset CIFAR10.

We train the model

by the stochastic gradient descent to minimize:


where are hyperparameters. In the pattern B (resp. the pattern C), we optimize the hyperparameters by a cross-validation. In the cross validation, we maximize the minimum of top-1 accuracy of the model on datasets and (resp.  and ).

Results of Forgetting
Data Class Pretrained state RND RLD (1 – 9) Truncation


0 ()
1 – 9 ()
CIFAR10 0 ()
1 – 9 ()
Table 2: Results of forgetting samples in a class (Pattern A), showing accuracy on and

. The results on the RND and the RLD are the averages and the standard deviations of 10 running experiments. For truncation, the result of 0

is undefined, since the truncated model does not output the label probability on the forgotten class. The accuracy on 1–9 (

) is the average of the accuracy on each class in .
Figure 3: Distribution of . We use Fashion-MNIST and the forgotten class is 0.
Figure 4: Plots of the testing accuracy in the case of the tile-type transformation and Pattern B (left figures) or C (right ones). Each result is the average with the standard deviation of 10 experiments. In each figure, CE+Fisher+RLD is a result of learning with hyperparameter and described in Supplementary Material. In each figure, CE is a result with . The lines CE + Fisher are results of using and the same that we used in CE+Fisher+RLD. In the legend in each figure, the symbol corresponds to Table 1.


Pattern A

Table 2 shows the performance comparison of the forgetting term in Pattern A. The method with higher accuracy on and lower accuracy on is better. Firstly, we observed that the truncation achieved the better performance than RND and RLD. Unfortunately, the truncation cannot be applied to forget more subtle information than class (e.g. samples). Next, We observed that the standard derivation of results of the RND is larger than that of the RLD. Therefore, we choose the RLD for the forgetting term in Pattern B and Pattern C.

We evaluated methods on the Fashion-MNIST (Xiao et al., 2017) and the CIFAR10 (Krizhevsky and others, 2009). We set , throughout experiments.

Additionally, Figure 3 shows that the distribution of the output applying softmax after several models. We observed that in the case of the truncation, the distribution of after training on whole training dataset (Figure 3, pink and dotted line) approximates the distribution after training on (Figure 3, blue line). However, we observed that RND and RLD do not approximate the scratch learning.

Pattern B and C

Notably, we observed in Figure 4 that the proposed method (CE + Fisher + RLD) achieved a higher accuracy of than the method only using and . Therefore, the random distillation term made forget the information contained in . Here is not used in both the forgetting process and the pretraining process. In this sense, the proposed selective forgetting method made DNNs forget contained in by using and .

We observed the catastrophic forgetting of in the baseline results (CE in Figure 4), which use only . Therefore, the term prevented the catastrophic forgetting.

We applied line-type and the tile-type (resp.  the line, the tile, and the color-type) transformation to the class

of the Fashion-MNIST (resp. CIFAR10) dataset. Then we trained the 10-layer multilayer perceptron (MLP), where its hidden layers have the same width as the input, using the training dataset in

and . After training, we computed the Fisher information matrix on the dataset except for the class . For the line-type and the color-type transformations, detailed results are shown in Supplementary Materials (see Figures S2 and S1).

Discussion and Conclusion

Focusing on realistic problems that need selective forgetting, we have formulated three patterns of selective forgetting. The formulation is based on performance on these four datasets shown in Table 1; , , , and . This formulation allows us to quantitatively assess selective forgetting, which is more subtle than sample forgetting.

In order to meet the demand for modifying trained models briefly, we have restricted access to the datasets to and in selective forgetting. That is, the restriction is to modify the model using the data that contains the information to be forgotten and the data without the information.

The loss function for the forgetting is a combination of the forgetting loss, the correction loss, and the remembering loss. In our approach, we use the classification loss on and regression to the random network or labels on while KL-divergence from the pre-trained model prevents the model from going too far from the pretrained state. This structure is actually a combination of EWC (Kirkpatrick et al., 2017) and the distillation to random values. Since EWC allows the model to learn additional data, it is naturally expected that the accuracy on and is high. Remarkably, the accuracy on improved by introducing the distillation term without using itself. Therefore, it is indicated that the distillation to random values is useful for forgetting more subtle information than samples in some situations.

However, the accuracy on remains around 50% although it is improved. The Fisher information on the pretrained model is considered as the cause of this problem; it contains information to be forgotten. We believe that the reason for the insufficient accuracy is that the effects of the information to be forgotten contained in Fisher information hardly disappear in EWC (Umer et al., 2020).

There would be several approaches for enhancing the accuracy especially on . One way as an extension of EWC or Fisher information is to make a method that removes the effect of specific samples from the Fisher information. This can lead the DNN to more effectively forgetting the information to be forgotten. We expect that we can construct such a method based on (Golatkar et al., 2020b). Another way is to construct a mechanism that memorizes the information of pretrainig data and can recall it by querying a single data point. Fisher information, which we used in the experiments, can be also considered as a memory for remembering the pretraining data, but we cannot divide it into the information of every single data point. Utilizing the memory of neural differential computers (Graves et al., 2016) is also a possible choice. When we have no restriction on saving the pretraining data such as privacy protection, we can take a simpler approach; just saving the data. However, even if in such situations, it is not practical to save all the data and iterate them. Instead of storing the whole data, we can save some of the data which seem to be important or save data as a generative model. In the other direction, finding or constructing a concrete map would be useful. We assumed that the map is known in the experiments, but it can be constructed in a data-driven way. We can use domain translation techniques such as CycleGAN (Zhu et al., 2017) by regarding the information to be forgotten as a domain. By finding the map, we can reduce the amount of the data to be stored, and improve the performance of the forgetting.

We assumed that is given, and we have not designated how to determine data which should contain. If we know the information to be forgotten, such as the background of images which affects the classification, it is straightforward; we collect data with such information as . Possible another situation is that we find additional extrapolating data that are misclassified to a certain class due to a common feature among them and then determine . Specifically, we pick data that belong to the class from the pretraining dataset and use them as . In these situations, the feature to be forgotten is manually determined. Suggesting such features or data depending on the additional data systematically has remained as a future direction. Such a method is especially useful in the context of continual learning.

Dataset Pattern Learning Rate
Fashion-MNIST Line B
Tile B
CIFAR10 Line B
Tile B
Color B
Table S1: Hyperparameters determined by cross-validations.

Appendix A Supplementary Materials

Searching hyperparameters

For the experiments of Pattern B and Pattern C, we searched the hyperparameters , , and the learning rate of SGD by Optuna (Akiba et al., 2019) evaluating the five-fold cross-validation accuracy for 200 loops. The value to be evaluated in each loop was computed by the following way. Recall that are supposed to be available for tuning hyperparameters via cross validation. Set . In each loop of the searching, we uniformly divide the training dataset of (resp. ) to 20 % and 80 % data of the and write them and (resp.  and ). Then we applied the selective forgetting to the model using and

by 10-epoch with momentum 0.9 for 10-epochs. Then we calculated each accuracy on

, , and . For Pattern B and Pattern C, we maximize the minimum of the accuracy on the corresponding three validation sets described in Table 1. Table S1 describes the searched hyperparameters.

Setting of Model

Throughout experiments, we used the same setup of MLP. The number of layers was ten. To make the backpropagation stable, we used a normalized hard tanh

as the activation function, which is given by the following:


where and . The setting of the activation makes the MLP achieve dynamical isometry Pennington et al. (2018)

. The model did not contain batch normalization layers or any other normalization layers. We initialized the weight matrices by independently and uniformly sampled orthogonal matrices and did bias terms by 0.

Appendix B Additional Experiments

In Figure S1 for the case of FashionMNIST, we observed that the accuracy on increased from the initial state when we used CE+Fisher+RLD. However, we observed that in Figure S1 and Figure S2, for the case of CIFAR10, the increase in accuracy was slight. Since the line-style is a smaller transformation one than the tile-style, the cause of this phenomenon can be attributed to the difficulty in tuning the hyperparameters.

Figure S1: Plots of the testing accuracy in the case of the line-type transformation and Pattern B (left figures) or C (right ones). Each result is the average with the standard deviation of 10 experiments.
Figure S2: Plots of the testing accuracy in the case of the color-type transformation and Pattern B (left figures) or C (right ones). Each result is the average with the standard deviation of 10 experiments.


  • T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: Appendix A.
  • R. Binns (2018) Fairness in machine learning: lessons from political philosophy. In Conference on Fairness, Accountability and Transparency, Proceedings of Machine Learning Research, Vol. 81, pp. 149–159. Cited by: Introduction.
  • L. Bourtoule, V. Chandrasekaran, C. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot (2019) Machine unlearning. Preprint arXiv:1912.03817. Cited by: Formulations of Selective Forgetting, Methods for Selective Forgetting.
  • X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017)

    Targeted backdoor attacks on deep learning systems using data poisoning

    Preprint arXiv:1712.05526. Cited by: Introduction, Pattern B: Forget Backdoor.
  • Z. Chen, B. Liu, R. Brachman, P. Stone, and F. Rossi (2018) Lifelong machine learning: second edition. Morgan & Claypool Publishers. Cited by: Introduction.
  • C. Dwork (2008) Differential privacy: a survey of results. In Theory and Applications of Models of Computation, pp. 1–19. Cited by: Formulations of Selective Forgetting.
  • M. Fredrikson, S. Jha, and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pp. 1322–1333. Cited by: Introduction.
  • R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann (2020) Shortcut learning in deep neural networks. Preprint arXiv:2004.07780. Cited by: Introduction.
  • A. Ginart, M. Guan, G. Valiant, and J. Y. Zou (2019) Making AI forget you: data deletion in machine learning. In Advances in Neural Information Processing Systems 32, pp. 3518–3531. Cited by: Formulations of Selective Forgetting, Methods for Selective Forgetting.
  • A. Golatkar, A. Achille, and S. Soatto (2020a) Eternal sunshine of the spotless net: selective forgetting in deep networks. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 9304–9312. Cited by: Formulations of Selective Forgetting, Methods for Selective Forgetting, Information to be Forgotten.
  • A. Golatkar, A. Achille, and S. Soatto (2020b) Forgetting outside the box: scrubbing deep networks of information accessible from input-output observations. Preprint arXiv:2003.02960. Cited by: Methods for Selective Forgetting, Information to be Forgotten, Discussion and Conclusion.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: Discussion and Conclusion.
  • C. Guo, T. Goldstein, A. Hannun, and L. van der Maaten (2019) Certified data removal from machine learning models. Preprint arXiv:1911.03030. Cited by: Formulations of Selective Forgetting, Methods for Selective Forgetting.
  • S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman (2012) Leakage in data mining: formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (4), pp. 1–21. Cited by: Introduction.
  • R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan (2018) Measuring catastrophic forgetting in neural networks. In

    Thirty-second AAAI conference on Artificial Intelligence

    pp. 3390–3398. Cited by: Introduction, Catastrophic Forgetting.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: Introduction, Introduction, Catastrophic Forgetting, The remembering term , Discussion and Conclusion.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: Figure 1.
  • A. Krizhevsky et al. (2009) Learning multiple layers of features from tiny images. Technical report Cited by: Pattern A.
  • W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool (2017) Webvision database: visual learning and understanding from web data. Preprint arXiv:1708.02862. Cited by: Figure 1, Patten A: Forget Samples.
  • Y. Li, B. Wu, Y. Jiang, Z. Li, and S. Xia (2020) Backdoor learning: a survey. Preprint arXiv:2007.08745. Cited by: Features Finer than Samples.
  • Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: Introduction.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV), pp. 3730–3738. Cited by: Figure 1.
  • L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pp. 27–38. Cited by: Introduction.
  • G. Neff and P. Nagy (2016) Automation, algorithms, and politics— talking to bots: symbiotic agency and the case of Tay. International Journal of Communication 10, pp. 17. Cited by: Patten A: Forget Samples.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54 – 71. Cited by: Introduction.
  • J. Pennington, S. Schoenholz, and S. Ganguli (2018) The emergence of spectral universality in deep networks. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1924–1932. Cited by: Appendix A.
  • M. Umer, G. Dawson, and R. Polikar (2020) Targeted forgetting and false memory formation in continual learners through adversarial backdoor attacks. 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: Discussion and Conclusion.
  • M.J. Wolf, K.W. Miller, and F.S. Grodzinsky (2017) Why we should have seen that coming: comments on Microsoft’s Tay “experiment,” and wider implications. The ORBIT Journal 1 (2), pp. 1 – 12. Cited by: Patten A: Forget Samples.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Preprint arXiv:1708.07747. Cited by: Pattern A.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In IEEE international Conference on Computer Vision, pp. 2223–2232. Cited by: Discussion and Conclusion.