1 Introduction
The right to be forgotten, or right to erasure, entitles data owners the right to delete their data from an entity storing it. Recently enacted legislations, such as the General Data Protection Regulation (GDPR)^{1}^{1}1https://gdprinfo.eu/ in the European Union and the California Consumer Privacy Act (CCPA)^{2}^{2}2https://oag.ca.gov/privacy/ccpa in California, have legally solidified this right. Google Search has received nearly 3.2 million requests to delist certain URLs in search results over five years [8].
In the machine learning context, the right to be forgotten requires that, in addition to the data itself, any influence of the data on the model disappears [10, 59]. This process, also called machine unlearning, has gained momentum both in academia and industry [10, 62, 17, 9, 55, 18, 20, 35, 7, 27, 50]. The most legit way to implement machine unlearning is to remove the data sample requested to be deleted (referred to as target sample), and retrain the ML model from scratch, but this incurs high computational overhead. Recently, Bourtoule et al. [9] have proposed an approximate method to accelerate the machine unlearning process.
Machine unlearning naturally generates two versions of the ML model, namely the original model and the unlearned model, and creates a discrepancy between those due to the target sample’s deletion. While originally designed to protect the privacy of the target, we argue that machine unlearning may leave some imprint of the target sample, and thus create unintended privacy risks. More specifically, while the original model may not reveal much private information about the target, additional information might be leaked through the unlearned model.
1.1 Our Contributions
In this paper, we study to what extent data is indelibly imprinted in an ML model by quantifying the additional information leakage caused by machine unlearning. We concentrate on machine learning classification, the most common machine learning task, and assume both original and unlearned models to be blackbox, the most challenging setting for an adversary.
We first propose a novel membership inference attack in the machine unlearning setting that aims at determining whether the target sample is part of the training set of the original model. Different from classical membership inference attacks [54, 49] which leverage the output (posterior) of a single target model, our attack leverages outputs of both original and unlearned models. More concretely, we propose several aggregation methods to jointly use the two posteriors from the two models as our attack model’s input, either by concatenating them or by computing their differences. Our empirical results show that the concatenationbased methods perform better in overfitted models, while the differencebased methods perform better in wellgeneralized models.
Second, to quantify the unintended privacy risks incurred by machine unlearning, we propose two novel privacy metrics, namely Degradation Count and Degradation Rate. Both of them concentrate on measuring how much relative privacy the target has lost due to machine unlearning. Concretely, Degradation Count calculates the proportion of cases where the adversary’s confidence about the membership status of the target sample is larger with our attack than with classical membership inference attack. Degradation Rate calculates the average absolute increase of confidence between our attack and classical membership inference.
We conduct extensive experiments to evaluate the performance of our attack over a series of ML models, ranging from logistic regression to convolutional neural networks, with multiple categorical datasets and image datasets. The experimental results show that our attack consistently degrades the membership privacy of the target sample, which indicates machine unlearning can have counterproductive effects on privacy. In particular, we observe that privacy is especially degraded because of machine unlearning in the case of wellgeneralized models. For example, we observe that the classical membership inference attack has an accuracy (measured by AUC) close to
, or random guessing, on the lowoverfitted decision tree classifier. On the contrary, the AUC of our attack is
, and the Degradation Count and Degradation Rate are and , respectively, which demonstrates that machine unlearning can have a detrimental effect on membership privacy even with wellgeneralized models. We further show that we can effectively infer membership information when a group of samples (instead of a single one) are deleted together from the original target model.Finally, in order to mitigate the privacy risks stemming from machine unlearning, we propose two possible defense mechanisms: (i) publishing only the top confidence values of the posterior, and (ii) publishing only the predicted label. The experimental results show that our attack is very robust to the top defense, even when the model owner only releases the top confidence value. On the other hand, publishing only the predicted label can effectively prevent our attack.
To summarize, we show that machine unlearning will degrade privacy of the target sample in general. This discovery sheds light on the risks of implementing the right to be forgotten in the ML context. We believe that our attack and metrics will help develop more privacypreserving machine unlearning approach. The main contributions of this paper are fourfold:

We take the first step to quantify the unintended privacy risks in machine unlearning through the lens of membership inference attacks.

We propose several practical approaches for aggregating the information returned by the two versions of the ML models.

We propose two novel metrics to measure the privacy degradation stemming from machine unlearning and conduct extensive experiments to show the effectiveness of our attack.

We propose two defense mechanisms to mitigate the privacy risks stemming from our attack and empirically evaluate their effectiveness.
1.2 Organization
Section 2 introduces some background knowledge about machine learning and machine unlearning, and the threat model. Section 3 present the details of our proposed attack. We propose two privacy degradation metrics in Section 4. We conduct extensive experiments to illustrate the effectiveness of the proposed attack in Section 5. In Section 6, we introduce several possible defense mechanisms and empirically evaluate their effectiveness. We discuss the related work in Section 7 and conclude the paper in Section 8.
2 Preliminaries
In this section, we first introduce some background knowledge on machine learning and unlearning, and then we present the threat model.
2.1 Machine Learning
In this paper, we focus on machine learning classification, the most common machine learning task. An ML classifier maps a data sample
, whereis a vector of entries indicating the probability of
belonging to a certain class according to the model . The sum of all values in is by definition. To construct an ML model, one needs to collect a set of data samples, referred to as the training set. The model is then built through a training process that aims at minimizing a predefined loss function with some optimization algorithms, such as stochastic gradient descent (SGD).
2.2 Machine Unlearning
Recent legislations such as GDPR and CCPA enact the “right to be forgotten”, which allows individuals to request the deletion of their data by the service provider to preserve their privacy. In the context of machine learning, e.g., MLaaS, this implies that the model owner should remove the target sample from its training set . Moreover, any influence has on the model should also be removed. This process is referred to as machine unlearning.
Retraining from Scratch. The most legit way to implement machine unlearning is to retrain the whole ML model from scratch. Formally, denoting the original model as and its training dataset as , this approach consists of training a new model on dataset .^{3}^{3}3Note that we also study the removal of more than one sample in our experimental evaluation, but for simplicity we formalize our problem with one sample only. We call this the unlearned model. Retraining from scratch is easy to implement. However, when the size of the original dataset is large and the model is complex, the computational overhead of retraining is too large. To reduce the computational overhead, several approximate approaches have been proposed [27, 10, 50, 7], among which [9] works in an ensemble style and is the most general one.
SISA. The training dataset in is partitioned into disjoint parts . The model owner trains a set of original ML models on each corresponding dataset . When the model owner receives a request to delete a data sample , it just needs to retrain the submodel that contains , which results in unlearning model . Submodels that do not contain remain unchanged. Notice that the size of dataset is much smaller than ; thus, the computational overhead of is much smaller than the “retraining from scratch” method.
At inference time, the model owner aggregates predictions from the different submodels to provide an overall prediction. The most commonly used aggregation strategy is majority vote and posterior average. In our experiments, we use posterior average as aggregation strategy.
2.3 Threat Model
The objective of the adversary is to perform membership inference towards the target sample, i.e., to determine whether a given target sample is in the training set of the original model [54, 49]. Knowing that a specific data sample was used to train a particular model may lead to potential privacy breach. For example, knowing that a certain patient’s clinical records were used to train a model associated with a disease (e.g., to determine the appropriate drug dosage or to discover the genetic basis of the disease) can reveal that the patient carries the associated disease. The classical membership inference attack can achieve this objective by exploiting the output (typically posterior distribution over possible classes) returned by the original model. In the machine unlearning setting, the adversary has access to the outputs of both the original model and the unlearned model; thus he can exploit two versions of posteriors to launch the membership inference attack towards the target sample.
Similar to previous membership inference attacks [54, 49], we assume the adversary has blackbox access to the models. This means that the adversary can only query these models and obtain their corresponding posteriors. Compared to the whitebox setting, where the adversary has direct access to the architecture and parameters of the target model, the blackbox setting is more realistic, and more challenging for the adversary [39]. We further assume that the adversary has a shadow dataset which can be used to train a set of shadow models to mimic the behavior of the target model. The shadow models are then used to generate another dataset to train the attack model (see Section 3 for more details). The shadow dataset can either come from the same distribution as the target dataset or from a different one. We evaluate both settings in Section 5.
3 Membership Inference in
Unlearning
In this section, we detail our membership inference attack in the machine unlearning setting.
3.1 Attack Pipeline
The general attack pipeline of our attack is illustrated in Figure 1. It consists of three phases: posterior generation, feature construction and (membership) inference.
Posterior Generation. The adversary has access to two versions of the target ML models, the original model and the unlearned model . Given a target sample , the adversary queries and , and obtains the corresponding posteriors, i.e., and , also referred to as confidence values or levels [54].
Feature Construction. Given the two posteriors and , the adversary aggregates them to construct the feature vector . There are several alternatives to construct the feature. We discuss them in Section 3.3.
Inference. Finally, the adversary feed obtained to the attack model, which is a binary classifier, to determine whether the target sample is in the training set of the original model or not. We describe how to build the attack model in Section 3.2.
3.2 Attack Model Training
We assume the adversary has a local dataset, which we call the shadow dataset . The shadow dataset can come from a different distribution than the one used to train the target model. To infer whether the target sample is in the original model or not, our approach is to train a binary classifier that captures the difference between the two posteriors. The intuition is that, if the target sample is deleted, the two models and will behave differently. Figure 2 illustrates the training process of the attack model, and the detailed training procedure is presented as follows.
Training Shadow Models. To mimic the behavior of the target models, the adversary needs to train a shadow original model and a set of shadow unlearned models. To do this,the adversary first partitions into two disjoint parts, the shadow negative set and the shadow positive set . The shadow positive set is used to train the shadow original model . The shadow unlearned model is trained by deleting samples from . For ease of presentation, we assume the shadow unlearned model is obtained by deleting exactly one sample. We will show that our attack is still effective for group deletion in Section 5.7. The adversary randomly generates a set of deletion requests (target samples) and train a set of shadow unlearned models , where shadow unlearned model is trained on dataset .
Obtaining Posteriors. At the posterior generation phase, the adversary feeds each target sample to the shadow original model and its corresponding shadow unlearned model , and gets two posteriors and .
Constructing Features. The adversary then uses the feature construction methods discussed in Section 3.3 to construct training cases for the attack model. In classical membership inference, posteriors of serve as member cases of the attack model. But in the machine unlearning setting, is member of shadow original model and nonmember of shadow unlearned model . To avoid confusion, we call the samples related to positive cases instead of member cases for the attack model.
To train the attack model, the adversary also needs a set of negative cases. This can be done by sampling a set of negative query samples from the shadow negative dataset and query the shadow original model and unlearned model. To get a good model generalization performance, the adversary needs to ensure that the number of positive cases and the number of negative cases of the attack model are balanced, i.e., , where is the cardinality of the sample set.
Improving Diversity. To improve the diversity of the attack model, the adversary obtains multiple shadow original models by randomly sampling multiple subsets of samples from the shadow positive dataset . For each shadow original model,the adversary randomly generates a set of deletion requests and trains a sequence of shadow unlearned models. In Section 5.5, we will conduct empirical experiments to show the impact of the number of shadow original models on the attack performance.
Training Attack Model.
Given sets of positive cases with features and negative cases with features, we rely on four standard and widely used classifiers for our attack model: logistic regression, decision tree, random forest, and multilayer perceptron.
3.3 Feature Construction
Given the two posteriors, a straightforward approach to aggregate the information is to concatenate them, i.e., , where is the concatenation operation. This preserves the full information. However, it is possible that the concatenation contains redundancy. In order to reduce redundancy, we can instead rely on the difference between and to capture the discrepancy left by the deletion of the target sample. In particular, we make use of the elementwise difference and the Euclidean distance .
In order to better capture the level of confidence of the model, one may also sort the posterior before the difference or concatenation operations [16]. Specifically, we sort the original posterior in descending order and get the sorted original posterior . We then rearrange the order of the unlearned posterior to align its elements with , and get the sorted unlearned posterior .
To summarize, we adopt the following five methods to construct the feature for the attack model:

Direct concatenate (), i.e.,

Sorted concatenate (), i.e.,

Direct difference (), i.e., .

Sorted difference (), i.e., .

Euclidean distance (), i.e.,
In Section 5.3, we conduct empirical experiments to evaluate the performance of the above methods and summarize a feature choice guidance.
4 Privacy Degradation Measurement
In this paper, we aim to evaluate to what extent machine unlearning may degrade the membership privacy of an individual whose data sample has been deleted from the training set (we also call this the target sample). Specifically, we want to quantify the additional privacy degradation our attack has over classical membership inference (or the improvement of membership inference) in order to measure the unintended information leakage due to data deletion in machine learning. To this end, we propose two privacy degradation metrics that measure the difference of the confidence levels of our attack and classical membership inference in predicting the right membership status of the target sample.
Given target samples to , define as the confidence of our attack for classifying as a member, and as the confidence of classical membership inference. Let be the true status of , i.e., if is a member, and otherwise. With that, we define the following two metrics:

DegCount. DegCount stands for Degradation Count. It calculates the proportion of target samples whose true membership status is predicted with higher confidence by our attack than by classical membership inference. Formally, DegCount is defined as
where is the indicator function which equals if is true; otherwise equals . Higher DegCount means higher privacy degradation level.

DegRate. DegRate stands for Degradation Rate. It calculates the average confidence improvement rate of our attack predicting the true membership status compared to classical membership inference. DegRate can be formally defined as
Higher DegRate means higher privacy degradation level.
5 Evaluation
In this section, we conduct extensive experiments to evaluate the unintended privacy risks of machine unlearning. We first conduct an endtoend experiment to validate the effectiveness of our attack on multiple datasets using the most legit unlearning method, i.e., retraining from scratch. Second, we compare different feature construction methods proposed in Section 3.3
and summarize a principle for choosing among them. Third, we evaluate the impact of overfitting and different hyperparameters on our attack. Fourth, we conduct experiments to evaluate dataset and model transferability between shadow model and target model. Finally, we show the effectiveness of our attack against group deletion and
unlearning method.5.1 Experimental Setup
Environment. All algorithms are implemented in Python 3.7 and all the experiments are conducted on a server with Intel Xeon E78867 v3 @ 2.50GHz and 1.5TB memory.
Datasets. We run experiments on two different types of datasets: categorical dataset and image dataset. The categorical datasets are used to evaluate the vulnerability of simple machine learning models, such as logistic regression (LR), decision tree (DT), random forest (RF), and multilayer perceptron (MLP). The image datasets are used to evaluate the vulnerability of stateoftheart convolutional neural networks, such as ResNet [24]. We use the following datasets in our experiment.

UCI Adult [3]. UCI Adult is a widely used categorical dataset for classification. It is a census dataset that contains around samples with features, including race, gender, occupation, etc. The original classification task is to predict whether the income of a person is over , which is a binary classification task. To evaluate the performance of multiclass classification, we transform the occupation feature into a label in our experiment, with possible classes for this label. For ease of presentation, we denote these two tasks as two different datasets, namely Adult (income) and Adult (occupation).

US Accident [4]. US Accident is a countrywide traffic accident dataset, which covers states of the United States. This dataset contains around 3M samples. We filter out attributes with too many missing values and obtain valid features. The valid features include temperature, humidity, pressure, etc. The classification task is to predict the accident severity level which contains classes.

InstaNY [6]. This dataset contains a collection of Instagram users’ location checkin data in New York. Each checkin contains a location and a timestamp; and each location belongs to a category. We use the number of checkins that happened at each location in each hour on a weekly basis as the location feature vector. The classification task is to predict each location’s category among different categories. After filtering out locations with less than 50 checkins, we get 19,215 locations for InstaNY dataset. Later in the section, we also make use of checkins in Los Angeles, namely InstaLA [6], for evaluating the data transferring attack. This dataset includes 16,472 locations.

MNIST [2]. MNIST is an image dataset widely use for classification. It is a 10class handwritten digits dataset which contains 42,000 samples, each being formatted into a pixel image.

CIFAR10 [1]. CIFAR10 is the benchmark dataset used to evaluate image recognition algorithms. This dataset contains 60,000 colored images of size , which are equally distributed on the following 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. There are 50,000 training images and 10,000 testing images.
Metrics. In addition to the two privacy degradation metrics proposed in Section 4, we also rely on the traditional AUC metric to measure the absolute performance of our attack and classical membership inference. To summarize, we have the following three metrics:

AUC. It is a widely used metric to measure the performance of binary classification in a range of thresholds [15, 6, 42, 41, 46, 22, 49, 66, 30, 65]. It tells how much the attack model is capable of distinguishing between member and nonmember. Higher AUC value implies better ability to predict the membership status. An AUC value equals to 1 shows a maximum performance (truepositive rate of 1 with falsepositive rate of 0) while an AUC value of 0.5 shows a performance equivalent to random guessing.

DegCount. It stands for Degradation Count, which is defined in Section 4.

DegRate. It stands for Degradation Rate, which is defined in Section 4.
Experimental Setting. We evenly split each dataset into disjoint target dataset and shadow dataset . In Section 5.6, we will show that the shadow dataset can come from a different distribution than the target dataset. The shadow dataset is further split into shadow positive dataset and shadow negative dataset ( for and for ). We randomly sample subsets of samples from , each containing samples, to train shadow original models. Let us denote the training dataset for the shadow original model as , where . For each shadow original model , we train shadow unlearned models on , where is randomly sampled from . We then follow the procedure in Section 3.2 to construct the training dataset for the attack model, and train the attack model using different classifiers. By default, we set the hyperparameters of the shadow models to .
Similarly, the target dataset is split into target positive dataset and target negative dataset . Following the same procedure as for the shadow dataset, we train target original models, each containing samples and generating unlearned models. The data generated by the shadow models serve as training data for the attack model, while the data generated by the target models serve as testing data. By default, we set the hyperparameters of the target models to . The model parameter settings of logistic regression, decision tree, random forest and multilayer perceptron are listed in Appendix B.
5.2 Evaluation of the Method
In this subsection, we conduct endtoend experiment to evaluate our attack against the most legit unlearning approach of retraining the ML model from scratch. We start by considering the scenario where only one sample is deleted for each unlearned model. The scenario where multiple samples are deleted before the ML model is retrained will be evaluated in Section 5.7
. We conduct the experiment on both categorical datasets and image datasets with three evaluation metrics, namely AUC, DegCount, DegRate.
Figure 3 shows the performance for categorical datasets. We evaluate the performance on multiple target models and multiple attack models. Specifically, we use four standard classifiers for both target models and attack models, resulting in combinations. The groups on the axis represent the attack models and the color bars (see the legend) represent the target models. We report here the results with the optimal features as explained in Section 5.3.
The experimental results show that our attack performs consistently better than classical membership inference on all datasets, target models, attack models, and metrics. Compared to classical membership inference, our attack achieves up to improvement of the AUC. The best DegCount and DegRate values are of and , respectively. This indicates that our attack indeed degrades membership privacy of the target sample in the machine unlearning setting. Comparing the performance of different target models, we observe that decision tree is the most vulnerable ML model.
Figure 4 illustrates the performance for the image datasets. We use a simple convolutional neural network (CNN), whose architecture is described in Appendix A, as the target model for both MNIST and CIFAR10 datasets. We further use ResNet50 for the CIFAR10 datasets. As for the attack model, we use four standard classifiers, same as in the categorical datasets. The groups in the axis represent the attack models and the color bars (see the legend) represent the datasets and their corresponding target models. The experimental results show that CIFAR10 trained with ResNet50 is the most vulnerable case. The reason behind is that the overfitting level of CIFAR10 trained with ResNet50 is the largest. From Table 1, we observe that the overfitting level of CIFAR10 trained with ResNet50 is 0.260, while both MNIST and CIFAR10 trained with simple CNNs have an overfitting level smaller than 0.05.
5.3 Finding Optimal Features
Figure 5 illustrates the attack AUC of different feature construction methods. We compare two different types of target models: (a) the wellgeneralized model logistic regression (trained on InstaNY dataset), and (b) the overfitted model ResNet50 (trained on CIFAR10 dataset). The readers can refer to Table 1 for the overfitting values, and we will further discuss the impact of overfitting in Section 5.4. We then apply the different feature construction methods proposed in Section 3.3 to different attack models, resulting in combinations. For comparison, we also include the classical membership inference as a baseline.
Concatenation vs. Difference. Concatenationbased methods (, ) directly concatenate the two posteriors to preserve the full information, while differencebased methods capture the discrepancy between two versions of posteriors. We use two approaches to capture this discrepancy: elementwise difference (, ) and Euclidean distance ().
Overall, Figure 5 shows that, on the one hand, concatenationbased methods perform better on the overfitted model, i.e., ResNet50. On the other hand, the differencebased methods perform better on the wellgeneralized model, i.e., logistic regression.
The reason behind this is that the concatenationbased methods exploit similar information as classical membership inference, namely the plain posterior information. We can observe from Figure b that, classical membership inference performs well on an overfitted target model, which is consistent with the conclusion of previous studies [54, 49]. Thus, our attack should also perform well on an overfitted target model by using similar type information, i.e., the plain posterior information. Our attack outperforms classical membership inference on the overfitted target model essentially because of the additional posterior it uses from the unlearned model. On the other hand, instead of using plain posterior information, the differencebased methods capture the discrepancy between two versions of the posterior due to the deletion of the target sample, i.e., the imprint of the target sample. Therefore, these methods, such as , can achieve more than AUC on the wellgeneralized model (regardless of the attack model), while the corresponding AUC of classical membership inference is close to , i.e., equivalent to random guessing.
Sorted vs. Unsorted. Comparing to and to in Figure 5, we observe that attack AUC of both concatenationbased method and differencebased method are clearly better after sorting. These results confirm our conjecture that sorting could improve the confidence level of the adversary.
Feature Selection Summary. Our empirical comparison provides us with the following rules for the feature construction methods: (1) use concatenationbased methods on overfitted models; (2) use differencebased methods on wellgeneralized models; (3) sort posteriors before the concatenation and difference operations.
Dataset  Target Model  Overfitting  AUC 

Adult (inc)  LR  0.013 
0.600 (0.505 )

DT  0.017 
0.882 (0.497 )


RF  0.009 
0.544 (0.509 )


Adult (occ)  LR  0.016 
0.507 (0.506 )

DT  0.017 
0.903 (0.506 )


RF  0.043 
0.611 (0.507 )


Accident  LR  0.022 
0.538 (0.494 )

DT  0.025 
0.929 (0.501 )


RF  0.026 
0.572 (0.497 )


InstaNY  LR  0.096 
0.983 (0.490 )

DT  0.024 
0.941 (0.503 )


RF  0.081 
0.685 (0.551 )


MNIST  CNN  0.018 
0.511 (0.496 )

CIFAR10  CNN  0.036 
0.507 (0.502 )

ResNet50  0.260 
0.719 (0.548 )

5.4 Impact of Overfitting
Overfitting measures the accuracy gap between training and testing datasets. Previous studies [54, 63] have shown that overfitted models are more susceptible to classical membership inference attacks, while wellgeneralized models are almost immune to them. In this subsection, we want to revisit the impact of overfitting on our attack.
Table 1 depicts the attack AUC for different overfitting levels. We use random forest as attack model, use and as feature construction method for wellgeneralized and overfitted target model, respectively. The experimental results show that our attack can still correctly infer the membership status of the target sample in wellgeneralized models. For example, when the target model is a decision tree, the overfitting level in Adult (income) dataset is , thus decision tree can be regarded as a wellgeneralized model. While the performance of classical membership inference on this model is equivalent to random guessing (AUC = ), our attack performs very well, with an AUC of . In general, we observe that our attack performance is relatively independent of the overfitting level.
5.5 Hyperparameter Sensitivity
We now evaluate the impact of different hyperparameters on the performance of our attack. Specifically, we want to know the impact of the number of shadow original models , the number of samples per shadow original model and the number of unlearned models per shadow original model. The corresponding hyperparameters of the target models are fixed (as defined at the end of Section 5.1), since only the hyperparameters of the shadow models can be tuned to launch the attack.
We conduct the experiments on Adult (income) dataset with decision tree as target model. Following our findings in Section 5.3, we evaluate the attack AUC of different combination of attack models, i.e., decision tree, random forest and logistic regression, and differencebased feature construction methods, i.e., , , .
Number of Shadow Original Models . Figure a depicts the impact of , which varies from to . The figure shows that the attack AUC sharply increases when increases from to , but remains quite stable for greater values of . This indicates that setting is enough for the diversity of shadow original model.
Number of Samples per Shadow Original Model. Figure b illustrates the impact of . When increases from to , the attack AUC with increases from to , while the attack AUC with increases from to , except for logistic regression. However, adding more than 1000 samples does not help improve the attack performance further.
Number of Unlearned Models per Shadow Original Model. Figure c illustrates the impact of , which varies from to . The experimental result shows that has negligible impact on the attack AUC. This indicates that using a few unlearned models is sufficient to achieve a high attack performance.
5.6 Attack Robustness
In this subsection, we conduct experiments to validate the dataset and model transferability between shadow model and target model. That is, we evaluate whether the adversary can use a different dataset and model architecture than the target model to train the shadow models.
ShadowTarget  NYNY  NYLA 

DTDT 
0.944 (0.491 )

0.931 (0.503 )

DTLR 
0.964 (0.494 )

0.974 (0.513 )

LRLR 
0.986 (0.505 )

0.982 (0.511 )

LRDT 
0.927 (0.502 )

0.926 (0.508 )

We use InstaNY and InstaLA datasets to perform the dataset transferring attack. As described in Section 5.1, these two datasets contain checkin data from different cities, thus have different distributions. We use InstaNY dataset to train the shadow models and InstaLA dataset to train the target model. We evaluate the dataset transferability for two target models (decision tree and logistic regression) with logistic regression as the attack model. The results are given in Table 2. We break the table into two parts: the upper two rows gives results when the shadow model is the decision tree; and the lower two rows are for logistic regression. Within each part, the lower row indicates results for model transfer, and the right column is for dataset transfer.
Dataset Transferability. Comparing the AUC values of the transfer setting with that of the nontransfer setting, we only observe a small performance drop for all target models. For instance, when the target model is decision tree, the attack AUC of transfer setting and nontransfer setting are and , respectively. The attack AUC only drops by .
Model Transferability. For model transferring attack, we evaluate the pairwise transferability among target models decision tree and logistic regression. In Table 2, unbold rows in column NYNY illustrate the performance of model transfer. The experimental results show that model transfer only slightly degrade the attack performance of our attack. For example, when the shadow model and target model are both logistic regression, the attack AUC equals to . When we change the target model to decision tree, the attack AUC is still of .
Dataset and Model Transferability. Unbold rows of column NYLA in Table 2 show the attack AUC when we transfer both the dataset and the model simultaneously. Even in this setting, our attack can still achieve pretty good performance. This result shows that our attack is robust to both different dataset distribution and different model architecture.
5.7 Evaluation of Group Deletion
So far, we have focused on the scenario where only one sample was deleted for each unlearned model. However, there could exist cases where a group of samples are deleted before generating the unlearned model. This can happen when multiple data owners request the deletion of their data at the same time, or when the model owner caches the deletion requests and updates the model only when he has received numerous requests to save computational resources.
In this subsection, we conduct experiments to show the performance of our attack in the group deletion scenario. We randomly delete data samples from each original model to generate the unlearned model. We evaluate our attack on the InstaNY dataset with three metrics, four different target models, and four different attack models. For each attack model, we select the best features following the principles described in Section 5.3.
The experimental results in Figure 7 show that our attack is still effective, even though the attack performance of group deletion is slightly worse than single sample deletion (see Figure 3). For example, when the target model is logistic regression and attack model is random forest, the attack AUC of single deletion and group deletion are and , respectively. The reason is that a single sample could be hidden among the group of deleted samples, thereby preserving its membership information. This result reveals that conducting group deletion could mitigate, to some extent, the impact of our attack.
5.8 Evaluation of the Method
The unlearning algorithm we focused on so far is retraining from scratch, which can become computationally prohibitive for large datasets and complex models. Several approximate unlearning algorithms have been proposed to accelerate the training process. In this subsection, we evaluate the performance of our attack against the most general approximate unlearning algorithm, [9].
We remind the readers that the main idea of is to split the original dataset into disjoint shards and train submodels. In the inference phase, the model owner aggregates the prediction of each submodel to produce the global prediction using some aggregation algorithm. In this experiment, we set and use posterior average as aggregation algorithm.
Figure 8 illustrates the performance of . The experiment is conducted on the InstaNY dataset with three metrics reported. We report the experimental results of four different target models and four different attack models. For each attack model, we select the best features following the principles described in Section 5.3.
The experimental results show that our attack performance drops compared to the algorithm. Now, only the target model LR is prone to a significant drop in privacy due to unlearning. One possible reason is that the aggregation algorithm of reduces the influence of a specific sample on its global model.
6 Possible Defenses
This paper takes the first step to investigate the privacy risks stemming from machine unlearning. Extensive experiments have demonstrated that publishing unlearned model could degrade the membership privacy of a target whose data has been deleted. In this section, we present two possible defense mechanisms and empirically evaluate their effectiveness. The main idea of both mechanisms is to reduce the information accessible to the adversary [54].
Dataset (Target Model)  Attack Model  No defense  Top1  Top2  Top3  Label 

Adult (occupation) (DT)  RF  0.916 
0.899 
0.906 
0.911 
0.501 
DT  0.918 
0.903 
0.906 
0.910 
0.506 

LR  0.918 
0.904 
0.907 
0.911 
0.506 

MLP  0.918 
0.904 
0.909 
0.907 
0.493 

InstaNY (DT)  RF  0.937 
0.930 
0.931 
0.942 
0.506 
DT  0.938 
0.932 
0.932 
0.943 
0.502 

LR  0.928 
0.923 
0.927 
0.926 
0.502 

MLP  0.928 
0.923 
0.927 
0.929 
0.505 

InstaNY (LR)  RF  0.976 
0.947 
0.965 
0.965 
0.546 
DT  0.972 
0.946 
0.961 
0.961 
0.546 

LR  0.969 
0.948 
0.960 
0.962 
0.546 

MLP  0.970 
0.948 
0.960 
0.966 
0.453 
Publishing the Top Confidence Values. This defense reduces attacker’s knowledge by only publishing the top confidence values of the posteriors returned by both original and unlearned models. Formally, we denote the posterior vector as , where is the number of classes of the target model and is the confidence value of class . When the target model receives a query, the model owner calculates posterior and sorts it in descending order, resulting in . The model owner then publishes the first values in , i.e., .
In the machine unlearning setting, the top confidence values of the original model and the unlearned model may not correspond to the same set of classes. To launch our attack, the adversary constructs a pseudocomplete posterior vector for both original model and unlearned model. The pseudocomplete posterior takes the published confidence value for their corresponding classes, and evenly distributes the remaining confidence value to other classes, i.e., for , . The adversary can then launch our attack using the pseudocomplete posterior.
Table 3 shows the experimental results of Top, Top and Top defenses. We conduct experiment on InstaNY dataset and Adult (occupation) datasets. For Adult (occupation) dataset, we report the results of decision tree as target model; for InstaNY dataset, we report the results of decision tree and logistic regression as target model. For each dataset, we report the performance of different attack models, each selecting the best feature following the principle described in Section 5.3. The results show that publishing top confidence value cannot significantly mitigate our attack.
Publishing the Label Only. This defense further reduces the information accessible to the adversary by only publishing the predicted label instead of confidence values (posteriors). To launch our attack, the adversary also needs to construct the pseudocomplete posterior for both original model and unlearned model. The main idea is to set the confidence value of the predicted class as , and set the confidence value of other classes as .
Table 3 illustrates the performance of the “label only” defense. The experimental setting is similar to Top defense. The experimental results show that the “label only” defense can effectively mitigate our attack in all cases. The reason is that deleting one sample is unlikely to change the output label of a specific target sample.
We leave the indepth exploration of effective defense mechanisms against our attack as a future work.
7 Related Work
Machine Unlearning. The notion of machine unlearning was first proposed in [10], which is the application of the right to be forgotten in the machine learning context. The most legit approach to implement machine unlearning is to remove the revoked samples from the original training dataset and retrain the ML model from scratch. However, retraining from scratch incurs very high computational overhead when the dataset is large and when the revoke requests happen frequently. Thus, most of the previous studies in machine unlearning focus on reducing the computational overhead of the unlearning process [10, 9, 50, 7, 27].
Cao and Yang [10] proposed to transform the learning algorithms into summation form that follows statistical query learning, breaking down the dependencies of training data. To remove a data sample, the model owner only needs to remove the transformations of this data sample from the summations that depend on this sample. However, the algorithm in [10] is not applicable to learning algorithms that cannot be transformed into summation form, such as neural networks. Thus, Bourtoule et al. [9] proposed a more general algorithm named . The main idea of is to split the training data into several disjoint shards, with each shard training one submodel. To remove a specific sample, the model owner only needs to retrain the submodel that contains this sample. To further speed up the unlearning process, the authors proposed to split each shard into several slices and store the intermediate model parameters when the model is updated by each slice.
Another line of machine unlearning study aims to verify whether the model owner complies with the data deletion request. Sommer et al. [55] proposed a backdoorbased method. The main idea is to allow the data owners to implant a backdoor in their data before training the ML model in the MLaaS setting. When the data owners later request to delete their data, they can verify whether their data have been deleted by checking the backdoor success rate.
The research problem in this paper is orthogonal to previous studies. Our goal is to quantify the unintended privacy risks for deleted samples in machine learning systems when the adversary has access to both original model and unlearned model. To the best of our knowledge, this paper is the first to investigate this problem.
Although quantifying privacy risks of machine unlearning has not been investigated yet, there are multiple studies on quantifying the privacy risks in the general right to be forgotten setting. For example, Xue et al. [62] demonstrate that in search engine applications, the right to be forgotten can enable an adversary to discover deleted URLs when there are inconsistent regulation standards in different regions. Ellers et al. [13] demonstrate that, in network embeddings, the right to be forgotten enables an adversary to recover the deleted nodes by leveraging the difference between the two versions of the network embeddings.
Membership Inference. Membership inference attacks have been extensively studied in many different data domains, ranging from biomedical data [26, 5, 22] to mobility traces [46]. Shokri et al. [54] presented the first membership inference attack against ML models. The main idea is to use shadow models to mimic the target model’s behavior to generate training data for the attack model. Salem et al. [49] gradually removed the assumptions of [54] by proposing three different attack methods. Since then, membership inference has been extensively investigated in various ML models and tasks, such as federated learning [37], whitebox classification [39], generative adversarial networks [23, 11]
[56], and computer vision segmentation
[25]. Another line of study focused on investigating the impact of overfitting [63, 31] and of the number of classes of the target model [53] on the attack performance. However, all of the previous studies focus on the classical ML setting where the adversary only has access to a single snapshot of the target model. This is the first work studying membership inference in the machine unlearning context.To mitigate the threat of membership inference, a plethora of defense mechanisms have been proposed. These defenses can be classified into three classes: reducing overfitting, perturbing posteriors, and adversarial training. There are several ways to reduce overfitting in the ML field, such as regularization [54], dropout [49], and model stacking [49]. In [32], the authors proposed to explicitly reduce the overfitting by adding to the training loss function a regularization term, which is defined as the difference between the output distributions of the training set and the validation set. Jia et al. [30] proposed a posterior perturbation method inspired by adversarial example. Nasr et al. [38] proposed an adversarial training defense to train a secure target classifier. During the training of the target model, a defender’s attack model is trained simultaneously to launch the membership inference attack. The optimization objective of the target model is to reduce the prediction loss while minimizing the membership inference attack accuracy.
Attacks against Machine Learning. Besides membership inference attacks, there exist numerous other types of attacks against ML models [44, 58, 19, 43, 57, 45, 40, 60, 64, 16, 21, 51, 29, 36, 61, 34, 31, 47, 52, 33, 12, 48, 28]. Ganju et al.[16] proposed a property inference attack aiming at inferring general properties of the training data (such as the proportion of each class in the training data). Model inversion attack [15, 14] focuses on inferring the missing attributes of the target ML model. A major attack type in this space is adversarial examples [44, 43, 57, 45, 64]. In this setting, an adversary adds carefully crafted noise to samples aiming at mislead the target classifier. A similar type of attacks is backdoor attack, where the adversary as a model trainer embeds a trigger into the model for her to exploit when the model is deployed [19, 36, 61, 48]. Another line of work is model stealing, Tramèr et al. [58] proposed the first attack on inferring a model’s parameters. Other works focus on inferring a model’s hyperparameters [40, 60].
8 Conclusion
This paper takes the first step to investigate the unintended privacy risks in machine unlearning through the lens of membership inference. We propose several feature construction methods to summarize the discrepancy between the posteriors returned by original model and unlearned model. Extensive experiments on five different realworld datasets show that our attack in multiple cases outperform the classical membership inference attack on the target sample, especially on wellgeneralized models. We further present two mechanisms by reducing the information accessible to the adversary to mitigate the newly discovered privacy risks. We hope that these results will help improve privacy in practical implementation of machine unlearning.
References
 [1] Note: https://www.cs.toronto.edu/~kriz/cifar.html Cited by: 5th item.
 [2] Note: http://yann.lecun.com/exdb/mnist/ Cited by: 4th item.
 [3] Note: https://archive.ics.uci.edu/ml/datasets/adult Cited by: 1st item.
 [4] Note: https://www.kaggle.com/sobhanmoosavi/usaccidents Cited by: 2nd item.
 [5] (2016) Membership Privacy in MicroRNAbased Studies. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 319–330. Cited by: §7.
 [6] (2017) walk2friends: Inferring Social Links from Mobility Profiles. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1943–1957. Cited by: 3rd item, 1st item.

[7]
(2020)
Machine Unlearning: Linear Filtration for Logitbased Classifier
. Note: CoRR abs/2002.02730 Cited by: §1, §2.2, §7.  [8] (2019) Five Years of the Right to be Forgotten. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 959–972. Cited by: §1.
 [9] (2019) Machine Unlearning. Note: CoRR abs/1912.03817 Cited by: §1, §2.2, §5.8, §7, §7.
 [10] (2015) Towards Making Systems Forget with Machine Unlearning. In IEEE Symposium on Security and Privacy (S&P), pp. 463–480. Cited by: §1, §2.2, §7, §7.
 [11] (2019) GANLeaks: A Taxonomy of Membership Inference Attacks against GANs. Note: CoRR abs/1909.03935 Cited by: §7.
 [12] (2020) On Training Robust PDF Malware Classifiers. In USENIX Security Symposium (USENIX Security), Cited by: §7.
 [13] (2019) Privacy Attacks on Network Embeddings. Note: CoRR abs/1912.10979 Cited by: §7.
 [14] (2015) Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1322–1333. Cited by: §7.
 [15] (2014) Privacy in Pharmacogenetics: An EndtoEnd Case Study of Personalized Warfarin Dosing. In USENIX Security Symposium (USENIX Security), pp. 17–32. Cited by: 1st item, §7.
 [16] (2018) Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 619–633. Cited by: §3.3, §7.
 [17] (2019) Making AI Forget You: Data Deletion in Machine Learning. In Annual Conference on Neural Information Processing Systems (NIPS), pp. 3513–3526. Cited by: §1.
 [18] (2020) Forgetting Outside the Box: Scrubbing Deep Networks of Information Accessible from InputOutput Observations. Note: CoRR abs/2003.02960 Cited by: §1.
 [19] (2017) Badnets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. Note: CoRR abs/1708.06733 Cited by: §7.
 [20] (2020) Certified Data Removal from Machine Learning Models. Note: CoRR abs/1911.03030 Cited by: §1.

[21]
(2018)
LEMNA: Explaining Deep Learning based Security Applications
. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 364–379. Cited by: §7.  [22] (2019) MBeacon: PrivacyPreserving Beacons for DNA Methylation Data. In Network and Distributed System Security Symposium (NDSS), Cited by: 1st item, §7.
 [23] (2019) LOGAN: Evaluating Privacy Leakage of Generative Models Using Generative Adversarial Networks. Symposium on Privacy Enhancing Technologies Symposium. Cited by: §7.

[24]
(2016)
Deep Residual Learning for Image Recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 770–778. Cited by: §5.1.  [25] (2019) SegmentationsLeak: Membership Inference Attacks and Defenses in Semantic Image Segmentation. Note: CoRR abs/1912.09685 Cited by: §7.
 [26] (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using HighDensity SNP Genotyping Microarrays. PLOS Genetics. Cited by: §7.
 [27] (2020) Approximate Data Deletion from Machine Learning Models: Algorithms and Evaluations. Note: CoRR abs/2002.10077 Cited by: §1, §2.2, §7.
 [28] (2020) High Accuracy and High Fidelity Extraction of Neural Networks. In USENIX Security Symposium (USENIX Security), Cited by: §7.
 [29] (2018) ModelReuse Attacks on Deep Learning Systems. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 349–363. Cited by: §7.
 [30] (2019) MemGuard: Defending against BlackBox Membership Inference Attacks via Adversarial Examples. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 259–274. Cited by: 1st item, §7.
 [31] (2019) Stolen Memories: Leveraging Model Memorization for Calibrated WhiteBox Membership Inference. Note: CoRR abs/1906.11798 Cited by: §7, §7.

[32]
(2020)
Membership Inference Attacks and Defenses in Supervised Learning via Generalization Gap
. Note: CoRR abs/2002.12062 Cited by: §7.  [33] (2019) How to Prove Your Model Belongs to You: A BlindWatermark based Framework to Protect Intellectual Property of DNN. In Annual Computer Security Applications Conference (ACSAC), Cited by: §7.
 [34] (2019) DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Model. In IEEE Symposium on Security and Privacy (S&P), pp. 673–690. Cited by: §7.
 [35] (2020) Learn to Forget: UserLevel Memorization Elimination in Federated Learning. Note: CoRR abs/2003.10933 Cited by: §1.
 [36] (2019) Trojaning Attack on Neural Networks. In Network and Distributed System Security Symposium (NDSS), Cited by: §7.
 [37] (2019) Exploiting Unintended Feature Leakage in Collaborative Learning. In IEEE Symposium on Security and Privacy (S&P), Cited by: §7.
 [38] (2018) Machine Learning with Membership Privacy using Adversarial Regularization. In ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §7.
 [39] (2019) Comprehensive Privacy Analysis of Deep Learning: Passive and Active Whitebox Inference Attacks against Centralized and Federated Learning. In IEEE Symposium on Security and Privacy (S&P), Cited by: §2.3, §7.
 [40] (2018) Towards ReverseEngineering BlackBox Neural Networks. In International Conference on Learning Representations (ICLR), Cited by: §7.
 [41] (2017) DeepCity: A Feature Learning Framework for Mining Location CheckIns. In International Conference on Weblogs and Social Media (ICWSM), pp. 652–655. Cited by: 1st item.
 [42] (2017) Quantifying Location Sociality. In ACM Conference on Hypertext and Social Media (HT), pp. 145–154. Cited by: 1st item.
 [43] (2017) Practical BlackBox Attacks Against Machine Learning. In ACM Asia Conference on Computer and Communications Security (ASIACCS), pp. 506–519. Cited by: §7.
 [44] (2016) The Limitations of Deep Learning in Adversarial Settings. In IEEE European Symposium on Security and Privacy (Euro S&P), pp. 372–387. Cited by: §7.
 [45] (2018) SoK: Towards the Science of Security and Privacy in Machine Learning. In IEEE European Symposium on Security and Privacy (Euro S&P), Cited by: §7.
 [46] (2018) Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. In Network and Distributed System Security Symposium (NDSS), Cited by: 1st item, §7.
 [47] (2019) Misleading Authorship Attribution of Source Code using Adversarial Learning. In USENIX Security Symposium (USENIX Security), pp. 479–496. Cited by: §7.
 [48] (2020) Dynamic Backdoor Attacks Against Machine Learning Models. Note: CoRR abs/2003.03675 Cited by: §7.
 [49] (2019) MLLeaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. In Network and Distributed System Security Symposium (NDSS), Cited by: §1.1, §2.3, §2.3, 1st item, §5.3, §7, §7.
 [50] (2020) “Amnesia”  A Selection of Machine Learning Models That Can Forget User Data Very Fast. In Annual Conference on Innovative Data Systems Research (CIDR), pp. 8364–44035–46992. Cited by: §1, §2.2, §7.
 [51] (2018) Poison Frogs! Targeted CleanLabel Poisoning Attacks on Neural Networks. In Annual Conference on Neural Information Processing Systems (NIPS), pp. 6103–6113. Cited by: §7.
 [52] (2019) Neutaint: Efficient Dynamic Taint Analysis with Neural Networks. In IEEE Symposium on Security and Privacy (S&P), Cited by: §7.

[53]
(2020)
Exploiting Transparency Measures for Membership Inference: a Cautionary Tale.
In
The AAAI Workshop on PrivacyPreserving Artificial Intelligence (PPAI)
, Cited by: §7.  [54] (2017) Membership Inference Attacks Against Machine Learning Models. In IEEE Symposium on Security and Privacy (S&P), pp. 3–18. Cited by: §1.1, §2.3, §2.3, §3.1, §5.3, §5.4, §6, §7, §7.
 [55] (2020) Towards Probabilistic Verification of Machine Unlearning. Note: CoRR abs/2003.04247 Cited by: §1, §7.
 [56] (2019) Auditing Data Provenance in TextGeneration Models. In ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 196–206. Cited by: §7.
 [57] (2017) Ensemble Adversarial Training: Attacks and Defenses. In International Conference on Learning Representations (ICLR), Cited by: §7.
 [58] (2016) Stealing Machine Learning Models via Prediction APIs. In USENIX Security Symposium (USENIX Security), pp. 601–618. Cited by: §7.
 [59] (2018) Humans Forget, Machines Remember: Artificial Intelligence and the Right to Be Forgotten. Computer Law & Security Review. Cited by: §1.
 [60] (2018) Stealing Hyperparameters in Machine Learning. In IEEE Symposium on Security and Privacy (S&P), Cited by: §7.
 [61] (2019) Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In IEEE Symposium on Security and Privacy (S&P), pp. 707–723. Cited by: §7.
 [62] (2016) The Right to be Forgotten in the Media: A DataDriven Study. Symposium on Privacy Enhancing Technologies Symposium. Cited by: §1, §7.
 [63] (2018) Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. In IEEE Computer Security Foundations Symposium (CSF), Cited by: §5.4, §7.
 [64] (2018) Tagvisor: A Privacy Advisor for Sharing Hashtags. In The Web Conference (WWW), pp. 287–296. Cited by: §7.
 [65] (2020) Towards Plausible Graph Anonymization. In Network and Distributed System Security Symposium (NDSS), Cited by: 1st item.
 [66] (2019) Language in Our Time: An Empirical Analysis of Hashtags. In The Web Conference (WWW), pp. 2378–2389. Cited by: 1st item.
Appendix A Architecture of Simple CNN
Layer  Parameters 

Conv2D_1  (, 32, =3, 1) 
Relu   
Conv2D_2  (32, , , 1) 
Maxpolling2D  =2 
Dropout_1  (0.25) 
Flatten  1 
Linear_1  (, 128) 
Relu   
Dropout_2  0.5 
Linear_2  (128, #classes) 
Softmax  dim=1 
and Maxpooling layer
are equal to 3 and 2, respectively.Appendix B Model Parameter Settings
We use multiple ML models in our experiments. All models are implemented by sklearn version 0.22. For reproduction purpose, we list their parameter settings as follows:

Logistic Regression: We use LBFGS as solver and penalty for regularization, and set other parameters as default.

Decision Tree: We use Gini index as criterion, set parameter max_leaf_nodes as 10, and set other parameters as default.

Random Forest: We use Gini index as criterion, use estimators, set min_samples_leaf=30, and set other parameters as default.

Multilayer Perceptron:
We use SGD as solver, use ReLu as activation function. The hidden layer size is
, the learning rate is , the regularizer is .
Comments
There are no comments yet.