Machine Unlearning of Features and Labels

by   Alexander Warnecke, et al.

Removing information from a machine learning model is a non-trivial task that requires to partially revert the training process. This task is unavoidable when sensitive data, such as credit card numbers or passwords, accidentally enter the model and need to be removed afterwards. Recently, different concepts for machine unlearning have been proposed to address this problem. While these approaches are effective in removing individual data points, they do not scale to scenarios where larger groups of features and labels need to be reverted. In this paper, we propose a method for unlearning features and labels. Our approach builds on the concept of influence functions and realizes unlearning through closed-form updates of model parameters. It enables to adapt the influence of training data on a learning model retrospectively, thereby correcting data leaks and privacy issues. For learning models with strongly convex loss functions, our method provides certified unlearning with theoretical guarantees. For models with non-convex losses, we empirically show that unlearning features and labels is effective and significantly faster than other strategies.



There are no comments yet.


page 1

page 2

page 3

page 4


Interpreting Shared Deep Learning Models via Explicable Boundary Trees

Despite outperforming the human in many tasks, deep neural network model...

Coded Machine Unlearning

Models trained in machine learning processes may store information about...

Second-Order Group Influence Functions for Black-Box Predictions

With the rapid adoption of machine learning systems in sensitive applica...

On the Accuracy of Influence Functions for Measuring Group Effects

Influence functions estimate the effect of removing particular training ...

Certified Data Removal from Machine Learning Models

Good data stewardship requires removal of data at the request of the dat...

Removing the influence of a group variable in high-dimensional predictive modelling

Predictive modelling relies on the assumption that observations used for...

Data Appraisal Without Data Sharing

One of the most effective approaches to improving the performance of a m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has become an ubiquitous tool in analyzing personal data and developing data-driven services. Unfortunately, the underlying learning models can pose a serious threat to privacy if they inadvertently reveal sensitive information from the training data. For example, Carlini et al. [12] show that the Google text completion system contains credit card and social security numbers from personal emails, which may be exposed to users during the autocompletion of text. Once such sensitive data has entered a learning model, however, its removal is non-trivial and requires to selectively revert the learning process. In absence of specific methods for this task in the past, retraining from scratch has been the only resort, which is costly and only possible if the original training data is still available.

As a remedy, Cao & Yang [11] and Bourtoule et al. [8] propose methods for machine unlearning. These methods decompose the learning process and are capable of removing individual data points from a learning model in retrospection. As a result, they enable to eliminate isolated privacy issues, such as data points associated with individuals. However, information leaks may not only manifest in single data instances but also in groups of features and labels. A leaked address of a celebrity might be shared in hundreds of social media posts, affecting large parts of the training data. Similarly, relevant features in a bag-of-words model may be associated with sensitive names and data, contaminating the entire feature space.

Unfortunately, instance-based unlearning as proposed in previous work is inefficient in these cases: First, a runtime improvement can hardly be obtained over retraining as the leaks are not isolated and larger parts of the training data need to be removed. Second, omitting several data points will inevitably reduce the fidelity of the corrected learning model. It becomes clear that the task of unlearning is not necessarily confined to removing data points, but may also require corrections on the orthogonal layers of features and labels, regardless of the amount of affected training data.

In this paper, we propose a method for unlearning features and labels. Our approach is inspired by the concept of influence functions, a technique from robust statistics [31]

, that allows for estimating the influence of data on learning models 

[33, 34]. By reformulating this influence estimation as a form of unlearning, we derive a versatile approach that maps changes of the training data in retrospection to closed-form updates of the model parameters. These updates can be calculated efficiently, even if larger parts of the training data are affected, and enable the removal of features and labels. As a result, our method can correct privacy leaks in a wide range of learning models with convex and non-convex loss functions.

For models with strongly convex loss, such as logistic regression and support vector machines, we prove that our approach enables

certified unlearning. That is, it provides theoretical guarantees on the removal of features and labels from the models. To obtain these guarantees, we extend the concept of certified data removal [28] and show that the difference between models obtained with our approach and retraining from scratch become arbitrarily small. Consequently, we can define an upper bound on this difference and thereby realize provable unlearning in practice.

For models with non-convex loss functions, such as deep neural networks, similar theoretical guarantees do not hold in general. However, we empirically demonstrate that our approach provides substantial advantages over prior work. Our method is significantly faster in comparison to sharding 

[24, 8]

and retraining while removing data and preserving a similar level of accuracy. Moreover, due to the compact updates, our approach requires only a fraction of the training data and hence is applicable when the original data is not entirely available. We show the efficacy of our approach in case studies on removing privacy leaks in spam classification and unintended memorization in natural language processing.


In summary, we make the following major contributions:

  1. Unlearning with closed-form updates. We introduce a novel framework for unlearning of features and labels. This framework builds on closed-form updates of learning models and thus is signicantly faster than instance-based approaches to unlearning.

  2. Certified unlearning. We derive two unlearning strategies for our framework based on first-order and second-order gradient updates. Under convexity and continuity assumptions on the loss, we show that both strategies can provide certified unlearning.

  3. Empirical analysis. We empirically show that unlearning of sensible information is possible even for deep neural networks with non-convex loss functions. We find that our first-order update is extremely efficient, enabling a speed-up over retraining by up to three orders of magnitude.

The rest of the paper is structured as follows: We review related work on machine unlearning and influence functions in Section 2. Our approach and its technical realization are introduced in Sections 4 and 3, respectively. The theoretical analysis of our approach is presented in Section 5 and its empirical evaluation in Section 6. Finally, we discuss limitations in Section 7 and conclude the paper in Section 8.

2 Related Work

The increasing application of machine learning to personal data has started a series of research on detecting and correcting privacy issues in learning models [e.g., 53, 13, 12, 45, 36, 47]. In the following, we provide an overview of work on machine unlearning and influence functions. A broader discussion of privacy and machine learning is given by De Cristofaro [21] and Papernot et al. [41].

Machine unlearning

Methods for unlearning sensitive data are a recent branch of security research. Earlier, the efficient removal of samples was also called decremental learning [14]

and used to speed up cross validation for various linear classifiers 

[16, 17, 15]. Cao & Yang [11]

show that a large number of learning models can be represented in a closed summation form that allows for elegantly removing individual data points in retrospection. However, for adaptive learning strategies, such as stochastic gradient descent, this approach provides only little advantage over retraining from scratch and thus is not well suited for correcting problems in neural networks.

As a remedy, Bourtoule et al. [8] propose a universal strategy for unlearning data points from classification models. Similarly, Ginart et al. [24] develop a technique for unlearning points in clustering. The key idea of both approaches is to split the data into independent partitions—so called shards—and aggregate the final model from submodels trained over these shards. In this setting, the unlearning of data points can be efficiently carried out by only retraining the affected submodels. Aldaghri et al. [3] show that this approach can be further sped up for least-squares regression by choosing the shards cleverly. Unlearning based on shards, however, is suitable for removing a few data points only and inevitably deteriorates in performance when larger portions of the data require changes.

This limitation of sharding is schematically illustrated in Fig. 1

. The probability that all shards need to be retrained increases with the number of data points to be corrected. For a practical setup with

 shards, as proposed by Bourtoule et al. [8], changes to as few as  points are already sufficient to impact all shards and render this form of unlearning inefficient, regardless of the size of the training data. We provide a detailed analysis of this limitation in Stochastic Sharding Analysis. Consequently, privacy leaks involving hundreds or thousands of data points cannot be addressed with these approaches.

Figure 1: Probability of all shards being affected when unlearning for varying number of data points and shards ().

Influence functions

The concept of influence functions that forms the basis of our approach originates from robust statistics [31] and has first been used by Cook & Weisberg [20]

for investigating the changes of simple linear regression models. Although the proposed techniques have been occasionally employed in machine learning 

[35, 32], the seminal work of Koh & Liang [33] recently brought general attention to this concept and its application to modern learning techniques. In particular, this work uses influence functions for explaining the impact of data points on the predictions of learning models.

Influence functions have then been used to trace bias in word embeddings back to documents [10, 19], determine reliable regions in learning models [46], and explain deep neural networks [7]. Moreover, Basu et al. [6] increase the accuracy of influence functions by using high-order approximations, Barshan et al. [5] improve the precision of influence calculations through nearest-neighbor strategies, and Guo et al. [29] show that the runtime can be decreased when only specific samples are considered. Golatkar et al. [26, 25] use influence functions for sample removal in deep neural networks by proposing special approximations of the learning model.

In terms of theoretical analysis, Koh et al. [34] study the accuracy of influence functions when estimating the loss on test data and Neel et al. [40] perform a similar analysis for gradient based update strategies. Rad & Maleki [44] further show that the prediction error on leave-one-out validations can be reduced with influence functions. Finally, Guo et al. [28] introduce the idea of certified removal for data points that we extend in our approach.

All of these approaches, however, remain on the level of data instances. To our knowledge, we are the first to build on the concept of influence functions for unlearning features and labels from learning models.

3 Unlearning with Updates

Let us start by considering a supervised learning task that is described by a dataset

with each object consisting of a data point and a label . We assume that

is a vector space and denote the

-th feature (dimension) of by . Given a loss function that measures the difference between the predictions of a learning model  and the true labels, the optimal model can be found by minimizing the regularized empirical risk,


where is a regularizer and describes the loss on the entire dataset. In this setup, the process of unlearning amounts to adapting to changes in  without recalculating the optimization problem in Eq. 1.

3.1 Unlearning Data Points

To provide an intuition for our approach, we begin by asking the following question: How would the optimal learning model  change, if only one data point  had been perturbed by some change ? Replacing by leads to the new optimal set of model parameters:


However, calculating the new model exactly is expensive. Instead of replacing the data point  with , we can also up-weight by a small value  and down-weight accordingly, resulting in the following optimization problem:


Eqs. 3 and 2 are equivalent for and solve the same problem. As a result, we do not need to explicitly remove a data point from the training data but can revert its influence on the learning model through a combination of appropriate up-weighting and down-weighting.

It is easy to see that this approach is not restricted to a single data point. We can simply define a set of data points  and its perturbed versions , and arrive at the weighting


This generalization enables us to approximate changes on larger portions of the training data. Instead of solving the problem in Eq. 4, however, we formulate this optimization as an update of the original model . That is, we seek a closed-form update of the model parameters, such that


where has the same dimension as the learning model  but is sparse if only a few parameters are affected.

As a result of this formulation, we can describe changes of the training data as a compact update rather than iteratively solving an optimization problem. We show in Section 4 that this update step can be efficiently computed using first-order and second-order gradients. Furthermore, we prove in Section 4 that the unlearning success of both updates can be certified up to a tolerance if the loss function  is strictly convex, twice differentiable, and Lipschitz-continuous.

3.2 Unlearning Features and Labels

Equipped with a general method for updating a learning model, we proceed to introduce our approach for unlearning features and labels. To this end, we expand our notion of perturbations and include changes to labels by defining

where modifies the features of a data point and its label. By using different changes in the perturbations , we can now realize different types of unlearning using closed-form updates.

Replacing features

As the first type of unlearning, we consider the task of correcting features in a learning model. This task is relevant if the content of some features violates the privacy of a user and needs to be replaced with alternative values. As an example, personal names, identification numbers, residence addresses, or other sensitive data might need to be removed after a model has been trained on a corpus of emails.

For a set of features and their new values , we define perturbations on the affected points by

For example, a credit card number contained in the training data can be blinded by a random number sequence in this setting. The values can be adapted individually, such that fine-grained corrections become possible.

Replacing labels

As the second type of unlearning, we focus on correcting labels. This form of unlearning is necessary if the labels captured in a model contain unwanted information. For example, in generative language models, the training text is used as input features (preceding characters)

and labels (target characters) [27, 48]. Hence, defects can only be eliminated if the labels are unlearned as well.

For the affected points and the set of new labels , we define the corresponding perturbations by

where corresponds to the data points in without their original labels. The new labels can be individually selected for each data point, as long as they come from the domain , that is, . Note that the replaced labels and features can be easily combined in one set of perturbations , so that defects affecting both can be corrected in a single update. In Section 6.2, we demonstrate that this combination can be used to remove unintended memorization from generative language models with high efficiency.

Revoking features

Based on appropriate definitions of and , our approach enables to replace the content of features and thus eliminate privacy leaks. However, in some scenarios it might be necessary to completely remove features from a learning model—a task that we denote as revocation. In contrast to the correction of features, this form of unlearning poses a unique challenge: The revocation of features reduces the input dimension of the learning model. While this adjustment can be easily carried out through retraining with adapted data, constructing a model update as in Eq. 5 is tricky.

To address this problem, let us consider a model  trained on a dataset . If we remove the features from this dataset and train the model again, we obtain a new optimal model with reduced input dimension. By contrast, if we set the values of the features to zero in the dataset and train again, we obtain an optimal model with the same input dimension as . Fortunately, these two models are equivalent for a large class of learning models, including support vector machines and several neural networks as the following lemma shows.

Lemma 1.

For learning models processing inputs

using linear transformations of the form

, we have .


It is easy to see that it is irrelevant for the dot product whether a dimension of is missing or equals zero in the linear transformation

As a result, the loss of both models is identical for every data point . Hence, is also equal for both models and thus the same objective is minimized during learning resulting in equal parameters. ∎

Lemma 1 enables us to erase features from many learning models by first setting them to zero, calculating the parameter update, and then reducing the dimension of the models accordingly. Concretely, to revoke the features , we locate the data points where these features are non-zero with

and construct corresponding perturbations such that the features are set to zero by unlearning,

Revoking labels

The previous strategy allows revoking features from several learning models. It is crucial if, for example, a bag-of-words model has captured sensitive data in relevant features and therefore a reduction of the input dimension during unlearning is unavoidable. Unfortunately, a similar strategy for the revocation of labels is not available for our method, as we are not aware of a general shortcut, such as Lemma 1. Still, if the learning model contains explicit output dimensions for the class labels, as with some neural network architectures, it is possible to first replace unwanted labels and then manually remove the corresponding dimensions.

4 Update Steps for Unlearning

Our approach rests on changing the influence of training data with a closed-form update of the model parameters, as shown in Eq. 5. In the following, we derive two strategies for calculating this closed form: a first-order update and a second-order update. The first strategy builds on the gradient of the loss function and thus can be applied to any model with a differentiable loss. The second strategy also incorporates second-order derivatives which limits the application to loss functions with an invertable Hessian matrix.

4.1 First-Order Update

Recall that we aim to find an update that we add to our model . If the loss is differentiable, we can find the optimal first-order update by


where is a small constant that we refer to as unlearning rate. A complete derivation of Eq. 6 is given in First-order update. Intuitively, this update shifts the model parameters in the direction from to where the size of the update step is determined by the rate . This update strategy is related to the classic gradient descent update GD used in many learning algorithms and given by

However, it differs from this update step in that it moves the model to the difference in gradient between the original and perturbed data, which minimizes the loss on and at the same time removes the information contained in .

The first-order update is a simple and yet effective strategy: Gradients of can be computed in  [43]

and modern auto-differentiation frameworks like TensorFlow 


and PyTorch 

[42] offer easy gradient computations for the practitioner. The update step involves a parameter that controls the impact of the unlearning step. To ensure that data has been completely replaced, it is necessary to calibrate this parameter using a measure for the success of unlearning. In Section 6, for instance, we show how the exposure metric by Carlini et al. [12] can be used for this calibration.

4.2 Second-Order Update

The calibration of the update step can be eliminated if we make further assumptions on the properties of the loss function . If we assume that  is twice differentiable and strictly convex, the influence of a single data point can be approximated in closed form [20] by

where is the inverse Hessian of the loss at , that is, the inverse matrix of the second-order partial derivatives. We can now perform a linear approximation for to obtain


Since all operations are linear, we can easily extend Eq. 7 to account for multiple data points and derive the following second-order update:


A full derivation of this update step is provided in Second-order update. Note that the update does not require any parameter calibration, since the parameter weighting of the changes is directly derived from the inverse Hessian of the loss function.

The second-order update is the preferred strategy for unlearning on models with a strongly convex and twice differentiable loss function, such as a logistic regression, that guarantee the existence of . Technically, the update step in Eq. 8 can be easily calculated with common machine-learning frameworks. In contrast to the first-order update, however, this computation involves the inverse Hessian matrix, which is non-trivial for neural networks, for example.

Computing the inverse Hessian

Given a model with parameters, forming and inverting the Hessian requires time and  space [33]. For models with a small number of parameters, the matrix can be pre-computed and explicitly stored, such that each subsequent request for unlearning only involves a simple matrix-vector multiplication. For example, in Section 6.1, we show that unlearning features from a logistic regression model with about  parameters can be realized with this approach in less than a second.

For complex learning models, such as deep neural networks, the Hessian matrix quickly becomes too large for explicit storage. Moreover, these models typically do not have convex loss functions, such that the matrix may also be non-invertible, rendering an exact update impossible. Nevertheless, we can approximate the inverse Hessian using techniques proposed by Koh & Liang [33]. While this approximation weakens the theoretical guarantees of the unlearning process, it enables applying second-order updates to a variety of complex learning models, similar to the first-order strategy.

To apply second-order updates in practice, we have to avoid storing explicitly and still be able to compute . To this end, we rely on the scheme proposed by Agarwal et al. [2] to compute expressions of the form that only require to calculate and do not need to store . Hessian-Vector-Products (HVPs) allow us to calculate efficiently by making use of the linearity of the gradient

Denoting the first terms of the Taylor expansion of by we have , and can recursively define an approximation given by . If

for all eigenvalues

of , we have for . To ensure this convergence, we add a small damping term to the diagonal of and scale down the loss function (and thereby the eigenvalues) by some constant which does not change the optimal parameters . Under these assumptions, we can formulate the following algorithm for computing an approximation of : Given data points sampled from , we define the iterative updates

In each update step, is estimated using a single data point and we can use HVPs to evaluate efficiently in as demonstrated by Pearlmutter [43]. Using batches of data points instead of single ones and averaging the results further speeds up the approximation. Choosing large enough so that the updates converge and averaging

runs to reduce the variance of the results, we obtain

as our final estimate of in of time. In Section 6.2 we demonstrate that this strategy can be used to calculate the second-order updates for a deep neural network with million parameters.

5 Certified Unlearning

Machine unlearning is a delicate task, as it aims at reliably removing privacy issues and sensitive data from learning models. This task should ideally build on theoretical guarantees to enable certified unlearning, where the corrected model is stochastically indistinguishable from one created by retraining. In the following, we derive conditions under which the updates of our approach introduced in Section 4.2 provide certified unlearning. To this end, we build on the concepts of differential privacy [22, 18] and certified data removal [28], and adapt them to the unlearning problem.

Let us first briefly recall the idea of differential privacy in machine learning: For a training dataset , let be a learning algorithm that outputs a model after training on , that is, . Randomness in

induces a probability distribution over the output models in 

. The key idea of differential privacy is a measure of difference between a model trained on and another one trained on for some .

Definition 1.

Given some , a learning algorithm is said to be -differentially private (-DP) if

holds for all .

Thus, for an -DP learning algorithm the difference between the log-likelihood of a model trained on and one trained on is smaller than for all possible models, datasets, and data points. Based on this definition, we can introduce the concept of -certified unlearning. In particular, we consider an unlearning method that maps a model to a corrected model where denotes the dataset containing the perturbations required for the unlearning task.

Definition 2.

Given some and a learning algorithm , an unlearning method  is -certified if

holds for all .

This definition ensures that the probability to obtain a model using the unlearning method and training a new model on from scratch deviates at most by . Similar to certified data removal [28], we introduce -certified unlearning, a relaxed version of -certified unlearning, defined as follows.

Definition 3.

Under the assumptions of Definition 2, an unlearning method is -certified if


hold for all .

That is, -certified unlearning allows the method to slightly violate the conditions from Definition 2 by a constant . Using the above definitions, it becomes possible to derive conditions under which certified unlearning is possible for both our approximate update strategies.

5.1 Certified Unlearning of Features and Labels

Based on the concept of certified unlearning, we analyze our approach and its theoretical guarantees on removing features and labels. To ease this analysis, we make two assumptions on the employed learning algorithm: First, we assume that the loss function  is twice differentiable and strictly convex such that always exists. Second, we consider regularization in optimization problem (1), that is, which ensures that the loss function is strongly convex.

A powerful concept for analyzing unlearning is the gradient residual for a given model and a corrected dataset . For strongly convex loss functions, the gradient residual is zero if and only if equals since in this case the optimum is unique. Therefore, the norm of the gradient residual reflects the distance of a model from one obtained by retraining on the corrected dataset . While a small value of this norm is not sufficient to judge the quality of unlearning, we can develop upper bounds to prove properties related to differential privacy [18, 28]. Consequently, we derive bounds for the gradient residual norms of our two update strategies. The corresponding proofs are given in Proofs for Certified Unlearning.

Theorem 1.

If all perturbations lie within a radius , that is , and the loss is -Lipschitz with respect to and , the following upper bounds hold:

  1. If the unlearning rate , we have

    for the first-order update of our approach.

  2. If is -Lipschitz with respect to , we have

    for the second-order update of our approach.

This theorem enables us to bound the gradient residual norm of both update steps. We leverage these bounds to reduce the difference between unlearning and retraining from scratch. In particular, we follow the approach by Chaudhuri et al. [18] and add a random linear term to the loss function to shape the distribution of the model parameters. Given a vector drawn from a random distribution, we define

with a corresponding gradient residual given by

By definition, the gradient residual of differs only by the added vector from the residual of the original loss , which allows to precisely determine its influence on the bounds of Theorem 1 depending on the underlying distribution of .

Let be an exact minimizer of on with density function and an approximated minimum obtained through unlearning with density . Guo et al. [28] show that the max-divergence between and for the model produced by can be bounded using the following theorem.

Theorem 2 (Guo et al. [28]).

Let be an unlearning method with a gradient residual with . If is drawn from a probability distribution with density satisfying that for any there exists an such that implies then

for any produced by the unlearning method .

Theorem 2 equips us with a way to prove the certified unlearning property from Definition 2. Using the gradient residual bounds derived in Theorem 1, we can adjust the density function of  in such a way that Theorem 2 applies for both removal strategies using the approach presented by Chaudhuri et al. [18] for differentially private learning strategies.

Theorem 3.

Let be the learning algorithm that returns the unique minimum of and let be an unlearning method that produces a model . If for some we have the following guarantees.

  1. If is drawn from a distribution with density then the method performs -certified unlearning for .

  2. If for some then the method performs -certified unlearning for with .

Theorem 3 allows us to establish certified unlearning of features and labels in practice: Given a learning model with noise coming from , our approach is certified if the gradient residual norm—which can be bounded by Theorem 1—remains smaller than a constant depending on , and the parameters of the distribution of .

Dataset Model Points Features Parameters Classes Replacement Certified
Enron LR
Alice LSTM
Table 1: Overview of the considered datasets and models for unlearning scenarios.

6 Empirical Analysis

We proceed with an empirical analysis of our approach and its capabilities. For this analysis, we examine the efficacy of unlearning in practical scenarios and compare our method to other strategies for removing data from learning models, such as retraining and fine-tuning. As part of these experiments, we employ models with convex and non-convex loss functions to understand how this property affects the success of unlearning. Overall, our goal is to investigate the strengths and potential limitations of our approach when unlearning features and labels in practice and examine the theoretical bounds derived in Section 5.1.

Unlearning scenarios

Our empirical analysis is based on the following two scenarios in which sensitive information must be removed from a learning model. The scenarios involve common privacy and security issues in machine learning, with each scenario focusing on a different issue, learning task, and model. Table 1 provides an overview of these scenarios for which we present more details on the experimental setup in the following sections.

Scenario 1: Sensitive features. Our first scenario deals with machine learning for spam filtering. Content-based spam filters are typically constructed using a bag-of-words model [4, 51]. These models are extracted directly from the email content, so that sensitive words and personal names in the emails unavoidably become features of the learning model. These features pose a severe privacy risk when the spam filter is shared, for example in an enterprise environment, as they can reveal the identities of individuals in the training data similar to a membership inference attack [45]. We evaluate unlearning as a means to remove these features ( Section 6.1).

Scenario 2: Unintended memorization. In the second scenario, we consider the problem of unintended memorization [12]

. Generative language models based on recurrent neural networks are a powerful tool for completing and generating text. However, these models can memorize sequences that appear rarely in the training data, including credit card numbers or private messages. This memorization poses a privacy problem: Through specifically crafted input sequences, an attacker can extract this sensitive data from the models during text completion 

[13, 12]. We apply unlearning of features and labels to remove identified leaks from language models ( Section 6.2).

Performance measures

Unlike other problems in machine learning, the performance of unlearning does not depend on a single numerical measure. For example, one method may only partially remove data from a learning model, whereas another may be successful but degrades the prediction performance of the model. Consequently, we identify three factors that contribute to effective unlearning and provide performance measures for our empirical analysis.

1. Efficacy of unlearning. The most important factor for successful unlearning is the removal of data. While certified unlearning, as presented in Section 5, theoretically ensures this removal, we cannot provide similar guarantees for learning models with non-convex loss functions. As a result, we need to employ measures that quantitatively assess the efficacy of unlearning. In particular, we use the exposure metric [12] to measure the memorization strength of specific sequences in language generation models after unlearning.

2. Fidelity of unlearning. The second factor contributing to the success of unlearning is the performance of the corrected model. An unlearning method is of practical use only if it preserves the capabilities of the learning model as much as possible. Hence, we consider the fidelity of the corrected model as a performance measure. In our experiments, we use the accuracy of the original model and the corrected model on a hold-out set as a measure for the fidelity.

3. Efficiency of unlearning. If the training data used to generate a model is still available, a simple but effective unlearning strategy is retraining from scratch. This strategy, however, involves significant runtime and storage costs. Therefore, we also consider the efficiency of unlearning as a relevant factor. In our experiments, we measure the runtime and the number of gradient calculations for each unlearning method, and relate them to retraining since gradient computations are the most costly part in our update strategies and modern optimization algorithms for machine learning models.

Baseline methods

To compare our approach with related strategies for data removal, we employ different baseline methods as reference for examining the efficacy, fidelity, and efficiency of unlearning.

Retraining. As the first baseline method, we employ retraining from scratch. This method is applicable if the original training data is available and guarantees proper removal of data. The unlearning method by Bourtoule et al. [8] does not provide advantages over this baseline when too many shards are affected by data changes. As shown in Section 2 and detailed in Stochastic Sharding Analysis, this effect already occurs for relatively small sets of data points, and thus we do not explicitly consider sharding in our empirical analysis.


As a second method for comparison, we make use of naive fine-tuning. Instead of starting all over, this strategy simply continues to train a model using corrected data. This is especially helpful for neural networks where the new optimal parameter is close to the original one and a lot of optimization steps at the beginning can be saved. In particular, we implement this fine-tuning by performing stochastic gradient descent over the training data for one epoch. This naive unlearning strategy serves as a middle ground between costly retraining and specialized methods, such as our approach.

Occlusion. For linear classifiers, there exists a one-to-one mapping between features and weights. In this case, one can naively unlearn features by simply replacing them with zero when they occur or equivalently set the corresponding weight to zero. This method ignores the shift in the data distribution incurred by the missing features but is very efficient as it requires no training or update steps. Although easy to implement, occlusion can lead to problems if the removed features have a significant impact on the model.

6.1 Unlearning Sensitive Names

In our first unlearning scenario, we remove sensitive features from a content-based spam filter. As a basis for this filter, we use the Enron dataset [39], which includes  emails labeled as spam or non-spam. We divide the dataset into a training and test partition with a ratio of and respectively. To create a feature space for learning, we extract the words contained in each email using whitespace delimiters and obtain a bag-of-words model with  features weighted by the term frequency inverse document frequency metric. We normalize the feature vectors such that and learn a logistic regression classifier on the training set to use it for spam filtering. Logistic regression is commonly used for similar learning tasks and employs a strictly convex and twice differentiable loss function. This also ensures that a single optimal parameter exists that can be obtained via retraining and used as an optimal baseline. Moreover, the Hessian matrix has a closed form and can be stored in memory to allow an exact second order update for evaluation.

Sensitive features

To gain insights into relevant features of the classifier we employ a simple gradient based explanation method [50]. While we observe several reasonable words with high weights in the model, we also discover features that contain sensitive information. For example, we identify several features corresponding to first and last names of email recipients. Using a list of common names, we can find about  surnames and forenames present in the entire dataset. Similarly, we find features corresponding to phone numbers and zip codes related to the company Enron.

Although these features may not appear to be a significant privacy violation at first glance, they lead to multiple problems: First, if the spam filter is shared as part of a network service, the model may reveal the identity of individuals in the training data. Second, these features likely represent artifacts and thus bias spam filtering for specific individuals, for example, those having similar names or postal zip codes. Third, if the features are relevant for the non-spam class, an adversary might craft inputs that evade the classifier. Consequently, there is a need to resolve this issue and correct the learning model.

Unlearning task

We address the problem using feature unlearning, that is, we apply our approach and the baseline methods to revoke the identified features from the classification model. Technically, we benefit from the convex loss function of the logistic regression, which allows us to apply certified unlearning as presented in Section 5. Specifically, it is easy to see that Theorem 3 holds since the gradients of the logistic regression loss are bounded and are thus Lipschitz-continuous. For a detailed discussion on the Lipschitz constants, we refer the reader to the paper by Chaudhuri et al. [18].

Efficacy evaluation

Theorem 3 equips us with a certified learning strategy via a privacy budget that must not be exceeded by the gradient residual norm of the parameter update. Concretely, for given parameters

and noise on the weights that has been sampled from a Gaussian normal distribution with variance

the gradient residual must be smaller than


Table 2 shows the effect of the regularization strength and the variance on the classification performance on the test dataset. As expected, the privacy budget is clearly affected by as a large variance clearly impacts the classification performance whereas the impact of the regularization is small.

Table 2: The spam filter’s accuracy for varying parameters  and .
Figure 2: Distribution of the Gradient residual norm when removing random combinations of a given number of names. Lower residuals reflect better approximation and allow smaller values of .

To evaluate the gradient residual norm further, we set both and to and remove random combinations of the most important names from the dataset using our approaches. The distribution of the gradient residual norm after unlearning is presented in Fig. 2

. We can observe that the residual rises in the number of names to be removed since more data is affected by the update steps. The second order update step achieves extremely small gradient residuals with small variance, while both the first order and naive feature removal produce higher residuals with more outliers. Since naive retraining always produces gradient residuals of zero, the second order approach is best suited for unlearning in this scenario. Notice that by Equation (

9), the bound for the gradient residual depends linearly on the parameter for a given and . Therefore, the second-order update also allows smaller values for for a given feature combination to unlearn or, vice versa, allows to unlearn many more features for a given .

Fidelity evaluation

To evaluate the fidelity in a broad sense, we use the approach of Koh & Liang [33] and firstly compare the loss on the test data after the unlearning task. Fig. 3 shows the difference in test loss between retraining and unlearning when randomly removing combinations of size (left) and (right) features from the dataset. Both the first-order and second-order method approximate the retraining very well (Pearson’s ) even if many features are removed. Simply setting the affected weights to zero, however, cannot adapt to the distribution shift and leads to larger deviations of test loss on the test set.

Figure 3: Difference in test loss between retraining and unlearning when removing random combinations of size 16 (left) and 32 (right). After a perfect unlearning the results lie on the identity line.

In addition to the loss, we also evaluate the accuracy of the spam filter for the different unlearning methods on the test data in Table 3. Since the accuracy is less sensitive to small model changes, we restrict ourselves to the most important features and choose a small regularization strength of such that single features can become important. The number of affected samples in this experiment is rising quickly from ( of the dataset) for four features to () when deleting features. The large fraction of affected training data stresses that instance-based methods are no suited to repair these privacy leaks as they shrink the available data noticeable.

As a first observation, we note that removing the sensitive features leads to a slight drop in performance for all methods, especially when more features are removed. On a small scale, in turn, the second order method provides the best results and is closest to a model trained from scratch. This evaluation shows that single features can have a significant impact on the classification performance and that unlearning can be necessary if the application of the model requires a high level of accuracy.

Removed features
Original model
Unlearning (1st)
Unlearning (2nd)
Table 3: Test accuracy of the corrected spam filter for varying number of removed features (in  ).

Efficiency evaluation

We use the previous experiment to measure the efficiency when deleting features and present the results in Table 4. In particular, we use the L-BFGS algorithm [37] for optimization of the logistic regression loss. Due to the linear structure of the learning model and the convexity of the loss, the runtime of all methods is remarkably low. We find that the first-order update is significantly faster than the other approaches. This difference in performance results from the underlying optimization problem: While the other approaches operate on the entire dataset, the first-order update considers only the corrected points and thus enables a speedup factor of . For the second-order update, the majority of runtime and gradient computations is used for the computation and inversion of the Hessian matrix. If, however, the number of parameters is small, it is possible to pre-compute and store the inverse Hessian on the training data such that the second order update comes down to a matrix-times-vector multiplication and becomes faster than retraining.

Unlearning methods Gradients pt Runtime pt Speed-up
Retraining s   —
Unlearning (1st) s
Unlearning (2nd) s
Hessian stored s
Table 4: Average runtime when unlearning random combinations of features from the spam filter.

6.2 Unlearning Unintented Memorization

In the second scenario, we remove unintended memorization artifacts from a generative language model. Carlini et al. [12] show that these models can memorize rare inputs in the training data and exactly reproduce them during application. If this data contains private information like credit card numbers or telephone numbers, this may become a severe privacy issue [13, 53]. In the following, we use our approach to tackle this problem and demonstrate that unlearning is also possible with non-convex loss functions.

Canary insertion

We conduct our experiments using the novel Alice in Wonderland as training set and train an LSTM network on the character level to generate text [38]. Specifically, we train an embedding with  dimensions for the characters and use two layers of  LSTM units followed by a dense layer resulting in a model with  million parameters. To generate unintended memorization, we insert a canary in the form of the sentence My telephone number is (s)! said Alice into the training data, where (s) is a sequence of digits of varying length [12]. In our experiments, we use numbers of length and repeat the canary so that points are affected. After optimizing the categorical cross-entropy loss of the model on this data, we find that the inserted phone numbers are the most likely prediction when we ask the language model to complete the canary sentence, indicating exploitable memorization.

Exposure metric

In contrast to the previous scenario, the loss of the generative language model is non-convex and thus certified learning is not applicable. A simple comparison to a retrained model is also difficult since the optimization procedure is non-deterministic and might get stuck in local minima of the loss function. Consequently, we require an additional measure to assess the efficacy of unlearning in this experiment and make sure that the inserted telephone numbers have been effectively removed. To this end, we use the exposure metric introduced by Carlini et al. [12]

where is a sequence and is the set containing all sequences of identical length given a fixed alphabet. The function returns the rank of with respect to the model and all other sequences in . The rank is calculated using the log-perplexity of the sequence and states how many other sequences are more likely, i.e. have a lower log-perplexity.

As an example, Fig. 4 shows the perplexity distribution of our model where a telephone number of length has been inserted during training. The histogram is created using of the total possible sequences in . The perplexity of the inserted number significantly differs from all other number combinations in , indicating that it has been strongly memorized by the underlying language model.

Figure 4: Perplexity distribution of the language model. The vertical lines indicate the perplexity of an inserted telephone number before and after unlearning. Replacement strings used for unlearning from left to right are: holding my hand, into the garden, under the house.

The exact computation of the exposure metric is expensive, as it requires operating over the set . Note that our approximation of the perplexity in Fig. 4

precisely follows a skew normal distribution even though the evaluated number of sequences is small compared to

. Therefore, we can use the approximation proposed by Carlini et al. [12]

that determines the exposure of a given perplexity value using the cumulative distribution function of the fit skewnorm density.

Unlearning task

To unlearn the memorized sequences, we replace each digit of the phone number in the data with a different character, such as a random or constant value. Empirically, we find that using text substitutions from the training corpus work best for this task. The model has already captured these character dependencies, resulting in a small update of the model parameters. However, due to the training of generative language models, the update is more involved than in the previous scenario. The model is trained to predict a character from preceding characters. Thus, replacing a text means changing both the features (preceding characters) and the labels (target characters). Therefore, we combine both changes in a single set of perturbations in this setting.

Efficacy evaluation

First, we check whether the memorized telephone numbers have been successfully unlearned from the generative language model. An important result of the study by Carlini et al. [12] is that the exposure is associated with an extraction attack: For a set with elements, a sequence with an exposure smaller than  cannot be extracted. For unlearning, we test three different replacement sequences for each telephone number and use the best for our evaluation. Table 5 shows the results of this experiment.

Length of number 5 10 15 20
Original model
Unlearning (1st)
Unlearning (2nd)
Table 5: Exposure metric of the canary sequence for different lengths. Lower exposure values make extraction harder.

We observe that our first-order and second-order updates yield exposure values close to zero, rendering an extraction impossible. In contrast, fine-tuning leaves a large exposure in the model, making a successful extraction very likely. On closer inspection, we find that the performance of fine-tuning depends on the order of the training data during the gradient updates, resulting in a high standard deviation in the different experimental runs. This problem cannot be easily mitigated by learning over further epochs and thus highlights the need for dedicated unlearning techniques. The fact that the simple first-order update can eradicate the memorization completely also shows that unintended memorization is present only on a local scale of the model.

Replacement Canary Sentence completion
taken mad!’ ‘prizes! said the lory confuse
not there␣ it,’ said alice. ‘that’s the beginning
under the mouse the book!’ she thought to herself ‘the
the capital of paris it all about a gryphon all the three of
Table 6: Completions of the canary sentence of the corrected model for different replacement strings and lengths.

Throughout our experiments, we also find that the replacement string plays a major role for the unlearning process in the context of language generation models. In Fig. 4, we report the log-perplexity of the canary for three different replacement strings after unlearning for a comparison111Strictly speaking, each replacement induces its own perplexity distribution but we find the difference to be marginal and thus place all values in the same histogram for the sake of clarity.. Each replacement shifts the canary far to the right and turns it into a very unlikely prediction with exposure values ranging from to . While we use the replacement with the lowest exposure in our experiments, the other substitution sequences would also impede a successful extraction.

It remains to answer the question what the model actually predicts after the unlearning step for the canary sequence. Table 6 shows different completions of the inserted canary sentence produced by the second-order update for replacement strings of different lengths. Apparantly, the predicted string is not equal to the replacement, that is, the unlearning does not push the model completely into the parameter set matching the replacement. In addition, we note that the sentences do not seem random, follow the structure of english language and still reflect the wording of the novel.

Fidelity evaluation

To evaluate the fidelity of the unlearning strategies, we examine the performance of the model in terms of accuracy. Table 7

shows the accuracy after unlearning for different numbers of affected data points. For small sets of affected points, our approach yields results comparable to retraining from scratch. No statistically significant difference can be observed in this setting, also when comparing sentences produced by the models. However, the accuracy of the corrected model decreases as the number of points becomes larger because the concept of infinitesimal change is violated. Here, the second-order method is better able to handle larger changes because the Hessian contains information about unchanged samples. The first-order approach focuses only on the samples to be fixed and thus increasingly reduces the accuracy of the corrected model. Again, we find that the replacement string plays an important role for the fidelity, especially when more samples are affected, which is expressed in the high standard deviation that can be observed in this case. Depending on the task the replacement string can thus be seen as a hyperparameter of the unlearning approach that has to be tuned.

Affected samples
Original model
Unlearning (1st)
Unlearning (2nd)
Table 7: Fidelity of original and corrected models (in ).

Efficiency evaluation

Finally, we examine the efficiency of the different unlearning methods in this scenario. At the time of writing, the CUDA library version 10.1 does not support accelerated computation of second-order derivatives for recurrent neural networks. Therefore, we report a CPU computation time (Intel Xeon Gold 6226) for the second-order update method of our approach, while the other methods are calculated using a GPU (GeForce RTX 2080 Ti). The runtime and number of gradient computations required for each approach are presented in Table 8.

As expected, the time to retrain the model from scratch is extremely long, as the model and dataset are large. In comparison, one epoch of fine-tuning is faster but does not solve the unlearning task in terms of efficacy. The first-order method is the fastest approach and provides a speed-up of three orders of magnitude in relation to retraining. The second-order method still yields a speed-up factor of over retraining, although the underlying implementation does not benefit from GPU acceleration. Given that the first-order update provides a high efficacy in unlearning and only a slight decrease in fidelity when correcting less than  points, it provides the overall best performance in this scenario.

Unlearning methods Gradients Runtime Speed-up
Retraining min   —
Fine-tuning s
Unlearning (1st) s
Unlearning (2nd) s
Table 8: Runtime performance of unlearning methods for affected samples

7 Limitations

Removing data from a learning model in retrospection is a challenging endeavor. Although our unlearning approach successfully solves this task in our empirical analysis, it has limitations that are discussed in the following and need to be considered in practical applications.

Scalability of unlearning

As shown in our empirical analysis, the efficacy of unlearning decreases with the number of affected data points. While privacy leaks with dozens of sensitive features and hundreds of affected points can be handled well with our approach, changing half of the training data likely exceeds its capabilities. Clearly, our work does not violate the no-free-lunch theorem [52] and unlearning using closed-form updates cannot replace the large variety of different learning strategies in practice.

Still, our method provides a significant speedup compared to retraining and sharding in situations where a moderate number of data points need to be corrected. Consequently, it is a valuable unlearning method in practice and a countermeasure to mitigate privacy leaks when the entire training data is no longer available or retraining from scratch would not resolve the issue fast enough.

Non-convex loss functions

Our approach can only guarantee certified unlearning for strongly convex loss functions that have Lipschitz-continuous gradients. While both update steps of our approach work well for neural networks with non-convex functions, they require an additional measure to validate successful unlearning in practice. Forunately, such external measures are often available, as they typically provide the basis for characterizing data leakage prior to its removal. In our experiments, for instance, we use a metric proposed by Carlini et al. [12] for unintended memorization in generative language models. Furthermore, the active research field of Lipschitz-continuous neural networks [49, 23, 30] already provides promising models that may result in better unlearning guarantees in the near future.

Unlearning requires detection

Finally, we like to point out that our unlearning method requires knowledge of the data to be removed from a model. Detecting privacy leaks in learning models is a hard problem, outside of the scope of this work. First, the nature of privacy leaks depends on the type of data and learning models being used. For example, the analysis of Carlini et al. [12, 13] focuses on generative learning models and cannot be transferred to non-sequential models easily. Second, privacy issues are usually context-dependent and difficult to formalize. The Enron dataset, which was released without proper anonymization, may contain other sensitive information not currently known to the public. The automatic discovery of such privacy issues is a research challenge in its own.

8 Conclusion

Instance-based unlearning is concerned with removing data points from a learning model after training—a task that becomes essential when users demand the “right to be forgotten” under privacy regulations such as the GDPR. However, privacy-sensitive information is often spread across multiple instances, impacting larger portions of the training data. Instance-based unlearning is limited in this setting, as it depends on a small number of affected data points. As a remedy, we propose a novel framework for unlearning features and labels based on the concept of influence functions. Our approach captures the changes to a learning model in a closed-form update, providing significant speedups over other approaches.

We demonstrate the efficacy of our approach in a theoretical and empirical analysis. Based on the concept of differential privacy, we prove that our framework enables certified unlearning on models with a strongly convex loss function and evaluate the benefits of our unlearning strategy in empirical studies on spam classification and text generation. In particular, for generative language models, we are able to remove unintended memorization while preserving the functionality of the models. This result provides insights on the problem of memorized sequences and shows that memorization is not necessarily deeply embedded in the neural networks.

We hope that this work fosters further research that derives approaches for unlearning and sharpens theoretical bounds on privacy in machine learning.


The authors gratefully acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy EXC 2092 CASA-390781972. Furthermore, we acknowledge funding from the German Federal Ministry of Education and Research (BMBF) under the projects IVAN (FKZ 16KIS1167) and BIFOLD (Berlin Institute for the Foundations of Learning and Data, ref. 01IS18025 A and ref 01IS18037 A) as well as by the Ministerium für Wirtschaft, Arbeit und Wohnungsbau Baden-Wuerttemberg under the project Poison-Ivy.


  • Abadi et al. [2015] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
  • Agarwal et al. [2017] N. Agarwal, B. Bullins, and E. Hazan. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research (JMLR), page 4148–4187, 2017.
  • Aldaghri et al. [2020] N. Aldaghri, H. Mahdavifar, and A. Beirami. Coded machine unlearning. arxiv:2012.15721, 2020.
  • Attenberg et al. [2009] J. Attenberg, K. Weinberger, A. Dasgupta, A. Smola, and M. Zinkevich. Collaborative email-spam filtering with the hashing trick. In Proc. of the Conference on Email and Anti-Spam (CEAS), 2009.
  • Barshan et al. [2020] E. Barshan, M. Brunet, and G. Dziugaite. Relatif: Identifying explanatory training examples via relative influence. In

    Proc. of International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2020.
  • Basu et al. [2020] S. Basu, X. You, and S. Feizi. On second-order group influence functions for black-box predictions. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, pages 715–724, 2020.
  • Basu et al. [2021] S. Basu, P. Pope, and S. Feizi.

    Influence functions in deep learning are fragile.

    In International Conference on Learning Representations (ICLR), 2021.
  • Bourtoule et al. [2021] L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot. Machine unlearning. 2021.
  • Boyd & Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. 2004.
  • Brunet et al. [2019] M.-E. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel. Understanding the origins of bias in word embeddings. In Proc. of International Conference on Machine Learning (ICML), 2019.
  • Cao & Yang [2015] Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In Proc. of IEEE Symposium on Security and Privacy (S&P), 2015.
  • Carlini et al. [2019] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In Proc. of USENIX Security Symposium, pages 267–284, 2019.
  • Carlini et al. [2021] N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, and A. Roberts. Extracting training data from large language models. 2021.
  • Cauwenberghs & Poggio [2000] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS), page 388–394, 2000.
  • Cawley [2006] G. Cawley. Leave-one-out cross-validation based model selection criteria for weighted ls-svms. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 1661–1668, 2006.
  • Cawley & Talbot [2003] G. C. Cawley and N. L. Talbot. Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition, 36(11):2585–2592, 2003.
  • Cawley & Talbot [2004] G. C. Cawley and N. L. Talbot. Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17(10):1467–1475, 2004.
  • Chaudhuri et al. [2011] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, page 1069–1109, 2011.
  • Chen et al. [2020] H. Chen, S. Si, Y. Li, C. Chelba, S. Kumar, D. Boning, and C.-J. Hsieh. Multi-stage influence function. In Advances in Neural Information Processing Systems (NeurIPS), pages 12732–12742, 2020.
  • Cook & Weisberg [1982] R. D. Cook and S. Weisberg. Residuals and influence in regression. New York: Chapman and Hall, 1982.
  • De Cristofaro [2021] E. De Cristofaro. A critical overview of privacy in machine learning. IEEE Security & Privacy Magazine, 19(4), 2021.
  • Dwork [2006] C. Dwork. Differential privacy. In Automata, Languages and Programming, pages 1–12, 2006.
  • Fazlyab et al. [2019] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS), 2019.
  • Ginart et al. [2019] A. Ginart, M. Y. Guan, G. Valiant, and J. Zou. Making AI forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Golatkar et al. [2020] A. Golatkar, A. Achille, and S. Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2020.
  • Golatkar et al. [2021] A. Golatkar, A. Achille, A. Ravichandran, M. Polito, and S. Soatto. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Graves [2013] A. Graves. Generating sequences with recurrent neural networks. Technical Report arXiv:1308.0850, Computing Research Repository (CoRR), 2013.
  • Guo et al. [2020a] C. Guo, T. Goldstein, A. Y. Hannun, and L. van der Maaten. Certified data removal from machine learning models. In Proc. of International Conference on Machine Learning (ICML), pages 3822–3831, 2020a.
  • Guo et al. [2020b] H. Guo, N. F. Rajani, P. Hase, M. Bansal, and C. Xiong. Fastif: Scalable influence functions for efficient model interpretation and debugging. arxiv:2012.15781, 2020b.
  • Guok et al. [2020] H. Guok, F. Eibe, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity. arXiv, 2020.
  • Hampel [1974] F. Hampel. The influence curve and its role in robust estimation. In Journal of the American Statistical Association, 1974.
  • Hassibi et al. [1994] B. Hassibi, D. Stork, and G. Wolff. Optimal brain surgeon: Extensions and performance comparisons. In Advances in Neural Information Processing Systems (NeurIPS), 1994.
  • Koh & Liang [2017] P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In Proc. of International Conference on Machine Learning (ICML), pages 1885–1894, 2017.
  • Koh et al. [2019] P. W. Koh, K. Ang, H. H. K. Teo, and P. Liang. On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • LeCun et al. [1990] Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), 1990.
  • Leino & Fredrikson [2020] K. Leino and M. Fredrikson. Stolen memories: Leveraging model memorization for calibrated white-box membership inference. In Proc. of the USENIX Security Symposium, 2020.
  • Liu & Nocedal [1989] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503–528, 1989.
  • Merity et al. [2018] S. Merity, N. S. Keskar, and R. Socher. An analysis of neural language modeling at multiple scales. arxiv:1803.08240, 2018.
  • Metsis et al. [2006] V. Metsis, G. Androutsopoulos, and G. Paliouras.

    Spam filtering with naive bayes - which naive bayes?

    In Proc. of Conference on Email and Anti-Spam (CEAS), 2006.
  • Neel et al. [2020] S. Neel, A. Roth, and S. Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. arxiv:2007.02923, 2020.
  • Papernot et al. [2018] N. Papernot, P. McDaniel, A. Sinha, and M. P. Wellman. SoK: Security and privacy in machine learning. In Proc. of the IEEE European Symposium on Security and Privacy (EuroS&P), 2018.
  • Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS). 2019.
  • Pearlmutter [1994] B. A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, 1994.
  • Rad & Maleki [2018] K. R. Rad and A. Maleki. A scalable estimate of the extra-sample prediction error via approximate leave-one-out. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82, 2018.
  • Salem et al. [2019] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes. ML-Leaks: Model and data independent membership inference attacks and defenses on machine learning models. In Proc. of the Network and Distributed System Security Symposium (NDSS), 2019.
  • Schulam & Saria [2019] P. Schulam and S. Saria. Can you trust this prediction? auditing pointwise reliability after learning. In Proc. of International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • Shokri et al. [2017] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In Proc. of the IEEE Symposium on Security and Privacy (S&P), pages 3–18, 2017.
  • Sutskever et al. [2011] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In Proc. of International Conference on Machine Learning (ICML), 2011.
  • Virmaux & Scaman [2018] A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Warnecke et al. [2020] A. Warnecke, D. Arp, C. Wressnegger, and K. Rieck. Evaluating explanation methods for deep learning in computer security. In Proc. of the IEEE European Symposium on Security and Privacy (EuroS&P), Sept. 2020.
  • Weinberger et al. [2009] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and osh Attenberg. Feature hashing for large scale multitask learning. In Proc. of the International Conference on Machine Learning (ICML), 2009.
  • Wolpert & Macready [1997] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.

    IEEE Transactions on Evolutionary Computation

    , 1(67), 1997.
  • Zanella Béguelin et al. [2020] S. Zanella Béguelin, L. Wutschitz, S. Tople, V. Rühle, A. Paverd, O. Ohrimenko, B. Köpf, and M. Brockschmidt. Analyzing Information Leakage of Updates to Natural Language Models. In Proc. 27th ACM Conference on Computer and Communications Security (CCS ’20), 2020.


Deriving the Update Steps

In the following, we derive the first-order and second-order update strategies used in the paper. For a deeper theoretical discussion of the employed techniques, we recommend the reader the book of Boyd & Vandenberghe [9].

First-order update

To derive the first-order update for our approach, let us first reconsider the optimization problem for the corrected learning model:


where is the combined loss function that is minimized. If is small and is differentiable with respect to , we can approximate using a first-order Taylor series at


Since is a minimum of we assume . Plugging in the Taylor series approximation and using the condition that , we arrive at

Since we can focus on the dot product. For two vectors the dot product can be written as where is the cosine between the vectors and . The minimal value of the cosine is which is achieved when , hence we have . This result indicates that is the optimal direction to move starting from . The actual step size, however, is unknown and must be adjusted by a small constant yielding the update step defined in Section 4.1:

Due to the linearity of the gradient in this step, the derivation is equal when multiple points are affected.

Second-order update

If we assume that the loss is twice differentiable and strictly convex, there exists an inverse Hessian matrix and we can proceed to approximate changes to the learning model using the technique of Cook & Weisberg [20]. In particular, we can determine the optimality conditions for Eq. 10 directly by

If is sufficiently small, we can approximate these conditions using a first-order Taylor series at . This approximation yields the solution:

Since we know that by the optimality of , we can rearrange this solution using the Hessian of the loss function, such that


where we additionally drop all terms in . By expressing this solution in terms of the influence of , we can further simplify it and obtain

Finally, when using in Eq. 10, the data point is replaced by completely. In this case, Eq. 12 directly leads to the second-order update defined in Section 4.2

Proofs for Certified Unlearning

In the following, we present the proofs for certified unlearning of our approach and, in praticular, the bounds of the gradient residual used in Section 5. First, let us recall Theorem 1 from Section 5.1.

Theorem 1.

If all perturbations lie within a radius , that is , and the loss is -Lipschitz with respect to and , the following upper bounds hold:

  1. If the unlearning rate , we have

    for the first-order update of our approach.

  2. If is -Lipschitz with respect to , we have

    for the second-order update of our approach.

To prove this theorem, we begin by introducing a lemma which is useful for investigating the gradient residual of the model on the dataset .

Lemma 2.

Given a radius with , a gradient that is -Lipschitz with respect to , and a learning model , we have


By definition, we have

We can now split the dataset into the set of affected data points and the remaining data as follows

By applying a zero addition and leveraging the optimality of the model on our dataset , we then express the gradient as follows


Finally, using the Lipschitz continuity of the gradient in this expression, we arrive at the following inequalities that finalize the proof of Lemma 2

With the help of Lemma 2, we can prove the update bounds of Theorem 1. Our proof is structured in two parts, where we start with investigating the first case and then proceed with the second case of the theorem.

Proof (Case 1).

For the first-order update, we recall that

where is the unlearning rate and we have

Consequently, we seek to bound the norm of

By Taylor’s theorem, there exists a constant and a parameter