Learn to Forget: User-Level Memorization Elimination in Federated Learning

by   Yang Liu, et al.
Xidian University

Federated learning is a decentralized machine learning technique that evokes widespread attention in both the research field and the real-world market. However, the current privacy-preserving federated learning scheme only provides a secure way for the users to contribute their private data but never leaves a way to withdraw the contribution to model update. Such an irreversible setting potentially breaks the regulations about data protection and increases the risk of data extraction. To resolve the problem, this paper describes a novel concept for federated learning, called memorization elimination. Based on the concept, we propose , a federated learning framework that allows the user to eliminate the memorization of its private data in the trained model. Specifically, each user in is deployed with a trainable dummy gradient generator. After steps of training, the generator can produce dummy gradients to stimulate the neurons of a machine learning model to eliminate the memorization of the specific data. Also, we prove that the additional memorization elimination service of does not break the common procedure of federated learning or lower its security.


page 1

page 2

page 3

page 4


SecureBoost: A Lossless Federated Learning Framework

The protection of user privacy is an important concern in machine learni...

Revocable Federated Learning: A Benchmark of Federated Forest

A learning federation is composed of multiple participants who use the f...

Incentives for Federated Learning: a Hypothesis Elicitation Approach

Federated learning provides a promising paradigm for collecting machine ...

GRAFFL: Gradient-free Federated Learning of a Bayesian Generative Model

Federated learning platforms are gaining popularity. One of the major be...

Federated Learning of User Authentication Models

Machine learning-based User Authentication (UA) models have been widely ...

Data privacy protection in microscopic image analysis for material data mining

Recent progress in material data mining has been driven by high-capacity...

Privacy-preserving Weighted Federated Learning within Oracle-Aided MPC Framework

This paper studies privacy-preserving weighted federated learning within...

I Introduction

Due to the improved security and high efficiency, federated learning arises to the star of decentralized learning overwhelmingly as soon as being proposed [18]. In 2019, the research team of Google announced that the federated learning technique had reached the state to solve the applied learning problems over tens of millions level real-world users, and anticipates for the usage in billion-level applications [5]. Not surprisingly, federated learning shall lead the trend of decentralized online learning in the future market. More applications that deploy machine learning as a service can save a significant amount of costs on model training by utilizing federated learning.

Despite being popular, federated learning is still faced with a variety of security and application challenges, such as defending the user data reconstruction attack [34] and adapting federated learning to some specific application scenarios [16], [30]. In fact, these widely focused challenges can be concluded as one point, i.e., how to make federated learning securely and efficiently memorize the training data. However, compared to “memorize”, the reversed process, “forget”, seems to be neglected because of the difficulty to extract specific memorization from a trained model. As a result, there is now neither an effective way left for the user to withdraw the uploaded private data or a uniform indicator to evaluate the “forget” state.

The recently released data protection regulations, e.g., the California Consumer Privacy Act (CCPA) in the United States [14] and the General Data Protection Regulation (GDPR) in the European Union [20], clearly rule that the user should have the right to withdraw his private data if there is no special statement in the user agreement. These rules imply that the lack of forgetting mechanism for federated learning potentially violates the regulations and overlooks the fairness of the user to control its private data freely. Moreover, the training set of the user in federated learning may contain some unintended data that are private but not really useful to improve model accuracy. The memorization of the trained model about these unintended data greatly increases the possibility of the adversary to extract the user’s private information. Take the language model as an example. The adversary can recover an inadvertently inserted sequence that is fully memorized after only attempts [8]. In principle, there is none directive way for the server to manage the unintended memorization in the trained model. However, we can utilize an indirect method to resolve the problem, that is, making the user to withdraw the memorization of the trained model about the data required to be eliminated, which is referred as memorization elimination in this paper. Based on the method, the user can lower the risk of privacy leakage as much as possible.

To achieve memorization elimination, we propose Forsaken in this paper. Forsaken mainly achieves the following two breakthroughs: 1) it provides an efficient way to implement user-level memorization elimination for federated learning; 2) it defines a new indicator to evaluate the performance of memorization elimination. For memorization elimination, Forsaken implements it through a trainable generator. When a user needs to conduct memorization elimination, the generator is invoked to generate dummy gradients by learning the state of the trained model. The dummy gradient can be accumulated to the trained model in a similar way to the gradient computed based on the normal method used in most federated learning scheme, e.g., stochastic gradient descent (SGD) 

[18]. Therefore, our memorization elimination dose not break the common procedure of federated learning, and can also fit the existing gradient privacy protection methods, such as  [13], [6], [29]. The difference is that the normal gradient is used to improve the overall performance of the target model but the dummy gradient is used to stimulate the neuron units of the machine learning model to eliminate the memorization of the specific data. After giving a memorization elimination method, the current dilemma is that there is none of a quantified way to evaluate its performance. For this purpose, Forsaken defines forgetting rate to experimentally evaluate memorization elimination. By combining the concept of membership inference for machine learning [21], forgetting rate can well describes the rate of the data that are successfully eliminated from the memorization of the target model.

The contributions of this paper are summarized below.

  • Revocable Federated Learning. We propose Forsaken, a federated learning framework that supports the user to manage the trained model’s memorization about its data independently. Particularly, the memorization management does not need to retrain the machine learning model or break the common procedure of federated learning.

  • Memorization Elimination. We formally give a novel definition of memorization elimination for federated learning, which contains a quantified indicator to measure the performance of memorization elimination, called forgetting rate. The indicator is derived from a current hot research spot, membership inference, and can be easily computed for an inspector.

  • Learn to Forget. We design a trainable generator that can stimulate the machine learning model to eliminate the memorization of the specific data by learning the state of the target model to generate dummy gradients.

  • Performance Evaluation. We implement Forsaken on five standard machine learning datasets to evaluate its performance. The results show that when we eliminate the memorization of 10 users’ 200 samples (about 1% of the whole training set), Forsaken can averagely achieve 87.49% forgetting rate, and only cause less than 5% accuracy loss of the target model on the remaining data.

Ii Background

In this section, we briefly overview the essential technical backgrounds of federated learning and membership oracle.

Ii-a Federated learning

Fig. 1: A standard framework of federated learning

Federated learning is a kind of online machine learning framework that protects the privacy of training data providers from the “gradient” level [34]. Compared with the conventional training method with centralized data storage, federated learning avoids direct access to the private data of users and has significant advantages in training efficiency because of the distributed architecture. A standard framework of federated learning is shown in Fig. 1. Assume that there are users , each of which owns its local private data set . At the beginning of federated learning, the central serve and agree on an identical machine learning model architecture and objective function . Then, at each iteration, each user downloads the parameters of the current model from and trains the model with based on the gradient descent method. The model updates (i.e., gradients) obtained by each user are subsequently returned to , who averages the updates from all users and accumulates them to the current model according to the following equation.


where is the parameters of the model trained for iterations; controls the learning rate; is the gradient uploaded by at the iteration; is number of training samples owned by ; and . In such a way, federated learning implements distribute model training without uploading the user’s private data, which greatly decreases the risk of privacy leakage.

Gradient Privacy. In federated learning, the gradient computed by each user is not completely secure. Through the generative adversarial network [1] or just a simple gradient optimizer [34], the adversary can reconstruct the private data from the gradient. To overcome the problem, the core idea of the existing methods [13], [6], [29] is as follows. First, before the gradient is sent to the server, the user masks it with the cryptographic tool, e.g., differential privacy [13] or secret sharing [6]. Then, the server (the adversary under the security model) aggregates the masked gradients and accumulates the result to the current model. During the process, the server cannot access any gradient of a specific user but obtain the final aggregated result. In this way, the success rate of such an attack can be reduced to be very slim. For security, we have to ensure that the above gradient privacy protection method can be applied to Forsaken, even when memorization elimination is invoked.

Ii-B Membership Oracle

The goal of a membership oracle for machine learning is to determine the attribution of a certain sample, as defined in Definition 1 [22].

Definition 1 (Membership Oracle). Given a trained machine learning model and an arbitrary sample , an ideal membership oracle outputs if belongs to the training set of ; otherwise, outputs .

Initially, the membership oracle is intensively used as the membership inference attacker towards breaking the membership privacy of machine learning [22]. In Forsaken, we utilize it as the tool to evaluate the performance of memorization elimination. Up to now, the target-shadow method proposed by Shokri et al. [22] and its variants [21], [19] are still the mainstream to implement a membership oracle. Concretely, the inspector (or attacker) collects a shadow data set coming from the same distribution as the training set used for the target model, and then, trains a shadow model with the shadow data set. Next,

uses the confidence vectors predicted by the shadow model for some members and non-members of the shadow model’s training set to train a binary classifier. Taking a sample’s confidence vector predicted by the target model as input, the binary classifier can infer whether the sample is a member of the training set for the target model or not, which is precisely a membership oracle. Note that the existing membership inference method can only be used in the classification task. Thus, unless stated, the machine learning models mentioned in the following paper are all classifiers by default. The research for the regression task is left for future work.

Iii What is Memorization Elimination?

In this section, we introduce the meaning of memorization elimination. Further, we describe the adversary model and design goals for implementing secure memorization elimination in federated learning.

Iii-a Motivation

Due to the excellent performance in both security and effect, federated learning reaches an unprecedented range of applications in the decentralized machine learning field. However, the mainstream of existing works about federated learning is still limited to how to make a model securely and effectively “memorize” something, e.g., enhancing the security and efficiency of gradient uploading [2] or extending the practicality of federated learning in some special scenarios [24, 28]. The reversed process, which we call memorization elimination, still stays in a blank stage.

Motivating Examples: Our study is first motivated by the practical requirement of federated learning. Up to now, federated learning is still a one-way trip for the user. Once a user has ever contributed its private data, no route of retreat is provided to withdraw the memorization of the trained model about these data. As mentioned before, such an irreversible setting on data memorization leaves a potential risk of violating some national data protection regulations while applying federated learning for applications..

Moreover, the lack of data forgetting mechanism increases the possibility of user data leakage. Consider a commonly discussed scenario of federated learning, training a generative sequence model on a text dataset used for automated sentence completion. Ideally, the dataset should not contain any rare-but-sensitive information about some individual users; alternatively, the trained model should not have strong memorization about this information and never emits it as sentence completion. In particular, if a user accidentally uses a sentence with the prefix “My bank password is …” update the model, the output of the trained model would not predict the exact number in the suffix of the user’s text as the most-likely completion when another user types the same prefix. Unfortunately, the research of [8] points out that it is hard for the current training environment to achieve the ideal condition, and the adversary can exploit this loophole to extract the sensitive information of the user efficiently. From the perspective of the real-world application, a general method to overcome the problem is to allow the user to check its data list used for federated learning and selectively withdraw the memorization of the trained model about these sensitive data.

Memorization Elimination: To explain what memorization elimination means for federated learning, we first illustrate what the memorization is for machine learning. Abstractly, the training process of a machine learning model is simply the memorization enhancement process. Through a series of iterative learning steps, the neurons of a machine learning model obtain some forms of memorization about the pattern of the training set; even the pattern of the training is randomized [31]. Further, a trained model intends to output what is in accordance with its memorization, i.e., strongly suggest what training data is used (see the concept of membership oracle [22, 21]). Moreover, take the classification task as an example. We can reasonably derive Definition 2 that if some samples are “totally forgotten”, the model can only guess the samples’ categories; in other words, the model will not specifically suggest these samples to be any category.

Definition 2. For a -classification machine learning task, we say that a sample is totally forgotten by a machine learning model if the confidence vector satisfies .

Fig. 2: -elimination: Express the training steps of federated learning as a time sequence. -elimination implements that before , the inspector regards the model as being trained by the whole dataset ; after , regards the model as being trained by the eliminated dataset .

Inspired by the above illustration, the memorization elimination for federated learning can be roughly understood as forcing a trained model to forget the pattern of the specific private data owned by the user. Referring to [8], we treat the private data required to be withdrawn as out-of-scope training data, which are usually a tiny part of the whole training data. Federated learning is not intended to make the model memorize any such data as soon as the memorization elimination operation is called. Based on the principle, we present the following definition, -elimination, i.e., conducting memorization elimination at the iteration.

Definition 3 (-Elimination). Given a machine learning model trained after iterations, a membership oracle , a training set and an elimination dataset , -elimination is perfectly performed if there exists an elimination function that can output where for each , we have and ; for each , .

Suppose that there is an inspector who owns an ideal membership oracle. As shown in Fig. 2, Definition 3 rules that from the perspective of , the machine learning models obtained after the iteration are trained by ; the models obtained before the iteration are not affected, i.e., protecting the backward privacy of the eliminated data. We say that backward privacy is enough because it ensures the final output model of federated learning does not contain the contributions of the eliminated data, Furthermore, ideal -elimination should have good directivity, that is, it forces the model to forget the pattern of specific data after a particular iteration but does not influence the performance on the remaining data.

Iii-B Adversary Model

The adversary model of Forsaken is inherited from the standard federated learning scheme [6, 12]. The details are given as follows.

Learning Scenario. Prior to introducing the adversary model, we describe the entities of the learning scenario in Forsaken. As illustrated in Fig. 1, there are total users and one central server that are agreed on the same training objective in Forsaken. For generality, the data of all users are assumed to be non-IID distributed (not independent and identically distributed), which is consistent with the setting of standard federated learning schemes.

Adversary Description. We assume that the adversary in Forsaken is secure under the following security assumptions.

  1. The adversary basically follows the standard curious-but-honest model [4], and can be the server or any user. The curious-but-honest adversary honestly conducts the predetermined protocol steps but never misses an opportunity to infer the honest user’s data from the received legitimate messages.

  2. The adversary is restricted from polynomial-time computation capacity. Moreover, we assume that there are secure channels between the server and the user for transmitting model updates111This paper focuses on memorization elimination of federated learning. For secure channel construction, please refer to [6, 13]..

  3. The device of the user (e.g., smartphone) provides confidentiality guarantees for the storage of the local private data (both training data and the model) and is physically secure towards the adversary. Such local isolation is the premised condition for the security of federated learning.

Iii-C Design Goal.

Forsaken is designed to achieve the following two goals.

Goal 1 (Security Inheritance). Forsaken does not break the common procedure of federated learning, and meanwhile, ensures that there is no polynomial-time algorithm for a curious-but-honest adversary to infer any data of the user.

The first goal is to avoid user privacy leakage, which is also the goal of the conventional federated learning scheme. In other words, we have to guarantee that Forsaken is at least as secure as other federated learning schemes in the curious-but-honest model. According to our adversary model, the attack surface of can only be concentrated on the user gradient. Therefore, to achieve this goal, Forsaken should be compatible with the existing secure aggregation schemes to protect the security of the gradient. Besides the above goal, the other goal of Forsaken is to provide an additional memorization elimination service for the user, which is defined below.

Goal 2 (Memorization Elimination). An arbitrary user , no matter honest or curious-but-honest, can choose to conduct -elimination on his data at any iteration of federated learning, and the additional operation of -elimination does not influence the learning process of other users.

The second goal is to implement secure memorization elimination based on the -elimination. Remark that the elimination function given in Definition 3 is considered to be an ideal function to achieve our goal. Nevertheless, considering the meaning of memorization for machine learning, it is not realistic to perfectly implement such a function. Alternatively, in the next section, we define a simple optimizer to approximate the function.

Iv Memorization Elimination in Federated Learning

In this section, we present the design details of memorization elimination in Forsaken.

Iv-a Measuring Memorization Elimination

Prior to introducing Forsaken, we define a quantified indicator to measure the performance of the -elimination method, called forgetting rate (FR).

For both the human brain and machine learning, memorization is an abstract concept that is hard to be directly measured. However, a widely accepted fact is that the memorization can only be formed from the known objects (the training samples for machine learning). Even though the human brain or machine learning can identify a never seen object, the association ability is derived from the known memorization in some ways. Thus, if we successfully eliminate the memorization of some specific data, the most intuitive reflection is that these data are transformed from “known” to “unknown” as illustrated in Fig. 2,. In such a case, the eliminated data are ensured to be no longer related to the memorization of the target model anymore. Naturally, the transformation rate between “known” and “unknown”, i.e., FR, can be directly utilized to evaluate the performance of memorization elimination. To compute FR, it is necessary to find a tool to identify whether a given sample is “known” or not. Definition 1 precisely describes such an ideal tool, that is, membership oracle. Nonetheless, the dilemma is that none of the existing membership inference algorithms can implement an ideal membership oracle. Therefore, FR combines both the performance of the membership oracle and the transformation rate, which can be mathematically expressed as the following equation.


where the meanings of , , and are given in Table I.

To further understand FR, we can refer to a common evaluation indicator of machine learning, recall rate [10]. In machine learning, the recall rate is the fraction of the total amount of positive instances that are correctly classified as “true”. Before conducting memorization elimination, the positive instances are the training samples required to be eliminated. expresses the recall rate of the membership oracle on these instances that are correctly identified as “known”. Correspondingly, indicates the recall rate of the instances that are correctly transformed to be “unknown” after conducting memorization elimination. Literally speaking, gives a qualified indicator to measure how many samples are changed from the memorized set (training set) to the unknown set (testing set) after memorization elimination. Notably, as the computation of involves the inference of a membership oracle that cannot be owned by a normal user, it is unpractical to directly utilize as an objective function of memorization elimination. To overcome the problem, we define a new objective function based on Definition 2 in Section IV-C.

Notations Descriptions
The number of eliminated training samples that are predicted to be TRUE by BEFORE conducting .
The number of eliminated training samples that are predicted to be FALSE by BEFORE conducting .
The number of eliminated training samples that are predicted to be TRUE by AFTER conducting .
The number of eliminated training samples that are predicted to be FALSE by AFTER conducting .
the membership oracle; the -elimination function.
TABLE I: Notation Table

Iv-B Intuition of Memorization Elimination

Our memorization elimination method is originally inspired by the nature of the human brain. Consider the laws of human memory. The most intuitive method to make the brain to forget something is to focus on other things. Similarly, to make a model eliminate the memorization about some specific data, the simplest way is to remove these data from the training set, and then, continue to train the model with the remaining dataset. Theoretically, after countless training steps, the model can gradually weaken the memorization of the pattern about the removed data. The advantage of such a natural memorization elimination method is that it causes little impact on the original performance of the target model. However, its disadvantage is also obvious, that is, suffering a great loss in efficiency and practicality. A well-trained model is designed to have a similar feature to the deep cognition behaviour of the human brain (e.g., the language ability). To make a well-trained model forget something in a natural way, the required time is usually too long to be accepted for practical applications.

To verify our analysis, we use the natural memorization elimination method to conduct a simple experiment with the standard image classification dataset, CIFAR-10. The detailed experiment setting is as listed in Section VI

. In the experiment, we first train a target model for 25 epochs according to the federated learning procedure. Then, at the

epoch, we randomly select a user and remove it from the candidate data provider set. Next, we continue to make the remaining users train the target model for the same number of epochs as previous training. The result shown in Fig. 3 state that only a very little increase on the is obtained from the extra epochs of training, which is consistent to the above analysis.

Fig. 3: Memorization elimination with the natural method for CIFAR-10. Memorization elimination occurs at epoch 25..

To overcome the defect of natural memorization elimination, it is necessary to enhance the “forgetting” strength towards the data of the selected user. To achieve this goal, we refer to the stochastic gradient descent (SGD) method used for model updates in federated learning (mathematically expressed as Eq. 1 in Section II). From Eq. 1, it is observed that the memorization of a trained model in federated learning is formed by the gradients uploaded by the users. Therefore, a forcible method implement memorization elimination is to withdraw the gradients that the user has contributed for model update, and the process can be completed in a normal training epoch of federated learning as given in Eq. 3.


where is the user whose data memorization is eliminated at the iteration, and is the gradient uploaded by at the iteration. As shown in Fig. 4, using the forcible elimination method to implement -elimination can indeed achieve a higher FR. However, since the SGD method used in federated learning is time-sequence related and irreversible, the performance loss caused by the method on the target model is also obvious. Although the above method does not obtain a considerable result, it inspires us to implement Forsaken from the gradient level.

Fig. 4: Memorization elimination with the forcible method for CIFAR-10. Memorization elimination occurs at epoch 40.

Iv-C Generator: Learn to Forget

We now present our memorization elimination methodology that relies on a trainable generator .

Dummy Gradient. From the forcible elimination method, we observe that gradient is a powerful component to influence the memorization in federated learning. Therefore, Forsaken also utilizes the gradient to implement memorization elimination. The difference is that the gradient used for memorization elimination in Forsaken is not computed based on the normal gradient method but specially produced by a dummy gradient generator. Initially, the dummy gradient is initialized with a series of tiny random values that have the same size as the normal gradient. By learning the state of the target model, successively adapts the dummy gradient to remodel the target model towards a predefined direction to implement directive and lossless memorization elimination. The following presents the principle of to generate the dummy gradient.

Learn to Forget. The design of dummy gradient generator is inspired by the neurological “active forgetting [11]” mechanism of the human body. Different from the hysteresis of natural forgetting (also known as “passive forgetting”), active forgetting is vigorous and can eliminate all traces and engram cells for a given memory. The active forgetting process is as follows [11]. To eliminate the specific memorization stored in the engram cells, human body let the dopamine neurons, called forgetting cells, to produce a kind of special dopamine. The dopamine serves as the forgetting signal that stimulates the remodeling of the engram cells to accelerate memorization elimination. Based on the feedback of the engram cells, the forgetting cells learn to regularize the production of dopamine to avoid unexpected forgetting.

1:The target model and its corresponding trainable parameters ; the elimination set ; the user that owns ; the maximum training epochs .
2: uses small values to initialize the dummy gradient , and then, do the following iteration.
3:for  to  do
4:     Update , where is the size of .
5:     Compute .
6:     Optimize based on the objective function .
7:     Output a new dummy gradient .
8:end for
9:Return the summing of dummy gradients .
Protocol 1 Memorization Elimination Generator (DummyG)

Similarly, plays the forgetting cell role in our memorization elimination mechanism for machine learning. To eliminate the memorization of the trained model about some specific data, successively produces a certain amount of dummy gradients whose function is like the dopamine. The dummy gradients can be applied to the neuron units of the machine learning model and stimulate them to remodel themselves to eliminate the memorization for the given data. Besides, can learn the state of the neuron units about memorization elimination from the feedback of the target model. According to the learning result, regularizes the size of dummy gradients to avoid unexpected elimination. After several epochs of learning, the target model shall quickly lose the memorization of the specific data because of the impact of . The detailed construction method of is illustrated below, and listed in Protocol 1 (DummyG).

1:The users, each of which owns a local dataset with size of ; the server ; the learning rate ; the maximum epochs for memorization elimination ; a gradient privacy protection function .
2: initializes a machine learning model

and determines the loss function

3:for each iteration  do
4:      randomly selects users and publishes , to .
5:     for each user  do
6:         if  chooses to eliminate the memorization of about its data then
7:              Ask for the permission of memorization elimination.
8:              Compute the dummy gradient DummyG, and then, send to .
9:         else
10:              Compute the normal gradient with the local data , and then, send to .
11:         end if
12:     end for
13:     Receiving , uses the secure aggregation function SecAgg corresponding to to compute SecAgg, and update , where .
14:end for
15:Return the trained model .
Protocol 2 Privacy-preserving Federated Learning with Memorization Elimination (SecForget)

Represent the target model whose memorization has to be eliminated as , where is the trainable parameter set of the target model. In the training process, first initializes the dummy gradient with a series of small random values (empirically set to less than ). The size of is the same as . Then, at the iteration of training, , sends to , and uses to update in a similar way to the SGD method used in federated learning, i.e., , where

is a hyperparameter that controls the learning rate (mentioned in Eq. 

1) and is the size of . Next, we input each sample of the elimination set into and compute . Notably, since Forsaken only focuses on the task with finite and discrete outputs as mentioned before, the output of the target model is limited to be a confidence vector with finite dimensions by default. is feedback to and used to optimize the dummy gradient by minimizing the following objective.


is differential w.r.t. the dummy gradient can be optimized with the standard gradient based method, such as L-BFGS [33] and Adam [32]. Here, the distance rules the optimization objective of the training, where is the confidence vector of a “perfectly” forgotten sample according to Definition 1. constrains the maximum step length and the optimization direction. is a regularization item that punish the changes on the original model to avoid performance loss. Finally, after completing the above iteration, we can get the dummy gradient that can eliminate the memorization of the target model about .

Iv-D Memorization Elimination in Federated Learning

With the dummy gradient generator, Forsaken can simply implement memorization elimination in federated learning as stated in Protocol 2 (SecForget).

The procedure of SecForget refers to the privacy-preserving federated learning scheme proposed by Google [18, 6] but provides additional memorization elimination option for each user. First, initializes the trainable parameters of a machine learning model and determines its loss function . Then, at each iteration, users are selected to participate in the model training. Each selected user has two options. If choosing to eliminate the memorization of the model about its data, invokes the generator defined in DummyG to produce the dummy gradient; otherwise, uses its local data to compute the gradient according to the standard SGD method. In principle, the user is not allowed to choose memorization elimination at the first few epochs of federated learning. Furthermore, since the plaintext gradient of a single user can be used to derive the user’s data [15], most of the existing federated learning schemes utilize some cryptographic tools to protect the privacy of the gradient before sending it to . From the design of DummyG, it can be observed that the dummy gradient can be treated in the same way as the normal gradient. Therefore, the mainstream gradient privacy protection methods, e.g., differential privacy [13] secret sharing [6] and homomorphic encryption [29], can be directly used in SecForget. Finally, can invoke the secure aggregation algorithm SecAgg that corresponds to the gradient privacy protection function to get the summing result of all gradients and use it update the current model according to Eq. 1.

V Theoretical Analysis

In this section, we theoretically prove that Forsaken can achieve it two design goals. Further, we discuss that Forsaken can be extended to be an attack tool to threaten the security of federated learning.

Data Privacy As defined in Goal 1, the security goal of Forsaken is to inherit the same data privacy level as the existing privacy-preserving federated learning schemes. From the procedure of SecForget, it can be seen that the only difference between Forsaken and other federated learning schemes is the dummy gradient produced by DummyG. Thus, we can give a formal definition of the security of Forsaken as follows.

Definition 4 (Security of Forsaken). We say that Forsaken is as secure as the existing federated learning schemes if the dummy gradient produced by DummyG can be treated in the same way as the true gradient produced by the standard gradient method with the privacy protection function.

Proof. According to the common gradient method used in federated learning, like SGD [17], the normal gradient given in Eq. 5, is the derivative of the loss function with respect to the trainable parameters , where is the training sample.


As for Forsaken, the generator in DummyG also uses the gradient based optimizer to generate the dummy gradient. The difference is that treats the gradient as the trainable target, not the model parameters. Therefore, the dummy gradient is actually the derivative of the loss function defined in Eq. 4 with respect to the gradient, which is mathematically expressed as Eq. 6.


From the perspective of the server, both the dummy gradient and the normal gradient can be regarded as a series of numerical matrixes that have the same function and same size. Thus, the dummy gradient can also be correctly applied to the gradient privacy protection function like other federated learning schemes [6, 13]. Moreover, after applying the gradient privacy protection function, the server can only access the aggregated gradients. From the aggregation result, it is impossible to separate the gradient of a specific user in polynomial time. Further, we can derive that the dummy gradient and normal gradient in SecForget are computationally indistinguishable. In conclusion, Forsaken is at least as secure as the existing federated learning schemes.

Memorization Elimination. As mentioned before, Forsaken implements memorization elimination (i.e., Goal 2) based the dummy gradient generator whose core is defined in Eq. 4. We now explain the correctness of Eq. 4 to lead the generator to generate effective dummy gradients for memorization elimination. Eq. 4 is mainly composed of two parts. The first part of Eq. 4 is based on Definition 2, which describes the output of a machine learning model for a “totally forgetten” sample. Definition 2 is reasonable because if we have no knowledge about a given sample, the only way to judge its category is guessing. A guessing output is always neutral. For example, if a machine learning model learns nothing about some specific samples in a -classification task, its output for these samples shall not bias towards any side, that is, approximating the guessing vector . Notably, learning nothing means having no relevant memorization and does not mean having the wrong memorization that may bias the output to the wrong side. Therefore, the first part makes the dummy gradient to change the specific sample from “memorized” to “totally forgetten”. Then, the second part of Eq. 4 is used to punish the parameter change on the target model. In this way, we can minimize the performance loss on the target model caused by the memorization elimination. In the next section, we further use experiments to prove the correctness of the above setting on memorization elimination.

Extended Discussion. Similar to the membership oracle, our dummy gradient generator can also be modified from a positive tool to an attack tool that can launch a kind of inconspicuous data poison attack towards federate learning.

Usually, the data poison attack is launched by the malicious data provider [7]. By uploading the gradient computed with mislabeled data, the attacker can mislead the server to train a machine learning model that always gives wrong outputs for some specific data. However, such an attack method has a defect that it causes obvious performance loss on the trained model for other data [26]. Our dummy gradient generator provides an alternative method to overcome the defect. It can be discovered that if a malicious user slightly changes the objective function ( in Eq. 4) according to its requirement and successively uploads the dummy gradient to the server, the final trained model can also be misled to identify some specific data wrongly. Even worse, as stated in the experiments of Section VI, the attack causes little impact on the model performance, which makes it difficult to be detected. Although the active attack is out of the scope of this paper, we still give an intuitive way to defend the dummy gradient based poison attack in reality, i.e., observing the running time of each user. Since the dummy gradient generation process is more complicated than the normal gradient and cannot be previously computed, the server can detect the malicious user by judging whether its gradient generation time is unusually much longer than other users’.

Vi Experiment Implementation

In this section, comprehensive experiments are conducted to prove that Forsaken can achieve Goal 2, memorization elimination.

Vi-a Experiment Setup.

We utilize five different machine learning datasets to conduct our experiments, namely CIFAR-10222https://www.cs.toronto.edu/ kriz/cifar.html, CIFAR-100, MNIST333http://yann.lecun.com/exdb/mnist/ and News444http://qwone.com/ jason/20Newsgroups/. Among them, the former three datasets are standard image classification datasets. The News dataset is a commonly used text classification and clustering dataset that has a balanced class distribution. Since the raw News dataset is all string-type data that cannot be directly applied for machine learning, we preprocess it by encoding the raw data into numerical matrixes in a similar way to [21]. The detailed information about the dataset is listed in Table II. In particular, to evaluate the indicator of our scheme, we have to train a membership oracle. Therefore, we train a membership oracle for each dataset according to the target-shadow method presented in [21]

. Specifically, we first randomly partition all of the target dataset into two halves. One half is used to train the target model. The other is used to train the shadow model. Then, we train the target and shadow models in the white-box mode, i.e., using the same architecture and hyperparameters. Finally, with the outputs of the two models as the training and testing sets, we train an XGBoost 

[9] model to serve as the membership oracle. The interested readers can refer to [21] for details.

Corresponding to the five datasets, we use five neural networks with different architectures. To comprehensively evaluate the performance of Forsaken, the five neural networks are deliberately set to have obvious differences in parameter size. The neural networks used for processing MNIST, CIFAR-100 and News are given in Fig. 

5, Fig. 6 and Fig. 7. For CIFAR-10, we use a previously proposed deep network architecture, called VGG-13 [23].


  1. Convolution: Input image , windows size , number of output channel .

  2. ReLU: Calculate ReLU for each input.

  3. MaxPooling: Window Size .

  4. Convolution: Windows size , number of output channel .

  5. ReLU: Calculate ReLU for each input.

  6. Fully Connected Layer: Fully connected the incoming nodes to the outgoing nodes.

  7. Fully Connected Layer: Fully connected the incoming nodes to the outgoing nodes.

Fig. 5: The neural network trained for MNIST


  1. Convolution: Input image , windows size , number of output channel .

  2. BatchNormal + ReLU: Calculate BatchNormal and ReLU for each input.

  3. MaxPooling: Window Size .

  4. Convolution: Windows size , number of output channel .

  5. BatchNormal + ReLU: Calculate BatchNormal and ReLU for each input.

  6. MaxPooling: Window Size .

  7. Convolution: Windows size , number of output channel .

  8. BatchNormal + ReLU: Calculate BatchNormal and ReLU for each input.

  9. MaxPooling: Window Size .

  10. Convolution: Windows size , number of output channel .

  11. BatchNormal + ReLU: Calculate BatchNormal and ReLU for each input.

  12. MaxPooling: Window Size .

  13. Fully Connected Layer: Fully connected the incoming nodes to the outgoing nodes.

Fig. 6: The neural network trained for CIFAR100


  1. Embedding: Input word vector , the output word embedding is .

  2. Transpose Convolution: Windows size , number of output channel .

  3. ReLU: Calculate ReLU for each input.

  4. MaxPooling: Window Size .

  5. Convolution: Windows size , number of output channel .

  6. ReLU: Calculate ReLU for each input.

  7. MaxPooling: Window Size .

  8. Convolution: Windows size , number of output channel .

  9. ReLU: Calculate ReLU for each input.

  10. MaxPooling: Window Size .

  11. Fully Connected Layer: Fully connected the incoming nodes to the outgoing nodes.

  12. Fully Connected Layer: Fully connected the incoming nodes to the outgoing nodes.

Fig. 7: The neural network trained for News. The layers in the networks are all used in the 1-dimension mode.

Moreover, the dummy gradient is initialized with the normal random generator provided by Numpy (limited to less than degree). The optimizer used as the memorization elimination generator is L-BFGS [33]

. All experiments are implemented with Pytorch, an open-source machine learning library of python. To simulate the federated learning process, we randomly split the training set of each dataset into a series of subsets with the same size (100 by default), each of which is assigned to a user as the local dataset. The users cooperatively train the target model (or shadow model) according to the steps of Protocol 

2. Note that memorization elimination is always applied for the target model in the following experiments. The default elimination size is 200. The eliminated samples are came from some randomly selected users. Each selected user contributes 20 randomly chosen samples to make up the elimination set.

Name No. of Instances Features Classes
MNIST 70000 (10000 for testing) 10
News 11314 (2262 for testing) 1000 20
CIFAR-10 60000 (10000 for testing) 10
CIFAR-100 60000 (10000 for testing) 100
TABLE II: Dataset Information

Vi-B Performance of Memorization Elimination

For memorization elimination, what the user mostly cares about is how many data are successfully forgotten by the trained model, which can be evaluated by FR, the indicator proposed in Section IV. In the evaluation, we emphatically observe the performance of Forsaken on the well-trained model. To achieve this, we first train couples of target models and shadow models with the prepared datasets based on the common procedure of federated learning given in Protocol 2 and carefully avoid overtraining. Then, we randomly select one user to call the memorization elimination service. Table III summarizes the of the selected user’s data after applying the dummy gradients on the trained models in different datasets. The result shows that Forsaken can transfer more than 90% eliminated samples from the training set to the known set. However, considering the accuracy of membership inference, the of Forsaken is a little lower than the expected value. In addition, it can be discovered that the size of the training set or the size of the dummy gradient required to be simulated by the generator does not strongly influence the performance of Forsaken. Even for the VGG-13 that has 14.09M dummy gradient required to be simulated, Forsaken can still effectively eliminate the specific memorization.

Different from the user, the central server concerns more about the performance loss of the target model after conducting memorization elimination. To evaluate this merit, we record the training accuracy and training loss of the target model before and after memorization elimination, shown in Fig. 8(a) and Fig. 8(b). Compared with two intuitive methods proposed in Section IV, the performance loss (reflected by the training accuracy and training loss in the experiments) caused by memorization elimination in Forsaken is negligible. Taken together, the above experiment results indicate that Forsaken can basically satisfy our design goal for memorization elimination.

Dataset Dummy
Gradient Size
MNIST 0.96M 85.54% 99% 84.68%
News 0.24M 86.37% 99.67% 86.01%
CIFAR-10 14.09M 87.33% 98% 85.58%
CIFAR-100 9.02M 95.56% 95.5% 91.26%
B.R.R ; A.R.R .
B.R.R A.R.R. , i.e., Eq. 2 in Section IV.
TABLE III: Forgetting Rate of Forsaken on Different Datasets
(a) Training accuracy change after memorization elimination.
(b) Training loss change after memorization elimination.
Fig. 8: The performance change of the target model after conducting Forsaken, Baseline-1 and Baseline-2.

Vi-C Effect Factors for Memorization Elimination

After evaluating the whole performance, we further conduct experiments to analyse some key factors that possibly affect the performance of Forsaken, namely the eliminated data size, the training iterations of the dummy gradient generator and overtraining. The experiments in this section are mainly performed on two datasets, MNIST and News, which severally represent the two classical tasks of machine learning, image classification and text classification.

Eliminated Data Size. For Forsaken, the complexity of the dummy gradient generator is positively related to the eliminated data size. Here, we conduct experiments to test the performance change of Forsaken with variable eliminated data sizes. Usually, there can be only a small part of the users that launch memorization elimination at the same time, Therefore, we set the eliminated size varied from 50 to 300 (about 1% of the total training set) in the experiments. Fig. 9 plots the experiment result. The increased eliminated data size leads to a slight reduction in for Forsaken. Besides, from the changes of the target model’s accuracy on the training and testing sets, we can observe that the memorization elimination hardly affects the performance of the target model, even the eliminated size reaches 300. The phenomenon shows that Forsaken is robust to the modest change of eliminated data size.

(a) The change of with different elimination sizes for MNIST.
(b) The change of with different elimination sizes for News.
Fig. 9: The performance change of Forsaken with different elimination sizes. Diff.Train.Acc and Diff.Test.Acc mean the difference of the training/testing accuracy before and after conducting Forsaken, respectively.

Training Iteration of Generator. Since the dummy gradient generator is implemented by the gradient based optimizer, its performance is strongly influenced by the training iteration. Fig. 10 shows the performance change of memorization elimination with increasing iterations. Commonly, the dummy gradient generator reaches its best performance after tens of iterations and ends its iteration. In the experiments, we force the generator to additionally train for some extra iterations to better illustrate its performance. Once the best point is reached, more training iterations hardly reduce but cause more and more accuracy loss of the target model. The reason is that according to our design of the generator, overtraining cannot cancel the eliminated memorization; however, it can cause over elimination of unrelated memorization, which lowers the accuracy of the target model. In practical,

(a) The at each iteration of the dummy gradient generator for MNIST.
(b) The at each iteration of the dummy gradient generator for News.
Fig. 10: The of Forsaken at each iteration of the dummy gradient generator.

Moreover, the efficiency of memorization elimination is also significant merit to evaluate Forsaken. Concretely, the efficiency of Forsaken is mainly affected by two factors, the training iterations of the generator and the size of the dummy gradient required to be simulated. Table IV reflects the running time to accomplish one user’s memorization elimination with different iteration numbers and different sizes of neural networks. The experiments are performed with a laptop, equipped with Intel Core i7-7200 CPU @2.50Ghz and 8GB RAM (no GPU acceleration). The experiment results show that the dummy gradient generation process can be completed in minutes, even for the neural network with 14.09M parameters.

Iteration Running Time (s)
10 3.88 14.78 59.89 56.01
15 6.13 22.34 86.97 88.81
20 8.82 29.65 117.51 124.14
25 11.93 36.99 152.71 164.82
30 15.64 44.43 182.48 209.97
TABLE IV: Efficiency of Memorization Elimination

Overtraining. Referring to the former research [8], overtraining is always tightly related to the machine learning memorization. Loosely speaking, overtraining brings deeper and stronger memorization of the trained model towards the training set and leads to worse performance on the unknown data set. Fig. 11 shows a typical example of overtraining, which occurs by training the target model with only half of the training set for News. At the first few epochs, the testing loss drops rapidly until reaching the best point. After the point, the trend of testing loss is reversed, i.e., beginning to increase, which means the model is overtrained. To evaluate Forsaken versus overtraining, we record the memorization elimination results at the different stages of model training, including before and after overtraining. Not surprisingly, the memorization is successfully eliminated no matter overtraining occurs or not. Nonetheless, for the overtrained model, Forsaken has a better performance on memorization elimination. The reason is that the overtrained model usually has more redundant memorization on the training set, which leaves more space for Forsaken to remodel the model to operate memorization elimination.

Fig. 11: The performance of Forsaken versus overtraining

Vi-D Memorization Elimination for Language Model

For language model, Carlini et al. [8] introduces an efficient way to measure the degree of the memorization for a given sequence , which can be mathematically expressed as the following equation.


where is the memorization degree; is the log-perplexity of under the machine learning model ;

is a skew-normal function with mean

, standard derivation and skew . Log-perplexity is a common indicator to evaluate how “surpriese” the language model is to see a given sequence, which can computed according to Eq. 8.


Notably, is a specially defined indicator that can only be applied to the sequence model, e.g, language model. As stated in [8], the risk of a given sequence to be extracted by an adversary is positively related to .

Fig. 12: Comparing and test loss to across different training iterations of the dummy gradient generator.

To further evaluate the effectiveness of Forsaken, we use the Penn TreeBank (PTB) dataset555https://github.com/tomsercu/lstm/tree/master/dat to train a language model with a two-layer LSTM [27] that has 512 hidden units. During the training process, we insert 200 canaries666Canary is a kind of specially generated sequence defined in [8]. into ten different users’ local set. The 200 canaries are treated as the eliminated set. After 100 epochs of training, we conduct memorization elimination. Fig. 12 plots the experiment result, where is the averaged value of the 200 canaries. It can be discovered that deceases to about 1.2 along with the training of our dummy gradient generator. Carlini’s experiment [8]

points out that the success probability for the adversary to extract a given sequence is negligible when

is less than 10. Therefore, the above experiment shows that Forsaken can significantly lower the probability of the adversary to extract the user’s private from the unintended memorization. Meanwhile, from the test loss change (increase about 10%), we can conclude that the performance loss caused by the memorization elimination is totally acceptable.

Vii Related Work

A significant amount of related work about machine learning security inspires our work in this paper.

Federated learning. Forsaken is basically designed for the emerging federated learning technique, which is first proposed by Google for privacy-preserving machine learning model training in the mobile crowdsensing scenario [18]. Federated learning significantly improves the security level of traditional distributed learning by raising the attack object from data to gradient. Nevertheless, Bagdasaryan et al.pointed out that the gradient was not as secure as Google declared, and vulnerable to the adversarial example attack [3]. Therefore, the followup work of federated learning put a tremendous amount of effort into designing protecting the gradient privacy through different cryptographic tools, e.g., secret sharing [6], differential privacy [13], and homomorphic encryption [29]. The core of all the methods is that they significantly increase the attack hardness of the attacker by using the secure gradient aggregation method to implement the gradient model update [26], which can also be used in Forsaken.

Although massive researches have been done on federated learning, there is still an unresolved practical problem, which is that none of them consider how to let a user securely quit from a learning federation. After all, people are more and more realizing the importance of individual privacy, and the GDPR [20] released by the European Union has ruled that a natural person should have the right to choose to remain or withdraw its private data without special statement.

Membership Inference. The most important evaluation tool used in Forsaken is the membership inference algorithm, i.e., membership oracle. The research of membership inference is first inspired by the membership privacy problem while deploying machine learning as a service [22]. Given a trained model , a training set and an arbitrary sample , membership inference answers the question whether is a member of based on the model output . Further, it makes it possible to determine whether the memorization of about the pattern of is directly trained by , an individual person’s private information in federated learning, or derived from other similar data.

The early membership inference towards machine learning used multiple shadows models to train the membership oracle [22], which had to cost considerable computation resources. Later, the experiments of Salem et al. [21] showed that even with only one shadow model, the membership oracle could still work. Considering the efficiency and practicality under the federated learning scenario, we do not add the membership inference performance as a part of the optimization objective for the dummy gradient generator. However, the principle of membership inference inspires us a lot to define our concept of memorization elimination for federated learning.

Memorization of Machine Learning. Few of the present works focus on the security of the memorization in machine learning. Song et al. [25] proposed several encoding methods that made the model to secretly “memorize” the training data in the training process. Correspondingly, the adversary could extract the memorized data from the trained model based on the specially designed decoding method, In fact, the memorization mentioned in [25] is intentionally backdoored by the adversary, and to obtain the memorization, the training procedure (i.e., the normal loss function) has to be changed. Carlini et al. [8] studied the unintentional memorization phenomenon in the language model and gave an excellent method to measure its exposure level. Besides the applicable range, the critical difference of the above two works and ours is that they focus on how to extract or measure the memorization of machine learning, but ours tries to eliminate the specific memorization.

Viii Conclusion

To date, most of the researches about federated learning concentrated on how to efficiently and securely memorize the training set. However, there was still a lack of research on how to help the user eliminate unexpected memorization. In this paper, we emphatically discussed this problem and proposed Forsaken, a new framework of federated learning that provided the memorization elimination service for the user. To implement Forsaken, we first presented a quantified indicator, called forgetting rate, to measure the performance of memorization elimination. Then, inspired by the memorization management mechanism of the human body, we proposed a “learn to forget” method to achieve memorization elimination for machine learning. In the method, the user could stimulate the neurons of a machine learning model to eliminate the memorization of the specific data by training a dummy gradient generator. In particular, the dummy gradient could be treated according to the common procedure of federated learning, which indirectly ensured the security of Forsaken.

Although a novel memorization elimination method was proposed, there were still several areas where our work was limited in scope: 1) Forsaken was designed to fit the classification task of machine learning and could not directly handle other types of tasks, e.g., regression and clustering. 2) Although only several epochs were required to train the dummy gradient generator, the computation overhead of memorization elimination was still too high for some energy-limited devices when the size of the neural network parameters was large.


  • [1] Y. Aono, T. Hayashi, L. Wang, S. Moriai, et al. (2017)

    Privacy-preserving deep learning: revisited and enhanced

    In International Conference on Applications and Techniques in Information Security, pp. 100–110. Cited by: §II-A.
  • [2] S. Awan, F. Li, B. Luo, and M. Liu (2019) Poster: a reliable and accountable privacy-preserving federated learning framework using the blockchain. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2561–2563. Cited by: §III-A.
  • [3] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov (2018) How to backdoor federated learning. arXiv preprint arXiv:1807.00459. Cited by: §VII.
  • [4] D. Bogdanov, S. Laur, and J. Willemson (2008) Sharemind: a framework for fast privacy-preserving computations. In European Symposium on Research in Computer Security, pp. 192–206. Cited by: item 1.
  • [5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, et al. (2019) Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046. Cited by: §I.
  • [6] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2017) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191. Cited by: §I, §II-A, §III-B, §IV-D, §V, §VII, footnote 1.
  • [7] D. Cao, S. Chang, Z. Lin, G. Liu, and D. Sun (2019) Understanding distributed poisoning attack in federated learning. In 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), pp. 233–239. Cited by: §V.
  • [8] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284. Cited by: §I, §III-A, §III-A, §VI-C, §VI-D, §VI-D, §VII, footnote 6.
  • [9] T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §VI-A.
  • [10] J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §IV-A.
  • [11] R. L. Davis and Y. Zhong (2017) The biology of forgetting—a perspective. Neuron 95 (3), pp. 490–503. Cited by: §IV-C.
  • [12] D. Gao, Y. Liu, A. Huang, C. Ju, H. Yu, and Q. Yang (2019)

    Privacy-preserving heterogeneous federated transfer learning

    In 2019 IEEE International Conference on Big Data (Big Data), pp. 2552–2559. Cited by: §III-B.
  • [13] R. C. Geyer, T. Klein, and M. Nabi (2017) Differentially private federated learning: a client level perspective. arXiv preprint arXiv:1712.07557. Cited by: §I, §II-A, §IV-D, §V, §VII, footnote 1.
  • [14] E. L. Harding, J. J. Vanto, R. Clark, L. Hannah Ji, and S. C. Ainsworth (2019) Understanding the scope and impact of the california consumer privacy act of 2018. Journal of Data Protection & Privacy 2 (3), pp. 234–253. Cited by: §I.
  • [15] B. Hitaj, G. Ateniese, and F. Perez-Cruz (2017) Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 603–618. Cited by: §IV-D.
  • [16] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §I.
  • [17] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017) Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §V.
  • [18] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §I, §I, §IV-D, §VII.
  • [19] M. Nasr, R. Shokri, and A. Houmansadr (2018) Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634–646. Cited by: §II-B.
  • [20] G. D. P. Regulation (2016) Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46. Official Journal of the European Union (OJ) 59 (1-88), pp. 294. Cited by: §I, §VII.
  • [21] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes (2018) Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246. Cited by: §I, §II-B, §III-A, §VI-A, §VII.
  • [22] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §II-B, §II-B, §III-A, §VII, §VII.
  • [23] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §VI-A.
  • [24] V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §III-A.
  • [25] C. Song, T. Ristenpart, and V. Shmatikov (2017) Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 587–601. Cited by: §VII.
  • [26] Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan (2019) Can you really backdoor federated learning?. arXiv preprint arXiv:1911.07963. Cited by: §V, §VII.
  • [27] M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §VI-D.
  • [28] N. H. Tran, W. Bao, A. Zomaya, N. M. NH, and C. S. Hong (2019) Federated learning over wireless networks: optimization model design and analysis. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 1387–1395. Cited by: §III-A.
  • [29] R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, and H. Ludwig (2019) HybridAlpha: an efficient approach for privacy-preserving federated learning. In

    Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security

    pp. 13–23. Cited by: §I, §II-A, §IV-D, §VII.
  • [30] Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §I.
  • [31] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §III-A.
  • [32] Z. Zhang (2018) Improved adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pp. 1–2. Cited by: §IV-C.
  • [33] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal (1997) Algorithm 778: l-bfgs-b: fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS) 23 (4), pp. 550–560. Cited by: §IV-C, §VI-A.
  • [34] L. Zhu, Z. Liu, and S. Han (2019) Deep leakage from gradients. In Advances in Neural Information Processing Systems, pp. 14747–14756. Cited by: §I, §II-A, §II-A.