I-a Motivating Industrial Scenario in Alibaba
The industrial scenario in Alibaba that motivated federated submodel learning is the desire to provide customized and accurate e-commerce recommendations for billion-scale clients while keeping user data on local devices.
Currently, the recommendation systems in Alibaba are cloud based and require the server cluster to collect, process, and store numerous user data. In addition, the deployed recommendation models follow a golden paradigm of embedding111In general, deep learning with a huge and sparse input space (e.g., e-commerce goods IDs, natural language texts, and locations) requires an embedding layer to first transform inputs into a lower-dimensional space . Additionally, the full embedding matrix tends to occupy a large proportion of the whole model parameters (e.g., in our evaluated DIN model and more than two-thirds in Gboard’s CIFG language model ).
and Multi-Layer Perceptron (MLP)
: user data are first encoded into high-dimensional sparse feature vectors, then embedded into low-dimensional dense vectors, and finally fed into fully connected layers. To improve accuracy, Deep Interest Network (DIN) introduces the attention mechanism to activate the user’s historical behaviors, namely relative interests, with respect to the target item; Deep Interest Evolution Network (DIEN) 
further extracts latent interests and monitors interest evolution through Gated Recurrent Unit (GRU) coupled with attention update gate; and Behavior Sequence Transformer (BST) incorporates transformer to capture the sequential signals underlying the user’s behavior sequence.
However, typical fields of user data involved in recommendation include user profile (e.g., user ID, gender, and age), user behavior (e.g., the list of visited goods IDs and relevant information, such as category IDs and shop IDs), and context (e.g., time, display position, and page number). More or less, these data fields are sensitive, and some clients who value security and privacy highly may refuse to share their data. In addition, according to the General Data Protection Regulation (GDPR), which was legislated by the European Commission and took effect on May 25, 2018, any institution or company is prohibited from uploading user data and storing it in the cloud without the explicit permissions from the European Union users [7, 8]. Under such circumstances, refining the recommendation models and further providing accurate recommendations become urgent demands as well as thorny challenges in practice.
Federated learning, which decouples the ability to do machine learning from the need to upload and store data in the cloud, is a potential solution. However, the original framework of federated learning, proposed by Google researchers in , requires each client to download the full machine learning model for training and inference, which is impractical for resource-constrained clients in the context of complex deep learning tasks. For example, as the largest online consumer-to-consumer platform in China, Taobao (owned by Alibaba) has roughly two billion goods in total , which is far larger than the 10,000 word vocabulary in the natural language scenario of Google’s Gboard [2, 11, 12]. This implies that the full embedding matrix of goods has roughly two billion rows and roughly occupies 134GB of space, when the embedding dimension is 18, and each element adopts 32-bit representation. If each client directly downloads the full matrix for learning, it inevitably consumes huge overheads, which are unacceptable and unaffordable for one billion Taobao users with smart devices. To improve efficiency, we observe that a certain user tends to browse, click, and buy a small number of goods, and thus just needs a tailored model for learning, which can sharply reduce the overheads and is more practical for mobile clients. Continuing with the above example, if a Taobao user’s historical data involve 100 goods, she only needs to pull and push the corresponding 100 rows, rather than the entire two billion rows, of the embedding matrix. Based on this key observation, we propose a new framework of federated learning, called federated submodel learning, as follows.
I-B Framework of Federated Submodel Learning
In the beginning of one communication round, a cloud server first selects a certain number of eligible clients, typically end users whose mobile devices are idle, charging, and connected to an unmetered Wi-Fi network. This eligibility criteria is used to avoid a negative effect on the user experience, data usage, or battery life. Then, each chosen client downloads part of the global model as she requires, namely a submodel, from the cloud server. For example, in the e-commerce scenario above, a client’s submodel mainly consists of the embedding parameters for the displayed and clicked goods in her historical data, as well as the parameters of the other network layers. Afterwards, the client trains the submodel over her private data locally. At the end of one round, the cloud server lets those chosen clients who are still alive upload the updates of their submodels and further aggregates the submodel updates to form a consensus update to the global model. Considering the convergencies of the global model at the cloud server and the submodels on clients, the above process is iterated for several rounds.
If each client leverages the full model rather than her required submodel for learning, federated submodel learning will degenerate to conventional federated learning. Compared with the conventional one, our new framework further decouples the ability to accomplish federated learning from the need to use the prohibitively large full model, which can dramatically improve efficiency. For example, in our evaluation, the size of a client’s desired submodel is only of the full model’s size. Thus, our framework is more practical for resource-constrained clients and deep learning tasks.
I-C Newly Introduced Privacy Risks
Just as every coin has two sides, federated submodel learning not only brings in efficiency but also introduces extra privacy risks. On one hand, compared with using the public full model in conventional federated learning, the download of a submodel and the upload of the submodel update would require each client to provide an index set as auxiliary information, specifying the “position” of her submodel. However, the index set normally corresponds to the client’s private data. For example, to specify the required rows of the embedding matrix in the e-commerce scenario, a client mainly needs to provide the goods IDs in her user data as the index set. Thus, the disclosure of a client’s real index set to the cloud server can still be regarded as the leakage of the client’s private data, breaking the tenet of federated learning. On the other hand, compared with the aligned full model in federated learning, each client only submits the update of her customized and highly differentiated submodel in federated submodel learning. As a result, the aggregation of updates with respect to a certain index can come from a unique client, which indicates that the cloud server not only can ascertain that the client has a certain index but also can learn her detailed update. These two kinds of knowledge both breach the client’s private data. Further, such a privacy risk in e-commerce is more severe than that in natural language because compared with the vocabularies of different Gboard users, the goods IDs of different Taobao users are more differentiated. We will detail and visualize the preceding privacy risks in Section III-A and Fig. 2.
I-D Fundamental Problems and Challenges
In essence, to mitigate the above privacy risks, we need to jointly solve two fundamental problems modeled from the processes of downloading a submodel and uploading a submodel update, respectively. One is how a client can download a row from a matrix, maintained by an untrusted cloud server, without revealing which row, alternatively the row index, to the cloud server. The other is how a client can modify a row of the matrix, still without revealing which row was modified and the altered content to the cloud server. Using the terminology from file system permissions, the first problem has a “read-only” attribute, where the client only reads the file. In contrast, the second problem is in a “write” mode, where the client can edit the file. Further incorporating the obscure requirement of two operations, the second problem appears more challenging than the first one. We now analyze these two problems in detail as follows.
We start with the first problem. One trivial method is that the client downloads the full matrix, as in conventional federated learning, and then extracts the required row locally. Although this method perfectly hides the fetched row index, it incurs significant communication cost, which can be unaffordable for resource-constrained mobile devices, especially when the matrix is huge, e.g., representing a deep neural network. To avoid downloading the full matrix, Private Information Retrieval (PIR)[13, 14, 15] can be applied, which exactly matches our problem settings, including the read-only mode and the privacy preservation requirement of the retrieved elements. The state-of-the-art constructions of private information retrieval include Microsoft’s SealPIR  and Labeled PSI  and Goolge’s PSIR , where two Microsoft protocols have been deployed in its Pung private communication system . We note that another celebrated cryptographic primitive, called Oblivious Transfer (OT) , is stronger than private information retrieval. It not only guarantees that the cloud server does not know which row the client has downloaded, as in private information retrieval, but also ensures that the client does not know the other rows of the matrix, which is instead not needed in practical federated submodel learning. Therefore, if we consider the first problem independently, private information retrieval may be a good choice.
We next dissect the second problem. For a concrete row of the full matrix, if clients modify this row one by one, the cloud server definitely knows those clients who modified this row and their detailed contents of modification. Thus, one feasible way is to first securely aggregate all the modifications without revealing any individual modification, and then apply the aggregate modification to the row of the full matrix once. In particular, such a guarantee can be provided by the secure aggregation protocol in  and some other schemes for oblivious addition, e.g., based on additively homomorphic cryptosystems [19, 20, 21]
. With the secure aggregation guarantee, if more than one client participates in aggregation and at least one of their modifications is nonzero, then the cloud server cannot reveal which client(s) truly intend to modify this row and their detailed modifications. Further, a larger number of involved clients implies a stronger privacy guarantee. One extreme case is in conventional federated learning, which harshly lets all chosen clients in one communication round be involved, no matter whether they truly intend to modify this row or not. Thus, it can offer the best privacy guarantee. Nevertheless, considering each client needs to be involved for each row of the full matrix, it is too inefficient to be applicable in the large-scale deep learning context. Another extreme case is in federated submodel learning, which simply lets those clients who really intend to modify this row be involved. Hence, each client only needs to be involved for those rows that she truly intends to modify, implying the best efficiency. However, different clients tend to modify highly differentiated or even mutually exclusive rows. For the joint modification with respect to some row, chances are high (e.g., with probabilityin our evaluated Taobao dataset) that only one client is involved. Under such a circumstance, the secure aggregation guarantee no longer works, which leaks the client’s real intention and her detailed modification. In a nutshell, trivial solutions to the second problem cannot well balance or support tuning privacy and efficiency.
I-E Our Solution Overview and Major Contributions
Jointly considering the above two fundamental problems and several practical issues, we propose a secure scheme for federated submodel learning. In our scheme, each chosen client generates three types of index sets locally: real, perturbed, and succinct. First, the real index is extracted from a client’s private data and is kept secret from the other system participants, including the cloud server and any other chosen client. Second, the perturbed index set is used to interact with others in the download and upload phases. It is generated by applying randomized response twice with one memoization step between. Such a design, together with secure aggregation, allows the client to hold a self-controllable deniability against whether she really intends or does not intend to download some row and to upload the modification of this row, even if the client may be chosen to participate in multiple communication rounds. The strength of deniability is rigorously quantified using local differential privacy. Further, rather than trivially using the prohibitively large-scale full index set as the questionnaire of randomize response in every communication round, we identify a necessary and sufficient index set, namely the union of the chosen clients’ real index sets. Considering the secrecy of each client’s real index set, we propose an efficient and scalable Private Set Union (PSU) protocol based on Bloom filter, secure aggregation, and randomization, allowing clients to obtain the union under the mediation of an untrusted cloud server without revealing any individual real index set. In particular, private set union promises a wide range of applications but receives little attention. Due to unaffordable overheads, none of the existing protocols can be deployed in practice yet. Last, the succinct index set is derived from the intersection between the real and perturbed index sets, and it is used to prepare the data and submodel for local training.
We summarize our key contributions in this work as follows:
To the best of our knowledge, we are the first to propose the framework of federated submodel learning and further to identify and remedy new privacy risks.
Our proposed secure scheme mainly features the properties of randomized response and secure aggregation to empower each client with a tunable deniability against her real intention of downloading the desired submodel and uploading its update, thus protecting her private data. As a moat, we designed an efficient and scalable private set union protocol based on Bloom filter and secure aggregation, which can be of independent and significant value in practice.
We instantiated with Taobao’s e-commerce scenario, adopted Deep Interest Network (DIN) for recommendation, and implemented a prototype system. Additionally, we extensively evaluated over one month of Taobao data. The evaluation and analysis results demonstrate the practical feasibility of our scheme, as well as its remarkable advantages over the conventional federated learning framework in terms of model accuracy and convergency, communication, computation, and storage overheads. Specifically, compared with conventional federated learning, which diverges in the end, our scheme improves the highest Area Under the Curve (AUC) by 0.072. In addition, at the same security and privacy levels as conventional federated learning with secure aggregation, our scheme reduces of communication overhead on both sides of the client and the cloud server. Moreover, our scheme reduces (resp., ) and (resp., ) of computation (resp., memory) overheads on the sides of the client and the cloud server, respectively. Furthermore, when the size of the full model scales further, it does not incur additional overhead to our scheme, but it prohibits conventional federated learning from being applied. Finally, for our private set union, when the number of chosen clients in one round is 100, the communication overhead per client is less than 1MB, and the computation overheads of the client and the cloud server are both less than 40s, even if the dropout ratio of the chosen clients reaches .
Ii Related Work
In recent years, federated learning has become an active topic in both academic and industrial fields. In this section, we briefly review some major focuses and relevant work as follows. For more related work, We direct interested readers to the surveys written by Li et al.  and Yang et al. .
First and most important is to identify and address security and privacy issues of federated learning. Bonawitz et al.  proposed a secure, communication-efficient, and failure-robust aggregation protocol in both honest-but-curious and active adversary settings. It can ensure that the untrusted cloud server learns nothing but the aggregate (or mathematically, the sum) of the model updates contributed by chosen clients, even if part of clients drop out during the aggregation process. To bound the leakage of a certain client’s training data from her individual model update, several differentially private mechanisms were proposed. McMahan et al. 
offered client-level differential privacy for recurrent language models based on the celebrated moments account scheme in
. Here, the moments account allows the release of all intermediate results during the training process, particularly the gradients per iteration; keeps track of privacy loss in every iteration; and provides a tighter compositive/cumulative privacy guarantee. However, in the practical federated learning scenario, only the model update after multiple iterations/epochs is revealed, whereas all intermediate gradients are hidden. Specific to this case, Feldman et al. analyzed the detailed amplification effect of hiding intermediate results on differential privacy. In contrast to these defense mechanisms, Bagdasaryan et al.  developed a model replacement attack launched by malicious clients to backdoor the global model at the cloud server. Melis et al.  exploited membership and property inference attacks to uncover features of the clients’ training data from model updates.
Second is to improve the communication efficiency, especially the expensive and limited up-link bandwidth for mobile clients. To overcome this bottleneck, two types of solution methods have been proposed in general. One is to reduce the total number of communication rounds between the cloud server and the clients. A pioneering work is the federated averaging algorithm proposed by McMahan et al. . Its key principle is to let each client locally train the global model for multiple epochs, and then upload the model update. Thus, it is more communication efficient than the common practice of conventional distributed learning to exchange gradients per iteration in datacenter-based scenarios. The other complementary way is to further reduce the size of the transmitted message in each communication round, particularly through compressing model updates. Typical compression techniques include sparsification, subsampling, and probabilistic quantization coupled with random rotation. For example, after quantization, the original float-type elements of the update of the global model can be encoded as integer-type values with a few bits [28, 29]. Considering the compressed model updates are discrete, while classic differentially private deep learning mechanisms, hinging on the Gaussian mechanism, only support continuous inputs, Agarwal et al.  proposed a Binomial mechanism to guarantee differential privacy for one iteration while enjoying communication efficiency. Another effective approach to improving communication efficiency is to first apply dropout strategies to the global model, and then let clients train over the same reduced model architecture . As a result, the downloaded model and the uploaded model update can be compressed in terms of dimension.
Third is from learning theory. The federated learning framework has two atypical characteristics: non independent and identically distributed (non-iid) and unbalanced data distributed over numerous clients. Such statistical heterogeneity makes most existing analysis techniques for iid data infeasible and poses significant challenges for designing theoretically robust and efficient learning algorithms. The federated averaging algorithm mentioned above, as a cornerstone of federated learning, empirically shows its effectiveness in some tasks, but was observed to diverge for a large number of local epochs in 
. More specifically, it lets multiple chosen clients run mini-batch Stochastic Gradient Descent (SGD) in parallel, and then lets the cloud server periodically aggregate the model updates in a weighted manner, where weights are proportional to the sizes of the clients’ training sets. Recently, Yu et al.
advanced the convergency analysis of the global model in federated averaging by imposing smooth and bounded assumptions on the loss function. Their follow-up work further presented a momentum extension of parallel restarted SGD, which is compatible with federated learning. Different from the above work, Smith et al.  focused on learning separate but related personalized models for distinct clients by leveraging multitask learning for shared representation. Chen et al.  instead adopted meta-learning to enable client-specific modeling, where clients contribute information at the algorithm level rather than the model level to help train the meta-learner. Mohri et al.  considered an unfairness issue that the global model can be unevenly biased toward different clients. They thus proposed a new agnostic federated learning framework where the global model can be optimized for any possible target distribution, which is formed via a mixture of client distributions. Eichner et al.  captured data heterogeneity in federated learning, particularly cyclic patterns, and offered a pluralistic solution for convex objectives and sequential SGD.
Fourth is regarding production and standardization. Google has deployed federated learning in its Android keyboard, called Gboard, to polish several language tasks, including next-word prediction , query suggestion , out-of-vocabulary words learning , and emoji prediction 
. In particular, the query suggestion used logistic regression as the triggering model for on-device training to determine whether the candidate suggestion should be shown or not. In addition, the other three tasks leveraged a tailored Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN), called Coupled Input and Forget Gate (CIFG). In
, Google’s team also detailed their initial system design and summarized practical deployment issues, such as irregular device availability, unreliable network connectivity and interrupted execution, orchestration of lock-step execution across heterogonous devices, and limited device storage and computation resources. They also pointed out some future optimization directions in bias, convergence time, device scheduling, and bandwidth. To facilitate open research, Google integrated a federated learning simulation interface into its deep learning framework, called TensorFlow Federated
. However, this open-source module lacks several core functionalities, e.g., secure and privacy preserving mechanisms, on-device training, socket communication between cloud server and clients, task scheduling, and dropout/exception handling. These significantly suppress federated learning related productions in other commercial companies. Caldas et al.
released a benchmark for federated learning, called LEAF. Currently, LEAF comprises some representative datasets, evaluation metrics, and a referenced implementation of federated averaging.
Parallel to existing work, where clients use the same (simplified) global model for learning, we propose a novel federated submodel learning framework for the sake of scalability. Under this framework, we identify and remedy new security and privacy issues, due to the dependence between the position of a client’s desired submodel and her private data as well as the misalignment of clients’ submodel updates in aggregation.
|Global/Full model at the cloud server, denoted by a matrix with rows and columns|
|Full row index set of|
|The set of clients chosen by the cloud server in one communication round, the cardinality of|
|The up-to-date set of clients who are alive throughout the communication round|
|A chosen client|
|Client ’s real index set that corresponds to local data and specifies truly required rows of|
|A perturbed index set of client , to download the submodel from the cloud server and to securely upload the update of the submodel to the cloud server|
|Client ’s downloaded submodel|
|Client ’s succinct submodel for local training|
|Client ’s succinct submodel update|
’s uploaded submodel update by padding zero vectors to the succinct submodel update
|A privacy level/budget of local differential privacy|
|A Bloom filter with bits and hash functions, representing/accommodating a set of elements|
|Dimension of vector in the secure aggregation protocol|
|Client ’s probability parameters to generate|
|The probability that an index in client ’s real index set will fall into her perturbed index set|
|The probability that an index not in client ’s real index set will fall into her perturbed index set|
|The probability of the cloud server ascertaining that an index belongs to some client’s real index set and also learning her detailed update with respect to this index from the securely aggregated submodel update|
|The probability of the cloud server ascertaining that an index does not belong to some client’s real index set from the securely aggregated submodel update|
|The expected cardinality of each client’s real index set|
|The least residue system modulo|
|A level of stochastic quantization mechanism|
|FL||Conventional federated learning|
|SFL||Secure federated learning, namely conventional federated learning with secure aggregation|
|SFSL||Secure federated submodel learning|
|CPP||Choice of probability parameters|
In this section, we elaborate on the privacy risks sketched in Section I-C and formally define the corresponding security requirements. We also review some existing building blocks.
We first introduce some necessary notations. For the sake of clarity, frequently used notations and abbreviations throughout this paper are also listed in Table I. We use a two-dimensional matrix with rows and columns to represent the global/full model, denoted as . Such a matrix-based representation not only suffices for the recommendation models used in Alibaba but also can easily degenerate to a widely used vector-based representation [18, 30], by setting the number of columns to 1. Additionally, we let denote the entire row index set of . Moreover, we let denote those clients who are selected by the cloud server to participate in one communication round of federated submodel learning. For a chosen client , we let denote her real index set, which implies that the user data of client involves the rows in with indices .
Iii-a Details on Privacy Risks and Security Requirements
We now expand on two kinds of privacy leakages that the federated submodel learning brings in, compared with conventional federated learning. We provide Fig. 2 for illustration. We here adopt an honest-but-curious security model, in which the cloud server and all clients follow the designed protocol, but try to glean sensitive information about others.
The first kind of privacy leakage is the disclosure of a client’s real index set, which specifies the position of a submodel and implies the client’s private data, to the cloud server. For example, each row of the embedding matrix for goods in the recommendation model is linked with a certain goods ID, which indicates that a client’s real index set, specifying her required rows of the embedding matrix, is in fact the goods IDs in her private data. Similarly, when federated submodel learning is applied to the natural language scenario (e.g., next-word prediction in Gboard), a client’s real index set to locate her wanted parameters of word embedding is actually the vocabulary extracted from her typed texts. Thus, the disclosure of a client’s real index set to the cloud server is still regarded as the leakage of the client’s private data. In contrast, for conventional federated learning, each client essentially uses the full index set, which is public to the cloud server and all other clients, and does not reveal any private information.
The second kind of privacy leakage is from the aggregation of misaligned submodel updates, where the cloud server may not only know that a certain client has a concrete index but also learn her detailed update with respect to this index. In addition to the fact that the real index reveals a client’s private data, the client’s individual submodel update can still memorize or even allow reconstruction of her private data, namely “model inversion” attack [42, 43, 44, 1, 45]. To conceal a client’s individual update in conventional federated learning, the secure aggregation protocol  can be applied, which allows the cloud server to obtain the sum of multiple vectors without learning any individual vector. As shown in Fig. 2(a), with respect to index , Alice submits the update, denoted by the vector , whereas Bob and Charlie submit two zero vectors. The secure aggregation protocol can guarantee that the cloud server only obtains the sum of three vectors, i.e., , but does not know the content of any individual vector. This further implies that from the aggregate result, the cloud server can merely infer that at least one client has index , but cannot identify which client(s). Such a functionality is essentially analogous to anonymization. In a nutshell, the zero updates from Bob and Charlie function as two shields of Alice. However, in federated submodel learning, due to the differentiation and misalignment of clients’ submodels, the “zero” shields from other clients vanish, and the aggregation of updates with respect to a certain index can come from one unique client, making secure aggregation ineffective. For example, in Fig. 2(b), only Alice who has index 2 submits the update , whereas Bob and Charlie submit nothing. Without the blindings from Bob and Charlie, the cloud server not only knows that Alice has index while Bob and Charlie do not have but also learns Alice’s detailed update .
Given the two kinds of privacy leakages above, we define the corresponding security requirements. First, for the disclosure of real index sets when clients interact with the cloud server, we consider that each client should have plausible deniability of whether a certain index is or is not in her real index set. To measure the strength of plausible deniability, we adopt local differential privacy, which is a variant of standard differential privacy in the local setting. Specifically, the perturbation in local differential privacy is performed by clients in a distributed manner, rather than relying on a data curator, as a trusted authority to conduct centralized perturbation in differential privacy. Thus, the privacy of an individual client’s data is not only preserved from external attackers but also from the untrusted data curator, e.g., the cloud server in our context. Due to its intriguing security properties, local differential privacy for various population statistics has recently received significant industrial deployments (e.g., in Google [46, 47], Apple , and Microsoft ), as well as lasting academic attention [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]. We now present the formal definition of local differential privacy as follows:
Definition 1 (Local Differential Privacy).
A randomized mechanism satisfies -local differential privacy, if for any pair of inputs from a client, denoted as and , and for any possible output of , denoted as , we have
where is a privacy budget controlled by the client. A smaller offers a better privacy guarantee.
Intuitively, the above definition says that the output distribution of the randomized mechanism does not change too much, given distinct inputs from the client. Thus, local differential privacy formalizes a sort of plausible deniability: no matter what output is revealed, it is approximately equally as likely to have come from one input as any other input. In addition, when local differential privacy applies to obscure the membership of a certain index in federated submodel learning, the inputs and the outputs are boolean values, where possible inputs (resp., outputs) are two states: a certain index “in” or “not in” a client’s real (resp., revealed) index set. Moreover, we can check that conventional federated learning provides the strongest deniability, where the level of local differential privacy is for each client. The reason is that no matter whether an index is or is not in a client’s real index set (different inputs), this index will definitely be revealed (the same output). In contrast, federated submodel learning provides the weakest deniability, where the level of local differential privacy is for each client, because if an index is in (resp., not in) a client’s real index set (different inputs), this index will definitely (resp., definitely not) be revealed, i.e., the output with probability 1 (resp., 0).
Second, direct secure aggregation of submodel updates is the most efficient but insecure case, which can leak whether some client has a certain index as well as her detailed update. In contrast, the other extreme case is conventional federated learning with secure aggregation, which is most secure but inefficient. Specifically, all participating clients upload the full model updates, which can perfectly prevent privacy leakages due to the misalignment of customized submodels. To enable clients to tune privacy and efficiency in a fine-grained manner, we define a client-controllable privacy protection mechanism for submodel updates aggregation.
A privacy protection mechanism for submodel updates aggregation is client controllable, if it enables participating clients to determine the probabilities of the following two complementary events: From the securely aggregated submodel update,
Event 1: the cloud server ascertains that an index belongs to some client’s real index set and also learns her detailed update with respect to this index;
Event 2: the cloud server ascertains that an index does not belong to some client’s real index set.
We note that revealing the states of some clients having and not having a certain index should both be regarded as privacy leakages. Furthermore, when the above definition applies to federated learning, and if at least two clients participate in aggregation, the probability of Event 1 is 0, and the probability of Event 2 is still 0 for those indices within the union of the chosen clients’ real index sets. For an index outside the union, e.g., index shown in Fig. 2(a), the probability of Event 2 is approaching 1. The reason is that from the aggregate zero vector, the cloud server almost ascertains that all clients do not have this index, despite of some rare cases (e.g., Alice and Bob submit two vectors of elements differing in signs, and Charlie submits a zero vector).
Iii-B Building Blocks
We review randomized response, secure aggregation, and Bloom filter underlying our design.
Iii-B1 Randomized Response
Randomized response, due to Warner in 1965 , is a survey technique in the social sciences to collect statistical information about illegal, embarrassing, or sensitive topics, where the respondents want to preserve privacies of their answers. A classical example for illustrating this technique is the “Are you a member of the communist party?” question. For this question, each respondent flips a fair coin in secret and tells the truth if it comes up tails; otherwise, she flips a second coin and responds “Yes” if heads and “No” if tails. Thus, a communist (resp., non-communist) will answer “Yes” with probability (resp., ) and “No” with probability (resp., ).
The intuition behind randomized response is that it provides plausible deniability for both “Yes” and “No” answers. In particular, a communist can contribute her response of “Yes” to the event that the first and second coin flips were both heads, which occurs with probability . Meanwhile, a non-communist can also contribute her response of “No” to the event that the first coin is heads and the second coin is tails, which still occurs with probability . Furthermore, the plausible deniability of randomized response can be rigorously quantified by local differential privacy. As analyzed in [62, 46], for a one-time response, each respondent has local differential privacy at the level , irrespective of any attacker’s prior knowledge.
Iii-B2 Secure Aggregation
An individual model update may leak a client’s private data under the notorious model inversion attack. Nevertheless, to update the global model in federated learning, the cloud server does not need to access any individual model update and only requires the aggregate, basically the sum, of multiple model updates. For example, if clients participate in the aggregation, denoted as , where client holds a vector of dimension , the cloud server should just obtain the sum , while maintaining each individual in secret. For this purpose and the characteristics of mobile devices, particularly limited and unstable network connections and common dropouts, Google researchers proposed a secure aggregation protocol in . Implied by the functionality of oblivious addition, secure aggregation in federated learning can further ensure that even if the model inversion attack succeeds, the attacker (e.g., the honest-but-curious or active adversary cloud server, or an external intruder) may only infer that a group of clients has a certain data item but cannot identify which concrete client. This functionality is similar to anonymization. In what follows, we briefly review the secure aggregation protocol from communication settings, technical intuitions, scalability, and efficiency.
First, we introduce its communication settings. During the aggregation process, a client can neither establish direct communication channels with other clients nor natively authenticate other clients. However, each client has a secure (private and authenticated) channel with the cloud server. Thus, if one client intends to exchange messages with other clients, she needs to hinge on the cloud server as a relay. In addition, to guarantee confidentiality and integrity against the mediate cloud server, client-to-client messages should be encrypted with symmetric authenticated encryption, where the secret key is set up through Diffie-Hellman key exchange between two clients. Moreover, to defend active adversaries, a digital signature scheme is required for consistency checks. These basic settings make the secure aggregation protocol different from other relevant work about oblivious addition [19, 20, 21], or, more generally, secure multiparty computation [63, 64, 65, 66], which requires direct peer-to-peer communication between clients; assumes the availability of multiple noncolluding cloud servers; or resorts to a trusted third party for key generation and distribution.
Second, we outline the technical intuitions behind secure aggregation. Each client doubly masks her private data, including a self mask and a mutual mask. Here, the self mask is chosen by the client, whereas the mutual mask is agreed on with the other clients through Diffie-Hellman key exchange and is additively cancelable when summed with others. Considering that some clients may drop out at any point, their masks cannot be canceled. To handle this problem, each client uses a threshold secret sharing scheme to split her private seed of a Pseudo-Random Number Generator (PRNG) for generating the self mask as well as her private key for generating the mutual mask, and then distribute the shares to the other clients. As long as some minimum (no less than the threshold) number of clients remain alive, they can jointly help the cloud server remove the self masks of live clients and the mutual masks between dropped and live clients.
Third, we present the scalability and efficiency of the secure aggregation protocol. We list its communication, computation, and storage complexities in Table II, where is the number of clients involved in the aggregation, and denotes the number of data items held by each client or the dimension of her data vector. We can see that this protocol is quite efficient for large-scale data vectors, especially from communication overhead, and thus can apply to mobile applications. In particular, as reported in , when clients are involved in the aggregation, and each client has -bit values, the communication overhead of the secure aggregation protocol expands over sending data in the clear.
Iii-B3 Bloom Filter
Bloom filter, conceived by Bloom in 1970 , is a space-efficient probabilistic data structure to represent a set whose elements come from a huge domain. In addition, when testing whether an element is a member of the set, a false positive is possible, but a false negative is impossible. In other words, an element that is diagnosed to be present in the set possibly does not belong to the set in reality, and an element that is judged to be not present definitively does not belong to the set. We describe its technical details and properties as follows.
A Bloom filter is a -length bit vector initially set to 0, denoted as . In addition, it requires different independent hash functions. The output range of these hash functions is , which corresponds to the positions of the Bloom filter. To represent a set of elements, we apply hash functions to each element and set the Bloom filter at the positions of hash values to 1. In the membership test phase, to check whether an element belongs to the set, we simply check the Bloom filter at the positions of its hash values. If any of the bits at these positions is 0, the element is definitely not in the set. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. Specifically, the false positive rate () of a Bloom filter depends on the length of Bloom filter ; the number of hash functions ; and the cardinality of set . According to [68, 69], its detailed formula is given as
Given and , to minimize the false positive rate, the optimal number of hash functions is
In addition, given and assuming the optimal number of hash functions is used, to achieve a desired false positive rate , the optimal length of Bloom filter should be
Thus, the optimal number of bits per the set’s element is
and the corresponding number of hash functions is
The above deductions mean that for a given false positive rate, the length of a Bloom filter is proportional to the size of the set being filtered, while the required number of hash functions only relies on the target false positive rate.
We further introduce an appealing property of Bloom filter when performing union over the underlying sets. To represent sets, denoted as , we use Bloom filters with the same length and the same hash functions, denoted as . Then, the union of these sets, i.e., , can be represented by a Bloom filter, which performs bitwise OR operations over the Bloom filters, i.e., . Such a union operation is lossless (implying that the false positive rate remains unchanged) in the sense that the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of these sets. In addition, because the Bloom filter needs to accommodate the union of sets, the parameter
, denoting the cardinality of the set, should be determined by estimating the cardinality of the unionrather than that of each individual set .
We finally present a generalized version of Bloom filter, called counting Bloom filter . It consists of counters/integers rather than bits and can represent a multiset, where an element can occur more than once. The main difference between counting Bloom filter and Bloom filter lies in that when representing an element, we increment the counters by one at the positions of its hash values. Thus, compared with the membership test in Bloom filter, counting Bloom filter further supports more general counting queries of a given element, i.e., whether the count number of a given element in a multiset is smaller than a certain threshold.
Iv Design of Secure Federated Submodel Learning
In this section, we present the design rationale and the design details of Secure Federated Submodel Learning (SFSL).
Iv-a Design Rationale
We illustrate our key design principles mainly through demonstrating how to handle two fundamental problems raised in Section I-D and how to resolve several practical issues.
As shown in Fig. 3, we handle two fundamental problems in a unified manner rather than in separate ways. During both download and upload phases, a client consistently uses a perturbed index set in place of her real index set. In contrast, during the local training phase, the client leverages the intersection of her real index set and perturbed index set to prepare the succinct submodel and involved user data. With the blinding of the perturbed index set to interact with the outside world, the client holds a plausible deniability of some index being in or not in her real index set. Specifically, the client generates her perturbed index set locally with randomized response as follows. First, the sensitive question asked by the cloud server here is “Do you have a certain index?”. Then, the client answers “Yes” with two customized probabilities, conditioned on whether the index is or is not in her real index set. These two probabilities allow the client to fine-tune balance between privacy and utility.
We further carefully examine the feasibility of our index set perturbation method in handling two fundamental problems. For the first problem in the download phase, if a client intends to download a certain row of the full matrix, and she actually downloads, she can blame her action to randomization, i.e., the occurrence of the event that the index not in a client’s real index set returns a “Yes” answer. Similarly, if a client does not intend to download the row, and she actually does not, she can still attribute her action to randomization, i.e., the occurrence of the event that the index in a client’s real index returns a “No” answer. Regarding the second problem in the upload phase, the usage of the perturbed index set still empowers a client to deny her underlying intention of modifying or not modifying some row of the full matrix, even if the cloud server observes her binary action of modifying or not modifying. Additionally, for a concrete row, there are two different groups of clients involved in the joint modification: (1) One group consists of those clients who intend to modify the row and contribute nonzero modifications; and (2) the other group comprises those clients who do not intend to modify the row and pretend to modify by submitting zero modifications. Under the secure aggregation guarantee, even though the cloud server observes the aggregate modification, it is hard for the cloud server, as an adversary, to identify any individual modification and further to infer whether some client originally intends to perform a modification or not. The hardness is controlled by the sizes of two groups, alternatively the probabilities of an index in and not in the real index set returning a “Yes” answer, which are fully tunable by clients.
In addition to these two basic problems, there still exist two practical issues to be solved before the above index set perturbation method can apply to federated submodel learning. The first issue regards practical efficiency, i.e., whether it is practical and necessary for the cloud server to ask “Do you have a certain index?” for each index in the full index set. In our context, the number of rows of the matrix, representing the deep learning model, is in the magnitude of billions. Thus, it is impractical for a client to answer billion-scale questions and further download and securely upload those rows with “Yes” answers of the full model. We thus turn to narrowing down the size of the questions and identify a sufficient and necessary index set, namely the union of the chosen clients’ real index sets. Our optimization is inspired by an example: if a client’s real index set is of size 100, and the full index set is of size 1 billion, using the probability parameters in the survey of party membership, her expected number of “Yes” answers is . Such a calculation implies that the dominant “Yes” answers are those with “No” in reality but “Yes” due to randomness. Nevertheless, most of the “No”-to-“Yes” answers are useless. More specifically, for those indices that do not belong to any client’s real index set, e.g., index 5 in Fig. 2 and Fig. 3, although part ( in expectation) of clients upload zero vectors for randomization, the cloud server can still infer from the aggregate zero vectors that these clients do not actually have the indices. Thus, it is not necessary to cover any index outside the union.
Accompanied with the first issue, another fundamental and thorny problem that arises is how multiple clients can obtain the union of their real index sets under the mediation of an untrusted cloud server without revealing any individual client’s real index set to others, i.e., the need of a private set union protocol. Considering any existing scheme doest not satisfy the atypical setting and the stringent requirement of federated submodel learning, we design a novel private set union scheme based on Bloom filter, secure aggregation, and randomization. We first let each chosen client represent her real index set as a Bloom filter. Then, different from the common practice to derive the union of sets by performing bit-wise OR operations over their Bloom filters, which naturally requires both addition and multiplication operations, we let the cloud server directly “sum” the Bloom filters. Here, the sum operation can be performed obliviously and efficiently under the coordination of an untrusted cloud server with the secure aggregation protocol. The aggregate Bloom filter is actually a counting Bloom filter, equivalent to constructing it from scratch by inserting each set in sequence. Besides the membership information, the counting Bloom filter still contains the count number of each element in the union. To prevent such an undesirable leakage, we let each client replace bit 1s in her Bloom filter with random integers, while keeping each bit 0 unchanged. When recovering the union of real index sets, one naive method for the cloud server is to do membership tests for the full index set, which is prohibitively time consuming, and can also introduce a large number of false positives. To handle these problems, we let the cloud server first divide the full index set into a certain number of partitions, and then let each client fill in a bit vector to indicate whether there exists an element of her real index set falling into these partitions. Then, just like computing the union with Bloom filters, the cloud server can securely determine those partitions that contain clients’ indices to further facilitate efficient union reconstruction.
The second issue regards the longitudinal privacy guarantee when a client is chosen to participate in multiple communication rounds. However, the initial version of randomized response only provides a rigorous privacy guarantee when an audience answers the same question once, facilitating only a one-time response to “Do you have a certain index?” in our context. Thus, we need to extend the original randomized response mechanism to allow repeated responses from the same client to those already answered indices in a privacy-preserving manner. Our extension leverages key principles from Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) [46, 47] and plays the randomized response game twice with a memoization step between. Specifically, the noisy answers generated by the inner randomized response will be memoized and permanently replace real answers in the outer randomized response. This ensures that even though a client responds to the membership of a concrete index for an infinite number of times, she can still hold a plausible deniability of her real answer, where the level of deniability is lower bounded by the memoized noisy answer.
Iv-B Design Details
Following the guidelines in Section IV-A, we propose a secure scheme for federated submodel learning. We introduce the scheme in a top-down manner, where we first give an overview of its top-level architecture and then show two underlying modules, namely index set perturbation and private set union. For the sake of clarity, we outline our design in Algorithm 1, Algorithm 2, and Algorithm 3.
Iv-B1 Secure Federated Submodel Learning
Before presenting our secure federated submodel learning framework, we first briefly review the federated averaging algorithm , which is the cornerstone and core of conventional federated learning. In particular, federated averaging is a synchronous distributed learning method, for non-iid and unbalanced training data distributed at massive communication-constrained clients, under the coordination of a cloud server. At the beginning of one communication round, the cloud server sends the up-to-date parameters of the global model and the training hyperparameters to some clients. Here, the training hyperparameters include the optimization algorithm, typically mini-batch SGD, the local batch size (the number of training samples used to locally update the global model once, namely per iteration), the local epochs (the number of passes over a client’s entire training data), and the learning rate. Then, each chosen client trains the global model on her data and uploads the update of the global model together with the size of her training data to the cloud server. The cloud server takes a weighted average of all updates, where one client’s weight is proportional to the size of her local data, and finally adds the aggregate update to the global model.
We now present secure federated submodel learning in Algorithm 1, which generalizes the federated averaging algorithm to support effective and efficient submodel learning and preserves desirable security and privacy properties while incorporating the unstable and limited network connections of mobile devices. At the initial stage, the cloud server randomly initializes the global model (Line 1). For each communication round, the cloud server first selects some clients to participate (Line 3) and also maintains an up-to-date set of clients who are alive throughout the whole round. A chosen client determines her real index set based on her local data, which can specify the “position” of her truly required submodel (Line 10). For example, if the visited goods IDs of a Taobao user include , then she requires the first, second, and fourth rows of the embedding matrix for goods IDs, which further implies that her real index set should contain . Then, the cloud server launches the private set union protocol to obtain the union of all chosen clients’ real index sets while keeping each individual client’s real index set in secret (Lines 4 and 11). The union result will be further delivered to live clients, based on which each client can perturb her real index set with a customized local differential privacy guarantee (Line 12). In addition, each client will use the perturbed index set, rather than the real index set, to download her submodel and upload the submodel update (Lines 13 and 19). In other words, when interacting with the cloud server, a client’s real index set is replaced with her perturbed index set, which provides deniability of her real index set and thus obscures her training data. Upon receiving the perturbed index set from a client, the cloud server stores it for later usage and returns the corresponding submodel and the training hyperparameters to the client (Line 6). Depending on the intersection of the real index set and the perturbed index set, called the succinct index set, the client extracts a succinct submodel and prepares involved data as the succinct training set (Line 14). For example, if a Taobao user’s real index set is and her perturbed index set is , she receives a submodel with row indices from the cloud server, but just needs to train the succinct submodel with row indices over her local data involving goods IDs . After training under the preset hyperparameters, the client derives the update of the succinct submodel (Line 15) and further prepares the submodel update to be uploaded with the perturbed index set by adding the update of the succinct submodel to the rows with the succinct indices and padding zero vectors to the other rows (Line 16). Additionally, to facilitate the cloud server in averaging submodel updates according to the sizes of relevant local training data, each client also needs to count the number of her samples involving every index in the perturbed index set (Line 17). In particular, the numbers of samples involving the indices outside the succinct index set are all zeros. Furthermore, each client prepares the submodel update to be uploaded by multiplying each row with the corresponding count number, namely the weight, in advance (Line 18). Finally, the weighted submodel updates and the count vectors from live clients are securely aggregated under the guidance of the cloud server (Lines 7–9 and 19). Specifically, the cloud server guides the secure aggregation by enumerating every index in the union of the chosen clients’ real index sets. For each index, the cloud server first determines the set of live clients whose perturbed index sets contain this index and then lets these clients submit the materials for securely adding up the weighted updates and the count numbers with respect to this index (Line 8). The cloud server finally applies the update to the global model in this row by adding the quotient of the sum of the weighted updates and the total count number, namely the weighted average (Line 9). Considering that the weighted submodel updates and the count numbers are aggregated side by side, each client can augment the matrix, denoting her weighted submodel update, with the transposed count vector in the last column, when preparing materials for secure aggregation (Line 19). In addition, to reduce the interactions between the cloud server and a client, they can package all the materials supporting secure aggregation, rather than exchange the materials for one index each time (Lines 7–9), i.e., the cloud server executes Lines 7 and 8 for each live client in parallel and then executes Line 9 for each index in the union .
Iv-B2 Index Set Perturbation
We now present how a client can generate a perturbed index set to download her submodel and to upload the update of the submodel, with a customized local differential privacy guarantee against the cloud server. Just like the exemplary question about party membership, the sensitive question here is “Do you have a certain index?”, asked by the cloud server. The clients participating in one round of federated submodel learning make up the population to be surveyed. Thus, the clients can use randomized response to answer “Yes” or “No”, which provides measurable deniabilities of their true answers. However, as sketched in Section IV-A, several practical issues need to be resolved so that randomized response can truly apply here. In what follows, we elaborate on our solution details on these issues.
As shown in Algorithm 2, we view the union of the chosen clients’ real index sets, rather than the full index set, as the scope of the cloud server’s questionnaire (Input). We reason about necessity and sufficiency as follows. Without loss of generality, we consider client and the other chosen clients . If any client in wants to obtain deniability of her real indices, she requires client to join as an audience to answer the questions about her real indices, which implies that client should know the union of the other chosen clients’ real index sets. By incorporating client ’s own real index set, the questionnaire to client should contain the union of all chosen clients’ real index sets. We further apply the above reasoning to all clients and can derive that the global questionnaire should contain the union of all chosen clients’ real index sets. Next, we illustrate whether the union suffices. We consider any index outside the union. Under the conventional federated learning framework, each client will upload a zero vector to the cloud server for this index. When the cloud server learns that the sum is a zero vector, she can infer that all chosen clients do not have this index. Please see index 5 in Fig. 2(a) for an intuition. From this perspective, any index outside the union does not need to be preserved in federated submodel learning as well. Nevertheless, suppose an index outside the union is introduced by chance, e.g., due to a false positive of Bloom filter when reconstructing union in Algorithm 3. A nice phenomenon occurs. On one hand, the privacy of federated submodel learning can be enhanced in the sense that the cloud server can only ascertain that those clients who return “Yes” answers do not really have this index, but cannot ascertain the states of the other clients due to plausible deniability. On the other hand, those clients with “Yes” answers need to download the row with respect to this index and further to upload zero vectors through secure aggregation, which are useless and increase their overheads.
Given the questionnaire, client basically uses two probability parameters in randomized response to fine-tune the tension among effectiveness, efficiency, and privacy (Lines 3–6). In particular, denotes the probability that an index in client ’s real index set will return a “Yes” answer and controls the factual size of a client’s user data contributed to federated submodel learning. Thus, a larger implies better effectiveness in terms of convergency rate. In addition, denotes the probability that an index outside client ’s real index set will return a “Yes” answer and determines the number of redundant rows to be downloaded and the number of padded zero vectors to be uploaded through the secure aggregation protocol. Hence, given a fixed , a smaller indicates higher efficiency. Furthermore, jointly adjust the level of local differential privacy, where a pair of closer provides a better privacy guarantee. We examine three typical examples: (1) The randomized response in the party membership survey takes and for each respondent; (2) conventional federated learning essentially leverages the full index, takes and for each client, and offers the best privacy and effectiveness guarantees but the worst efficiency guarantee; and (3) federated submodel learning adopts and for each client and provides the best effectiveness and efficiency guarantees but the worst privacy guarantee.
Considering that client can be chosen to participate in multiple communication rounds and needs to repeatedly respond to some answered indices, we extend the basic randomized response mechanism to offer a rigorous privacy guarantee, also called longitudinal privacy in the literature [46, 47, 58]. We adopt a memoization technique from RAPPOR. The core idea of RAPPOR is to play the randomized response game twice with a memoization step between. The first perturbation step, called permanent randomized response, is used to create a noisy answer, which is memoized by the client and permanently reused in place of the real answer. The second perturbation step, called instantaneous randomized response, reports on the memoized answer over time, eventually completely revealing it. In other words, the privacy level, guaranteed by the underlying memoized answer in the permanent randomized response, imposes a lower bound on the privacy level, ensured by each instantaneous/revealed response. When the memoization technique is applied to federated submodel learning, we let client maintain two index sets with “Yes” and “No” answers in the permanent randomized response, respectively (Input). Here, the permanent randomized response mechanism is parameterized by two probabilities to tune privacy and utility (Lines 3–6), as illustrated in the preceding paragraph. In addition, given that one client can be grouped with distinct clients in different communication rounds while the union of real index sets varies from one round to another, the client needs to handle new indices. As a new index comes (Line 2), client generates a permanent noisy answer for it and further updates her two memoized sets (Lines 7–10). Moreover, client obtains her final perturbed index set by performing an instantaneous randomized response over the memoized answers to the union of real index sets in the current communication round (Lines 11–16). In particular, the instantaneous randomized response is parameterized with another two probabilities (Lines 13 and 15), similar to in the permanent randomized response. Now, these four probability parameters jointly support tuning the tension among privacy, effectiveness, and efficiency. More specifically, , denoting the probability that an index in client ’s real index set finally returns a “Yes” answer, and , denoting the probability that an index not in client ’s real index set finally returns a “Yes” answer, now play the same roles as and , respectively. Detailed derivations of are deferred to Section V-A.
Finally, we provide some comments on the above design. First, our design is different from conventional locally differentially private schemes (e.g., randomized response and RAPPOR), which require each participating user to choose the same probability parameters (i.e., ), so that true statistics (e.g., heavy hitter, histogram, and frequency) can be well estimated using collected noisy answers, particularly after additional corrections. Such a requirement/assumption is no longer needed in our design because the cloud server, as the aggregator, performs aggregate statistics based on secure aggregation rather than over the noisy answers, e.g., counting how many samples from the chosen clients involve a certain index in total (Algorithm 1, Line 8). Therefore, as mentioned above, different clients can customize probability parameters to tune privacy and utility. Second, our index perturbation mechanism in Algorithm 2 needs a prerequisite that the real index set of a client does not change when she participates in different communication rounds. Considering that the real index set corresponds to the client’s local data, this prerequisite can be further converted to the invariance of the client’s local data. One feasible way to meet this prerequisite is to introduce the concept of “period” into federated submodel learning, e.g., one period can be set to one month. In a concrete period, a client uses the historical data in the previous one period to participate in federated submodel learning for several communication rounds. In addition, when entering a new period, the client just restarts Algorithm 2. The other feasible way is to allow changes in a client’s real index set from one communication round to another. This implies that the underlying binary states of some indices may change. For example, if a client’s local data and thus her real index set expand incrementally, some indices, which were not in, can fall into the real index set in later rounds. Recently, Erlingsson et al. 
considered a similar setting, in particular the collection of user statistics (e.g., software adoption) for multiple times with each user changing her underlying boolean value for a limited number of times. Therefore, their design, based on binary tree and Bernoulli distribution, can be leveraged to extend Algorithm2, allowing a client to change her local data and thus her real index set in different communication rounds.
Iv-B3 Private Set Union
We introduce the last module of our design: private set union. We first briefly review related work about private set operations, with a focus on the often overlooked but significantly important private set union. Then, we outline the practical infeasibility of existing protocols when adapted to federated submodel learning. We finally present our efficient and scalable scheme.
The goal of a private set operation protocol is to allow multiple parties, where each party holds a private set, to obtain the result of an operation over all the sets, without revealing each individual private set and without introducing a trusted third party. Compared with Private Set Intersection (PSI) [71, 72, 73, 74, 75, 76] and Private Set union Cardinality (PSC) [77, 78, 79], which have received tremendous attention and also have seen several practical applications, such as in social networks [80, 81]; human genome testing ; location-based services ; security incident information sharing ; online advertising ; private contact discovery ; and the Tor anonymity network , there is little work and negligible focus on Private Set Union (PSU). Nevertheless, private set union promises a wide range of applications in practice, e.g., union queries over several databases, and, more generally, integration/sharing of datasets from multiple private sources. Thus, independent of federated submodel learning, the task of designing a practical private set union protocol itself is highly desired and urgent. Existing protocols mainly come from the fields of data mining and cryptography. In the data mining field, the representative design of private set union  is based on commutative encryption and requires direct communication between any pair of two parties. Unfortunately, the design leaks the cardinality of any two-party set intersection, and the underlying commutative encryption is fragile as well. In the cryptography field, according to the representation format of a set, the protocols can be generally divided into two categories: polynomial based [87, 88, 89, 90, 91] and Bloom filter based [92, 93, 94]. For the polynomial-based protocols, elements of a set are represented as the roots of a polynomial, and the union of two sets is converted to the multiplication of two polynomials. For the protocols based on Bloom filter, the union operation over sets is normally transformed to the element-wise OR operation over Bloom filters, as demonstrated in Section III-B3, whereas the logic OR operation can be further converted to bit addition and bit multiplication. To obliviously perform addition and multiplication operations, the above two kinds of protocols mainly turn to generic secure two-party/multiparty computation (e.g., garbled circuit, homomorphic encryption, secret sharing, and oblivious transfer), or outsource secure computation to multiple noncolluding servers. Due to unaffordable computation and communication overheads, none of the existing private set union protocols have been deployed in practice. In addition to inefficiency, the basic setting of these protocols significantly differs from that of federated submodel learning, where clients cannot directly communicate with each other and should mediate through an untrusted cloud server. Additionally, the set elements here can come from a billion-scale domain, which has not been touched in previous work as of yet.
Given the infeasibility of existing protocols and the atypical setting of federated submodel learning, we present our new private set union scheme in Algorithm 3. First, each client represents her real index set as a Bloom filter (Line 3). The details about how to set the parameters of the Bloom filter can be found in Section III-B3. Second, different from the common practice to derive the union of sets by performing bitwise OR operations over their Bloom filters, which requires both addition and multiplication operations, we let the cloud server directly sum the Bloom filters. Here, the sum operation can be conducted obliviously and efficiently under the coordination of the untrusted cloud server with secure aggregation (Line 8). In addition, the resulting Bloom filter is actually a counting Bloom filter, equivalent to constructing it from scratch by sequentially inserting each real index set. In addition to membership information, the counting Bloom filter also contains the count numbers of elements in the union of real index sets. In other words, the cloud server may learn how many clients have a certain index, which is undesired in our context. To prevent the leakage of count numbers, we let each client generate a perturbed integer vector, which replaces each bit 1 in her Bloom filter with a random integer and keeps each bit 0 unchanged (Line 4). Such a perturbation process can obscure count numbers while retaining membership information. Third, after obtaining the sum of perturbed Bloom filters, the cloud server can recover the union of real index sets by doing membership tests for the full index set. For example, to judge whether an index belongs to the union, we check the resulting integer vector at the positions of its hash values. The index is considered to be in the union only if all the values are nonzero. However, one practical issue arises: the domain of index can be huge, e.g., the full size of the goods IDs in Taobao is in the magnitude of billions. Thus, it can be prohibitively time consuming to enumerate all indices. Even worse, the direct enumeration method can also introduce a large number of false positives in the union, i.e., those indices not in the union are falsely judged to be in, which can further lead to unnecessary redundancy in the download and upload phases. To handle these problems, we further incorporate a private “partition” union to narrow down the scope of index for union reconstruction above. We let the cloud server divide the full domain of the index into a certain number of partitions ahead of time (Line 1). A good partition scheme needs to well balance the pros in the union reconstruction phase and the cons of additional cost. Given the partitions, each client first uses a bit vector to record whether there exists an index in her real index set falling into the partitions (Line 5). Just the same as the Bloom filter to hide the concrete count numbers, the client further replaces each bit 1 with a random integer (Line 6). Then, the cloud server obtains the sum of the integer vectors using the secure aggregation protocol and reveals those partitions with nonzero integers in the corresponding positions. By simply doing membership tests for the indices falling into these partitions, the cloud server can efficiently construct the union. Last, the union is delivered to all live clients (Line 8).
V Security and Performance Analyses
In this section, we first analyze the privacy guarantees of our secure federated submodel learning scheme according to Definition 1 and Definition 2, i.e., Theorem 1 and Theorem 2. We also provide an instantiation of our scheme, where each client consistently uses the union of the chosen clients’ real index sets when interacting with the cloud server, and prove that its security and privacy guarantees are as strong as conventional federated learning with secure aggregation (hereinafter also called “Secure Federated Learning” and abbreviated as “SFL”), i.e., Theorem 3. We then show the proven security of our proposed private set union protocol, i.e., Theorem 4. We finally analyze the performance of our scheme by comparing with that of secure federated learning.
V-a Security and Privacy Analyses
By Definition 1, we analyze the local differential privacy guarantee of index set perturbation in Algorithm 2. As stepping stones, we first analyze permanent randomized response and a one-time instantaneous randomized response in Lemma 1 and Lemma 2, which impose an upper bound and a lower bound on the privacy level of Algorithm 2, namely Theorem 1.
Permanent randomized response in Algorithm 2 for client achieves local differential privacy at the level .
We focus on a certain index . According to Definition 1, we need to consider all possible pairs of inputs from client and all possible outputs of the permanent randomized response in Algorithm 2. Here, the input pair is in and not in client ’s real index set, namely and . In addition, the possible outputs are obtaining “Yes” and “No” noisy answers for memoization, namely and . We thus can compute four ratios between the conditional probabilities of a permanent output with a pair of distinct inputs: , , , and . By Definition 1, we can derive the level of local differential privacy : . ∎
A one-time instantaneous randomized response in Algorithm 2 for client satisfies local differential privacy at the level , where , , and .
The proof is similar to that of Lemma 1. The difference is that the possible outputs are index being in and not in the final perturbed index set, namely and . We first compute two conditional probabilities and , denoting the probabilities of in the final perturbed index set given an index is and is not in client ’s real index set, respectively. In particular, we can derive through
where Equation (2
) follows from the law of total probability, and Equation (3) follows that is independent of conditioned on or . In a similar way, we can get Based on and , we can still compute four ratios between the conditional probabilities of an instantaneous output given a pair of different inputs and draw the level of local differential privacy . ∎
From the above deduction, we can draw that and in the instantaneous randomized response play the same roles as and in the permanent randomized response. This intuition has been given in Section IV-B2, and is now formally verified here.
By combining the above two lemmas, we show the level of local differential privacy ensured by Algorithm 2.
When client is chosen to participate in an arbitrary number of communication rounds, Algorithm 2 satisfies -local differential privacy, where .
We consider that client participates in communication rounds of federated submodel learning, and Algorithm 2 guarantees -local differential privacy. Thus, client should generate instantaneous randomized responses. On one hand, suppose that an attacker only leverages the -th instantaneous randomized response while ignoring all previous instantaneous randomized responses. This corresponds to the strongest possible local differential privacy guarantee, namely the lower bound on . According to Lemma 2, a one-time instantaneous randomized response guarantees -local differential privacy. Therefore, . On the other hand, if the attacker leverages all instantaneous randomized responses, and as approaches positive infinity, the worst case is that the attacker reveals the permanent randomized response. This corresponds to the weakest possible local differential privacy guarantee, namely the upper bound on . By Lemma 1, the permanent randomized response can guarantee -local differential privacy. Hence, . We complete the proof. ∎
In fact, to derive the detailed local differential privacy guarantee when client participates in communication rounds, namely , we need to make an additional assumption on how effectively the attacker infers client ’s permanent randomized response from all instantaneous randomized responses. We defer this explorative problem as our future work. Additionally, if we set , implying and , this corresponds to that any index, no matter whether it is or is not in client ’s real index set, will receive a permanent “Yes” answer and an instantaneous “Yes” answer. In other words, if client takes the union of the chosen clients’ real index sets as her perturbed index set, the local differential privacy guarantee of Algorithm 2 is , which is the strongest case, as in conventional federated learning.
Algorithm 1 is a client-controllable privacy protection mechanism for submodel updates aggregation. In particular, for any index , we let and denote the numbers of live clients who do not have and have in reality, respectively. If each live client chooses the same probability parameters (i.e., , and ), then Algorithm 1 can guarantee: Event 1 happens with probability , and Event 2 happens with probability .
Event 1 happens when only one of the clients who have in reality submits a nonzero update while all clients who do not have in reality also do not submit zero updates. By the product rule, we can compute the joint probability of Event 1 as .
Event 2 happens when all clients who have in reality do not submit nonzero updates. Under such a circumstance, if part (at least one) of clients who do not have in reality submit zero updates, from the aggregate zero update, the cloud server almost ascertains that these clients do not have in reality. According to the product rule, we can compute the probability of Event 2 as . ∎
Theorem 2 enables clients to jointly adjust the privacy level of secure submodel updates aggregations by choosing different probability parameters. Details on fine-tuning are deferred to Appendix -A. Additionally, we still examine the case that each client uses the union of the chosen clients’ real index sets to upload the submodel update by setting , implying and . This is the strongest possible client-controllable privacy for secure submodel updates aggregation, as in secure federated learning. Combining with the local differential privacy guarantee, we can see that secure federated submodel learning with the setting holds the same security and privacy levels as secure federated learning under both Definition 1 and Definition 2. We further generalize this observation in Theorem 3, which is free of any security or privacy definition.
If the probability parameters all take 1s for each chosen client , the security and privacy of secure federated submodel learning scheme in Algorithm 1 are as strong as secure federated learning.
We consider any index from the full index set (i.e., ), and prove in two complementary cases as follows.
Case 1 (): In both secure federated submodel learning and secure federated learning, each client will download -th row of the full model and then upload the update of this row to the cloud server through secure aggregation. Specifically, if is not in a client’s real index set, then this client will upload a zero vector. The whole processes of two schemes are consistent, implying the same security and privacy guarantees.
Case 2 (): In secure federated submodel learning, each client will not download -th row and also will not upload the zero vector. Thus, the adversary knows that each client doesn’t not have index (i.e., ), and each client’s update is a zero vector. In contrast, in secure federated learning, each client will download -th row and update a zero vector as her update. From the result that the aggregate update is a zero vector, the adversary still ascertains that each client does not have and her update is a zero vector, which are the same as in secure federated submodel learning.
By summarizing two cases, we complete the proof. ∎
We finally analyze the security of private set union. As a springboard, we give Lemma 3, which states that the modular addition of one or more random integers from is still uniformly random in . We note that the elementary operation in secure aggregation  is modular addition with a large modulus rather than original addition, which is consistent with Lemma 3. In addition, in the field of number theory, the set of integers is called the least residue system modulo , or the ring of the integers modulo . Moreover, the set together with the operation of modular addition form a finite cyclic group.
For any nonempty set and for all , is randomly and independently chosen from , denoted as , then is still uniformly random in .
We prove by induction on the cardinality of , denoted as , where .
The base case is to show that the statement holds for . We let denote the element, where . Thus, it is trivial that .
The inductive step is to prove that if the statement for any nonempty holds, then the statement for where still holds. We let denote . Hence, it suffices to show that