Machine learning algorithms play an essential role in extracting patterns and useful knowledge from datasets. Traditional machine learning methods are usually developed from the centralized or parallel perspective, i.e., the overall datasets are stored in one, or a group of computing machines in hand, and the learning objective can be efficiently achieved by the machines with their direct access to the datasets. Although many of those algorithms are well-studied and have excellent performance over even petabyte-scale datasets, they are still confronted with serious troubles when implemented in the business or research fields with high concerns on data confidentiality. For example, every day portable and wearable smart devices collect billions of users’ motion and body condition information, which contains numerous valuable patterns that benefit reproductions. However, it is impossible to perform traditional machine learning algorithms for knowledge extraction, because the distributedly collected data are prohibited to upload to central servers subject to strict privacy laws and regulations, such as the GDPR. Similar examples also exist in a variety of areas, such as finance, health, education, and etc.
Federated learning, first proposed in 2016 
, is a privacy-preserving machine learning approach to analyzing the patterns of distributed sensitive datasets. The notion was later expanded to a clearer and more comprehensive framework with three components: horizontal federated learning in which participants share the same feature space, vertical federated learning in which participants share the same ID space, and federated transfer learning for disjoint data distributions. In this framework, every distributed participant shares encrypted messages that are computed based on their individual datasets, and updates its local model according to the received messages, with or without the presence of a trusted third party. However, in federated learning, especially in the vertical scenario, it is common but unreasonable to omit the investigation of the sample ID’s information leakage after leveraging them to align distributed datasets. In a standard vertical federated learning system, for example, samples held by different participants are aligned by executing some secure protocol and letting every one know what the exact intersect set is. In practical applications, the participants are usually companies or institutions in competitive relations, and many clients of one company are potential advertising targets of another. On the one hand, thus, the participants would not allow the disclosure of their sample IDs. On the other hand, there are asymmetric federations in real life, in which a subset of the participants are small companies with strong requirements of ID privacy protection, while the other are large companies who do not concern much ID privacy, because their guests are almost all citizens in the society. Such an unbalanced setup requires a vertical federated learning system to distinguish the “weak” and the “strong” sides of the federation and take into consideration their specific privacy protection demands.
In this paper, we specify the vertical federated learning into two classes: the symmetric and the asymmetric, in order to develop federated learning algorithms that preserve the privacy of sample IDs for the participants who indeed demand. The contributions of this paper are summarized as follows.
We formally propose and comprehensively characterize the notion of asymmetrically vertical federated learning.
We incorporate the standard private set intersection protocol to achieve the asymmetric ID alignment phase in an asymmetrically vertical federated learning system. In addition, we provide a Pohlig-Hellman realization of the adapted private set intersection protocol.
We present a genuine with dummy approach to achieving asymmetric federated model training. To illustrate its application, we provide a federated logistic regression algorithm as an example. Experiments are also made for validating the feasibility of the approach.
The rest of the paper is organized as follows. In Section 2, we formulate the symmetric and asymmetric classification of vertical federated learning. We further present an asymmetric private set intersection protocol for ID alignment and one of its realizations in Section 3. Section 4 provides the genuine with dummy approach to asymmetric model training and its application to federated logistic regression, along with a few experimental validationa in Section 5. Finally, several concluding remarks are given in Section 6.
2 Problem Definition
In this section, we revisit the notion of vertical federated learning and formally categorize it into two classes to study the privacy preservation of sample IDs.
2.1 Vertical Federated Learning
Let denote a complete dataset with representing the sample ID space, the feature space and the label space, respectively. It was defined in  that the vertical federated learning is conducted over two datasets , satisfying
In real-world applications, however, it is nearly impossible to find two original datasets collected by distributed parties that share exactly the same sample ID space. Therefore, as a preparation phase for vertical federated learning, it is necessary to introduce appropriate ID-alignment protocols that assist each party with its secure identification of and establishment of two datasets’ row mapping. We define as the pre-ID-alignment datasets that correspond to and , respectively. Clearly, and . In addition, we write as the whole ID space that contains all possible elements in and . Then the vertical federated learning presented by  is depicted in Figure 1.
2.2 Private Set Intersection
Recall that ID-alignment protocols are essential in the preparation of vertical federal model training. To achieve ID alignment, Private Set Intersection (PSI) protocols, as one of the most well studied areas in secure multiparty computation, are usually implemented in a federated learning system. In standard PSI, each party holds a set , which involves ’s confidential data. All parties would like to cooperatively find the intersection , and in the meantime, each keeps the elements in private. The realization of PSI protocols can be based on classical public-key cryptosystems [1, 3], oblivious transfer , garbled circuits , etc.
2.3 Symmetric and Asymmetric
Let us call the parties who own . As can be seen from Figure 1, and have equally powerful positions in the sense that the numbers of elements in their pre-ID-alignment sample ID spaces share the same order of magnitude and both sets play a major role in the whole ID space . In fact, even if picks uniformly at random, there is probability of . In such “federation of the strong”, each party would not gain much knowledge by executing PSI protocols to obtain the intersection , and thus it bothers neither party to let the other party know what IDs it possesses. Since finally both parties symmetrically obtain the intersection for model training, we term this scenario symmetrically vertical federated learning.
The opposite scenario is that , without loss of generality, has , while still holds . It is common that the federated learning system to be established is for ’s learning task, and thus the labels are provided by . This vertical data distribution is as shown in Figure 2. Clearly, there is a sufficiently small probability that an arbitrary ID one picks uniformly at random exactly belongs to . Therefore, we say is at the weak side in the sense that each sample ID in is regarded as sensitive information, whose privacy would be severely compromised through the revelation of by executing standard PSI protocols. In such “federation of the weak and the strong”, it is necessary to asymmetrically protect the ID privacy of the weak party in the ID alignment phase. This scenario is termed asymmetrically vertical federated learning.
It is reasonable not to analyze the “federation of the weak” scenario, because we may shrink the whole space to so that this scenario can be reduced to the “federation of the strong” one. Throughout this paper, we impose the following assumption.
It is evident that if Assumption 1 holds, either or . Now by representing a positive number by with and naming the order of magnitude of , we provide the following precise definition for.
It is called Symmetrically Vertical Federated Learning (SVFL) if
Since Assumption 1 holds in Definition 1, there always exist a weak and a strong participant in AVFL. Note that we base the analysis above and the rest of the paper on the two-party federation case for the simplicity of demonstration. In fact, the analysis can be trivially extended to the multi-party case by asymmetrically protecting the sample ID privacy of the weak party.
3 Asymmetric ID Alignment
ID alignment is typically the first stage of a vertical learning workflow. In this section, we adapt the standard PSI protocol to achieve asymmetric ID alignment and provide a realization using a classical cryptosystem.
3.1 Asymmetric PSI Protocols
Recall that in SVFL’s ID alignment phase, the execution of standard PSI protocols not only lets each party gain knowledge about the samples they both hold, but also provides a tag that manages to efficiently link together distributed sample fragments and paves the way for the follow-up federated model training phase. To realize a proper ID-alignment phase in AVFL, one has to adapt standard PSI protocols so that they satisfy
The exact intersection set is kept private against the strong participant.
In the federated model training phase, the distributed samples with ID in are still alignable.
Let denote the pre-ID-alignment sample ID sets held by the weak and the strong participant, respectively. Alternatively, they are defined by
A possible ID-alignment approach that meets requirements (i)-(ii) above is to use a variant of PSI protocols that we present below.
Asymmetric PSI (APSI) protocols, as a variant of PSI protocols, yield an obfuscated set at each party satisfying
In addition, only the weak participant further knows .
The output of such PSI protocols is illustrated in Figure 3.
3.2 A Pohlig-Hellman Realization
We now provide a realization of APSI protocols based on the Pohlig-Hellman encryption. Define and as the set of integers coprime to . Let denote a multiplicative group. Then the well-known Pohlig-Hellman encryption scheme is described by the following three components.
(Key generation) Select a prime number such that every plaintext is an element of , where has at least one large prime factors. For example, select with being also a prime number. Then select and compute . Finally, one reveals as public knowledge and keeps as the key to this symmetric cryptosystem.
(Encryption) Encrypt the plaintext by
(Decryption) Decrypt the ciphertext by
It is straightforward to show the commutative property of the Pohlig-Helmman encryption by checking for any . Now we present an APSI protocol based on the Pohlig-Hellman encryption in Algorithm 1.
Input: The strong participant holds . The weak participant holds . Security number .
Output: only obtains satisfying and
obtains both and .
Security Analysis. In Algorithm 1, the exchanged information by step 6 is , whose security is evidently guaranteed by the Pohlig-Hellman encryption scheme. By step 8, the strong participant receives the encrypted obfuscated set but it cannot distinguish which elements belong to the encrypted true intersection set. In fact, cannot even obtain any plaintext based on because the elements in are still encrypted with ’s key . By step 11, since the cooperatively decrypted message is only , ’s private information is protected against the weak participant . Clearly, in step 12, only is revealed to with kept private by .
It can be noted that the security number is interpreted as a cardinality ratio of the obfuscated intersection set to the true intersection set . As can be directly computed, there is probability that a uniformly random element in that picks belongs to . Clearly, , in which case and both obtain the true intersect set. In this special case, AVFL becomes SVFL. As goes up from zero to one, exponentially decreases, and thereby it becomes more difficult for to potentially identify from . When reaches the maximum one, the obfuscated set , i.e., the whole ID space of is used for obfuscating and cannot gain any knowledge by executing Algorithm 1.
4 Asymmetric Federated Model Training
In this section, we investigate the asymmetric federated model training process and propose a novel and general approach to training a model in an asymmetric fashion that is as good as in a symmetric fashion. We also provide an application of this approach to an existing federated learning algorithm.
4.1 Genuine with Dummy Approach
Using APSI protocols given in Definition 2, the weak participant obtains the true intersection set , which is a subset of the obfuscated intersection set that the strong participant knows. As shown in Figure 3, the vertical federated learning domain now contains margins, i.e., all labels and a few features are missing for the samples with their ID in . Indeed, federated transfer learning  can be used to fill in the margins by the feature-representation-transfer approach . Nevertheless, in many practical applications, such as financial risk management, it is normal to train machine learning model based on the small but exactly original data in , in order to avoid misjudgement of dishonest conduct. In these areas, the learned features and labels for , even if not “negatively transferred”, may lead to undesirably strict or loose risk control strategies. Therefore, it is necessary to design asymmetric model training schemes that take the distributed output of APSI protocols and yield the same or almost the same result as the SVFL. Based on the standard vertical model training in , we now present a Genuine with Dummy (GWD) approach to achieving asymmetric model training as follows. Note that a trusted third party is introduced as a secure coordinator.
generates a public-key cryptosystem, and sends the public key to .
exchange intermediate variables to cooperatively compute gradient and loss. The weak participants, say , normally execute computation rule for the samples in , i.e., the genuine, but set the variables that correspond to the samples in , i.e., the dummy, to specific mathematical identities so that their existence will not affect the relevant computed result. The identities can be, for example, zero in addition, one in multiplication or in function composition.
turn to for gradient and loss decryption service, and update their local models.
As we can see, the central idea of the GWD approach provided above is to let the weak participant execute the normal protocol for the genuine samples, while mathematically mute the dummy samples that it actually does not hold before sending them to the strong participant. To keep the strong participant unaware of the existence of the dummy samples, a potential method is to implement semantically secure encryption scheme, such as Paillier cryptosystem , to disable the participant from efficiently distinguishing the identities out of a group of normal variables. In addition, it is clear that the asymmetric model training achieved with the GWD approach would exhibit exactly the same performance as the standard (or symmetric) model training because the introduced identities strictly guarantee invariant intermediate results at every step.
An illustration of GWD architecture is provided in Figure 4, which depicts the federated execution process of a general subroutine. The weak participant would like to send encrypted messages to the strong participant and then expect a response of the execution result . However, the direct transmission would give away the genuine IDs. Instead, the weak particpant sends along with the mathematical identities corresponding to the dummy samples, and finally receives , which is identical to what it expects. Besides, the strong participant performs the computation , while remaining oblivious to the genuine IDs.
4.2 Asymmetrically Vertical Logistic Regression
We now take the coordinator-free federated logistic regression training presented in  as an example and adapt it to an asymmetric training protocol using the GWD approach. Let denote the ciphertext of a plain message . Then the adapted Asymmetrically Vertical Logistic Regression (AVLR) training protocol is presented in Algorithm 2.
Input: The strong participant holds the sample set . The weak participant holds the sample set . Learning rate .
Output: learn the weights , respectively, such that the joint weight is the global optimum of the model.
initialize their model estimatesand , respectively.
In Algorithm 2, the central subroutine is to let the weak participant share the encrypted scalars , based on which the strong participant evaluates its local gradient . To implement the GWD approach, computes for based on the true labels, but set for all , which is the addition identity. Due to the fact that preserves semantic security, can neither identify from the received set of s, nor distinguish which samples are the dummies. Since performs additions to compute , the presence of the dummy samples’ would have no effects on the result. In this way, the dataset, over which the federation of and train their model, is inherently the sample set with IDs in . This guarantees that the asymmetric vertical training yields the same result as the standard (symmetric) vertical training does.
This section provides a few experiments that validate the feasibility of the APSI + AVLR protocol and compare it with the existing standard (symmetric) protocol.
We implement our APSI and AVLR algorithms in the federated learning framework FATE222https://github.com/FederatedAI/FATE. The performance of the APSI + AVLR protocol is demonstrated over the dataset MNIST333http://yann.lecun.com/exdb/mnist/, which has 60000 samples and 784 features. To adapt the dataset to our distributed setup, we manually allocate an ID to each sample, split and assign the partitions to the weak participant and the strong participant as in Table 1.
|392 Features||392 Features|
As indicated by Table 1, the underlying federated training would be performed over the samples that both and hold. Therefore, the training performance is expected not to be as good as other algorithms that take the whole dataset as input. However, it is fairly reasonable to utilize the distributed setup in Table 1, since the experiments are conducted for the purpose of validating the feasibility of the APSI + AVLR protocol and comparing it with the standard (symmetric) version.
For the computing hardware, we use two individual machines to serve as the participants and either of them has 4 CPU cores and 16GB RAM. These machines are both located in the same region of Tencent Cloud444https://cloud.tencent.com/.
5.2 Numerical Results
In the experiments of the APSI + AVLR protocol, we adopt the fixed learning rate but various security numbers , and let the training process execute for iterations. It is worth mentioning the standard (symmetric) model training corresponds the case in our experiments. We plot the trajectories of training loss and AUC in Figure 5 and Figure 6, respectively. In Figure 5, the trajectories are almost the same for different s, especially for the and the case, which also holds true for the AUC trend in Figure 6. The observation of these figures validates that the APSI + AVLR protocol has as good performance as its symmetric version due to the introduction of mathematical identities.
In this paper, we studied the privacy preservation of sample IDs in vertical federated learning. To meet the privacy protection demands of different participants, we first proposed the notion of asymmetrically vertical federated learning. We then adapted the standard private set intersection protocol to achieve the asymmetric ID alignment phase in an asymmetrically vertical federated learning system. Correspondingly, a Pohlig-Hellman realization of the adapted protocol was provided. To achieve asymmetric federated model training, we also presented a genuine with dummy approach. We illustrated its application by providing a federated logistic regression as an example. Experiments were also made for validating the feasibility of this approach.
-  (2004) Efficient private matching and set intersection. In International Conference on the Theory and Applications of Cryptographic Techniques, pp. 1–19. Cited by: §2.2.
-  (2012) Private set intersection: are garbled circuits better than custom protocols?. In NDSS, Cited by: §2.2.
-  (2004) Privacy-preserving inter-database operations. In International Conference on Intelligence and Security Informatics, pp. 66–82. Cited by: §2.2.
-  (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §1.
-  (1999) Public-key cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Applications of Cryptographic Techniques, pp. 223–238. Cited by: §4.1.
-  (2009) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §4.1.
-  (2014) Faster private set intersection based on ot extension. In 23rd USENIX Security Symposium (USENIX Security 14), pp. 797–812. Cited by: §2.2.
-  (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §2.1, §4.1.
-  (2019) Parallel distributed logistic regression for vertical federated learning without third-party coordinator. arXiv preprint arXiv:1911.09824. Cited by: §4.2.