1 Introduction
Machine learning algorithms play an essential role in extracting patterns and useful knowledge from datasets. Traditional machine learning methods are usually developed from the centralized or parallel perspective, i.e., the overall datasets are stored in one, or a group of computing machines in hand, and the learning objective can be efficiently achieved by the machines with their direct access to the datasets. Although many of those algorithms are wellstudied and have excellent performance over even petabytescale datasets, they are still confronted with serious troubles when implemented in the business or research fields with high concerns on data confidentiality. For example, every day portable and wearable smart devices collect billions of users’ motion and body condition information, which contains numerous valuable patterns that benefit reproductions. However, it is impossible to perform traditional machine learning algorithms for knowledge extraction, because the distributedly collected data are prohibited to upload to central servers subject to strict privacy laws and regulations, such as the GDPR. Similar examples also exist in a variety of areas, such as finance, health, education, etc.
Federated learning, first proposed in 2016 [4]
, is a privacypreserving machine learning approach to analyzing the patterns of distributed sensitive datasets. The notion was later expanded to a clearer and more comprehensive framework with three components: horizontal federated learning in which participants share the same feature space, vertical federated learning in which participants share the same ID space, and federated transfer learning for disjoint data distributions. In this framework, every distributed participant shares encrypted messages that are computed based on their individual datasets and updates its local model according to the received messages, with or without the presence of a trusted third party. However, in federated learning, especially in the vertical scenario, it is common but unreasonable to omit the investigation of the sample ID’s information leakage after leveraging them to align distributed datasets. In a standard vertical federated learning system, for example, samples held by different participants are aligned by executing some secure protocol and letting everyone know what the exact intersect set is. In practical applications, the participants are usually companies or institutions in competitive relations, and many clients of one company are potential advertising targets of another. On the one hand, thus, the participants would not allow the disclosure of their sample IDs. On the other hand, there are asymmetrical federations in real life, in which a subset of the participants are small companies with strong requirements of ID privacy protection, while the other are large companies who do not concern much ID privacy, because their guests are almost all citizens in the society. Such an unbalanced setup requires a vertical federated learning system to distinguish the “weak” and the “strong” sides of the federation and take into consideration their specific privacy protection demands.
In this paper, we specify the vertical federated learning into two classes: the symmetrical and the asymmetrical, to develop federated learning algorithms that preserve the privacy of sample IDs for the participants who indeed demand. The contributions of this paper are summarized as follows.

We formally propose and comprehensively characterize the notion of asymmetrical vertical federated learning.

We incorporate the standard private set intersection protocol to achieve the asymmetrical ID alignment phase in an asymmetrical vertical federated learning system. In addition, we provide a PohligHellman realization of the adapted private set intersection protocol.

We present a genuine with dummy approach to achieving asymmetrical federated model training. To illustrate its application, we provide a federated logistic regression algorithm as an example. Experiments are also made for validating the feasibility of the approach.
The rest of the paper is organized as follows. In Section 2, we formulate the symmetrical and asymmetrical classification of vertical federated learning. We further present an asymmetrical private set intersection protocol for ID alignment and one of its realizations in Section 3. Section 4 provides the genuine with dummy approach to asymmetrical model training and its application to federated logistic regression, along with a few experimental validations in Section 5. Finally, several concluding remarks are given in Section 6.
2 Problem Definition
In this section, we revisit the notion of vertical federated learning and formally categorize it into two classes to study the privacy preservation of sample IDs.
2.1 Vertical Federated Learning
Let denote a complete dataset with representing the sample ID space, the feature space and the label space, respectively. It was defined in [8] that the vertical federated learning is conducted over two datasets , satisfying
In realworld applications, however, it is nearly impossible to find two original datasets collected by distributed parties that share exactly the same sample ID space. Therefore, as a preparation phase for vertical federated learning, it is necessary to introduce appropriate IDalignment protocols that assist each party with its secure identification of and establishment of two datasets’ row mapping. We define as the preIDalignment datasets that correspond to and , respectively. Clearly, and . In addition, we write as the whole ID space that contains all possible elements in and . Then the vertical federated learning presented by [8] is depicted in Figure 1.
2.2 Private Set Intersection
Recall that IDalignment protocols are essential in the preparation of vertical federal model training. To achieve ID alignment, Private Set Intersection (PSI) protocols, as one of the most well studied areas in secure multiparty computation, are usually implemented in a federated learning system. In standard PSI, each party holds a set , which involves ’s confidential data. All parties would like to cooperatively find the intersection , and in the meantime, each keeps the elements in private. The realization of PSI protocols can be based on classical publickey cryptosystems [1, 3], oblivious transfer [7], garbled circuits [2], etc.
2.3 Symmetrical and Asymmetrical
Let us call the parties who own . As can be seen from Figure 1, and have equally powerful positions in the sense that the numbers of elements in their preIDalignment sample ID spaces share the same order of magnitude and both sets play a major role in the whole ID space . In fact, even if picks uniformly at random, there is probability of . In such “federation of the strong”, each party would not gain much knowledge by executing PSI protocols to obtain the intersection , and thus it bothers neither party to let the other party know what IDs it possesses. Since finally both parties symmetrically obtain the intersection for model training, we term this scenario symmetrical vertical federated learning.
The opposite scenario is that , without loss of generality, has , while still holds . It is common that the federated learning system to be established is for ’s learning task, and thus the labels are provided by . This vertical data distribution is as shown in Figure 2. Clearly, there is a sufficiently small probability that an arbitrary ID one picks uniformly at random exactly belongs to . Therefore, we say is at the weak side in the sense that each sample ID in is regarded as sensitive information, whose privacy would be severely compromised through the revelation of by executing standard PSI protocols. In such “federation of the weak and the strong”, it is necessary to asymmetrically protect the ID privacy of the weak party in the ID alignment phase. This scenario is termed asymmetrical vertical federated learning.
It is reasonable not to analyze the “federation of the weak” scenario, because we may shrink the whole space to so that this scenario can be reduced to the “federation of the strong” one. Throughout this paper, we impose the following assumption.
Assumption 1
.
It is evident that if Assumption 1 holds, either or . Now by representing a positive number by with and naming the order of magnitude of , we provide the following precise definition for.
Definition 1
Under Assumption 1
, vertical federated machine learning can be classified into two categories based on the participating parties’ preIDalignment sample ID spaces.

It is called Symmetrically Vertical Federated Learning (SVFL) if
Since Assumption 1 holds in Definition 1, there always exist a weak and a strong participant in AVFL. Note that we base the analysis above and the rest of the paper on the twoparty federation case for the simplicity of demonstration. In fact, the analysis can be trivially extended to the multiparty case by asymmetrically protecting the sample ID privacy of the weak party.
3 Asymmetrical ID Alignment
ID alignment is typically the first stage of a vertical learning workflow. In this section, we adapt the standard PSI protocol to achieve asymmetrical ID alignment and provide a realization using a classical cryptosystem.
3.1 Asymmetrical PSI Protocols
Recall that in SVFL’s ID alignment phase, the execution of standard PSI protocols not only lets each party gain knowledge about the samples they both hold, but also provides a tag that manages to efficiently link together distributed sample fragments and paves the way for the followup federated model training phase. To realize a proper IDalignment phase in AVFL, one has to adapt standard PSI protocols so that they satisfy

The exact intersection set is kept private against the strong participant.

In the federated model training phase, the distributed samples with ID in are still alignable.
Let denote the preIDalignment sample ID sets held by the weak and the strong participant, respectively. Alternatively, they are defined by
A possible IDalignment approach that meets requirements (i)(ii) above is to use a variant of PSI protocols that we present below.
Definition 2
Asymmetrical PSI (APSI) protocols, as a variant of PSI protocols, yield an obfuscated set at each party satisfying
In addition, only the weak participant further knows .
The output of such PSI protocols is illustrated in Figure 3.
3.2 A PohligHellman Realization
We now provide a realization of APSI protocols based on the PohligHellman encryption. Define and as the set of integers coprime to . Let denote a multiplicative group. Then the wellknown PohligHellman encryption scheme is described by the following three components.

(Key generation) Select a prime number such that every plaintext is an element of , where has at least one large prime factors. For example, select with being also a prime number. Then select and compute . Finally, one reveals as public knowledge and keeps as the key to this symmetrical cryptosystem.

(Encryption) Encrypt the plaintext by

(Decryption) Decrypt the ciphertext by
It is straightforward to show the commutative property of the PohligHellman encryption by checking for any . Now we present an APSI protocol based on the PohligHellman encryption in Algorithm 1.
Input: The strong participant holds . The weak participant holds . Security number .
Output: only obtains satisfying and
obtains both and .
Security Analysis. In Algorithm 1, the exchanged information by step 6 is , whose security is evidently guaranteed by the PohligHellman encryption scheme. By step 8, the strong participant receives the encrypted obfuscated set but it cannot distinguish which elements belong to the encrypted true intersection set. In fact, cannot even obtain any plaintext based on because the elements in are still encrypted with ’s key . By step 11, since the cooperatively decrypted message is only , ’s private information is protected against the weak participant . Clearly, in step 12, only is revealed to with kept private by .
It can be noted that the security number is interpreted as a cardinality ratio of the obfuscated intersection set to the true intersection set . As can be directly computed, there is probability that a uniformly random element in that picks belongs to . Clearly, , in which case and both obtain the true intersect set. In this special case, AVFL becomes SVFL. As goes up from zero to one, exponentially decreases, and thereby it becomes more difficult for to potentially identify from . When reaches the maximum one, the obfuscated set , i.e., the whole ID space of is used for obfuscating and cannot gain any knowledge by executing Algorithm 1.
4 Asymmetrical Federated Model Training
In this section, we investigate the asymmetrical federated model training process and propose a novel and general approach to training a model in an asymmetrical fashion that is as good as in a symmetrical fashion. We also provide an application of this approach to an existing federated learning algorithm.
4.1 Genuine with Dummy Approach
Using APSI protocols given in Definition 2, the weak participant obtains the true intersection set , which is a subset of the obfuscated intersection set that the strong participant knows. As shown in Figure 3, the vertical federated learning domain now contains margins, i.e., all labels and a few features are missing for the samples with their ID in . Indeed, federated transfer learning [8] can be used to fill in the margins by the featurerepresentationtransfer approach [6]. Nevertheless, in many practical applications, such as financial risk management, it is normal to train machine learning model based on the small but exactly original data in , in order to avoid misjudgment of dishonest conduct. In these areas, the learned features and labels for , even if not “negatively transferred”, may lead to undesirably strict or loose risk control strategies. Therefore, it is necessary to design asymmetrical model training schemes that take the distributed output of APSI protocols and yield the same or almost the same result as the SVFL. Based on the standard vertical model training in [8], we now present a Genuine with Dummy (GWD) approach to achieving asymmetrical model training as follows. Note that a trusted third party is introduced as a secure coordinator.

generates a publickey cryptosystem, and sends the public key to .

exchange intermediate variables to cooperatively compute gradient and loss. The weak participants, say , normally execute computation rule for the samples in , i.e., the genuine, but set the variables that correspond to the samples in , i.e., the dummy, to specific mathematical identities so that their existence will not affect the relevant computed result. The identities can be, for example, zero in addition, one in multiplication or in function composition.

turn to for gradient and loss decryption service, and update their local models.
As we can see, the central idea of the GWD approach provided above is to let the weak participant execute the normal protocol for the genuine samples, while mathematically mute the dummy samples that it actually does not hold before sending them to the strong participant. To keep the strong participant unaware of the existence of the dummy samples, a potential method is to implement semantically secure encryption scheme, such as Paillier cryptosystem [5], to disable the participant from efficiently distinguishing the identities out of a group of normal variables. In addition, it is clear that the asymmetrical model training achieved with the GWD approach would exhibit exactly the same performance as the standard (or symmetrical) model training because the introduced identities strictly guarantee invariant intermediate results at every step.
An illustration of GWD architecture is provided in Figure 4, which depicts the federated execution process of a general subroutine. The weak participant would like to send encrypted messages to the strong participant and then expect a response of the execution result . However, the direct transmission would give away the genuine IDs. Instead, the weak particpant sends along with the mathematical identities corresponding to the dummy samples, and finally receives , which is identical to what it expects. Besides, the strong participant performs the computation , while remaining oblivious to the genuine IDs.
4.2 Asymmetrical Vertical Logistic Regression
We now take the coordinatorfree federated logistic regression training presented in [9] as an example and adapt it to an asymmetrical training protocol using the GWD approach. Let denote the ciphertext of a plain message . Then the adapted Asymmetrical Vertical Logistic Regression (AVLR) training protocol is presented in Algorithm 2.
Input: The strong participant holds the sample set . The weak participant holds the sample set . Learning rate .
Output: learn the weights , respectively, such that the joint weight is the global optimum of the model.
In Algorithm 2, the central subroutine is to let the weak participant share the encrypted scalars , based on which the strong participant evaluates its local gradient . To implement the GWD approach, computes for based on the true labels, but set for all , which is the addition identity. Due to the fact that preserves semantic security, can neither identify from the received set of s, nor distinguish which samples are the dummies. Since performs additions to compute , the presence of the dummy samples’ would have no effects on the result. In this way, the dataset, over which the federation of and train their model, is inherently the sample set with IDs in . This guarantees that the asymmetrical vertical training yields the same result as the standard (symmetrical) vertical training does.
5 Experiments
This section provides a few experiments that validate the feasibility of the APSI + AVLR protocol and compare it with the existing standard (symmetrical) protocol.
5.1 Settings
We implement our APSI and AVLR algorithms in the federated learning framework FATE^{2}^{2}2https://github.com/FederatedAI/FATE. The performance of the APSI + AVLR protocol is demonstrated over the dataset MNIST^{3}^{3}3http://yann.lecun.com/exdb/mnist/, which has 60000 samples and 784 features. To adapt the dataset to our distributed setup, we manually allocate an ID to each sample, split and assign the partitions to the weak participant and the strong participant as in Table 1.
392 Features  392 Features  

10000 Samples  
50000 Samples  (Abandoned) 
As indicated by Table 1, the underlying federated training would be performed over the samples that both and hold. Therefore, the training performance is expected not to be as good as other algorithms that take the whole dataset as input. However, it is fairly reasonable to utilize the distributed setup in Table 1, since the experiments are conducted for the purpose of validating the feasibility of the APSI + AVLR protocol and comparing it with the standard (symmetrical) version.
For the computing hardware, we use two individual machines to serve as the participants and either of them has 4 CPU cores and 16GB RAM. These machines are both located in the same region of Tencent Cloud^{4}^{4}4https://cloud.tencent.com/.
5.2 Numerical Results
In the experiments of the APSI + AVLR protocol, we adopt the fixed learning rate but various security numbers , and let the training process execute for iterations. It is worth mentioning the standard (symmetrical) model training corresponds the case in our experiments. We plot the trajectories of training loss and AUC in Figure 5 and Figure 6, respectively. In Figure 5, the trajectories are almost the same for different s, especially for the and the case, which also holds true for the AUC trend in Figure 6. The observation of these figures validates that the APSI + AVLR protocol has as good performance as its symmetrical version due to the introduction of mathematical identities.
6 Conclusions
In this paper, we studied the privacy preservation of sample IDs in vertical federated learning. To meet the privacy protection demands of different participants, we first proposed the notion of asymmetrical vertical federated learning. We then adapted the standard private set intersection protocol to achieve the asymmetrical ID alignment phase in an asymmetrical vertical federated learning system. Correspondingly, a PohligHellman realization of the adapted protocol was provided. To achieve asymmetrical federated model training, we also presented a genuine with dummy approach. We illustrated its application by providing a federated logistic regression as an example. Experiments were also made for validating the feasibility of this approach.
References
 [1] (2004) Efficient private matching and set intersection. In International Conference on the Theory and Applications of Cryptographic Techniques, pp. 1–19. Cited by: §2.2.
 [2] (2012) Private set intersection: are garbled circuits better than custom protocols?. In NDSS, Cited by: §2.2.
 [3] (2004) Privacypreserving interdatabase operations. In International Conference on Intelligence and Security Informatics, pp. 66–82. Cited by: §2.2.
 [4] (2016) Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §1.
 [5] (1999) Publickey cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Applications of Cryptographic Techniques, pp. 223–238. Cited by: §4.1.
 [6] (2009) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §4.1.
 [7] (2014) Faster private set intersection based on ot extension. In 23rd USENIX Security Symposium (USENIX Security 14), pp. 797–812. Cited by: §2.2.
 [8] (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §2.1, §4.1.
 [9] (2019) Parallel distributed logistic regression for vertical federated learning without thirdparty coordinator. arXiv preprint arXiv:1911.09824. Cited by: §4.2.