Machine learning makes use of algorithms to perform tasks such as prediction or classification. Classical machine learning approaches need centralizing the training data on one machine or in a cluster. Centralized machine learning is, by far, the most common architecture. However, it also stops further applications in finance sectors (e.g., insurance and bank) from employing machine learning techniques. For confidentiality reasons, these finance sectors are not able to share their data and store it in the cloud, and thus cannot benefit from centralized machine learning with other organizations.
Data privacy is a fundamental challenge for many machine learning applications depending on data aggregation across different entities, especially in finance sectors. The trade-off between data privacy and learning on aggregated data creates a collaborative circumstance. Decentralized machine learning applies to keeps data safe and ensures privacy under some conditions. A new method called Federated Learning (or Collaborative Machine Learning) McMahan et al. (2017)Yang et al. (2019), proposed by Google, offers a way of decentralized and confidential machine learning. By using aggregated updating parameters of the model to train algorithms instead of raw data, federated learning empowers sectors where data cannot be transferred to third parties for confidentiality reasons with data network effects. In fact, federated learning plays a similar role in data parallelism based distributed machine learning. In other words, federated learning is a kind of distributed version of deep neural networks that the data can be partitioned and stored in multiple machines. However, federated learning is only able to transform (or online) deep learning into horizontal parallelism, which splits the data based on the quantity of the data, i.e., the different amount of data subset goes into the parallel computation. Thus, the learning scheme of federated learning cannot handle the situation of vertical parallelism. Additionally, recent researches Zhu et al. (2019) have shown that it is possible to obtain the private training data from the publicly shared gradients.
In this paper, we investigate a learning technique that allows two parties to collaboratively train a model while one party (the active party) holds data and another party (the passive party) holds the corresponding labels. We term this Asymmetrically Collaborative Machine Learning since the learning task is based on the model that parties actively interact by sharing information and take on asymmetric roles. Compared with federated learning and collaborative machine learning, asymmetrically collaborative machine learning is to transform machine learning algorithms into Vertical parallelism, which splits the data based on one or more specific internal characteristics of the data.
Technically, for this scheme, the straightforward solution is learning on encrypted data. There are two different ways to learn a model on encrypted data: differential privacy Dwork et al. (2014); Abadi et al. (2016) and homomorphic encryption Acar et al. (2018). Differential privacy injects noise into query results to avoid inferring information about any specific record. However, it needs careful calibration to balance privacy and model usability. Further, private attributes still remain in plaintext, which are unacceptable for finance sectors, so users may still have security concerns. A more promising solution comes from the recent advance in homomorphic encryption. It allows users to encrypt data with the public key and offload computation to the cloud (or other parties). The cloud computes on the encrypted data and generates encrypted results. Without the secret key, the cloud simply serves as a computation platform but cannot access any user information. But it is extremely costly in computation, thus unsuitable for high dimension data and costly machine learning methods (e.g., deep neural networks).
In this paper, we propose a novel privacy-preserving architecture to solve the problem of how the two parties can collaboratively train a model while one party holds data, and another party holds the label by neural networks. Intuitively, we avoid learning on encrypted data directly. Thus we decompose a Deep Neural Network (DNN) into two components: the ’Feature Extraction’ part and the ’Classifier’ part. In ’Feature Extraction’ a locally unencrypted deep neural network is used to extract more compact features from unencrypted data. In the ’Classifier,’ a shallow neural network is used to learn a classifier on encrypted features came from the component of feature extraction. However, such a locally unencrypted and locally encrypted deep neural network results in a new challenge of preserving-privacy. Compared with learning on encrypted data, the ’classifier’ needs to calculate the gradient of inputs and passes it back to the ’feature extraction’ part. Thus we also need to avoid information leakage in the gradient, which has been proved inZhu et al. (2019). Furthermore, we take a two-layer neural network as an example to decompose the forward and backward propagation into four different steps and propose a protocol to avoid information leakage in them. Our contribution includes:
We define a new scheme of collaborative machine learning, called asymmetrically collaborative machine learning which is promising in many real-world applications.
We propose a novel privacy-preserving, computationally efficient, homomorphic encryption-based backpropagation algorithm for asymmetrically collaborative machine learning.
An extensive empirical evaluation of the proposed approach is demonstrated to show the proposed method achieves 100 times speed-up compared with the methods based on learning on encrypted data.
2 Related Work
Traditional encryption methods, such as AES (Advanced Encryption Standard) Daemen and Rijmen (1999), are extremely fast, and allow data to be stored conveniently in encrypted form. However, it is costly and quite a challenge to perform even simple analytics on the encrypted data (i.e. ciphertexts). For example, the cloud server needs access to the secret key, and the owner of the data needs to download, decrypt as well as operate on the data locally. These commonly lead to security concerns. Homomorphic Encryption (HE) is able to encrypt data before sending to a cloud computing platform while still allowing the operations of search, sort, and edit on the ciphertexts. This property avoids needing to ship data back and forth to be decrypted from the cloud computing platform. HE which supports any function on ciphertexts is known as Fully Homomorphic Encryption (FHE) Gentry (2009), while Partially Homomorphic Encryption (PHE) Damgård et al. (2012); Juvekar et al. (2018) includes encryption schemes that have homomorphic properties, with respect to one operation (e.g., only addition or only multiplication, but not both). However, FHE faces a fundamental problem which extremely costly in computation. Thus it is unpractical for some machine learning methods (e.g. high dimension linear model and neural network). Paillier Paillier (1999), as well known one method of PHE, supports unlimited numbers of additions between ciphertext, and multiplication between a ciphertext and a scalar constant. In other words, given and , you can not get . You can only get which equals . Given and , you can get which equals . But notice that , in this case, was not encrypted. is the homomorphic addition with ciphertext and is the homomorphic multiplication between ciphertext and a scalar constant. is the ciphertext of plaintext .
GELU-Net Zhang et al. (2018) proposes a novel privacy-preserving architecture based on paillier. The main difference between the proposed method and GELU is that GELU works under the situation that one party In this paper, we focus on solving the privacy issue that data and labels are distributed on different parties.
3.1 Applying Neural Networks to Collaborative Machine Learning
To solve this problem, our main strategy is making a deep neural network partially learn on unencrypted data and partially learn on encrypted features to reduce the computation cost in the stage of learning on encrypted data. More specifically, a deep neural network is carefully partitioned into two parts: the ’feature extraction’ and the ’classifier’ (or ’regression’). The ’feature extraction’ plays a role of dimension reduction in the plaintext and produces compact features to the ’classifier’. And then the implementation of ’the classifier’ is learning a simplified model (e.g. logistic regression) on encrypted features generated from the ’feature extraction’ part. It is easy to observe that this setting is insensitive to the dimension of inputs and network architectures not only because the ’feature extraction’ part contains the majority of neural network and it does not use arithmetic operations of homomorphic encryption, but also because the computational complexity of ’classifier’ part is related to the dimension of features generated from the ’feature extraction’ part rather than the dimension of raw inputs and the former is far more compact. Take a two-layer neural network as an example, one for ’feature extraction’ and another one for ’classifier’, The corresponding training objective is shown in Eq.1, for simplicity the bias items are omitted:
where is the dataset. and represend the input data sample and its related target (label).
is ReLU functionGlorot et al. (2011) and are parameters of two layers respectively. is the dimension of input sample. is the dimension of hidden layer. is the output dimension as the same as the number of categories. Assuming is CrossEncropy LeCun et al. (2015), is learning rate, We can further decompose forward and backward propagation into two different parts respectively, and then there are four steps as following:
Step 1 The forward propagation on the active party: the active party feeds the data into the neural net (feature extraction ) on the active party. The output activations is then send to the passive party.
Step 2 The forward propagation on the passive party: the passive party propagates the received activation through its neural nets and compute the CrossEncropy loss .
Step 3 The back propagation on the passive party: the passive party compute the required gradients: , , and update parameter via and send back to the active party.
Step 4 The back propagation on the active party: the active party receives the gradient from passives party, computes the gradient and updates the parameters .
To fulfil the requirement that both data and label cannot get or infer from the other side, we encrypt the intermediate results exchanged in both parties, and carefully design a Privacy-Preserving Back-Propagation that is compatible with the PHE in the setting of Asymmetrically Collaborative Machine Learning. We detail the privacy issue and the algorithm in the following section.
3.2 Secure Forward Propagation on the active party
The secure forward propagation on the active party is similar to the normal forward propagation used in training neural nets. The only difference is that the last layer activations (outputs) in the secure forward propagation needs to be encrypted with PHE. We note the additive homomorphic encryption as Since these activations will be transmitted to the passive party, leaving them in the plaintext will lead to the issue of information leakage. The passive party or attacker can collect these activations as meaningfully features for later use. Potentially, the passive party or attacker may infer the personal information from these activations. Thus, the active party will encrypt the activations to and then send to the passive party.
3.3 Secure Forward Propagation on the passive party
After receiving the encrypted activation , the passive party keeps going forward propagation, as shown in the step 1 of Sec 3.1 It calculates the weighted sum () and applies softmax. However, the non-linearity cannot be computed in the ciphertext. The solution to this problem is transmitting the weighted sum back to the active party for decryption and get the plaintext to calculate the softmax results Zhang et al. (2018). But directly sending the weighted sum without any protection will leak the prediction to the active party. The active party can use the activation prediction pairs to learn the classifier part in the passive party. In this way, the active party can learn the weights of the model on the passive party and further approximate the label. To end this, instead of transmitting the weighted sum directly back to the active party, the passive party will inject random noise to it . The noise hides the real weighted sum, which prevents the active party from accumulating activation prediction pairs to infer the weights of the neural network on the passive party. Finally, the active party decrypts the noisy weighted sums and sends the decrypted to the passive party. And then the passive party removes the noise injected before and computes the prediction (softmax result).
Another problem here is that the passive party get both and and can easily get
via linear regression. To end this, the passive party should use noisy weightto calculate the weighted sum () where . Note that is generated by the active party, we will discuss how to inject to in the Sec 3.4. With same procedure, the passive party still need to add additional noise to and send back the active party to avoid computing non-linearity homomorphically. To perform the correct forward propagation the active party will cancel the noise in via and send to the passive party for computing the true softmax output with respect to . Since the passive party can only observe the noisy and with respect to . The passive party cannot infer the activation . The whole forward propagation performs in a privacy-preserved manner.
3.4 Secure Backward Propagation on the passive party
During the backpropagation we need to estimate two gradient: the gradientand the gradient
. Note that these two gradients are the linear transformation of eitheror , both the active party and the passive party can derive what they want via regression. Not carefully dealing with gradient updating may cause a significant information leak. In backpropagation, the passive party will compute the following gradients: , and . Note that the gradient of weights is in the encrypted form. After weights updating , the weights will result in the encrypted form. This causes the issue that in the forward propagation of the next iteration, there will be two encrypted quantities in calculating weighted sum , which is incompatible with PHE. To avoid this situation, the passive party need to send the gradients of weight to the active party and get the decrypted gradients back. However, this solution is dangerous. Since the passive party holds and the active party holds the , knowing the gradient make both parties leak information at the same time. Thus, both two parties need to add random noise to the gradients of weights before sending it to the other side. Specifically, they do
The passive party add noise to and send to the passive party.
the active party decrypt
The active party add random noise :
The active party send to the passive party
The passive party remove noise :
Note that noise generated by the passive party can be removed immediately while the gradients of weight still contain noise where = . With the passive party blindly update the parameters as:
We can see that the noise will accumulate in weight in each iteration. If we note the accumulated noise as , the true weights that supposed to be used in the forward and backward propagation should be . To perform the right forward propagation, the active party needs to cancel the noise by subtracting from the noisy weighted sums as described in Sec 3.3. Similar to the forward propagation, in the back propagation, the extra noisy gradients is also added to and need to be removed before backpropagating to the active party. To achieve this, the active party needs to send the encrypted to the passive party. the passive party calculate the true gradient and send the encrypted gradient to the active party.
3.5 Secure Backward propagation on the active party
The active party decrypt the gradient received from the passive party, compute the gradient and update the parameters . The whole privacy preserved backpropagation is detailed in Algorithm 1 and Algorithm 2.
To evaluate the proposed method, specifically, we determine if our model (i) achieves lossless performance in classification tasks and (ii) if it brings advantages over other solution based on learning on encrypted data. Please note we don’t want to highlight the performance of neural networks but we can achieve a lossless performance. we implement the passive party and the active party both on two PCs with CPU Core i7-6850k and 64GB RAM connected by 1Gbps LAN. The neural network of the active party is built using pytorchPaszke et al. (2017) and the neural net of the passive party is implemented by pure python integrated with . We compare the proposed method with two previous studies that have considered privacy-preserved training with homomorphic encryption for deep neural networks. GELU Zhang et al. (2018) using PHE Paillier (1999) and CryptoNets Gilad-Bachrach et al. (2016).
In order to make comparison with previous work: GELU and CryptoNets Gilad-Bachrach et al. (2016). We implement the following architecture of the proposed method:
Multinomial Logistic Regression(MLR: Data-Dense(10)-Softmax
5, stride 2, 5 filter) -ReLu-MeanPooling-ReLu-Dense(84))-Dense(10)-Softmax
LeNet-5: Data-Conv(55, stride 1, 6 filter)-MeanPooling-ReLu-Conv(55, stride 1, 16 filter)-MeanPooling-ReLu-Dense(120)-Dense(84)-Dense(10)-Softmax
For the proposed method, the final layer, Dense(10)-Softmax, is on the passive party side, the rest of layers are on the active party side.
4.2 Training Accuracy
In this section, we show the proposed method is a lossless solution. The neural network architecture we use is Conv-1. Due to arithmetic operations of homomorphic encryption only support multiplication and addition, CryptoNets uses the square function to avoid computing the non-linear activation. However, this makes the training unstable and damage to the accuracy Gilad-Bachrach et al. (2016). From the Table. 1, we can see that CryptoNets suffers an accuracy loss ranging from 2% to 5% compared with the proposed method. We also note that GELU is also a lossless solution. This is because GELU adopts the similar round trip strategy that sending activation back and forth between the passive party and the active party for handling non-linear operation as we do.
4.3 Computation Speed
As above mentioned, the dimension of inputs and the neural network architecture are key factors to other methods. In this section we will demonstrate that the proposed method is insensitive to these two factors.
Sensitivity of Input Dimension
One problem of learning on the encrypted data is the computation speed depend on the input dimension. It is because the dimension of the encrypted data directly increases the number of homomorphic encryption operations. We compare our model with multinomial logistic regression that learned on encrypted data. We experiment with simulated data with various feature lengths. As shown in Fig. 1, our proposed method is insensitive to the dimensions of input data compared with multinomial logistic regression (i.e., shallow network with softmax activation). We do the experiment on the real word data using MNIST LeCun et al. (1998) and CIFAR-10 Krizhevsky and Hinton (2009). We can observe from the table. 2 the result of CIFAR-10 Krizhevsky and Hinton (2009) (dimensions is ) is very similar to the result on MNIST LeCun et al. (1998) (dimensions is ). This benefits from we employ learning on encrypted features generated from the ’feature extraction’ of the deep neural network, which can significantly reduce raw data dimensions. Compared with learning on encrypted data, feature extraction is very efficient. So that is why the proposed method consumes similar time in the different datasets.
|Multinomial Logistic Regression (MNIST)||0.5887|
|Multinomial Logistic Regression (CIFAR-10)||2.7027|
Sensitivity of Neural Network Architectures
Note that the proposed method is designed to work on any architecture without much computation speed degrade. In comparison, the other baselines models like GELU and CryptoNet are sensitive to either the neural networks architecture or input data dimension. Thus, using complex neural net such as VGG Simonyan and Zisserman (2014), ResNet He et al. (2016)
with high dimension input such as ImageNetDeng et al. (2009) is infeasible to the methods of learning on encrypted data, and it will take an extremely long time. Moreover, since the arithmetic operations in PHE is much faster than arithmetic operations in FHE. We mainly focus on comparing the proposed method with GELU. Thus, to make a fair comparison, we train the proposed method and GELU on different architecture described in Sec. 4.1. We report the computation time (for one inference).
|Multinomial Logistic Regression||0.588||0.93|
|Our Method (Conv-1)||0.0582|
Table.3 shows the computation time (for one inference) and the accuracy on MNIST LeCun et al. (1998). We observe that the proposed method achieves more than 100x speed-up over GELU-Net with no accuracy loss. Even compared with simpler architecture multinomial logistic regression (MLR), our method is much faster. Therefore our method works much faster than GELU and performs much better than the linear model with encrypted data due to the deep neural network can learning more semantic represendation for targets. Table. 4 shows the computation speed of the proposed method in both deep (LeNet-5) and shallow neural net (Conv-1). Result reveals that the time for one inference is almost invariant to the architectures as long as the output dimension is the same (both Conv-1 and LeNet-5 are 84). That means the mainly computational bottleneck is on the ’classifier’ which learns the part model on encrypted data. It also shows that our deep model is highly potential to apply to more complicated data which need the deeper neural network.
In this paper, we have proposed a scheme, called asymmetrically collaborative machine learning where one party has data, but the other party has labels only. A deep neural network with a partly unencrypted and partly encrypted strategy is proposed to avoid learning on encrypted data directly for this scheme. Beyond that, we offer a series of solutions to preserve privacy from both parties involved. The design has ensured the efficiency and effectiveness of the proposed method. We have carried out extensive experiments that demonstrate more than times speedup compared with the state-of-the-art solutions.
-  (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1.
-  (2018) A survey on homomorphic encryption schemes: theory and implementation. ACM Computing Surveys (CSUR) 51 (4), pp. 79. Cited by: §1.
-  (1999) AES proposal: rijndael. Cited by: §2.
-  (2012) Multiparty computation from somewhat homomorphic encryption. In Annual Cryptology Conference, pp. 643–662. Cited by: §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §4.3.
-  (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §1.
Fully homomorphic encryption using ideal lattices.
Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 169–178. Cited by: §2.
-  (2016) Cryptonets: applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, pp. 201–210. Cited by: §4.1, §4.2, §4.
Deep sparse rectifier neural networks.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.3.
-  (2018) gazelle: A low latency framework for secure neural network inference. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1651–1669. Cited by: §2.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.3.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §3.1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.3, §4.3.
-  (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §1.
-  (1999) Public-key cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Applications of Cryptographic Techniques, pp. 223–238. Cited by: §2, §4.
-  (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
-  (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §1.
-  (2018-07) GELU-net: a globally encrypted, locally unencrypted deep neural network for privacy-preserved learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 3933–3939. Cited by: §2, §3.3, §4.
-  (2019) Deep leakage from gradients. In Advances in Neural Information Processing Systems, pp. 14747–14756. Cited by: §1, §1.