In 2016, with AlphaGo beating the top human professional Go player, people became familiar with artificial intelligence (AI) related concepts and applications. In addition, AI has achieved great success in diverse application areas, such as image classification, recommendation system, and precision marketing, and become an essential part of our daily life.
In the past decade, the rapid development of AI mainly depends on the progress of data volume, graphics processing unit (GPU), and deep learning. Deep learning methods have proved effective in real applications but require powerful computing resources and a huge amount of data during training in order to prevent overfitting. Moreover, to train a deep model, various transactional data involving privacy must be collected from users or institutions and stored in a central server. This data centralization way is dangerous since the central server can encounter data unsafety or privacy leakage once it is deliberately attacked.
Data security and privacy have attracted global attention in recent years. In November 2016, China passed its first Cybersecurity Law, aiming to strengthen cyberspace governance through a number of initiatives, including personal information protection, special protection of critical information infrastructure, and local storage of data. The General Data Protection Regulation (GDPR) took effect in May, 2018. GDPR has been designed to provide individuals with greater control over how their personal data is collected, stored, transferred, and used, while also simplifying the regulatory environment across the European Union (EU).
It is a new challenge to discover AI knowledge from big data while not compromising data security and privacy. To do this, Google proposed a federated learning framework in McMahan et al. (2016) for training a privacy-preserving model at the first time. The main idea is to enable multiple devices to collaboratively learn a shared prediction model while all the training data is kept locally on device. Besides, a federated averaging algorithm is proposed in Konečnỳ et al. (2016) to greatly reduce training rounds for converging. The communication costs in one round can be further reduced by compressing gradient updates using random rotation and quantization. Bonawitz et al. (2017) developed a secure aggregation protocol by encrypting participant’s local gradients before aggregation.
Federated learning methods, however, focus more on the safe model training on the basis of encrypted gradient updates. In fact, there exist other strategies for secure federation that are not restricted to learning from multi-party data. One solution is to train models or extract knowledge in a ciphertext space on a central sever where primtive data is encrypted before pooling together and will not decrypted during training. Another solution is to first extract crude knowledge from each participant, which is encoded with deep neural networks or traditional machine-learning models, and then refine knowledge through ensemble or aggregation on a server. When more and more knowledge is extracted and stored as knowledge nodes, connecting them together will naturally form a knowledge network that contributes to further secure knowledge reasoning.
In this paper, we put forward a new term, knowledge federation, and provide a four-level federation hierarchy as well as related definitions. Knowledge federation unifies the above-mentioned strategies and is a general framework for secure multi-party computation and learning.
2 Overview and Hierarchy
In this section, we give an overview of knowledge federation from the definition and the hierarchy with four-level federation, followed by elaboration upon four different levels: information level, model level, cognition level, and knowledge level.
In this work, knowledge federation is formally defined as collaboratively creating or discovering significant knowledge over isolated mutli-party data while preventing data leakage. Specifically, given a set of data } respectively distributing on a party , we expect to find a knowledge from this data with a mapping function . Generally speaking, knowledge refers to models or patterns generated with available data. Taking advantage of the knowledge, one can make reasonable inference for new data and sound decisions. For example, a consumer has the income data in a bank, and has already got two credit cards from the other two financial institution. With his/her monthly income and granted credit data, the accumulated credit risk can be more reliably assessed as a knowledge. When this customer applies for a new credit card from another institution, the accumulated risk knowledge can be used to predict his/her overall risk and make a rational decision of whether to approve this application.
From the definition, we know that the aim of knowledge federation is to extract knowledge from multi-party data, and the federation process can not leak data privacy of any participant to others. That means, the data of each party must be kept locally and can be utilized only after encryption or embedding. Meanwhile, the global knowledge obtained with federation should perform better than the local knowledge on isolated data, and approximate to the knowledge generated in a centralized way. As we saw in the previous example, both income and credit limit are extremely private and thus need to be strictly preserved during federation.
Generally, knowledge discovery involves three key elements, original data, models or patterns, and knowlege representation. In fact, federation can happen to any element level, as shown in Figure 1. Therefore, according to the occurrence time of federation, knowledge federation can be viewed as a four-level theoretical framework,
Low level. In this level, the federation takes place at the early stage of computing or learning, it basically assembles all isolated ciphertext after data encryption. The encryption must be homomorphic so that the following computing or learning can work normally on the encrypted space. Since each data is seperately processed into new information before federation, this level is also called information level.
Middle level. When the federation occurs during model training, knowledge federation is equivalent to federated machine learning to some extent. In this level, local models are iteratively updated through aggregating models on a third-party server. Model updates are usually encrypted with such technologies as differential privacy before uploading to the server. As a consequence, model level is another name of this scenario.
High level. Cognition is a kind of knowledge. In high level, coarse cognition is first locally extracted on each party, and then the federation works on the coarse cognition in order to produce fine cognition or meaningful knowledge. This federation is similar to ensemble learning in some sense, but the noticeable difference between them is that ensemble learning is irrelevant of multi-party data or privacy preserving. Cognition is so important to the high level that we also call this level as cognition level.
Top level. Once knowledge is created or learned, it will be stored in knowledge warehouse and shared with other entities. In top level, all knowledge is viewed as independent knowledge nodes that connect each other to construct knowledge network. Roaming and exploring on the network, one can produce or infer more knowledge for decision making. This level is also called knowledge level.
2.2 Information Level
As shown in Figure 2, information-level federation requires that original data must be encrypted on each party before uploading to the third-party server. It is worth noting that the encryption is supposed to be homomorphic since the following knowledge creation on the server needs mathematical operations for computing or learning whithout direct decryption of ciphertext. That is, original data is first converted into encrypted information through an encryption function , and then significant knowledge is produced on with the function .
Information-level federation was studied previously by Graepel et al. (2013) and Aslett et al. (2015), where privacy-preserving machine learning is based on fully homomorphic encryption (FHE). Subsequently, CryptoNets, the first neural network over encrypted data, was proposed in Dowlin et al. (2016) to do the inference of privacy-preserving deep learning. Others cryptographic techniques were also applied in Liu et al. (2017); Jiang et al. (2018) to achieve similar goals. In addition, to support both the training and inference phases, a CryptoNN framework was proposed in Xu et al. (2019) to train a neural network model over encrypted data by using a functional encryption scheme. Akavia et al. (2019)
developed a privacy-preserving solution to learn regularized linear regression models using a linearly homomorphic encryption (LHE) scheme.Kim et al. (2020)
combined differential privacy methods and homomorphic encryption techniques for logistic regression. For more information, please refer to the surveyDomingo-Ferrer et al. (2019).
There are some examples involving information-level federation applications, including the secure prediction of neural networks Jiang et al. (2018), the secure retrieval of data from encrypted databases Akavia et al. (2018), classification Hesamifard et al. (2019) and document ranking Shao et al. (2019). The challenge is that the current technological status and the efficiency issues still restrict the wide applicability of information-level federation in practice.
2.3 Model Level
Model-level knowledge federation mainly concerns about how to extract global knowledge based on local models, as demonstrated in Figure 3
. The way of gathering local models varies a lot with data distribution on each party. Taking into account the difference of data distribution in real applications, model-level federation is further classified into three types, cross-sample, cross-feature, and bi-cross federation.
2.3.1 Cross-sample Federation
In cross-sample cases, data with same features is distributed on each party, but samples or users on a party are independent and mostly disjoint from other parties. Labels for samples will be collected respectively on each party. The federation aims to take good advantage of all these samples to produce a common model through aggregating model updates rather than uploading local data to the third-party server. Since local labels are only used to supervise local models and do not need to be transmitted among different parties, label privacy is secure as well. A typical example of cross-sample federation application is the next word prediction on mobile phones that was first studied by Google in Konečnỳ et al. (2016).
2.3.2 Cross-feature Federation
When there exist common samples, although with different features, among several parties, fusing separate features of common samples will be helpful to model improvement. Unfortunately, it is unacceptable to directly concatenate them for the sake of data security. Cross-feature federation is a way of federated learning that both considers comprehensive features and prevents data leakage. Hardy et al. (2017); Nock et al. (2018) described a cross-feature federation scheme to train a logistic regression model and adopted homomorphic encryption for privacy-preserving computations. This federation has been in demand for financial risk control since the rapid development of Internet finance. But cross feature still faces two challenges. One is how to prevent user privacy leakage during aligning common samples between parties, the other is how to preserve label privacy while training models on the party with no labels collected.
2.3.3 Bi-cross Federation
Except for cross-sample and cross-feature, there is another more complex scenario where only a small portion of samples or features are intersected among all participants. Herewith, the federation involves the hybrid of cross-sample and cross-feature, so we refer to this setting as bi-cross federation. To make the best of available data, transfer learning or knowledge distillation can be used to provide federation solutions for the entire sample and feature space. The transfer federated learning method proposed inLiu et al. (2018)
explores hidden representation of incomplete features and samples through adapting extracted knowledge to target domain. This federation is more common and useful in real applications. For example, suppose there are two institutions, one is a local insurance company located in a city, and the other is a hospital located in another city. Obviously, only a small portion of user samples is possibly intersected between these two parties due to different geographical areas. Moreover, business differentiation determines that overlapping features between them is quite limited. If we expect to utilize data from these two institutions to train a model for insurance risk assessment, bi-cross federation will come in handy.
2.4 Cognition Level
The obvious distinction between cognition and model levels is that feature embedding, rather than model updates, will be encrypted and applied to further ensemble in cognition-level federation. The embedding could be the last fully-connected layer in deep neural networks or the local cognition extracted on a party. The ensemble on the third party is a training process with an independent model based on local embeddings, and the training will interact with local models and iterate until convergence. To be more specific, as shown in Figure 4, during federation, high-level features embedded in local data are first encrypted and sent to the third-party server. Then the server performs knowledge discovery through training an ensemble model. The ensemble model will reversely direct the optimization of local models. Local embedding can be viewed as coarse cognition (or meta knowledge) that is brought together to create fine cognition (or global knowledge). Pratical applications are often in need of this federation. For instance, if we want to comprehensively analyze and predict user behavior through the multi-source heterogeneous data including video, audio and text, cognition-level federation should be the best way for extracting global behavior knowledge while preserving respective data privacy.
2.5 Knowledge Level
Once the initial knowledge is constructed in a certain way and saved in a knowledge base, the federation will enter a higher-level phase, knowledge-level federation, where initial knowledge from multiple knowledge bases will further collaborate and evolve into more significant knowledge. In order to ensure that knowledge is able to flow easily among different knowledge sources, a knowledge network should be first constructed through connecting all knowledge nodes each of which represents an independent knowledge base. In a nutshell, the knowledge-level federation actually expects to let knowledge freely flow in the knowledge network and mine more comprehensive and valuable knowledge through knowledge fusion or reasoning, which is greatly helpful for managers or supervisors to make sound decisions.
It should be emphasized that knowledge network is totally different from, but closely related to, knowledge graph. The latter mainly describes entities and their interrelations, organized in a graph, as discussed inEhrlinger and Wöß (2016). Knowledge network is built on the top of knowledge graphs, and is envisaged as a network of all kinds of knowledge which are relevant to seveal specific domains or to multiple organizations. In this case, knowledge fusion and reasoning techniques Grosan and Abraham (2011); Dong et al. (2015) can be applied to provide solutions on the network under a federation.
Here is an example of knowledge-level application. Consider two pieces of knowledge, one is that a company has a record of tax evasion in a knowledge node, the other is that this company is inable to offset debts with assets in another node. Credit risk can thus be comprehensively assessed through knowledge-level federation.
3 Unification of Secure Multi-party Computation and Learning
Knowledge federation is a unfied framework for secure multi-party computation (MPC) and multi-party learning (MPL) since the computing or learning task can be achieved respectively under this framework. The notable difference between MPC and MPL is that the latter requires to train a model with the multi-party input data, but the former does not. In the knowledge federation framework, both MPC and MPL are unified as the federation process that takes place on a virtual or real, but always independent, third-party server.
3.1 Secure Multi-party Computation
In MPC, the knowledge is generated by performing homomorphic operations such as addition, multiplication, and maximum on the encrypted data, which requires that the encryption must be homomorphic. That is, the operation on the ciphertext can achieve exactly the same result as on the plaintext. Although the third party is probably untrustable, this computing procedure does not leak data privacy on the ciphertext space. If the third-party is virtual or omitted, the proposed framework is fully decentralized, which is quite useful especially in two-party collaboration. According to the forementioned description, secure MPC is radically a special case of the information-level federation with regard to computation.
3.2 Secure Multi-party Learning
In some situations, the knowledge must be jointly learned on input data from each party, where it is in essence a secure multi-party learning (MPL) problem. There are two ways of secure federation in MPL. A natural idea is that local data is first homomorphicly encrypted and then sent to the third-party server, models are trained on the server with the traditional or deep learning methods. This case actually amounts to the information-level federation with regard to learning. Another popular way of implementing secure MPL is often called federated learning in some literatures Konečnỳ et al. (2016); Yang et al. (2019) as well. A model is first locally trained with isolated data on each party, model updates are gathered and computed on the third party, and the model is then updated and trained iteratively in this way until convergence. This way is equivalent to the model-level knowledge federation introduced in Section 2.3.
Recently, federated learning has been used frequently in research and business, usually in close association with secure multi-party computation, privacy-preserving learning, and distributed machine learning. Nevertheless, the term is a narrow concept merely focusing on secure multi-party learning, it cannot provide the full description of practical senarios involving secure federation, such as shared computation. Obviously, federation should be a broad topic that is not limited to learning. In this work, we have proposed a novel concept, knowledge federation, that unifies the framework of secure mutli-party computation and learning, and delineated its hierachy with four-level federation that serves as basis for discussions on this topic. Taking into account diverse applications, knowledge federation bears more resemblance to an abstract framework than to a mathematical structure. Our ongoing research focuses on an in-depth analysis of this concept with respect to privacy-preserving implementations as well as the assessment of data quality and contribution of each participant. It is expected that in the near future, knowledge federation would break the barriers between institutions and establish a model market where knowledge could be created and shared together at liberty, while with safety.
- Akavia et al. (2018) Adi Akavia, Dan Feldman, and Hayim Shaul. 2018. Secure Search on Encrypted Data via Multi-Ring Sketch. 985–1001. https://doi.org/10.1145/3243734.3243810
- Akavia et al. (2019) Adi Akavia, Hayim Shaul, Mor Weiss, and Zohar Yakhini. 2019. Linear-Regression on Packed Encrypted Data in the Two-Server Model. 21–32. https://doi.org/10.1145/3338469.3358942
- Aslett et al. (2015) Louis Aslett, Pedro Esperança, and Chris Holmes. 2015. Encrypted statistical machine learning: new privacy preserving methods. (08 2015).
- Bonawitz et al. (2017) Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1175–1191.
- Domingo-Ferrer et al. (2019) Josep Domingo-Ferrer, Oriol Farràs, Jordi González, and David Sánchez. 2019. Privacy-preserving cloud computing on sensitive data: A survey of methods, products and challenges. Computer Communications 140-141 (05 2019), 38–60. https://doi.org/10.1016/j.comcom.2019.04.011
- Dong et al. (2015) Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei Zhang. 2015. From Data Fusion to Knowledge Fusion. Proceedings of the VLDB Endowment 7 (03 2015). https://doi.org/10.14778/2732951.2732962
- Dowlin et al. (2016) Nathan Dowlin, Ran Gilad-Bachrach, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. 2016. CryptoNets : Applying Neural Networks to Encrypted Data with High Throughput and Accuracy. , 12 pages.
- Ehrlinger and Wöß (2016) Lisa Ehrlinger and Wolfram Wöß. 2016. Towards a Definition of Knowledge Graphs. SEMANTiCS (Posters, Demos, SuCCESS) 48 (2016).
- Graepel et al. (2013) Thore Graepel, Kristin Lauter, and Michael Naehrig. 2013. ML Confidential: Machine Learning on Encrypted Data. In Information Security and Cryptology – ICISC 2012, Taekyoung Kwon, Mun-Kyu Lee, and Daesung Kwon (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–21.
- Grosan and Abraham (2011) Crina Grosan and Ajith Abraham. 2011. Knowledge Representation and Reasoning. Springer Berlin Heidelberg, Berlin, Heidelberg, 131–147. https://doi.org/10.1007/978-3-642-21004-4_6
- Hardy et al. (2017) Stephen James Hardy, Wilko Henecka, Hamish Iveylaw, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv: Learning (2017).
- Hesamifard et al. (2019) Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. 2019. Deep Neural Networks Classification over Encrypted Data. 97–108. https://doi.org/10.1145/3292006.3300044
- Jiang et al. (2018) Xiaoqian Jiang, Miran Kim, Kristin Lauter, and Yongsoo Song. 2018. Secure Outsourced Matrix Computation and Application to Neural Networks, Vol. 2018. 1209–1222. https://doi.org/10.1145/3243734.3243837
- Kim et al. (2020) M. Kim, J. Lee, L. Ohno-Machado, and X. Jiang. 2020. Secure and Differentially Private Logistic Regression for Horizontally Distributed Data. IEEE Transactions on Information Forensics and Security 15 (2020), 695–710.
- Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
- Koushanfar et al. (2019) Farinaz Koushanfar, Sadegh Riazi, and Bita Rouhani. 2019. Deep Learning on Private Data. IEEE Security and Privacy Magazine (02 2019). https://doi.org/10.1109/MSEC.2019.2935666
- Liu et al. (2017) Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. 2017. Oblivious Neural Network Predictions via MiniONN Transformations. 619–631. https://doi.org/10.1145/3133956.3134056
- Liu et al. (2018) Yang Liu, Tianjian Chen, and Qiang Yang. 2018. Secure Federated Transfer Learning. arXiv: Learning (2018).
- McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).
- Nock et al. (2018) Richard Nock, Stephen James Hardy, Wilko Henecka, Hamish Iveylaw, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2018. Entity Resolution and Federated Learning get a Federated Resolution. arXiv: Databases (2018).
- Shao et al. (2019) Jinjin Shao, Shiyu Ji, and Tao Yang. 2019. Privacy-aware Document Ranking with Neural Signals. 305–314. https://doi.org/10.1145/3331184.3331189
- Xu et al. (2019) Runhua Xu, James Joshi, and Chao Li. 2019. CryptoNN: Training Neural Networks over Encrypted Data.
- Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology 10, 2 (2019), 12.