1 Introduction
Recently, Knowledge Graph (KG) has been popularly constructed and used by more and more companies due to its ability of connecting different types of data in meaningful ways and supporting rich data services. A KG is a heterogeneous graph composed of entities (nodes) and relations (edges), and in some KGs [besta2019demystifying] there are also properties (features) and labels^{1}^{1}1Labels can be taken as special properties, and we use properties to denote both features and labels in the following of this paper. for entities. A knowledge edge is represented as a factual triple of the form (head entity, relation, tail entity), also denoted as (, , ). For example, (Andrew C. Yao [Scientist,theorist, and Professor]; WinnerOf; Turing Award [Alan Turing and ACM]) is a fact in KG with entity properties. So far, KGs have been applied into various tasks such as question answering [zhang2018variational, huang2019knowledge], recommender system [cao2019unifying, wang2019kgat], and information extraction [hoffmann2011knowledge, koncel2019text]. Recent advances in KGs include knowledge representation learning [bordes2011learning, nickel2016holographic, wang2017knowledge], knowledge acquisition and completion [han2018neural, chen2018variational, omran2019embedding], and knowledgeaware applications [petroni2019language].
KG isolation problem — an example. Data isolation has been a longstanding problem ever since, especially with kinds of regulations coming into force all over the world in recent years. KG isolation is a typical example of data isolation problem. That is, KGs are isolated by multiple parties, as the example shown in Figure 1. KG isolation is quite a common problem in practice, since different institutions (e.g., banks, financial companies, and social media platforms) may construct their own KGs with their own data. Figure 1 shows a typical case of the KG isolation problem, where there are two parties and each of them has a KG itself. More specifically, Party has four entities and their relations, and party has four entities and their relations. Besides, each entity has its properties another property, e.g., ‘good’ and ‘bad’. Due to the data isolation problem, the KGs of party and party
cannot share with each other. Thus, information are also limited to both parties to deploy further artificial intelligence applications. For example, although they have overlap entities (
), it is difficult for party or party to figure out the fact that works for both companies and . Nevertheless, when both parties train a model based on their standalone KG, the model of party is more likely to misjudge as since a good entity ‘Alice’ knows him, and he works for a good company. However this is not the case in party , party knows that has relation with a ‘really bad’ people, thus he should be probably labelled as
.To solve the data isolation problem, existing works propose different privacy preserving machine learning techniques such as collaborative learning
[chase2017private, li2020homopai], federated learning [yang2019federated, kairouz2019advances], split learning [vepakomma2018split], and secure machine learning [mohassel2017secureml, riazi2018chameleon]. To date, existing privacy preserving machine learning techniques have covered most traditional data mining and machine learning algorithms, e.g., kmeans
[mohassel2019practical], PCA [liu2020privacy][chen2020homomorphic], tree based model [fang2020hybrid][zheng2020industrial], and recommender system [chen2018privacy, chen2020practical].So far, there have been several literatures on privacy preserving graph algorithms [brickell2005privacy, he2011privacy, blanton2013data, sharma2016privacy, chang2016privacy]. For example, Brickell and Shmatikov [brickell2005privacy] proposed secure methods for finding shortest distance on multiple graphs. Graph anonymization and private link discovery approaches were presented in [he2011privacy], and dataoblivious graph analysis algorithms were provided in [blanton2013data]. Besides, the authors in [sharma2016privacy, chang2016privacy]
proposed secure methods for graph analysis on encrypted graphs in cloud computing setting. However, how to perform privacy preserving KG under KG isolation setting remains an open research problem. This is potentially because two reasons. On the one hand, data in KG not only contains entities (samples) and properties (features), but also involves different kinds of relations between entities, which is more complicated than data in traditional machine learning. On the other hand, techniques in KG usually involve machine learning approaches such as deep learning and thus are more complex than naive graph analysis algorithms.
In this paper, to fill this gap, we aim to summarise the open problems for privacy preserving KG in data isolation setting. That is, there are multiple parties and each of them has a KG constructed by its own private data The purpose of privacy preserving KG is to perform KG related tasks using the KGs from multiple parties, on the basis of protecting the data privacy, meanwhile achieving comparable performance as the plaintext KG by merging the raw KGs of multiple parties.
Open problems in privacy preserving KG. Motivated by the existing advanced techniques in graph learning and KG, in this paper, we summarize the open problems in privacy preserving KG from five aspects, i.e., merging, query, representation, completion, and applications, as is shown in Figure 2.
Privacy preserving KG merging. KG isolation problem describes the facts that parties independently holds their only private KGs, therefore the first open problem for privacy preserving KG is to merge these KGs and store the merged result securely. For merging, different parties are likely to have different entity sets, and therefore, the most important thing is to identify their common entities and merge the corresponding properties privately. A possible solution is private set intersection [pinkas2018scalable] and secure MultiParty Computation (MPC) [yao1986generate, lindell2020secure]. For storage, the troublesome problem is how to keep the merged result secure and maintain doable for the following operations and application on KG. Homomorphic encryption [gentry2009fully] and secret sharing [shamir1979share] are possible ways for this.
Privacy preserving KG query. For traditional KG, query is quite straightforward for a single party by using mature graph traversal languages such as Gremlin [rodriguez2015gremlin]. However, under privacy preserving KG setting, query is challenging considering the privacy constrain. That is, the one who initiates the query cannot obtain any other information except the query result, and the KG parties do not know what the query is. Intuitively, this can be done directly by using Oblivious Transfer (OT) [rabin2005exchange] or private information retrieval [chor1995private]. Unfortunately, privacy preserving KG query is more challenging since these KGs are isolated by multiparties, and therefore, it is the second open problems in privacy preserving KG. A possible solution for this is combining OT with other cryptographic techniques such as secret sharing [shamir1979share], garbled circuit [yao1986generate], and PseudoRandom Generator (PRG) [haastad1999pseudorandom].
Privacy preserving KG representation. KG representation learning is a critical research direction of KG, which aims to learn lowdimensional embeddings of entities and relations, paving the way for many knowledge completion tasks and downstream applications [ji2020survey]. In KG isolation setting, after KG merging, data (including entities, properties, and relations) are securely stored by multiparties. Although existing secure machine learning as a service [mohassel2017secureml] provides methods for representing entities (samples) and properties (features), it cannot capture the relations between entities, and thus cannot achieve comparable performance as the traditional plaintext KG representation learning approaches. Therefore, privacy preserving KG representation is the third open problem in privacy preserving KG. A possible solution could be combining graph neural (convolution) network [cai2018comprehensive, liu2019geniepath] with the above mentioned cryptographic techniques to build privacy preserving graph neural (convolution) network. The learned representations can be stored in plaintext or encrypted format based on privacy requirement.
Privacy preserving KG completion. KG completion (aka reasoning) is an active field of research since KGs are known for their incompleteness and noise. Existing researches on KG completion include entity property prediction [lin2015learning], and triple classification [dong2019triple]. With the learned KG representations for entities and relations, in either plaintext or encrypted format, one can build secure machine learning algorithms for further KG completion tasks. The challenge here is how to provide scalable and flexible secure machine learning frameworks so that one can easily build secure machine learning algorithms to meet the various needs in KG completion. One possible solution is a system with hybrid secure computation protocols, rich computation operations, and powerful domain specific language.
Privacy preserving KGaware applications. KG and its related techniques have boosted the performance of numerous applications such as natural language understanding [logan2019barack], question answering [chen2019bidirectional], fraud detection [liu2018heterogeneous], risk assessment [cheng2019risk], and recommender system [wang2019kgat]. Although one cannot enumerate all the possible KGaware applications in KG isolation setting, we showcase three realworld privacy preserving KGaware applications and present how to solve them using the four privacy preserving KG techniques above.
The following paper is organized as follows. Section 2 describes definition, related secure computation techniques, and threat model. Section 3Section 7 present the detailed open problems and possible solutions in privacy preserving KG, i.e., merging, query, representation, completion, and applications. Section 8 concludes this paper.
2 Definitions and Backgrounds
We first give the definitions for traditional KG and privacy preserving KG. We then describe background knowledge on secure computation techniques. We finally describe the threat models in privacypreserving KG, which models the adversaries’ behaviour. Table 1 lists the main notations used throughout the paper.
Notation  Explanation 

Knowledge graph (KG)  
Entity set  
Relation set  
Fact set  
/  Head/tail entity 
Property set  
A property (feature)  
Location set in KG  
Query instruction set  
A relation between entities  
Secret sharing of  
h  Entity embedding 
W  Weight matrix 
Nonlinear active function 

Traversal condition  
Polynomial coefficients  
Neighborhood function  
Aggregator of th depth propagation  
Propagation depth  
,  Parties who own KG 
2.1 Definition
Definition of traditional KG. Following previous literature [wang2017knowledge, ji2020survey], we define a knowledge graph as , where , and are sets of entities, relations, and facts, respectively. A fact is denoted as a triple which is composed of a head entity , a tail entity , and a relation between them, e.g., (Andrew C. Yao; WinnerOf; Turing Award). Besides, entities may have a property set that describe them. For each entity, there is a property subset that describes it, e.g., Andrew C. Yao [Scientist, theorist, and Professor] and Turing Award [Alan Turing and ACM].
Definition of privacy preserving KG. Assume there are parties and each of them has an individual KG , where . Same as the traditional KG, the fact set of party contains the triples which contains a head entity , a tail entity , and a relation between them. Besides, for each party , there is a property set describing its entities. The purpose of privacypreserving KG is to conduct KGrelated tasks (including query, representation, completion, and application) on the basis of (1) preserving the data confidentiality of the KGs held by parties, and (2) achieving comparable performance as the traditional KG on mixed plaintext data.
2.2 Secure Computation
Secure computation is a general cryptography term, encompassing all methods that allows computation on data while still keeping data private.
It is also considered as the core technique for implementing a privacypreserving application. In the literature, secure computation has directed research into generic solutions such as Homomorphic Encryption (HE) [Gentry2010ComputingAF, Fan2012SomewhatPF], Oblivious Random Access Memory (ORAM) [Goldreich1987TowardsAT, Goodrich2011ObliviousRS] and Universal Circuit (UC) [Kolesnikov2008APU, Lipmaa2016ValiantsUC, Gnther2017MoreEU], and also inspired works on primitives with a specific question, such as Oblivious Transfer (OT) [Ishai2003ExtendingOT, Asharov2013MoreEO, Chou2015TheSP], Private Information Retrieval (PIR) [Kiayias2015OptimalRP, Canetti2017TowardsDE, Boyle2017CanWA, Patel2018PrivateSI, Angel2018PIRWC, Ali2019CommunicationComputationTI] and Private Set Intersection (PSI) [Chase2020PrivateSI].
Though the problem of secure computation has been studied for almost 30 years, most of the works are theoretical. Recently, the rapid development of computer networking has pushed one solution into practical — secure multiparty computation (MPC) [yao1986generate, lindell2020secure]. In this work, we will take MPC as a typical technique to solve privacy preserving KG problems. More specifically, we will focus on secret sharing technique.
Secret sharing (SS) [shamir1979share]. Secret Sharing is an essential cryptographic primitive for many MPC protocols. Roughly, a secret sharing scheme splits a secret value into multiple pieces, such that the secret is only revealed with sufficient number of pieces. Formally, a secret sharing scheme comprises two algorithms , and we use a pair with angle brackets, i.e., , to denote is secret shared. Take a two party ( and ) secret sharing for example, considering that party wants to share () its private data with party , first randomly generates a share with denoting a large prime, and keeps itself. Then calculates mod , and sends to . To reconstruct () data , which is shared between both parties, one party obtains the share from the other party, and then calculates mod .
We now list secret sharing based secure computation primitives used in this paper as follows.

[leftmargin=*]

LINEAR for secretly shared values , , and plaintext values and . Linear operations can be done by each party locally without interacting with other parties. And the result is also shared.

MUL for secretly shared values and , such that and returns a shared value . Secretly shared multiplication replies on Beaver’s Triple technique [beaver1991efficient] which needs interactions between participants.

DIV for secretly shared values and , such that and returns a shared value . Secretly shared multiplication can be implemented using numerical optimization algorithms such as Goldschmidt’s series expansion algorithm [goldschmidt1964applications]. After it, DIV can be approximated and computed by LINEAR and MUL.

ARGMAX for a list of shared values and returns the one with the maximum value. This can be done by conducting secure comparison using Boolean secret sharing [demmler2015aby]. It can also speedup by using tree structure based parallel comparison [mohassel2019practical].
Note that secret sharing only works in finite field to guarantee security, and fixedpoint representation is popularly used to make it suitable for float numbers [mohassel2017secureml].
2.3 Threat Model
The threat model of PPKG follows the standard MPC security under the real world vs. ideal world paradigm [lindell2020secure]. That is to say, we categorize the adversary’s behaviours into one of the following:

Semihonest adversary who corrupts parties but follows the protocol as it specified.

Malicious adversary who causes corrupted parties to deviate arbitrarily from the prescribed protocol in an attempt to violate security.
3 Privacy Preserving KG Merging
In this section, we first describe the problems in privacy preserving KG merging and our proposed secure merging solutions. We then present how to store the merged KG securely based on secret sharing.
3.1 Merging
Under KG isolation setting, KGs are separated by multiple parties, and naturally, the first step for privacy preserving KG is to merge these KGs and store the merged result securely. Privacy preserving KG merging has several challenging problems. First, multiparties have their own entity sets, and how to align their entities privately is the first challenging problem. Second, the KGs of different parties are built from different data sources, hence, their entity names may be different from each other although those names have the same meaning. For instance, Party may mark the node as “”, while Party mark the node as “”, however both of which refer to the same entity. Therefore, how to conduct entity linking across different parties privately is the second problem. Third, the same entity across multiKGs may also have different property values. For example, the KG of party shows has a property of 25 years old, while the KG of party indicates is 35 years old. Therefore, how to merge the property values for the same entity becomes another problem. We now describe these challenging problems in details and present possible solutions.
3.1.1 Private Entity Alignment
Under KG isolation setting, private entity alignment aims to align the same entities among different KGs. Formally, it is defined as follows.
Definition 3.1 (Private entity alignment).
Given any two KGs, namely and , where and , private entity alignment aims to find the common entity set of and , and meanwhile protect other private information of both KGs.
Take the example in Figure 1, private entity alignment will find the common entity set (entity 2 and entity 3) for party and party and keep other information private.
Possible solutions for private entity alignment. We find that private entity alignment has exactly the same purpose as one of the existing primitives called PSI [pinkas2018scalable]. PSI is a cryptographic protocol that aims to compute the intersection of two sets held by two parties, during which both parties should learn the elements (if any) common to both sets and nothing (or as little as possible) else [de2010practical]. PSI has been extensively researched recently. So far, there are several types of PSI, including publickey based PSI [chen2017fast], circuit based PSI [huang2012private], and OT (extension) based PSI [pinkas2018scalable]. One can directly apply these PSI protocols for private entity alignment task.
3.1.2 Private Entity Linking
Entity linking (disambiguation) is a longstanding problem in KG [moro2014entity]. In KG isolation setting, private entity linking is defined as follows.
Definition 3.2 (Private entity linking).
Given any two KGs, and , where and , for a target entity , private entity linking aims to find the entity that is the matched with (if have), and meanwhile protect the data privacy of both KGs. Note that, here , since private entity alignment has solved this case.
Take Figure 4 as an example, for the target entity ‘Albert Einstein’ of one party, private entity linking aims to find its matched entity ‘Einstein’ from the other party and meanwhile keep the entities, relations, and descriptions private.
Existing work on plaintext entity linking mainly have two steps, i.e., candidate entity generation and candidate entity ranking [shen2014entity]. Candidate entity generation can be done by either using lexical based method or semantic based method. Lexical based methods directly calculate the text similarity of entities and their properties [zhang2010entity]. Later on, semantic based methods, mostly neural models, are popularly used to learn entity embeddings [francis2016capturing, chen2018bilinear, sun2017cross]. Candidate entity ranking aims to select the most relevant candidate entity to link for the target entity. The traditional ranking models, e.g., logistic regression, tree based model, and neural networks can be directly used for ranking. However, none of them is designed to protect data privacy in KG isolated setting.
Recently, there are several work on securely calculating the similarity of texts [gondree2009longest, pang2010privacy, reich2019privacy]. For example, Gondree and Mohassel proposed a secure method for calculating the longest common subsequence between two parties [gondree2009longest]. Pang et al. proposed a private text retrieval method using encryption technique [pang2010privacy]. Reich et al. proposed secure text classification protocol using MPC [reich2019privacy]. Although their protocol can be applied for private entity linking, but it fails to capture the semantic information between entities.
Possible solutions for private entity linking. Private entity linking can be taken as a ranking problem, as described above. Therefore, we design a twostep solution for it, as is shown in Figure 4, i.e., generating entity features securely based on KGs and ranking entities securely based on features. Note that we do not list candidate entity generation as the first step, but one can do it beforehand to decrease the size of entity feature generation and entity ranking.
Generating entity features securely based on KGs. Entity features should represent the characteristics (e.g., lexical aspect or semantic aspect) of entities. The purpose of this step is to generate entity features for all the parties with all the KG data, and keep these data privately at the same time. We propose to generate entity features for multiparties securely using whatever information available, e.g., entity name and property. However, there is few existing solutions on generating entity features under privacypreserving setting, especially semantic based methods. Therefore, there is an urgent need of privacypreserving semantic models such as word2vec [mikolov2013distributed] and doc2vec [le2014distributed]. After generating entity features securely, each party holds the encrypted or plaintext entity features or , based on specific security requirement.
Ranking entities securely based on features.
Entity ranking aims to find the best matching entity from the other KG (if have), for a target entity in a given KG, based on the generated entity features. We propose to rank entities securely using either unsupervised learning models, e.g., privacy preserving nearest neighbor
[shaneck2009privacy], or supervised learning models, e.g.,
[fang2020hybrid, chen2020homomorphic]. After this, a party will link its entities to the entities of other parties. As the example shown in Figure 4, entity 3 (‘Albert Einstein’) in KG 1 is linked to entity 8 (‘Einstein’) in KG 2.3.1.3 Private Property Merging
After private entity alignment and private entity linking, it will find the common entity set between different parties, which includes the same entities (e.g., entity 2 and entity 3 in Figure 1) and the linked entities (e.g., entity 3 and entity 8 in Figure 4). However, these entities may have different properties, e.g., ‘physicist’ for entity 3 and ‘actor’ for entity 8 in Figure 4. The simplest solution for this is taking ‘physicist’ and ‘actor’ as two different properties, but doing this will introduce more noise. Formally, we define private property merging as follows.
Definition 3.3 (Private property merging).
Given the aligned or linked common entity set () of KGs, where , private property merging aims to merge the properties () of while keep each party’s property private.
Possible solutions for private property merging. The best solution for private property merging is making secure computations based on different property types, as is shown in Table 2. We divide entity properties into two type, i.e., continuous or discrete properties and categorical properties. The former are realvalued variables (e.g., height) or numeric variables (e.g., age) that have comparison relations, while the later are a limited number of categories or distinct groups (e.g., gender and country) that usually do not have a logical order. We propose different merging operations for these two kinds of properties. First, for continuous and discrete properties, we propose three secure merging operations, i.e., max, min, and (weighted) average. This can be done by using secure MPC techniques, e.g., garbled circuit and secret sharing. An example of this is that, party has a property (age, 23) and party has a property (age, 15), and one possible merged property is their average (age, 19). Second, for categorical properties, we propose binary or multiple selection, i.e., select the most likely property value from two or multiple values, based on certain rules (e.g., voting). This could be done by using garbled circuits. A typical example is that, for the same entity property (gender), party has value ‘M’, party has value ‘F’, party has value ‘F’, and the merged gender is their majority (‘F’). However, these merged result should be stored securely to prevent information leakage.
property type  secure merging operation  example 

continuous or discrete  max, min, (weighted) average  : 23, : 15, merged: 19 
categorical  binary or multiple selection  : ‘M’, : ‘F’, : ‘F’, merged: ‘F’ 
3.2 Storage
The purpose of KG storage is to securely store the KGs after merging for each party. After merging, the relations and properties of the common entity set (including aligned entities and linked entities) will naturally change. Therefore, the main challenge here is how to securely store the relations and properties of the common entity set to facilitate subsequent KG tasks, such as query, representation, and completion.
To date, there are several ways to store KGs, e.g., triple table [harris20033store], property table [wilkinson2006jena], and DB2RDF [bornea2013building]. For example, we can represent the property graph model in Figure 1 in the form of triple table and property table. Given the assumption that each entity have two properties, Table 3 describes the storage structure of KGs. After merging, each party still has its own triples, however, its properties have been merged with other parties and thus have to be saved in a secure manner to prevent potential information leakage.
Entity  Property 

Alice  (0.8, 1) 
Bob  (0.5, 1) 
Jim Butler (JB)  () 
C1  (0.7, 1) 
Entity  Property 

Sam  (0.1, 1) 
Jim Butler (JB)  () 
Lee  (0.3, 0) 
C2  (0.4, 1) 
Possible solutions for secure KG storage. Two kinds of methods are popularly used to store data securely, i.e., secretshared method and encrypted method. The former stores data in secret sharing format [lai2019graphse2]. That is, each party holds a share of the raw data, as described in Section 2.2. The later stores data in the form of ciphertext [cash2013highly]. That is, each party holds the encrypted data whose secret key is kept by the other party. Following these existing research, we propose to store entity property table securely using either secretshared method or encrypted method. Table 5 and Table 5 show how to store KGs using secretshared method, where we use to denote th property of entity , for and .
4 Privacy Preserving KG Query
In this section, we first describe the taxonomy of KG queries. Then we describe the problem of privacypreserving KG query. We finally present the possible solutions for the above problem.
4.1 Taxonomy of KG Queries
Essentially, a KG query language is in the form of a regular graph query language but targets at knowledge graphs. Industrial graph databases, such as SPARQL [prudhommeaux2008sparql], Cypher [francis2018cypher], and Gremlin [rodriguez2015gremlin], adopt the property graph model (which is more elaborate) for knowledge graphs. In summary, KG query is a graph query which supports search functionality over property graphs. Angles et al. [Angles2017FoundationsOM] inspected the theory of graph queries in details, and summarized that graph query languages share two most fundamental querying functionalities:
Graph Pattern Matching
(GPM) and Graph Navigation (GN).Graph Pattern Matching (GPM). The core functionality of answering graph queries is graph pattern matching. Graph pattern is a graphstructured query with variables and constants. For the case of property graph model, the set of is defined as constant, denoted as , while the query is defined as the variable set, denoted as . For instance, a query “Search for the people that Marko knows” towards graph could be turned into a graph pattern , where , and . To successfully find a match, we first match the graph pattern to the graph , and then search for the occurrences of this pattern. Technically, the match for a KG graph pattern is defined as follows:
Definition 4.1 (Match).
Given a KG and an instruction (graph pattern) , a match of in is a mapping from to such that the mapping maps constants to themselves and variables to constants; if the image of under is contained within , then is a match.
Additionally, Basic Graph Pattern (BGP) could be augmented with relational operations, including projection, join, union, difference, optional, and filter. Those graph patterns are defined as Complex Graph Pattern (CGP). We summary various graph patterns, operations, and related literature in Table 6.
Type  Operations  Related Works 

BGP  Matching  [martinez1997algorithm] [Fan2012GraphPM] [Chen2018OnEO] [Cheng2008FastGP] 
CGP  Projection  [Buzmakov2015RevisitingPS] 
Join  [Fuchs2020EdgeFrameWO] [Vidal2010EfficientlyJG]  
Union  [Shoudai2018PolynomialTL]  
Difference  [ferre2018sparql]  
Optional  [Mennicke2019FastDS]  
Filter  [ferre2018sparql] 
Graph Navigation (GN). While graph pattern matching provides most of the query functionality, it is also helpful to support navigation towards the graph topology. Graph navigation has been widely studied by the research community [Wood2012QueryLF, Barcel2013QueryingGD] and is adopted in modern graph query languages such as Gremlin [rodriguez2015gremlin]. One typical example of such a query is ‘finding all friendsofafriend of some person’ in a social network. Here we are not only interested in the immediate acquaintances of a person, but also the people she might know through other people; namely, her friendsofafriend, their friends, and so on. Traditionally, Path Query is a basic component of GN functionality, where path query navigates through arbitrary number of edges in the graph. In particular, unlike solutions to GPM that have a fixedarity output, paths do not have a fixedarity, therefore we cannot directly define a mapping from variables to constants as in the case of GPM.
Definition 4.2 (Path Query).
Given a KG , a path query is defined in the form of , where specify the beginning (head) and the ending (tail) entities of the path, and denotes the traversal condition on the paths.
In a more complex and popular form, condition could also be expressed as a regular expression. Those path queries with regular expressions are called Regular Path Queries (RPQs). And twoway Regular Path Queries (2RPQ) further allows the inverse on traversal, i.e., traverse in the backwards direction. Additionally, Conjunctive Regular Path Queries (CRPQ) supports complex conjunctions of path queries. We summary these categories of GN queries and their related literature in Table 7.
Categories  Features  Related Works 

RPQ  Regular expressions  [Calvanese2003ReasoningOR] 
2RPQ  Regular expressions with inverse  [Calvanese1999RewritingOR] [Calvanese2003ReasoningOR] 
CRPQ  Conjunctions of RPQ  [Consens1990GraphLogAV] [Sasaki2020StructuralIF] 
4.2 Problem Description of Privacy Preserving KG Query
Privacy preserving KG query addresses the problem of query over securelystored graph databases. Recall in Section 3.2, the graph database is distributed and stored between two parties after merging. Each party holds its own triple table and property table, where the properties of the common entities are stored in secret share format and others are stored in plaintext. Informally, we defined the privacypreserving KG query problem as follows:
Definition 4.3 (Privacy preserving KG query).
Assume there are parties, each party holds a private KG: , and each party also has a property set describing its entity set which is stored in secret sharing format. For a general query, the purpose of privacy preserving KG query is find the matched result in all the KGs, on the basis of protecting KG privacy.
Intuitively, this problem seems to have a close relationship with the widelystudied Protected Database Search (PDS) [song2000practical] problem. In the literature, the problem of PDS has been widelystudied from different angles, including Private Information Retrieval (PIR) [Kiayias2015OptimalRP, Canetti2017TowardsDE, Boyle2017CanWA, Patel2018PrivateSI, Angel2018PIRWC, Ali2019CommunicationComputationTI], Searchable Symmetric Encryption (SSE) [cash2013highly], OrderPreserving Encryption (OPE) [ahmed2019semi]. Nevertheless, most of the existing works focus on relational databases or NoSQL databases. There is quite limited study on searching over protected graph databases. Unlike relational databases, graph databases by design maintain the relationship between nodes, which allow fast and efficient connection check between two nodes [besta2019demystifying, cui2020highly]. The most related work on protected graph database query is GraphSE [lai2019graphse2], which addresses the protected graph query problem by leveraging a SSE scheme called Oblivious CrossTags protocol (OXTs) [cash2013highly]. However, GraphSE inherits the leakage from the original OXT protocol and inevitably leaks access patterns. Therefore, GraphSE is not an ideal solution for privacy preserving KG query. To date, there is few research on solving privacy preserving KG query problem securely and efficiency.
4.3 Possible Solutions for Privacy Preserving KG Query
One possible solution for generalpurpose privacy preserving KG query system is to adopt the idea from industrial systems, Gremlin [rodriguez2015gremlin], SPARQL [prudhommeaux2008sparql] and Cypher [francis2018cypher]. In this section, we introduce the idea from [cryptoeprint:2020:1415], and describe how to adopt implement a privacypreserving graph traversal machine.
To begin with, Gremlin uses traverser with instructions to allow the navigation towards graph topology and the graph pattern matching. With the two most fundamental query functionalities of GPM and GN, we use the term “instruction” to represent a single atomic operation for GPM or GN functionality and also use the term “multiinstruction” as a conjunction of single instructions. Additionally, we take a simplified formalization of a graph traversal machine, that is,
where is the traverser’s location set and is the query instruction set. Moreover, we consider the locations for each traversal as sensitive apart from the beginning and the ending of the traversal, since they are the input and output of the traversal. Informally, we denote the traverser for our privacypreserving graph traversal machine as .
Secure single instruction query. On a highlevel, single instruction query takes a shared location set and an instruction as the input of a traverser , and outputs a new shared location set . Specifically, this procedure can be divided into two main steps, i.e., and , as is shown in Figure 5. is an atomic instruction processing operation, which inputs and
and further outputs a shared binary vector
with its length equal to . Here, the shared binary vector acts as an indication vector. After it, given the shared binary vector and the traverser’s entire location set , gets a new shared location set by multiplying and .Note that the detailed construction of varies with KG storage methods, KG graph models, and even with underlying graph data. Therefore, in this paper, we will not go deeper into the detail of ’s construction and only give an abstraction of .
Secure multiinstruction query. Given above secure single instruction query, it is also challenging to compose the single instruction queries to obtain a secure and efficient general framework for multiinstruction query. One native solution is to sequentially perform Single Instruction Evaluation (SIE), and then use secure matrix multiplication of and to get the shared traverser location . As is shown in Figure 6, the traverer begins with the entire location set , and an instruction set . Then SIE sequentially evaluates each instruction (), and receives a location set after every evaluation. Finally, we call a reconstruction on the resulting location and obtain the final results.
5 Privacy Preserving KG Representation
In this section, we first simply summarize the taxonomy of traditional KG representation learning. We then describe the privacy preserving KG representation problem. We finally present possible solutions for the above problem.
5.1 Taxonomy of KG Representation
KG representation learning, aka KG embedding, aims to learn lowdimensional embeddings of entities and relations [ji2020survey]. KG representation learning has been widely studied recently, because it significantly affects the performance of downstream KG completion and application tasks. According to [ji2020survey, wang2017knowledge], most existing works on KG representation mainly focus on two directions, i.e., encoding model and scoring function
. The former aims to encode the interactions of entities and relations through specific model architectures, while the later is used to measure the plausibility of facts. Commonly used encoding models include linear model, factorization model, neural network (NN) model, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Graph Neural Network (GNN). Popularly adopted scoring functions consist of distance based methods and similarity based methods. We summarize some related work of KG representation in Table
8. More details can be found in [ji2020survey, wang2017knowledge].Directions  Methods  Related Works 

Encoding Model  Linear  [bordes2013translating, wang2018multi] 
Factorization  [nickel2011three, jenatton2012latent]  
NN  [dong2014knowledge, socher2013reasoning]  
CNN  [shang2019end, dettmers2018convolutional]  
RNN  [gardner2014incorporating, neelakantan2015compositional]  
GNN  [nathani2019learning, vashishth2019composition, schlichtkrull2018modeling]  
Scoring Function  Distance based  [bordes2011learning, bordes2013translating, lin2015learning] 
Similarity based  [xue2018expanding, zhang2019interaction, xu2019relation] 
5.2 Problem Description of Privacy Preserving KG Representation
KG representation learning is nontrivial under KG isolation setting, since not only entity properties, but also relations between entities, are securely stored by multiparties after merging, as are shown in Table 5 and Table 5. Formally, privacy preserving KG representation is defined as follows.
Definition 5.1 (Privacy preserving KG representation).
Given parties, each of whom has an individual KG, i.e., , and each party also has a property set describing its entity set , where includes the properties () of the common entity set () that are stored in secret sharing format, and . The purpose of privacy preserving KG representation is to learn lowdimensional entity embedding for th entity of party , and relation embedding for the relation between th entity and th entity of party , on the basis of protecting each party’s private data.
So far, there is only limited work on privacy preserving graph embedding and graph neural network [zhou2020privacy, zheng2020asfgnn]. For example, Zhou et al. [zhou2020privacy] proposed a serveraided privacy preserving GNN learning method, which adopts the idea of split learning [vepakomma2018split] and splits the computation graph of GNN into two parts. The private graph data related computations are done by data holders and the rest hidden layer related computations are done by a neutral server. Zheng et al. [zheng2020asfgnn] proposed to combine federated learning with automated machine learning to solve the privacy and data nonindependent identically distributed problem in data isolation setting. Although both approaches can protect data privacy to a certain extent, they are not provable secure.
5.3 Possible Solutions for Privacy Preserving KG Representation Learning
We divide the solution for privacy preserving KG representation learning into the following three steps based on the main steps in traditional GNN.
Secure initial entity (node) embedding generation.
The first step is generating initial node embeddings securely for multiple parties using their node features. Traditionally, initial node embeddings are generated by using a nonlinear transformations, i.e.,
, where x is node feature, W is a weight matrix, and is a nonlinear active function. Under KG isolation setting, after secure KG storage, node features are either kept by a single party or secretly shared by multiply parties, as are shown in Table 5 and Table 5. Besides, the weight matrix (W) should also be kept secretly for privacy concern. Motivated by existing work [demmler2015aby], we propose to store W in secret sharing format. That is, each party holds a share . Therefore, secure initial node embedding generation becomes the following problem: each party holds its private feature x or its shares , and the weight share , and all the parties want to calculate securely and collaboratively, such that each party holds a share at the end of this step. Besides,are either nonlinear continuous active functions such as Sigmoid and Tanh which are not cryptographyfriendly, or piecewise active functions such as RELU which rely on timeconsuming secure comparison protocols. For the nonlinear continuous active functions, existing works propose to use polynomials
[aono2016scalable] or piecewise functions [mohassel2017secureml] to approximate them.We now present how to calculate securely for parties in details. We assume both x and W are securely shared^{2}^{2}2Note that x can be easily transformed into secret sharing format, even when it is originally held by a single party., and summarize the calculation procedure in Algorithm 1. It mainly has three steps. The first step is securely calculating xW for multiple parties, as is shown in line 11, which involves secret sharing based addition and multiplication. The second step calculates polynomial variables, as is shown in line 11. The last step approximates by using polynomial, as is shown in line 11. Here, we use a twoorder polynomial as an example to approximate the nonlinear functions as follows
(1) 
where the coefficients could be set using different methods [chen2020homomorphic]. After this step, each party holds the same number of entities, with the same dimension of embeddings.
Secure embedding propagation . The second step is propagating node embeddings (aka. passing message) securely for multiple parties using their initial node embeddings and relations between nodes on KGs, as is shown in Figure 7. Existing works on GNN have proposed different kinds of embedding propagation methods, e.g., convolution based [hamilton2017inductive], attention based [velivckovic2017graph], and gated mechanism based [li2015gated], and their mixtures [liu2019geniepath]. Take GraphSAGE [velivckovic2017graph]—a classic convolution based GNN—for example, it first aggregates neighborhood embeddings and then transforms it using a fullyconnected layer,
(2) 
(3) 
where the aggregator functions are of three types, i.e., Mean, LSTM, and Pooling. Under KG isolation setting, secure embedding propagation becomes challenging because initial node embeddings are shared by all the parties, and relations between nodes are also separated by all the parties.
We present how to perform embedding propagation securely under KG isolation setting in Algorithm 2, where we take Mean aggregator and Pooling aggregator for example and leave other aggregators as future work. Before secure embedding propagation, each party holds its own graph and the shares of initial node embedding . Besides, the weight matrices of th layer should also be kept secretly for privacy concern. In Algorithm 2, line 22 shows how embeddings are propagated using MEAN aggregator. Specifically, all the parties first locally calculate the mean of their shares, as is shown in line 22. They then securely calculate the mean of embeddings using secret sharing division protocol, as is shown in line 2. Alternatively, one can choose Pooling aggregator to do embedding propagation, as is shown in line 22 in Algorithm 2
. For Pooling aggregator, we apply an elementwise maxpooling operation to aggregate information across neighbors, using secret sharing ArgMax protocol. After it, all the parties concat their local shares (line
2) and calculate a nonlinear transformation using Algorithm 1. Finally, each party holds a share of entity embedding after depth of propagation.Secure loss computation. The third step is computing loss securely for multiple parties using their propagated entity embeddings based on certain tasks. For example, one usually uses crossentropy loss for classification tasks [hamilton2017inductive]
and noise contrastive estimation loss for unsupervised tasks
[mikolov2013distributed]. In this step, each party holds a share of entity embedding which can be taken as features, and all the parties want to calculate loss securely together. Existing secure machine learning models [mohassel2017secureml, demmler2015aby] can be directly applied to solve this problem.6 Privacy Preserving KG Completion
In this section, we first simply summarize the taxonomy of traditional KG representation learning. We then describe the privacy preserving KG representation problem. We finally present possible solutions for the above problem.
6.1 Taxonomy of KG Completion
KG completion, aka KG reasoning, is an important research problem due to the nature of incompleteness of KG. KG completion aims to infer the missing properties and triples. According to [ji2020survey], existing works on traditional KG completion are mainly in three types, i.e., embeddingbased methods [guan2018shared, shi2018open], relation path inference [lao2010relational, gardner2014incorporating], and rulebased reasoning [omran2019embedding, guo2016jointly]. Among them, embeddingbased methods are popularly used due to its high efficiency. Take triple prediction for example, embeddingbased methods first calculate pairwise scores of all the candidate entities given a target entity, and then rank the top candidate entities to link edges with the target entity.
6.2 Problem Description of Privacy Preserving KG Completion
Under KG isolation setting, private KG completion is defined as follows.
Definition 6.1 (Privacy preserving KG completion).
Given parties, each of whom has an individual KG, i.e., , and each party also has a sparse property set describing its entity set . The purpose of privacy preserving KG completion is to infer the missing properties and the missing triples for any party .
Although traditional KG completion has been extensively studied, to our best knowledge, there is few literature on how to do privacy preserving KG completion under KG isolation setting.
6.3 Possible Solutions for Privacy Preserving KG Completion
We propose an embeddingbased privacy preserving KG completion approach, and the key of which is building embedding projection functions, using the learnt KG representation.
First, for privacy preserving property completion, the embedding projection function maps embedding to property. Here, the embeddings are secretly shared by multiparties, which can be seen as private features. The properties are either secretly shared by multiparties or held by a single party which can also be transformed to secret sharing format, determined by whether the corresponding entity is a common entity or not, which can be taken as private labels. That is, , where is the secretly shared property, is the secretly shared entity embedding, and
is a tensor for property completion. The function
could be any existing neural layers such as a fullyconnected layer or a convolutional layer.Second, for privacy preserving triple completion, the embedding projection function maps two entity embeddings to a relation. Here, the entity embeddings are secretly shared by multiparties, and the relation denotes whether two entities are linked and thus are held by a single party. That is, , where and are the secretly shared head entity embedding and tail entity embedding, and is a tensor for triple completion. The function could be any existing functions for triple completion such as ProjE [shi2016proje] and SENN [guan2018shared]. Take ProjE for example, .
From the above description, we find that the projection functions for both property completion and triple completion need to be changeable due to the variation of different functions. Therefore, the challenge here is how to provide scalable and flexible secure machine learning platforms for one to easily build secure machine learning algorithms to meet the various projection functions in KG completion. This can be solved by developing a system (e.g., Nebula [wuposter]) with hybrid secure computation protocols, rich computation operations, and powerful domain specific language. To this end, we propose a possible solution, as is shown in Figure 8, which can be divided into four layers:

[leftmargin=*]

Cryptography primitives layer mainly contains secure computation protocols and their conversions. This is because different protocols have their own advantages and are suitable for different scenarios.

Operation layer implements popularly used operations in machine learning such as matrix multiplication. Besides, these operations can be performed by various protocols in the cryptography primitives layer.

Adapter layer defines domain specific language and compiler. This layer aims to facilitates machine learning developers to develop kinds of algorithms and functions without knowing the complicate cryptographic techniques.

Secure machine learning layer is built upon the lowerlevel layers. It provides commonly used secure machine learning models such as MultiLayer Perception (MLP), to meet the flexible requirements of KG completion tasks.
7 Privacy Preserving KGaware Applications
KGs have been applied into various tasks, including question answering [zhang2018variational, huang2019knowledge], recommender system [cao2019unifying, wang2019kgat], risk assessment [cheng2019risk], fraud detection [liu2018heterogeneous, wang2019semi], and information extraction [hoffmann2011knowledge, koncel2019text]. However, with more people caring privacy and kinds of regulations coming into force, KG isolation becomes a serious problem, which limits the performance KG and prevents KG from being more popularly used. In this section, we present three privacy preserving KGaware applications and simply describe how can our proposed techniques be applied into these applications.
7.1 Risk Assessment in Guarantee Loan
The bank loans are important to the development of small and medium enterprises, and a popularly used way to decrease loan default is guarantee. That is, when a small and medium enterprise needs a bank loan, it can choose another enterprise as a guarantor. As more and more enterprises participate, it becomes a guarantee loan KG. Therefore, how to judge a guarantee is risky or not is important to the risk control in banks. In practice, it is common that enterprises take loans from multiple banks to obtain sufficient funds. However, due to privacy considerations, these banks cannot share customer information with each other, i.e., guarantee loan KGs are isolated, which increases the difficulty of risk control.
Figure 9 shows a risk assessment example under guarantee loan KG isolation setting. Here, we assume there are two banks and each of them has a guarantee loan KG where node denotes enterprise and edge means guarantee relation. Apparently, the risk assessment ability is limited if only using a single KG. For example, bank is likely to allow the guarantee of enterprise 3 to enterprise 1, based its own KG. However, by incorporating the KG of bank , bank will probably refuse the guarantee of enterprise 3 to enterprise 1, because enterprise 1 has guaranteed enterprise 3 indirectly (through enterprise 2) and guarantee loop usually has high risk and thus is forbidden. Meanwhile, it is difficult for bank to decide whether enterprise 5 is appropriate to guarantee other enterprises since it has no guarantee behaviors in its own KG. By combining the KG of bank , bank will be easier to make the right decision.
The above described risk assessment problem under guarantee loan KG isolation setting can be solved using our proposed privacy preserving KG query and KG representation methods. First, the guarantee loop detection problem is actually querying whether an enterprise (e.g., 1) has a link to the other enterprise (e.g., 3) in the multiple KGs, on the basis of protecting the KGs. Second, to better assess guarantee loan, privacy preserving KG representation can be used to learning enterprise representations, followed by a binary classification task, similar as [cheng2019risk]. Thus, both privacy preserving KG query and KG representation can decrease the risk under guarantee loan KG isolation setting.
7.2 Fraud Detection
Fraud detection is a major task for lots of companies, especially for financial companies, due to its direct effect on capital loss. A key to fraud detection is to classify whether a user is a fraud user or nonfraud user
[wang2019semi]. In practice, users always use the products of multiple companies and each company usually builds its fraud detection system based on its own data. That is, each company has its own KG. It is natural that these companies can improve their fraud detection ability together by combining their KGs. Unfortunately, these companies cannot share private user data with each other due to regulation or competition reasons. Therefore, the KGs of different companies for fraud detection are isolated.Figure 10 shows a fraud detection example under KG isolation setting. Two companies have different useritem KG, they share some common users ( and ) but different graph relational data. We assume party has a KG with userdevice log in data and party has another KG with userwifi log in data. Apparently, taking the KGs of both companies into considering will bring more intelligent fraud detection ability. The purpose is to build a privacy preserving fraud detection system, i.e., classify whether a user is a fraud user or not, using two KGs.
The above fraud detection problem under KG isolation setting can by solved by using privacy preserving property completion technique. As we described in Section 1, labels can be taken as special properties. With privacy preserving property completion, two companies can build a fraud detection system together while protecting their own KGs.
7.3 Recommender System
Recommender system is a sharp tool to solve the information overload problem, especially when facing the big data era currently. Traditional recommender systems are mainly built based on useritem interaction data, e.g., rating and purchase data. Recent studies have shown that rich side information, e.g., user social information, are effective for improving the recommendation performance [chen2020secure]. However, such information may be isolated by another platform in practice. Such data isolation problem limits the performance of recommender systems.
Figure 11 shows a typical case of the KG isolation problem in recommender systems. Figure 11 (a) is a KG of party , which is built using its useritem interaction data. Figure 11 (b) is a KG of party , which is built based on its user social data. This is quite a common situation in reality, since ecommerce platforms such as Amazon have rich useritem interaction data while social media platforms like FaceBook have plenty user social data. How to use the additional data on other platforms to further improve the performance of the recommendation platforms, meanwhile protect the raw data security of both platforms, is a crucial question to be answered [chen2020secure].
The above secure recommendation problem under KG isolation setting can be solved by using privacy preserving triple completion technique. Because the essence of recommender system is link prediction problem on KG, i.e., predict whether there is a relation between a user (head entity) to an item (tail entity). With this technique on hand, the development of recommender systems will head to a new direction in the future.
8 Conclusion and Future Work
In this paper, we first described the KG isolation problem in practice. We then summarized the open problems in privacy preserving KG, including merging, query, representation, and completion. We formally defined these problems and proposed possible solutions for them. We finally presented three application scenarios of our proposed privacy preserving KG. Our work aims to shed light on the future directions in privacy preserving KG under data isolation setting. In the future, we would like to present detailed technical solutions for these problems and further deploy them in realworld applications.
Comments
There are no comments yet.