In recent years, people have built a large amount of knowledge graphs (KGs) such as Freebase bollacker2008freebase , DBpedia auer2007dbpedia , YAGO suchanek2007yago , NELL carlson2010nell and Wikidata vrandevcic2014wikidata . KGs provide us a novel aspect to describe the real world, which stores structured relational facts of concrete entities and abstract concepts in the real world. The structured relational facts could be either automatically extracted from enormous plaintexts and structured Web data, or manually annotated by human experts. To store these knowledge, KGs mainly contain two elements, i.e., entities that represent both concrete and abstract concepts, and relations that indicate relationships between entities. To record relational facts in KGs, many schemes such as RDF (resource description framework), have been proposed and typically represent those entities and relations in KGs as discrete symbols. For example, we know that Beijing is the capital of China. In KGs, we will represent this fact with the triple form as (Beijing, is_capital_of, China
). Nowadays, these KGs play an important role in many tasks in artificial intelligence, such as word similarity computationpedersen2004wordnet , word sense disambiguation leacock1998combining ; chen2014unified , entity disambiguation dredze2010entity , semantic parsing bordes2012joint ; berant2013semantic , text classification scott1998text ; wang2009using , topic indexing medelyan2008topic verma2007semantic , document ranking hu2009understanding , information extraction hoffmann2011knowledge ; daiber2013improving , and question answering bordes2014question ; bordes2014open .
However, people are still facing two main challenges to utilize KGs in real-world application: data sparsity and growing computational inefficiency. Existing knowledge construction and application approaches lao2010relational ; lao2012reading ; di2012linked ; smirnov2015patterns
usually store relation facts in KGs with one-hot representations of entities and relations which cannot afford their rich semantic information. One-hot representation, in essence, maps each entity or relation to an index, which can be very efficient for storage. However, it does not embed any semantic aspect of entities and relations. Hence, it cannot distinguish the similarities and differences among “Bill Gates”,“Steve Jobs” and “United States”. Moreover, these works rely on designed sophisticated and specialized features extracted from external information sources or network structure of KGs. With the increasing the KG’s size, these methods usually suffer from the issue of computational inefficiency and the lack of extensibility.
With the development of deep learning, distributed representation learning has shown their abilities in computer vision and natural language processing. Recently, distributed representation learning of KGs has also been explored, showing its powerful capability of representing knowledge in relation extraction, knowledge inference, and other knowledge-driven applications. Knowledge representation learning (KRL) typically learns the distributed representations of both entities and relations of a KG, and projects their distributed representations into a low-dimensional semantic space. KRL usually wants to encode the semantic meaning of entities and relations with their corresponding low-dimensional vectors. Compared with the traditional representation, KRL gives the entities and relations in KG much dense representations, which leads to lower computational complexity in its applications. Moreover, KRL can explicitly capture the similarity between entities and relations via measuring the similarity of their low-dimensional embeddings. With the advantages above, KRL is blooming in the applications of KGs. Up till now, there are a great number of methods having been proposed using representation learning in KGs.
In this article, we first review the recent advances in KRL. Second, we perform quantitative analysis of most existing KRL models on three typical tasks of knowledge acquisition including knowledge graph completion, triple classification, and relation extraction. Third, we introduce typical applications of KRL in real world such as recommendation system, language modeling, question answering, etc. Finally, we re-examine the remaining research challenges and outlook the trends for KRL and its applications.
2 Knowledge Representation Learning
Knowledge representation learning aims to embed the entities and relations in KGs into a low-dimensional continuous semantic space. For the convenience of presentation, we will introduce the basic notations used in this paper at the beginning. First, we define as a KG, where is a set of entities, is a set of relations, and is the set of fact triples with the format . Here, and indicate the head and tail entities, and indicates the relationship between them. For example, (Microsoft, founder, Bill Gates) indicates that there is a relation founder between Microsoft and Bill Gates.
Recently, KRL has become one of the most popular research areas and researchers have proposed many models to embed entities and relations in KGs. Next, we will introduce the typical models for KRL including linear model, neural model, translation model and other models.
2.1 Linear Models
Linear models employ a linear combination of the entities’ and relations’ representations to measure the probability of a fact triple.
2.1.1 Structured Embedding (SE)
SE bordes2011learning is one of the early models to embed KGs. SE first learns relation-specific matrices for head entities and tail entities respectively. After that, it multiples head and tail entities with the projecting matrix, and then defines the score function as distance between two multipled vectors for each triple as:
That is, SE transforms the entities’ vectors and by the corresponding head and tail relation matrices for the relation and then measuring their similarities in the transformed relation specific space, which reflect the semantic relatedness of the head and tail entities in the relation .
However, since the model learn two separate matrices for head and tail entities for each relation, it cannot precisely capture the semantic relatedness for entities and relations.
2.1.2 Semantic Matching Energy (SME)
SME bordes2012joint ; bordes2014semantic first represents head entities, relations and tail entities with vectors respectively, and then models correlations between entities and relations as semantic matching energy functions. SME defines a linear form for semantic matching energy functions:
and also a bilinear form:
where , , and are transformed matrices, indicates the Hadamard product and and
are bias vectors. Inbordes2014semantic , SME further extended it the bilinear form, which replace its matrices with
-way tensors, to improve its model ability.
2.1.3 Latent Factor Model (LFM)
LFM jenatton2012latent ; sutskever2009modelling employ a relation-specific bilinear form to consider the relatedness between entities and relations, and the score function for each triple is defined as:
where are the matrix for relation .
It’s a big improvement over the previous models since it interacts the distributed representations of head and tail entities by a simple and efficient way. However, LFM is still restricting due to its massive number of parameters used for modeling the relations.
DistMult yang2014embedding further reduce the number of relation parameters in LFM, which simply restricts to be a diagonal matrix. This results in a less complex model which achieves superior performance.
ANALOGY pmlr-v70-liu17d uses the same bilinear form score function to measure the probability of fact triples as LFM and further discuss the normality and commutativity of LFM.
2.2 Neural Models
Neural Models aim to output the probability of the fact triples by neural networks which take the entities’ and relations’ embeddings as inputs.
2.2.1 Multi Layer Perceptron (MLP)
2.2.2 Single Layer Model (SLM)
SLM is similar to MLP model. It attempts to alleviate the issue of SE model by connecting entities and relations embeddings implicitly via the nonlinearity of a single MLP neural network. The score function for each triple of SLM model is defined as
where are weight matrices, are the vector of relation .
Although SLM shows improvement over the SE model, it still suffer from problems when models large-scale KGs. The reason is that its non-linearity can only implicitly capture the interaction between entities and relations, and even lead to hard optimization.
2.2.3 Neural Tensor Network
As illustrated in Figure 1, Neural Tensor Network (NTN) socher2013reasoning employs a bilinear tensor to combined two entities’ embedding via multiple aspects. The score function for each triple of NTN model is defined as:
where is a -way tensor, are weight matrices, and is the vector of relation . Note that, SLM can be view as a special case of NTN without its tensor.
Meanwhile, unlike previous KRL models modeling each entity with one vector, NTN represents each entity via averaging the word embeddings of their names. This approach can capture the semantic meaning of each entity name and further reduce the sparsity of entity representation learning.
However, although the tensor operation in NTN can give a more explicit description of the comprehensive semantic relatedness between entities and relations, the following high complexity of NTN may restrict its applications on large-scale KGs.
2.2.4 Neural Association Model (NAM)
NAM liu2016probabilistic adopts multi-layer nonlinear activations in deep neural network to model the conditional probabilities between head and tail entities. NAM studies two model structures deep neural network (DNN) and relation modulated neural network (RMNN).
NAM-DNN feeds the head and tail entities’ embeddings into a MLP with fully connected layers, which is formalized as follows:
where , and is the weight matrix and bias vector for the -th fully connected layer respectively. And finally the score function of NAM-DNN is defined as:
Different from NAM-DNN, NAM-RMNN feds the relation embedding into each layer of the deep neural network as follows:
where , and indicate the weight matrices. And the score function of NAM-RMNN is defined as:
2.3 Matrix Factorization Models
Matrix factorization is an important technique to obtain low-rank representations. Hence, researchers also use matrix factorization in KRL.
A typical model of matrix factorization in KRL is RESCAL, a collective tensor factorization model presented in nickel2011three ; nickel2012factorizing , which reduce the modeling of the structure of KGs into a tensor factorization operation. In RESCAL, the triples in KGs forms a large tensor which is when holds, otherwise . Tensor factorization aims to factorize to entity embeddings and relation embeddings, so that is close to . Almost at the same time, (drumond2012predicting, ) also use tensor factorization in KRL with the same way. We can find that RESCAL is similar to the previous model LFM. The major difference is that RESCAL will optimize all values in including the zero values while LFM focuses on the triples in KGs.
Besides RESCAL, there are other works utilizing matrix factorization in KRL. (riedel2013relation, ; fan2014distant, ) learn representation for head-tail entity pair instead of single entity. Formally, it builds an entity-relation matrix which is when holds, otherwise . And then matrix factorization is applied to factorize into entity pair embeddings and relation embeddings. Similarly, (tresp2009materializing, ) and (huang2014scalable, ) both model the head entity and the relation-tail entity pair with two separated vectors. However, such paired modeling cannot capture the interaction of the pairs and is easier to suffer from the issue of data sparsity.
2.4 Translation Models
Representation learning has been widely used in many NLP task since (mikolov2013distributed, ) propose distributed word representation model and releases the tool word2vec. Mikolov et.al find some interesting phenomenon with their models. They find that the difference between the vectors of two words often embodies the relation between two words in the semantic space. For example, we have:
where indicates the word vector of word . In other words, word embeddings can capture the implicit semantic relatedness between and , and . Moreover, they find that this phenomenon also exists in both lexical and syntactic relations according to their experimental results in the analogy task. Researchersfu2014learning also use the features of word embeddings to discover the hierarchical relations between words.
Inspire by mikolov2013distributed , TransE bordes2013translating attempts to regard a relation as a translation vector between the head and tail entities’ vectors and for each triple . As illustrated in Figure 2, TransE wants that when holds. The score function for each triple is defined as:
where can be either or -norm.
Compared to traditional knowledge representation model, TransE can model complicated semantic relatedness between entities and relations with less model parameters and lower computational complexity. Bordes et al. evaluate the performance of TransE in the task of knowledge graph completion on the dataset of Wordnet and Freebase. The experimental results show that TransE has outperformed previous KRL models significantly, especially in the large-scale and sparse KGs.
Bordes et al. also propose a naive version of TransE, the Unstructured Model bordes2012joint ; bordes2014semantic , which simply assigns zero vector for each relation, and the score function is defined as:
However, due to the lack of relation embeddings, the Unstructured Model cannot consider relation information in the structure of KGs.
Since TransE is simple and efficient, many researchers expand TransE and apply it in many tasks. As it was, TransE is a typical model of knowledge representation. In the next section, we will take TransE as an example and introduce the major challenges and solutions in knowledge representation.
2.5 Other Models
Since the proposal of TransE, most of the new KR models have been based on it. Besides TransE and its extensions, we will also introduce some other models which also achieve promising performance.
2.5.1 Holographic Embeddings (HolE)
To combine the expressive power of the tensor product with the efficiency and simplicity of TransE, HolEnickel2015holographic uses the circular correlation of vectors to represent pairs of entities , where : denotes circular correlation:
The circular correlation is not commutative and its single component can b viewed as a dot product operation. This makes it better model the irreflexive relations and similar relations in KGs. Moreover, although circular correlation can be interpreted as a special case of tensor product, it can be accelerated by fast Fourier transform which makes it faster but maintains strong expressive ability.
For each triple , HolE define its score function as:
2.5.2 Complex Embedding (ComplEx)
employs eigenvalue decomposition model to take complex valued embeddings into consideration in KRL. The composition of complex embeddings makes ComplEx be capable of modeling various kinds of binary relations. Formally, the score function of the factof ComplEx is defined as:
where is expected to be when holds, otherwise . Here, is further calculated as follows:
where is a weight matrix, , indicates the the imaginary part of and indicates the the real part of . Note that, ComplEx can be view as an extension of RESCAL, which assigns complex embedding of the entities and relations.
Besides, (hayashi2017equivalence, ) have proved that HolE is mathematically equivalent to ComplEx recently.
3 The Main Challenges of Knowledge Graph Representation Learning
Recently, knowledge representation models such as TransE have achieved significant improvement in many real-world tasks. However, there are still many challenges in the KRL. In this section, we will take TransE as an example and introduce some related works which try to solve the problems in KRL.
3.1 Complex Relation Modeling
TransE is simple and effective, which has promising performance in large-scale KGs. However, due to TransE’s simpleness, it cannot deal with the modeling of complex relations in KGs.
Here, complex relations are defined as follows. According to their mapping properties, the relations are divided into four types including 1-to-1, 1-to-n, n- to-1 and n-to-n relations. Take 1-to-n relation as example, it means that the head entity in this relation links with multiple tail entities. We regard 1-to-n, n-to-1 and n-to-n relations as complex relations.
Researchers have found that existing KRL models have poor performance when dealing with complex relations. Take TransE as an example, since TransE regards a relation as a translation vector between head and tail entity.
it hopes for each fact triple . Therefore, we will obtain the following contradiction directly:
(1) If the relation r is a reflexive relation such as friends, i.e., and , we will get and .
(2) If the relation r is a 1-to-n relation, i.e. , we will get . Similarly, this problem also exists for the situation when is a n-to-1 relation.
For example, there are two triples (United States, President, Black Obama) and (United States, President, George W. Bush) in KGs. Here, the relation President is a typical one-to-many relation. If we use TransE to model these two triples, as illustrated in Figure 3, we will get the same embeddings of Black Obama and George W. Bush.
This obviously deviates from the truth. Black Obama and George W. Bush varies in many aspects except that they are both presidents of United States. Therefore, the entity embeddings gained by TransE are lacking in discrimination due to these complex relations.
Hence, how to deal with complex relations is one of the main challenges in KRL. Recently, there are some extensions of TransE which focus on this challenge. We will introduce these models in this section.
To address the issue of TransE when modeling complex relations, TransH wang2014knowledge is proposed that an entity should have different distributed representations in the fact triples with different relations.
As illustrated in Figure 4
, for a relation, TransH projects head and tail entities into the specific hyperplane of this relation. Formally, for a triple, the head and tail entity are first projected to the hyperplane of the relation , denoted as and which is calculated by:
where is the normal vector of the hyperplane. Then the score function for each triple is defined as
Note that, there may exist infinite number of hyperplanes for a relation , but TransH simply requires and to be approximately orthographic by restricting ,
Although TransH enables an entity having different representations for different relations, it still simply assumes that entities and relations can be represented in a unified semantic space. It prevents TransH from modeling entities and relations precisely. TransR lin2015learning observes that an entity may exhibit its different attributes in distinct relations and models entities and relations in separated spaces. As a result, although some entities such as Beijing and London are far away from each other in entity space, they are similar and close to each other in the some specific relation spaces, and vice versa.
As illustrated in Figure 5, for each triple , non-relevant head/tail entities (denoted as colored triangles) are kept away from relevant entities (denoted as colored circles) in the specific relation space by relation-specific projection, meanwhile these entities are not necessarily far away from each other in entity space.
For each triple , TransR first projects head and tail entities from entity space to -relation space via a projection matrix , denoted as and , which is defined as:
And then we force that . For each triple , the score function is correspondingly defined as:
Besides, (nguyen-EtAl:2016:N16-1, ) propose an extension of TransR: STransE that represents a relation with two different mapping matrices and a translation vector.
Further, Lin et al. found that a specific relation usually corresponds to head-tail entity pairs with distinct attributes. For example, for the relation “/location/location/contains”, its head-tail entities pattern may be continent-country, country-city, country-university, and so on. If current relations are divided into more precise sub-relations, the entities can be projected into a more accurate sub-relation space. It should be beneficial to represent KGs.
Therefore, Lin et al. propose CTransR which clusters all triples involved for a specific relation into multiple groups according to the embedding offsets . And the relations in the triples of the same group are defined as a new sub-relation. Then CTransR learns a sub-relation vector and relation-specific projection matrix for each cluster. For each triple , the score function of CTransR is finally defined as
where and .
In fact, although TransR has significant improvements compared with TransE and TransH, it still has several limitation. First, it simply share relation-specific projection matrix in head and tail entities, ignoring various types and attributes of head and tail entities. Moreover, as compared to TransE and TransH, TransR has much more parameter and higher computation complexity due to its matrix multiplication operation.
To address these issues, (jiknowledge2015, ) propose TransD which sets different mapping matrices for head and tail entities.
As illustrated in Figure 6, for a triple , TransD further learns two projecting matrices , to project head and tail entities from entity space to relation space respectively, which are defined as follows:
indicates the identical matrix,, , , , , and subscript marks the projection vectors. Here, the mapping matrices and are related to both entities and relations, and using two projection vectors instead of matrices solves the issue of large amount of parameter in TransR. Hence, for a triple , the score function of TransD is defined as:
Further, (yoon2016translation, ) propose a KRL model based on TransE, TransR, and TransD to preserve the logical properties of relations.
Although existing translation-based models have strong ability to model KGs, they are still far from practicality since entities and relations are heterogeneous and unbalanced, which is a great challenge in KRL.
To address these two issues, TranSparse ji2016knowledge considers the heterogeneity and the imbalance when modeling entities and relations in KGs. To overcome the heterogeneity, TranSparse(share) which replaces the dense matrices in TransR with sparse matrices, of which the sparse degrees is determined by he number of entity pairs related to corresponding relations. Formally, for each relation , the projection matrix ’s sparse degree is which is defined as:
where is a hyper-parameter indicating the minimun sparse degree, indicates the number of entity pairs related to relation , and is the relation which relates to the most entity pairs. Therefore, the projected entity vectors can be calculated by:
Besides, TranSparse(seperate) uses two different projection matrices and for head entity and tail entity to deal with the issue of imbalance of relations. The sparse degree is defined as:
where denotes the number of head/tail entities related to relation, and denotes the maximum one in .
Hence, the projection vector of head/tail entities is defined as:
And for both TranSparse(share) and TranSparse(seperate), the score function for a triple of TranSparse is defined as:
(xiao2015transa, ) think that TransE and its extensions have two major problems: (1) TransE and its extensions only use distance in their loss metric. Hence, they are lacking in flexibility. (2) TransE and its extensions treat each dimension of entities and relations vectors identically due to the oversimplified loss metric.
To address these two issues, TransA is proposed to change the oversimplified loss metric and to replace inflexible or distance with adaptive Mahalanobis distance of absolute loss. The score function of TransA is defined as follows:
where is a relation-specific symmetric non-negative weight matrix that corresponds to the adaptive metric.
As illustrated in Fig. 7, and are correct tail entities while are not. Fig. 7(a) shows that the incorrect entity is matched with the L2-norm distance. And Fig. 7(b) shows that by weighting embedding dimensions, the embeddings are refined because the correct entities have a smaller loss in x-axis or y-axis direction.
Similar to TransA, TransM fan2014transition
also proposes a new loss function in KRL, which assigns each fact triplewith a relation-specific weight . The key idea of TransM is that different relation may have different importances when learning the representations of KGs. And the score function of TransM is defined as:
TransM alleviates incapability of modeling complex relations in TransE by assigning lower weights to those relations.
Besides, TransF feng2016knowledge employs dot product instead of the or distance in TransE to measure the probability of fact triple , and the score function of TransF is defined as:
That is, TransF wants the vector of head entity to have the same direction with , and the vector of tail entity to have the same direction with .
Similar to CTransR, TransG xiao2015transg finds that existing translation-based models such as TransE cannot deal with the situation that a relation has multiple meanings when involves with different entity pairs. The reason is that these models only maintain a single vector for each relation, which may be insufficient to model distinct relation meanings. As illustrated in Fig. 8(a) shows that the valid triples cannot be distinguished from the incorrect ones by existing translation-based models since all semantic meanings of relation are regarded as the same. Fig. 8(b) shows that by considering the multiple semantic meanings of relations, TransG model could discriminate the valid triples from the invalid ones.
TransG proposes to use Bayesian non-parametric infinite mixture embedding to take the multiple semantic meanings of relations into consideration in KRL. For each entity, TransG assumes that the entity embedding vector subjects to standard normal distribution, i.e.,
where indicates the identical matrix, is the mean of head and tail entity vectors respectively,
indicate the variance of head and tail entity vectors’ distribution respectively. Hence, the relation vector is then defined as
where indicates the relation embedding vector for the -th semantic meaning of relation .
Then for a triple , the score function of TransG is defined as:
where is the weight factor corresponding to -th semantic meaning of relation .
He at el. he2015learning
notice that the semantic meanings of entities and relations in KGs are often uncertain. However, previous translation-based models do not consider this phenomenon when distinguishing a valid triple and its corresponding invalid triples. In order to explicitly consider KG’s uncertainties, KG2E represent entities or relations in KG through a vector with Gaussian distribution instead of a single vector. For an entity or a relation, they want the mean of its embedding to denote the center position of its semantic meanings, and the covariance matrix to describes its uncertainty.
As illustrated in Figure 9, each circle denotes an entity or a relation in KG, while its size denotes the corresponding uncertainty. Here, we can find that the uncertainty of relation Nationality is higher than other relations.
KG2E uses to express the relation between head entity and tail entity
, which corresponds to the probability distribution:
And the relation can be also expressed by a probability distribution of relation . Hence, we can measure the similarity of triple by measuring the similarity of two distribution and . In KG2E, the similarity between and is defined with two measures: KL-divergence and expected likelihood.
(1) Asymmetric similarity: KL-divergence based score function (KG2E_KL) is defined as
(2) Symmetric similarity: expected likelihood based score function (KG2E_EL) is defined as
Note that, to avoid overfitting, KG2E needs regularization during learning. It uses the following hard constraints:
(xiao2016knowledge, ) discover that existing KRL models could not make a precise knowledge graph completion in large-scale KG because of these models all employ an overstrict geometric form and an ill-posed algebraic system. To address these issue, they propose a novel model ManifoldE, which adopts manifold-based embedding principle instead of traditional or distance to model fact triples. Hence, for a given fact triple, the score function of ManifoldE is calculated by measuring the distance in the manifold:
where is a relation-specific parameter and is the manifold function which can be defined in two different ways:
Sphere assumes that for a fact triples, its head and tail entities lay in a sphere with radius . Hence, is defined as:
Hyperplane proposes to embed head and tail entities into two separated hyperplanes, and intersect with each other when their hyperplanes are not parallel. Hence, is defined as:
where and are specific relation vectors of head and tail entities respectively.
Recently, TransE’s extensions such as TransH, TransR, TransD, TranSparse, TransA, TransG, KG2E and so on have invested in dealing with the complex relation modeling issue. The experimental results on real-world datasets show that these methods have improvements as compared to TransE, which reveals the effectiveness of these models to consider different characteristics of the complex relations in KGs.
3.2 Relational Path Modeling
Although TransE and its extensions has achieved the great success in modeling entities and relations in KGs, they still face a problem caused by only considering direct relations between entities. It is known that, there are also relational paths between entities, which indicates the complicated semantic relatedness between entities. For example, the relation path indicates there is a relation GrandMother between and , i.e., (, GrandMother, ). In fact, relational paths have been taking into consideration in knowledge inference on large-scale KGs. (lao2010relational, ; lao2012reading, ; lao2011random, ; gardner2013improving, ) propose Path Ranking Algorithm (PRA) and apply it in finding unknown relational facts in large-scale KGs. PRA uses the relational paths between entities to predict their relations, and achieves great success, which indicates that relational paths between entities are informative for infer unknown facts.
Inspired by PRA algorithm, (lin2015modeling, )
propose Path-based TransE (PTransE) which extends TransE to model relational paths in KGs. Since the large number of relational paths in KGs and they usually contain noises, PTransE utilizes a Path-Constraint Resource Allocation (PCRA) algorithm to measure if a relational path is reliable. Further, PTransE proposes three typical operation including addition, multiplication and Recurrent neural network (RNN) to compose the relation embeddings into relational path embedding. Formally, for a relational path, the addition operation which is formalized as:
and the multiplication which is formalized as:
and the composition operation of RNN is defined using a reccurent matrix :, and the relational path embedding is defined as the final state pf RNN
Finally, the score function of PTransE is defined as:
where indicates the set of relational paths found by PCRA, indicates the reliability of the relational path and is a normalized constant.
Almost at the same time, there are other researchers considering relational paths with a similar way in KRL successfully garciacomposing ; neelakantan-roth-mccallum:2015:ACL-IJCNLP ; luo2015context . Algorithms utilizing the information of relational paths always suffer from expensive computation cost induced by enumerating paths between entities. Both lin2015modeling and garciacomposing address this issue by sampling informative paths. toutanova-EtAl:2016:P16-1 propose to utilize dynamic programming algorithm to make use of all relation paths efficiently. And das-EtAl:2017:EACLlong1 propose to use attention mechanism to incorporate multiple relational paths. Further, (feng-EtAl:2016:COLING1, ) propose to leverage the graph’s structure information into KRL. Besides, relational path learning has also been used in relation extraction zeng-EtAl:2017:EMNLP2017 and KG-based QA gu2015traversing .
The successes in PTransE and other related models have shown that taking relational paths into accounts can significantly improve the discrimination of relational learning and the system performance in the task of knowledge graph completion and so on. However, the existing models are still some preliminary attempts at modeling relational paths. There are many further investigations in the reliability measure and semantic composition of relational paths to be done.
3.3 Multi-source Information Learning
Most KRL methods stated above only concentrate on the fact triples themselves in the KG, regardless of the rich multi-source information such as textual information, type information, visual information and so on. This cross-modal information could provide additional knowledge located in plain texts, type structures or figures of entities and is important when learning knowledge graph representations.
3.3.1 Textual Information
Textual information is one of the most significant and widely spread information we send out and receive in every day. It is intuitive that we can consider textual information into KRL. NTN socher2013reasoning attempts to catch the potential textual relationships between entities by representing an entity using its entity name’s word embeddings. (wang2014knowledge_2, ; zhong2015aligning, ) propose jointly learning both entities and words embeddings by projecting them into the a unified semantic space, which aligns the entities and word embeddings using entity names, descriptions and Wikipedia anchors. And (zhang2015joint, ; xiao2017ssp, ) also propose other joint frameworks for the learning of text representations and knowledge graph representations. Further, (xu2016knowledge, ) propose to learn the models of relation extraction and knowledge graph representation jointly recently. These methods take textual information as supplements for KRL.
Another way of utilizing textual information is directly constructing knowledge graph representations from entity descriptions. Entity descriptions are often short paragraphs that provide the definitions or attributes of entities, which are maintained by some KGs or could be extracted from large datasets like Wikipedia. (xie2016representation, ) propose DKRL which learns both entity representations based on their descriptions with CBOW or CNN encoders, and entity representations from the fact triples of KG in a unified semantic space. And the score function of DKRL is defined as:
where and are the text-based representation of and which are obtained from the entity descriptions. Note that the description-based representation could be built to represent an entity even if the entity is not in training set. Therefore, the DKRL model is capable of handling zero-shot scenario. Recently, (fan2017distributed, ) also propose a logistic approach which also both learns entity representations based on their descriptions and learns entity representations from the fact triples of KG and achieve a better performance.
To model the complex relations in KG, (Wang:2016:TRL:3060621.3060801, ) propose TEKE which enhances the representation of both head/tail entities and relation with the representations of its neighbor entities with similar text when models a fact triple. TEKE first calculates a co-occurrence matrix which each element indicates co-occurrence frequency between the texts of and . And then TEKE defines ( is a hyper-parameter) as the set of neighbor entities of entity , and defines . Hence, the representations of neighbor entities is defined as:
And the score function of TEKE is defined as:
where and are mapping matrices.
3.3.2 Type Information
Besides textual information, entity type information, which can be viewed as a kind of label of entities, is also useful for KRL. There are some KGs such as Freebase and DBpedia possessing their own entity types. An entity could belong to multiple types, and these entity types are usually arranged with hierarchical structures. For example, William Shakespeare have both hierarchical types book/author and music/artist in Freebase.
(krompassISWC2015, ; chang-EtAl:2014:EMNLP2014, ) takes type information as type constraints in KRL, aiming to distinguish entities which belong to the same types. Their methods improve both performance of RESCAL chang-EtAl:2014:EMNLP2014 and TransE krompassISWC2015 . Instead of merely considering type information as type constraints, (guo2015semantically, ) proposes semantically smooth embedding (SSE) which incorporates the type information into KRL by forcing the entities which belongs to the same type to be close to each other in the semantic space. SSE employs two kinds of learning algorithm including Laplacian eigenmaps belkin2002laplacian :
where if and have the same type. Or locally linear embedding roweis2000nonlinear :
where indicates the set of the neighbors of entity . And then is incorporated as a regularization of the overall loss function when learning the knowledge graph representation. However, SSE still has a problem that it cannot utilize the hierarchy located in the entity types.
To address this issue, (hu2015entity, ) learn entity representations considering the whole entity hierarchy of Wikipedia. Further, TKRL xie2016representation_t utilizes hierarchical type structures to help to learn the embeddings of entities and relations of KGs, especially for those entities and relations with few fact triples. Inspired by the idea of multiple entity representations proposed in TransR, TKRL constructs projection matrices for each hierarchical type, and the score function of TKRL is defined as follows:
where and are two projection matrices for and depending on their corresponding hierarchical types in this triple, which are constructed by hierarchical type encoders. As the head entities of a relation may have several types, is defined as a weighted sum of the matrices of all involved hierarchical types (The same to ):
where if the type is in the hierarchical type set of head entity of relation , otherwise . Further, the hierarchical type encoders regard sub-types as projection matrices, and utilize multiplication or weighted summation to construct projection matrices for each hierarchical type, i.e.,
where is the -th sub-type of and is the corresponding projection matrix.
3.3.3 Visual Information
Besides textual and type information, visual Information such as images, which can provide an intuitive outlook of their corresponding entities’, is also useful for KRL. The reason is that the visual information may give significant hints suggesting some inherent attributes of entities from certain aspects.
(xie2016image, ) propose a novel KRL approach, Image-embodied Knowledge Representation Learning (IKRL), to take visual information into consideration when learning representations of the KGs. Specifically, IKRL first constructs the image representations for all entity images with neural networks, and then project these image representations from image semantic space to entity semantic space via a transform matrix. Since most entities may have multiple images with different qualities, IKRL selects the more informative and discriminative images via an attention mechanism. Finally, IKRL defines the score function following the framework of DKRL:
where and are the text-based representation of and
The evaluation results of IKRL not only confirm the significance of visual information in understanding entities but also show the possibility of a joint heterogeneous semantic space. Moreover, the author also finds some interesting semantic regularities in visual space similar to found in word space.
3.3.4 Logic Rules
Most existing KRL methods only consider the information of each relational fact separately, ignoring the interactions and correlations between different triples. Logic rules, which are usually the summaries of experience deriving from human beings’ prior knowledge, could help us for knowledge reasoning. For example, if we know the triple fact that (Obama, president_of, United States), we can easily infer with high confidence that (Obama, nationality, United States), since we know the logic rule that the relation president_of nationality.
Recently, there are some works attempting to introduce logic rules to knowledge acquisition and inference. ALEPH muggleton1995inverse , WARMR dehaspe1999discovery , and AMIE galarraga2013amie utilize Markov logic networks to extract logic rules in KGs. (pujara2013knowledge, ; beltagy2014efficient, ; wang2015knowledge, ) also utilize Markov logic networks to take the logic rules into consideration when extracting knowledge. Besides, (rocktaschel2015injecting, ) attempt to incorporate first-order logic domain knowledge into matrix factorization model to extract unknown relational facts from plain text. (rocktaschel2014low, ; wang2016learning, ) further learn low dimensional embeddings of logic rules.
Recently, KALE guo2016jointly incorporates logic rules into KRL via modeling the triples and rules jointly. For the triple modeling, KALE follows the translation assumption with minor alteration and the score function of KALE is defined as follows:
where takes value in for the convenience of joint learning.
To model the new-added rules, KALE employs the t-norm fuzzy logics proposed in hajek1998metamathematics .Specially, KALE uses two typical types of logic rules. The first one is which is the same as the example above. KALE represents the scoring function of this logic rule as follows:
The second logic rules is (e.g. given (Barbara Pierce Bush, father, George W. Bush)) and (George W. Bush, father, George H. W. Bush), we can infer that (Barbara Pierce Bush, grandfather, George H. W. Bush)). And KALE define the second scoring function as:
The joint training strategy takes all positive formulae including fact triples as well as logic rules into consideration. In fact, the path-based TransE lin2015modeling stated above also implicitly considers the latent logic rules between different relations via relational paths.
It is natural that we learn things in the real world with all kinds of multi-source information. Multi-source information such as plain texts, hierarchical types, or even images and videos, is of great importance when modeling the complicated world and constructing cross-modal representations. The success in these preliminary attempts demonstrates the significance and feasibility located in multi-source information, while there are still improvements to existing methods remaining to be explored. Moreover, there are still some other types of information which could also be encoded into KRL.
4 Training Strategies
In this section, we will introduce the training strategies for KRL models. There are two typical training strategies including margin-based approach and logistic-based approach.
4.1 Margin-based Approach
The margin-based approach defines the following loss function as training objective:
where indicates all parameters of the KRL models, returns the higher value between and , is the margin and is the set of invalid fact triples.
Generating Invalid Triple Set. In fact, existing KGs only contain valid fact triples, and therefore we need to generate invalid triples for the training of margin-based approach. Researchers have proposed to generate invalid triples by randomly replacing entities or relations in valid fact triples. Hence, the invalid triple set is defined as follows:
However, generating invalid triple set by uniformly replacement may lead to some errors. For example, the triple (Bill Gates, nationality, United States) may generate false invalid triple (Jobs Steve, nationality, United States). In fact, Jobs Steve is actually Americans. To alleviate this issue, when generating the invalid triple, (wang2014knowledge, ) proposed to assign different weights for head/tail entity replacement according to the relation characteristic. For example, for 1-to-n relation, they will tend to replace the “one” side instead of the “n” side, and therefore the probability to generate false-invalid fact triples will be reduced.
Besides, the uniform generating approach may not be able to generate representative negative training triples. For example, the triple (Bill Gates, nationality, United States) may generate invalid triple (Bill Gates, nationality, Jobs Steve). In fact, Jobs Steve is not a nation and such negative fact triple cannot fully train the KR models. Therefore, (socher2013reasoning, ) propose to generate negative triples by replacing entities with other entities of the same type.
4.2 Logistic-based Approach
The logistic-based approach defines the following loss function as training objective:
where indicates that energy of the fact triple (h, r, t), which is further defined as:
where is a bias constant.
5 Applications of Knowledge Graph Representation
Recent years have witnessed the great success in knowledge-driven applications such as information retrieval and question answering. These applications are expected to help accurately and deeply understand user requirements, and then appropriately give responses. Hence, they cannot work well without certain external knowledge.
However, there are still some gaps in the knowledge stored in KGs and the knowledge used in knowledge-driven applications. To address this issue, researchers employ KRL to bridge the gap between them. Knowledge graph representations are capable of solving the data sparsity and modeling the relatedness between entities and relations. Moreover, they are convenient to be included in deep learning methods and by nature posses potential in the combination with heterogeneous information.
In this section, we will introduce typical applications of KRL including three knowledge acquisition tasks and other tasks.
5.1 Knowledge Graph Completion
Knowledge graph completion aims to predict the missing entities or relations for given uncompleted fact triples. In this task, to evaluate the KRL approaches more effectively, we do not only give a best prediction, but give a detailed ranking lists of all the entities or relations in KGs.
In this paper, we select three typical KGs WordNet, Freebase and Wikidata to evaluate the knowledge graph representation models. For WordNet, we employ a widely-used dataset WN18 used in bordes2014semantic And for Freebase, we also select a widely-used dataset from Freebase FB15K used in bordes2014semantic .
For FB15k, we find that there exists some direct relatedness between the fact triples between its training set and testing set, which prevents us giving a exact evaluation of various KRL approaches. The reason is that some relations such as contains may have its reverse relation contained by in testing set. Therefore, we also sample a dataset from Wikidata, named as WD50k, to further evaluate the performance of these KRL models. We list statistics of these data sets in Table 1.
5.1.2 Evaluation Results
As set up in bordes2013translating
, we adopt the following evaluation metrics: (1) Mean Rank, which indicates the mean rank of all correct predictions; and (2) Hits@10, which is the proportion of correct predictions ranked in top-. We also use two settings “Raw” and “Filter”, where the “Filter” setting will filter out the other correct entities when measuring evaluation metrics.
|Metric||Mean Rank||Hits@10 ()||Mean Rank||Hits@10 ()||Mean Rank||Hits@10 ()|
In this section, we discuss the performance in detail to gain more insights about what really works for KRL. Evaluation results on WN18, FB15K and WN50k are shown in Table 2. From the table, we can see that:
(1) All the models with complex relation modeling including TransH, TransR, TransD, TranSparse, and KG2E outperform TransE in Hits@10 and Mean Rank on both datasets significantly. The reason is that TransE cannot deal with the complex relations in KG but these models attempt to alleviate the issue.
(2) By taking the relational path into consideration, PTransE achieves the best performance among all models on FB15k. It indicates that there exist complex relation inferences in KG and it can benefit the KRL.
(3) On WN18, we find that for all the models, when the dimension arises, the performance in Hits@10 will improve when the performance in Mean Rank will decrease. The reason is perhaps that the increase of dimension d could improve the discrimination of entities and relations especially for those entities with a large number of fact triples. However, for those entities with a few fact triples, the increase in dimension may lead to insufficient learning which may influence the system performance.
Translation models, HolE and ComplEx have achieved promising results in the task of knowledge graph completion. To conduct an in-depth analysis of these models, we select and re-implement eight typical models including TransE, TransH, TransR, TransD, TranSparse, PTransE, HolE and ComplEx. In this section, we compare the performance of the selected models in different mapping properties, dimensions, and margins.
|Tasks||Predicting Head (Hits@10)||Predicting Tail (Hits@10)|
We categorize the relations according to their characteristics into four classes:1-to-1, 1-to-n, n-to-1, n-to-n. In Table 3, we show separate evaluation results of these four types of relations on FB15K. We can observe that:
(1) All TransE’s extensions considering complex relation modeling achieve better results for the “1-to-n”, “n-to-1” and “n-to-n” relations as compared to TransE. It indicates that these models actually improve the ability to model complex relations.
(2) PTransE also performs better for the “1-to-1” relations as compared to TransE. It indicates that these models obtain better representations of entities and relations by especially dealing with complex relations.
(3) PTransE achieves the best performance among all models in all mapping properties. It indicates that PTransE obtain better representations of entities and relations by taking the relational paths into consideration and relation inference can benefit to knowledge graph completion.
(4) ComplEx performs better as compared to all translation models except PTransE which considers the information of relational paths. It demonstrates that the complex embeddings are more suitable to represent KGs as compared to traditional real vectors.
For all above models, there are two hyper-parameters which have a significant influence on the performance: the dimension and the margin . Hence, we further compare the performance with respect to these two hyper-parameters on the dataset FB15k in Hits@10. For other hyper-parameters, we use the same setting as the task of knowledge graph completion on FB15k.
Effect of dimension d.
Evaluation results are shown in Table 4. From the table, we observe that:
(1) All the models achieve better performance in dimension , and , and the system performance doesn’t improve significantly when the dimension is greater than 400.
(2) ComplEx is more robust as compared to all other models even for TransR with much more parameters, which indicates that the complex embeddings make the model more expressive.
Effect of margin.
Evaluation results are shown in Table 5 (As HolE and ComplEx don’t have this hyper-parameter, we don’t list their results here). From the table, we observe that:
(1) All the models perform well when the margin . Therefore these models can keep stable when the margin within a reasonable range.
(2) All the models cannot perform well when the margin . But TransR performs better as compared to other models when the margin . The reason is perhaps that TransR has much more parameters than other models and its strong model ability makes it more robust.
5.1.4 Type Constraints
In addition to the fact triples, most existing KGs such as Wikidata also provide type-constraints information for relations which gives the type constraints of the head and tail entities for each relation. The prior knowledge of relations provides additional information for KRL, e.g. that the relation nationality should relate only head entity of the type Person and tail entity of the type Country.
It has been proved that take the type-constraints information of relation into account could help KRL approaches to model entities and relations in KGkrompassISWC2015 . We also report the Hit@10 for all models with type constraints (+TC) in Table 6.
|TransE||74.5||78.7 (+4.2)||48.4||49.9 (+1.5)|
|TransH||76.7||80.0 (+3.3)||52.0||53.6 (+1.6)|
|TransR||79.0||81.9 (+2.9)||52.8||54.5 (+1.7)|
|TransD||77.5||80.0 (+2.5)||50.0||51.4 (+1.4)|
|TranSparse||75.9||79.8 (+3.9)||50.7||52.3 (+1.6)|
|ComplEx||84.0||87.2 (+3.2)||47.6||54.1 (+6.5)|
From the table we can see that: All the models have shown great improvement when considering type constraints. It indicates that the type-constraints information of relations provided by the KGs are useful for existing KRL methods in modeling KGs and further knowledge driven tasks.
5.2 Triple Classification
Triple classification aims to distinguish if a given triple is correct or not, which has been studied in socher2013reasoning ; wang2014knowledge as one of their evaluation tasks. Here, we use three typical datasets in this task including WN11, FB13 and FB15K, where the first two datasets are used in socher2013reasoning and their statistics are listed in Table 7.
The experimental results of triple classification is shown in Table 8. From the table, we have the following observations:
(1) On WN11, TransE and its extension have similar performance. The reason is perhaps that WN11 only has 11 relationships which are too simple to distinguish the model ability of different translation models.
2) None of TransE and its extensions can outperform NTN on FB13 with only relations. In contrast, on the more sparse dataset FB15K with relations, TransE and its extensions have much better performance as compared to NTN. The reason is perhaps that NTN is more expressive while maintains much more parameters. Therefore, it performs better in the dense graphs, while suffers from the lack of data in sparse graphs. On the contrary, TransE and its extensions are more simple and effective, achieving promising result in sparse graphs.
5.3 Relation Extraction
Relation extraction (RE) aims to extract unknown relational fact from plain text on the web, which is an important information source to enrich KGs. Recent, distantly supervised RE models mintz2009distant ; riedel2010modeling ; hoffmann2011knowledge ; surdeanu2012multi have become the mainstream approaches to extract novel facts from plain texts. However, these methods only use the information in plain text in knowledge acquisition, ignoring the rich information contained by the structure of KGs.
weston2013connecting proposes to combine TransE and existing distantly supervised RE models to extract novel facts, and obtains lots of improvements. Moreover, (han2016joint, ) propose a novel joint representation learning framework for KRL and RE. In this section, we will investigate if existing KRL models could effectively enhance existing distantly supervised RE models.
Following weston2013connecting , we adopt a widely used dataset NYT10 which is developed by riedel2010modeling in our experiments. This dataset contains relations, and relational facts as well as relational facts in training and testing sets respectively. Besides, the training and testing set contain and sentences respectively.
In our experiments, with loss of generality, we follow the experimental settings in lin2015learning to implement the distantly supervised RE model named as Sm2r followingweston2013connecting , and the KRL models are all trained in FB40k dataset which contains entities and .
We combine the output scores both from Sm2r with the scores from various KRL models to predict novel relational facts, and get precision-recall curves for the models combined with TransE, TransH, TransR and PTransE.
From the figure, we observe that by combining existing KRL models, the performance of distantly supervised RE is much better than the original ones. It indicates that incorporating the information from KGs is useful for distantly supervised RE.
5.4 Other Applications
Besides knowledge acquisition, KRL has also been applied in many other NLP task. In this section, we will introduce some typical knowledge-driven tasks including language modeling, question answering, information retrieval, recommendation system, and etc.
5.4.1 Language Modeling
Language models aim to learn a probability distribution over sequences of words, which is a classical and essential NLP task. Recently, neural models such as RNN have proved to be effective in language modeling. However, most existing neural language models suffer from the incapability of modeling and utilizing background knowledge. The reason is that the statistical co-occurrences cannot instruct the generation of all kinds of knowledge, especially for those entities with low frequencies in plain text.
To address this issue, (ahn2016neural, ) propose a Neural Knowledge Language Model (NKLM) that considers background knowledge provided by KGs when generating natural language sequences with RNN language models. The key is NKLM’s two heterogeneous ways to generate words. One is to generate a word from the “word vocabulary” according to the word probabilities calculated by RNN language model, and another one is to generate a word from the “knowledge vocabulary” according to the external KGs.
The NKLM model explores a novel neural model that combines the symbolic knowledge information in external KGs with RNN language models. However, the topic knowledge is needed when generating natural languages, which makes NKLM less practical and scalable for more general topic-independent texts. Nevertheless, we still believe that it is promising to encode knowledge into language model with such methods.
5.4.2 Question Answering
Question answering aims to give answers according to users’ questions, which needs the capabilities of both natural language understanding on questions and inference on answer selection. Therefore, combining knowledge with question answering is a straightforward application for knowledge representations. Conventional question answering systems directly utilize KGs as background databases. These systems usually transform user’s questions into regular queries and search KG for appropriate answers. However, they always ignore the potential relationships between entities and relations. Recently, with the development of deep learning, explorations have been done on neural network models for understanding questions and even generating answers.
Considering the flexibility and diversity of answer generation in natural languages, (yin2016neural, ) propose a neural generative question answering model which explores how to utilize the facts in KGs to answer simple factoid questions. Besides, KRL models is also applied in serban-EtAl:2016:P16-1 which attempts to generate factoid questions. Moreover, (hegenerating, ) propose an end-to-end question answering system which incorporates copying and retrieving mechanisms to generate natural answers using KRL technique.
5.4.3 Information Retrieval
Information retrieval aims to retrieve related articles according to user’s queries. Similar to question answering, how to exactly understand users’ meanings is a crucial problem of information retrieval. Hence, incorporating the information of KG could be beneficial to information retrieval. Traditional information retrieval systems always regard user’s query and retrieved articles as strings and measure their similarity using human designed feature such as bag-of-words. However, these system cannot actually realize users’ meaning via simple string matching.
Recently, with the success of KRL in many other NLP tasks, researchers have focused on utilizing KRL techniques for information retrieval. They usually improve the word-based representation used in information retrieval by entity-based representation learned by KRL methods. (hasibi2015entity, ) propose an entity-based language model to understand users’ queries in information retrieval, which is combined with word-based retrieval model to further improve the retrieval performance. Similarly, (xiong2016bag, ) propose a bag-of-entity model which represents queries and articles with their entities. Moreover, (nguyen2016toward, ) propose to incorporate KGs in deep neural approaches for document ranking and (xiong2017explicit, ) represents queries and articles in the entity space, and utilize KRL to capture their semantic relatedness in KGs.
5.4.4 Recommendation System
With the rapid growth of web information, recommender systems have been playing an important role in web application. Recommender system aims to predict the ”rating” or ”preference” that users may give to items. And since KGs can provide rich information including both structured and unstructured data, recommender systems have utilized more and more KG to enrich their contexts.
(cheekula2015entity, ) explore how to utilize the hierarchical knowledge from the DBpedia category structure in recommendation system and employ the spreading activation algorithm to identify entities of interest to the user. Besides, Passant passant2010dbrec measures the semantic relatedness of the artist entity in a KG to build music recommendation systems. However, most of these systems mainly investigate the problem by leveraging the structure of KGs. Recently, with the development of representation learning, (zhang2016collaborative, ) propose to jointly learn the representations of entities in both collaborative filtering recommendation systems and KGs.
Except for the task stated above, there are gradually more efforts focusing on encoding knowledge graph representations into other tasks such as dialogue system le2016lstm ; zhu2017flexible , entity disambiguation huang2015leveraging ; fang2016entity , entity typing Ren:2016:LNR:2939672.2939822 ; neelakantan-chang:2015:NAACL-HLT , knowledge graph alignment ijcai2017-209 ; ijcai2017-595 , dependency parsing kim2015re , etc. Moreover, the idea of KRL has also motivated the research of visual relation extraction Zhang_2017_CVPR ; baier2017improving and social relation extraction ijcai2017399 .
6 Discussion and Outlook
KGs represent both entities and their relations in the form of relational triples, which provides an effective way for human beings to learn and understand the real world. Now, as a useful and convenient tool to deal with the large-scale KGs, KRL is widely explored and utilized in multiple knowledge-driven tasks, which significantly improves their performances.
Although existing KR models have already shown their powers in modeling KGs, there are still many possible improvements of them to be explored of. In this section, we will discuss the challenges of KRL and its applications.
6.1 Further Exploration of Internal and External Information
Relational triples, which are regarded as the internal information of KGs, have been well organized by existing KRL methods. However, the performances of these models are still far from being practical in real-world application such as knowledge graph completion. In fact, entities and relations in KGs have their complex characteristics and rich information which have not been taken into full consideration. In this section, we will discuss the internal and external information to be further explored to enhance the performance of KRL methods.
6.1.1 Type of Knowledge
Researchers usually divide the relations in KGs into four types including 1-to-1, 1-to-n, n-to-1 and n-to-n relations according to their mapping properties. And different KRL methods have different performance when dealing with four kinds of relations. It indicates that we need to specially design different KRL framework for different kinds of knowledge or relations. However, existing KRL methods simply divide all relations into 1-to-1, 1-to-n, n-to-1 and n-to-n relations, which cannot effectively describe the characteristics of knowledge. According to the cognitive and computational characteristics of knowledge, existing knowledge could divide into several types: (1) Hyponymy (e.g. has_part) which indicates the subordination between entities. (2) Attribute (e.g. nationality) which indicates the attribute information of entities. Lots of entities may share the same attributes, especially for those enumerative attributes such as gender, age, etc. (3) Interrelation (e.g. friend_of) which indicates relationships between entities. It is intuitive that these different kinds of relations should be modeled in different ways.
6.1.2 Dynamics of Knowledge
Existing KRL methods usually simply embed the whole KG into a unified semantic space via learning from all fact triples, neglecting the time information contained in KG. In fact, knowledge is not static and will change over time. For any point of time, there should be a unique KG with the corresponding timestamp. For instance, George W. Bush was the president of United States during , and should not be regarded as a politician in recent years. Considering the time information of fact triples will help to understand entities and their relations more precisely in KRL. What’s more, the research on the development of KGs have impacts not only on KG theories and applications but also on the study of human histories and cognition. There are some existing works jiang2016encoding ; esteban2016predicting ; trivedi2017know attempting to incorporate temporal information into KRL, but their efforts are still preliminary and the dynamics of knowledge still needs to be further explored.
6.1.3 Multi-lingual Representation Learning
(mikolov2013distributed, ) observe a strong similarity of the geometric arrangements of corresponding concepts between the vector spaces of different languages, and suggest that a cross-lingual mapping between the two vector spaces is technically plausible. And the joint-space models for cross-lingual word representations are desirable, as language-invariant semantic features can be generalized to make it easy to transfer models across languages. Besides, there are many projects, such as DBpedia, YAGO, Freebase and so on, are constructing multilingual KGs by extracting structured information from Wikipedia. Multilingual KGs are important for the globalization of knowledge sharing and play important roles in many applications such as cross-lingual information retrieval, machine translation, and question answering. However, to the best of our knowledge, little works have been done for representation learning of multilingual KGs. Therefore, multi-lingual KRL, which aims to improve the performances of comparative sparse KGs in some languages with the help of those of rich languages, is also a significative but challenging work to be solved.
6.1.4 Multi-source Information Learning
With the fast development of high-speed network, billions of people from all over the world can easily upload and share multimedia contents instantly. As what we are witnessing, not only does Internet contain pages and hyper-links nowadays. It turns out that audio, photos, and videos have also become more and more on the Web. How to efficiently and effectively utilize the multi-source information from text to video is becoming a critical and challenging problem in KRL. And multi-source information learning has shown its potential to help model KGs while existing methods of utilizing such information are still preliminary. We could design more effective and elegant models to utilize these kinds of information better. Moreover, other forms of multi-source information such as social networks are still isolated from the construction of knowledge graph representations, which could be further explored.
6.1.5 One-shot/Zero-shot Learning
Recently, one-shot/zero-shot learning is blooming in various fields such as word representation, sentiment classification, machine translation and so on. One-shot/zero-shot learning aims to learn from instances of an unseen class or a class with only a few instances.In the representation of KGs, the practical problem is that the low-frequency entities and relations are learned more poorly than those of high-frequency. The representations of these low-frequent entities and relations are one of the key points to apply KGs in the real-world applications. It is natural that external information such as multi-lingual and multi-modal information can help to construct knowledge graph representations, especially for the large-scale sparse KG. We believe that with the help of multi-lingual and multi-modal representations of entities and relations, the representations of low-frequency entities and relations could be better in some degree. Besides, it’s necessary to design a new KRL framework which is more suitable for the representation learning of low-frequency entities and relations.
6.2 Complexity in Real-world Knowledge Applications
KGs are playing an important role in a variety of applications such as web search, knowledge inference, and question answering. However, due to the complexities of real-world knowledge applications, it is still difficult to effectively and efficiently utilize KGs. In this section, we will discuss the issues which we are confronted with when utilizing KGs in real-world application.
6.2.1 Low Quality of KGs
One of the main challenges in real-world knowledge applications is the quality of huge KGs themselves. Typical KGs such as Freebase, DBpedia, Yago, Wikidata and so on often obtain their fact triples by automatically knowledge acquisition from huge size of plain texts on the Internet. Therefore, these KGs inevitably suffer from the issues of noise and contradiction due to the lack of human labeling. These noises and conflicts will lead to error propagation when involves with real-world application. How to automatically detect the conflict or errors in existing KGs becomes an important problem when incorporating the information of KGs into real-world application.
6.2.2 Large Volume of KGs
The existing KGs are too cumbersome to deploy in real-world applications efficiently. They have already included millions of entities and billions of their facts about the world. For example, Freebase has million entities and billion triples of facts up to now. Due to huge sizes of KGs, some existing methods will be not practical because of their model and computational complexity. To the best of our knowledge, there are still many possible improvements on existing methods for leveraging both effectiveness and efficiency on the astonishing huge-size KGs.
6.2.3 Endless Changing of KGs
Knowledge changes with time, and there are new knowledge comes into being with time goes by. Existing KRL methods have to re-learn their models from scratch every time when the KG changes since their optimization objective is related to all the fact triples in KGs. It is time-consuming and not practical if we want to utilize KGs in real-world application. Therefore, to design a new framework of KRL which can carry out online learning and update the model parameters incrementally is crucial to the applications of KGs.
In this paper, we first give a broad overview of existing approaches based on KRL, with a particular focus on three main challenges including complex relation modeling, relational path modeling, and multi-source information learning. Secondly, we present a quantitative analysis of recent KR models and explore which factors benefit the modeling indeed in three knowledge acquisition tasks. Thirdly, we introduce typical applications of KRL including language modeling, question answering, information retrieval, recommendation system, etc. Finally, we discuss the remaining challenges of KRL and its application, and then give an outlook of the future study of KRL.
- (1) K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in: Proceedings of SIGKDD, 2008, pp. 1247–1250.
- (2) S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, Dbpedia: A nucleus for a web of open data, The semantic web (2007) 722–735.
- (3) F. M. Suchanek, G. Kasneci, G. Weikum, Yago: a core of semantic knowledge, in: Proceedings of WWW, ACM, 2007, pp. 697–706.
- (4) A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, Jr., T. M. Mitchell, Toward an architecture for never-ending language learning, in: Proceedings of AAAI, AAAI Press, 2010, pp. 1306–1313.
- (5) D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledge base, Communications of the ACM 57 (10) (2014) 78–85.
- (6) T. Pedersen, S. Patwardhan, J. Michelizzi, Wordnet:: Similarity: measuring the relatedness of concepts, in: Proceedings of HLT-NAACL, Association for Computational Linguistics, 2004, pp. 38–41.
- (7) C. Leacock, M. Chodorow, Combining local context and wordnet similarity for word sense identification, WordNet: An electronic lexical database 49 (2) (1998) 265–283.
- (8) X. Chen, Z. Liu, M. Sun, A unified model for word sense representation and disambiguation., in: Proceedings of EMNLP, 2014, pp. 1025–1035.
- (9) M. Dredze, P. McNamee, D. Rao, A. Gerber, T. Finin, Entity disambiguation for knowledge base population, in: Proceedings of ICCL, Association for Computational Linguistics, 2010, pp. 277–285.
- (10) A. Bordes, X. Glorot, J. Weston, Y. Bengio, Joint learning of words and meaning representations for open-text semantic parsing, in: Proceedings of AISTATS, 2012, pp. 127–135.
- (11) J. Berant, A. Chou, R. Frostig, P. Liang, Semantic parsing on freebase from question-answer pairs., in: Proceedings of EMNLP, Vol. 2, 2013, pp. 1533–1544.
- (12) S. Scott, S. Matwin, Text classification using wordnet hypernyms, Usage of WordNet in Natural Language Processing Systems.
- (13) P. Wang, J. Hu, H.-J. Zeng, Z. Chen, Using wikipedia knowledge to improve text classification, Knowledge and Information Systems 19 (3) (2009) 265–281.
- (14) O. Medelyan, I. H. Witten, D. Milne, Topic indexing with wikipedia, in: Proceedings of the AAAI WikiAI workshop, Vol. 1, 2008, pp. 19–24.
- (15) R. Verma, P. Chen, W. Lu, A semantic free-text summarization system using ontology knowledge, in: Procgseeds of TAC, 2007.
- (16) J. Hu, G. Wang, F. Lochovsky, J.-t. Sun, Z. Chen, Understanding user’s query intent with wikipedia, in: Proceedings of the WWW, ACM, 2009, pp. 471–480.
- (17) R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, D. S. Weld, Knowledge-based weak supervision for information extraction of overlapping relations, in: Proceedings of ACL-HLT, 2011, pp. 541–550.
- (18) J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes, Improving efficiency and accuracy in multilingual entity extraction, in: Proceedings of ICSS, ACM, 2013, pp. 121–124.
- (19) A. Bordes, S. Chopra, J. Weston, Question answering with subgraph embeddings, in: Proceedings of EMNLP, 2014, pp. 615–620.
- (20) A. Bordes, J. Weston, N. Usunier, Open question answering with weakly supervised embedding models, in: Proceedings of ECML PKDD, Springer, 2014, pp. 165–180.
N. Lao, W. W. Cohen, Relational retrieval using a combination of path-constrained random walks, Machine learning 81 (1) (2010) 53–67.
- (22) N. Lao, A. Subramanya, F. Pereira, W. W. Cohen, Reading the web with learned syntactic-semantic inference rules, in: Proceedings of EMNLP-CoNLL, 2012, pp. 1017–1026.
- (23) T. Di Noia, R. Mirizzi, V. C. Ostuni, D. Romito, M. Zanker, Linked open data to support content-based recommender systems, in: Proceedings of ICSS, ACM, 2012, pp. 1–8.
- (24) A. Smirnov, T. Levashova, N. Shilov, Patterns for context-based knowledge fusion in decision support systems, Information Fusion 21 (2015) 114–129.
- (25) A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al., Learning structured embeddings of knowledge bases, in: Proceedings of AAAI, 2011, pp. 301–306.
- (26) A. Bordes, X. Glorot, J. Weston, Y. Bengio, A semantic matching energy function for learning with multi-relational data, Machine Learning 94 (2) (2014) 233–259.
- (27) R. Jenatton, N. L. Roux, A. Bordes, G. R. Obozinski, A latent factor model for highly multi-relational data, in: Proceedings of NIPS, 2012, pp. 3167–3175.
- (28) I. Sutskever, J. B. Tenenbaum, R. Salakhutdinov, Modelling relational data using bayesian clustered tensor factorization, in: Proceedings of NIPS, 2009, pp. 1821–1828.
- (29) B. Yang, W. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases, CoRR abs/1412.6575.
- (30) H. Liu, Y. Wu, Y. Yang, Analogical inference for multi-relational embeddings, in: Proceedings of ICML, Vol. 70, PMLR, International Convention Centre, Sydney, Australia, 2017, pp. 2168–2178.
- (31) X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, W. Zhang, Knowledge vault: A web-scale approach to probabilistic knowledge fusion, in: Proceedings of SIGKDD, ACM, 2014, pp. 601–610.
- (32) R. Socher, D. Chen, C. D. Manning, A. Ng, Reasoning with neural tensor networks for knowledge base completion, in: Proceedings of NIPS, 2013, pp. 926–934.
- (33) Q. Liu, H. Jiang, A. Evdokimov, Z.-H. Ling, X. Zhu, S. Wei, Y. Hu, Probabilistic reasoning via deep learning: Neural association models, arXiv preprint arXiv:1603.07704.
- (34) M. Nickel, V. Tresp, H.-P. Kriegel, A three-way model for collective learning on multi-relational data, in: Proceedings of ICML, 2011, pp. 809–816.
- (35) M. Nickel, V. Tresp, H.-P. Kriegel, Factorizing yago: scalable machine learning for linked data, in: Proceedings of WWW, 2012, pp. 271–280.
- (36) L. Drumond, S. Rendle, L. Schmidt-Thieme, Predicting rdf triples in incomplete knowledge bases with tensor factorization, in: Proceedings of SAC, ACM, 2012, pp. 326–331.
- (37) S. Riedel, L. Yao, A. McCallum, B. M. Marlin, Relation extraction with matrix factorization and universal schemas., in: Proceedings of HLT-NAACL, 2013, pp. 74–84.
- (38) M. Fan, D. Zhao, Q. Zhou, Z. Liu, T. F. Zheng, E. Y. Chang, Distant supervision for relation extraction with matrix completion., in: Proceedings of ACL, 2014, pp. 839–849.
- (39) V. Tresp, Y. Huang, M. Bundschus, A. Rettinger, Materializing and querying learned knowledge, Proceedings of IRMLeS 2009.
- (40) Y. Huang, V. Tresp, M. Nickel, A. Rettinger, H.-P. Kriegel, A scalable approach for statistical learning in semantic graphs, Semantic Web 5 (1) (2014) 5–22.
- (41) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Proceedings of NIPS, 2013, pp. 3111–3119.
- (42) R. Fu, J. Guo, B. Qin, W. Che, H. Wang, T. Liu, Learning semantic hierarchies via word embeddings, in: Proceedings of ACL, 2014, pp. 1199–1209.
- (43) A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Proceedings of NIPS, 2013, pp. 2787–2795.
- (44) M. Nickel, L. Rosasco, T. Poggio, Holographic embeddings of knowledge graphs, in: Proceedings of AAAI, 2016, pp. 1955–1961.
- (45) T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: Proceedings of ICML, 2016, pp. 2071–2080.
- (46) K. Hayashi, M. Shimbo, On the equivalence of holographic and complex embeddings for link prediction, in: Proceedings of ACL, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 554–559.
- (47) Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by translating on hyperplanes, in: Proceedings of AAAI, 2014, pp. 1112–1119.
- (48) Y. Lin, Z. Liu, M. Sun, Y. Liu, X. Zhu, Learning entity and relation embeddings for knowledge graph completion, in: Proceedings of AAAI, 2015, pp. 2181–2187.
- (49) D. Q. Nguyen, K. Sirts, L. Qu, M. Johnson, Stranse: a novel embedding model of entities and relationships in knowledge bases, in: Proceedings of NAACL, Association for Computational Linguistics, San Diego, California, 2016, pp. 460–466.
- (50) G. Ji, S. He, L. Xu, K. Liu, J. Zhao, Knowledge graph embedding via dynamic mapping matrix, in: Proceedings of ACL, 2015, pp. 687–696.
- (51) H.-G. Yoon, H.-J. Song, S.-B. Park, S.-Y. Park, A translation-based knowledge graph embedding preserving logical property of relations., in: Proceedings of HLT-NAACL, 2016, pp. 907–916.
- (52) G. Ji, K. Liu, S. He, J. Zhao, Knowledge graph completion with adaptive sparse transfer matrix, in: Proceedings of AAAI, 2016.
- (53) H. Xiao, M. Huang, Y. Hao, X. Zhu, Transa: An adaptive approach for knowledge graph embedding, CoRR.
- (54) M. Fan, Q. Zhou, E. Chang, T. F. Zheng, Transition-based knowledge graph embedding with relational mapping properties, in: Proceedings of PACLIC, 2014, pp. 328–337.
- (55) J. Feng, M. Huang, M. Wang, M. Zhou, Y. Hao, X. Zhu, Knowledge graph embedding by flexible translation., in: Proceedings of KR, 2016, pp. 557–560.
- (56) H. Xiao, M. Huang, X. Zhu, Transg : A generative model for knowledge graph embedding, in: Proceedings ofACL, 2016, pp. 2316–2325.
- (57) S. He, K. Liu, G. Ji, J. Zhao, Learning to represent knowledge graphs with gaussian embedding, in: Proceedings of CIKM, ACM, 2015, pp. 623–632.
- (58) H. Xiao, M. Huang, X. Zhu, From one point to a manifold: Orbit models for knowledge graph embedding, in: Proceedings of IJCAI, 2016, pp. 1315–1321.
- (59) N. Lao, T. Mitchell, W. W. Cohen, Random walk inference and learning in a large scale knowledge base, in: Proceedings of EMNLP, 2011, pp. 529–539.
- (60) M. Gardner, P. P. Talukdar, B. Kisiel, T. M. Mitchell, Improving learning and inference in a large knowledge-base using latent syntactic cues, in: Proceedings of EMNLP, 2013, pp. 833–838.
- (61) Y. Lin, Z. Liu, M. Sun, Modeling relation paths for representation learning of knowledge bases, Proceedings of EMNLP (2015) 705–714.
- (62) A. Garcıa-Durán, A. Bordes, N. Usunier, Composing relationships with translations, Proceedings of EMNLP (2015) 318–327.
- (63) A. Neelakantan, B. Roth, A. McCallum, Compositional vector space models for knowledge base completion, in: Proceedings of ACL-IJCNLP, Association for Computational Linguistics, Beijing, China, 2015, pp. 156–166.
- (64) Y. Luo, Q. Wang, B. Wang, L. Guo, Context-dependent knowledge graph embedding., in: Proceedings of EMNLP, 2015, pp. 1656–1661.
- (65) K. Toutanova, V. Lin, W.-t. Yih, H. Poon, C. Quirk, Compositional learning of embeddings for relation paths in knowledge base and text, in: Proceedings of ACL, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1434–1444.
- (66) R. Das, A. Neelakantan, D. Belanger, A. McCallum, Chains of reasoning over entities, relations, and text using recurrent neural networks, in: Proceedings of EACL, Association for Computational Linguistics, Valencia, Spain, 2017, pp. 132–141.
- (67) J. Feng, M. Huang, Y. Yang, x. zhu, Gake: Graph aware knowledge embedding, in: Proceedings of COLING, The COLING 2016 Organizing Committee, Osaka, Japan, 2016, pp. 641–651.
- (68) W. Zeng, Y. Lin, Z. Liu, M. Sun, Incorporating relation paths in neural relation extraction, in: Proceedings of EMNLP, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 1769–1778.
- (69) K. Guu, J. Miller, P. Liang, Traversing knowledge graphs in vector space, Proceedings of EMNLP.
- (70) Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph and text jointly embedding, in: Proceedings of EMNLP, 2014, pp. 1591–1601.
- (71) H. Zhong, J. Zhang, Z. Wang, H. Wan, Z. Chen, Aligning knowledge and text embeddings by entity descriptions, in: Proceedings of EMNLP, 2015, pp. 267–272.