Distributional representations of concepts are often easy to obtain from unstructured data sets, but they tend to provide only a blurry picture of the relationships that exist between concepts. In contrast, knowledge graphs directly encode this relational information, but it can be difficult to summarize the graph structure in a single representation for each entity.
To combine the advantages of distributional and relational data, Faruqui-etal:2015 propose to retrofit embeddings learned from distributional data to the structure of a knowledge graph. Their method first learns entity representations based solely on distributional data and then applies a retrofitting step to update the representations based on the structure of a knowledge graph. This modular approach conveniently separates the distributional data and entity representation learning from the knowledge graph and retrofitting model, allowing one to flexibly combine, reuse, and adapt existing representations to new tasks.
However, a core assumption of Faruqui-etal:2015’s retrofitting model is that connected entities should have similar embeddings. This assumption often fails to hold in large, complex knowledge graphs, for a variety of reasons. First, subgraphs of a knowledge graph often contain distinct classes of entities that are most naturally embedded in disconnected vector spaces. In the extreme case, the representations for these entities might derive from very different underlying data sets. For example, in a health knowledge graph, the subgraphs containing diseases and drugs should be allowed to form disjoint vector spaces, and we might want to derive the initial representations from radically different data sets. Second, many knowledge graphs contain diverse relationships whose semantics are different from – perhaps even in conflict with – similarity. For instance, in the knowledge graph in Figure1, the model of Faruqui-etal:2015 would model embeddings , which is problematic as Aragorn is not semantically similar to a Nazgûl (they are enemies).
To address these limitations, we present Functional Retrofitting, a retrofitting framework that explicitly models pairwise relations as functions. The framework supports a wide range of different instantiations, from simple linear relational functions to complex multilayer neural ones. Here, we evaluate both linear and neural instantiations of Functional Retrofitting on a variety of diverse knowledge graphs. For benchmarking against existing approaches, we use FrameNet and WordNet. We then move into the medical domain, where knowledge graphs play an important role in knowledge accumulation and discovery. These experiments show that even simple instantiations of Functional Retrofitting significantly outperform baselines on knowledge graphs with semantically complex relations and sacrifice no accuracy on graphs where Faruqui-etal:2015’s assumptions about similarity do hold. Finally, we use the model to identify promising new disease targets for existing drugs.
Code which implements Functional Retrofitting is available at https://github.com/roaminsight/roamresearch.
A knowledge graph is composed of a set of vertices , a set of relation types , and a set of edges where each edge is a tuple in which the relationship holds between vertices and . Our goal is to learn a set of representations which contain the information encoded in both the distributional data and the knowledge graph structure, and can be used for downstream analysis. Throughout this paper, we use to refer to a scalar, to refer to a vector, and
to refer to a matrix or tensor.
3 Related Work
Here we are interested in post-hoc retrofitting methods, which adjust entity embeddings to fit the structure of a previously unseen knowledge graph.
3.1 Retrofitting Models
The primary introduction of retrofitting was Faruqui-etal:2015, in which the authors showed the value of retrofitting semantic embeddings according to minimization of the weighted least squares problem
where is the embedding learned from the distributional data and , set the relative weighting of each type of data. When and , this model assigns equal weight to the distributional data and the structure of the knowledge graph.
More recently, hamilton2017inductive presented GraphSAGE, a two-step method which learns both an aggregation function , to condense the representations of neighbors into a single point, and an update function , to combine the aggregation with a central vertex. Here, is the embedding dimensionality, is the aggregation dimensionality, and is the number of neighbors for each vertex. Note that is permitted, allowing for aggregation by concatenation. While this method is extremely effective for learning representations on simple knowledge graphs, it is not formulated for knowledge graphs with multiple types of relations. Furthermore, when the representation of a relation is known a priori
, it can be useful to explicitly set the penalty function (e.g., mrkvsic2016counter use hand-crafted functions to effectively model antonymy and synonymy). By aggregating neighbors into a point estimate before calculating relationship likelihoods, GraphSAGE makes it difficult to encode, learn, or extract the representation of a pairwise relation.
In a similar vein, faruqui2016morpho developed a graph-based semi-supervised learning method to expand morpho-syntactic lexicons from seed sets. Though the task is different from the retrofitting task we consider here, the performance and scalability of their method demonstrate the utility of directly encoding pairwise relations as message-passing functions.
3.2 Relational Penalty Functions
Our new Functional Retrofitting framework models each relation via a penalty function acting on a pair of entities with embedding dimensionalities and , respectively. By explicitly modeling relations between pairs of entities, Functional Retrofitting supports the use of a wide array of scoring functions that have previously been developed for knowledge graph completion. Here, we present a brief review of such scoring functions; for an extensive review, see [Nickel et al.2016].
TransE [Bordes et al.2013] uses additive relations in which the penalty function is low iff . The simple Unstructured Model [Bordes et al.2012] was proposed as a naïve version of TransE that assigns all , leading to the penalty function . This is the underlying penalty function of [Faruqui et al.2015]. It cannot consider multiple types of relations. In addition, while it models 1-to-1 relations well, it struggles to model multivalued relations.
TransH [Wang et al.2014]
was proposed to address this limitation by using multiple representations for a single entity via relation hyperplanes. For a relation, TransH models the relation as a vector on a hyperplane defined by normal vector . For a triple , the entity embeddings and are first projected to the hyperplane of . By constraining , we have the penalty function where .
TransR [Lin et al.2015] embeds relations in a separate space from entities by a relation-specific matrix that projects from entity space to relation space and a shared relation vector that translates in relation space by . We use this model as the inspiration for our linear penalty function.
The Neural Tensor Network (NTN; socher2013reasoning) defines a score function where is a relation-specific linear layer, is the tanh operation applied element-wise, is a 3-way tensor, and are weight matrices. All of these models can be directly incorporated in our Functional Retrofitting framework.
4 Functional Retrofitting
We propose the framework of Functional Retrofitting (FR) to incorporate a set of relation-specific penalty functions which penalizes embeddings of entities with dimensionalities , respectively. This gives the complete minimization:
where is observed from distributional data, and set the relative strengths of the distributional data and the knowledge graph structure, and regularizes with strength set by . is the negative space of the knowledge graph, a set of edges that are not annotated in the knowledge graph. FR uses to penalize relations that are implied by the representations but not annotated in the graph. To populate , we sample a single negative edge with the same outgoing vertex for each true edge . The user can calibrate trust in the completeness of the knowledge graph via the hyperparameter.
In contrast to prior retrofitting work, FR explicitly encodes directed relations. This allows the model to fit graphs which contain diverse relation types and entities embedded in disconnected vector spaces. Here, we compare the performance of two instantiations of FR – one with all linear relations and one with all neural relations – and show that even these simple models provide significant performance improvements. In practice, we recommend that users select relation-specific functions in accordance with the semantics of their graph’s relations.
4.1 Linear Relations
We implement a linear relational penalty function with regularization for minimization of:
Faruqui-etal:2015’s model is a special case of this formulation in which
Throughout the remainder of this paper, we refer to this baseline model as the “FR-Identity” retrofitting method.
We initialize embeddings as those learned from distributional data and relations to imply similarity:
where is the out-degree of vertex for relation type , is a hyperparameter to trade off distributional data against structural data, and sets the trust in completeness of the knowledge graph structure. In our experiments, we use for straightforward comparison with the method of Faruqui-etal:2015 and optimize by cross-validation. Given prior knowledge about the semantic meaning of relations, we could initialize relations to respect these meanings (e.g., antonymy could be represented by ).
We optimize this model by block optimization. Conveniently, we have closed-form solutions where the partial derivatives of Eq. 3 equal :
Constraining to be orthogonal by , we have where
4.2 Neural Relations
We also instantiate FR with a neural penalty function where is the element-wise tanh operation, , again with
regularization.We initialize weights in a similar manner as for the linear relations and update via stochastic gradient descent. In our experiments, we use, and sample the same number of non-neighbors as true neighbors.
We test FR on four knowledge graphs. The first two are standard lexical knowledge graphs (FrameNet, WordNet) in which FR significantly improves retrofitting quality on complex graphs and loses no accuracy on simple graphs. The final two graphs are large healthcare ontologies (SNOMED-CT, Roam Health Knowledge Graph), which demonstrate the scalability of the framework and the utility of the new embeddings.
For each graph, we successively evaluate link prediction accuracy after retrofitting to links of other relation types. Specifically, for each relation type , we retrofit to where is the set of edges with all relations of typebetween entities and (with 70% of vertices selected as training examples and the remainder reserved for testing). To have balanced class labels, we sample an equivalent number of non-edges, with and . Thus, the random baseline classification rate is set to . Other baselines are the embeddings built from distributional data and the retrofitting method of Faruqui-etal:2015, denoted as “None” and “FR-Identity”, respectively.
FrameNet [Baker et al.1998, Fillmore et al.2003] is a linguistic knowledge graph containing information about lexical and predicate argument semantics of the English language. FrameNet contains two distinct entity classes: frames and lexical units, where a frame is a meaning and a lexical unit is a single meaning for a word. To create a graph from FrameNet, we connect lexical unit to frame if occurs in . We denote this relation as “Frame”, and its inverse “Lexical unit”. Finally, we connect frames by the structure of FrameNet (Table 6). Distributional embeddings are from the Google News pre-trained Word2Vec model [Mikolov et al.2013a]; the counts of each entity type that were also found in the distributional corpus are shown in Table 6.
As seen in Table 1, the representations learned by FR-Linear and FR-Neural are significantly more useful for link prediction than those of the baseline methods.
WordNet [Miller1995, Fellbaum2005] is a lexical database consisting of words (lemmas) which are grouped into unordered sets of synonyms (synsets). To examine the performance of FR on knowledge graphs which predominately satisfy the assumptions of Faruqui-etal:2015, we extract a simple knowledge graph of lemmas and the connections between these lemmas that are annotated in WordNet. These connections are dominated by hypernymy and hyponymy (Table 7), which correlate with similarity, so we expect the baseline retrofitting method to perform well.
As seen in Table 2
, the increased flexibility of the FR framework does not degrade embedding quality even when this extra flexibility is not intuitively necessary. Here, we evaluate standard lexical metrics for word embeddings: word similarity and syntatic relations. For word similiarity tasks, the evaluation metric is the Spearman correlation between predicted and annotated similarities; for syntatic relation, the evaluation metric is the mean cosine similarity between the learned representation of the correct answer and the prediction by the vector offset method[Mikolov et al.2013b]. In contrast to our other experiments, here the only stochastic behavior is due to stochastic gradient descent training, not sampling of evaluation samples. Even though the WordNet knowledge graph largely satisfies the assumptions of the naïve retrofitting model, the flexible FR framework achieves sustained improvements for both word similarity datasets (WordSim-353; finkelstein2001placing, Mturk-771111http://www2.mta.ac.il/~gideon/mturk771.html, and MTurk-287and syntatic relations (Google Analogy Test Set222http://download.tensorflow.org/data/questions-words.txt).
SNOMED-CT is an ontology of clinical healthcare terms and concepts including diseases, treatments, anatomical terms, and many other types of entities. From the publicly available SNOMED-CT knowledge graph,333https://www.nlm.nih.gov/healthit/snomedct/index.html we extracted 327,001 entities and 3,809,639 edges of 169 different types (Table 8). To create distributional embeddings, we first link each SNOMED-CT concept to a set of Wikipedia articles by indexing the associated search terms in WikiData.444https://dumps.wikimedia.org/wikidatawiki/entities/ We aggregate each article set by the method of arora2016simple, which performs TF-IDF weighted aggregation of pre-trained term embeddings to create sophisticated distributional embeddings of SNOMED-CT concepts. This creates a single 300-dimensional vector for each entity.
As the SNOMED-CT ontology is dominated by synonymy-like relations, we expect the simple retrofitting methods to perform well. Nevertheless, we see minimal loss in link prediction performance by using the more flexible FR framework (Table 3). Our implementation supports the use of different function classes to represent different relation types; in practice, we recommend that users select function classes in accordance with relation semantics.
5.4 Roam Health Knowledge Graph
Finally, we investigate the utility of FR in the Roam Health Knowledge Graph (RHKG). The RHKG is a rich picture of the world of healthcare, with connections into numerous data sources: diverse medical ontologies, provider profiles and networks, product approvals and recalls, population health statistics, academic publications, financial data, clinical trial summaries and statistics, and many others. As of June 2, 2017 the RHKG contains 209,053,294 vertices, 1,021,163,726 edges, and 6,231,287,999 attributes. Here, we build an instance of the RHKG using only public data sources involving drugs and diseases. The structure of this knowledge graph is summarized in Table 9. In total, we select 48,649 disease–disease relations, 227,051 drug–drug relations, and 13,667 drug–disease relations used for retrofitting. A disjoint set of 11,306 drug–disease relations is reserved for evaluation.
In the RHKG, as in many industrial knowledge graphs, different distributional corpora are available for each type of entity. First, we mine 2.9M clinical texts for co-occurrence counts in physician notes. After counting co-occurrences, we perform a pointwise mutual information transform and row normalization to generate embeddings for each entity. For drug embeddings, we supplement these embeddings with physician prescription habits. We extract prescription counts for each of 808,020 providers in the 2013 Centers for Medicare & Medicaid (CMS) dataset555https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html and 837,629 providers in the 2014 CMS dataset. By aggregating prescriptions counts across provider specialty, we produce 201-dimensional distributional embeddings for each drug. Finally, we retrofit these distributional embeddings to the structure of the knowledge graph (excluding ‘Treats’ edges reserved for evaluation).
As shown in Table 5, the FR framework significantly improves prediction of ‘Treats’ relations. We hypothesize that this is due to the separable nature of the graph; as seen in Figure 2, the FR retrofitting framework can learn Disease and Drug subgraphs that are nearly separable. In contrast, Identity retrofitting generates a single connected space and distorts the embeddings.
We also investigate the predictions induced by the retrofitted representations. An interesting use of healthcare knowledge graphs is to predict drug retargets, that is, diseases for which there is no annotated treatment relationship with the drug but such a relationship may exist medically. As shown in Table 4, the top retargets predicted by the linear retrofitting model are all medically plausible. In particular, the model confidently predicts that Kenalog would treat contact dermatitis, an effect also found in a clinical trial [Usatine and Riojas2010].The second most confident prediction of drug retargets was that Kenalog can treat pemphigus, which is indicated on Kenalog’s drug label,666https://www.accessdata.fda.gov/drugsatfda_docs/label/2014/014901s042lbledt.pdf but was not previously included in the knowledge graph. The third prediction was that methyprednisolone acetate would treat nephrotic syndrome, which is reasonable as the drug is now labelled to treat idiopathic nephrotic syndrome.777https://dailymed.nlm.nih.gov/dailymed/fda/fdaDrugXsl.cfm?setid=978b8416-2e88-4816-8a37-bb20b9af4b1d Interestingly, several models predict that furosemide treats “aneurysm of unspecified site”, a relationship not indicated on the drug label888https://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?setid=eadfe464-720b-4dcd-a0d8-45dba706bd33, though furosemide has been observed to reduce intracranial pressure [Samson and Beyer Jr1982], a key factor in brain aneurysms. Finally, both the distributional data and the embeddings produced by the baseline identity retrofitting model make the nonsensical prediction that Latanoprost, a medication used to treat intraocular pressure, would also treat superficial ankle and foot injuries.
The accuracy of the predictions from the more complex models underscores the utility of the new framework for retrofitting distributional embeddings to knowledge graphs with relations that do not imply similarity.
6 Conclusions and Future Work
We have presented Functional Retrofitting, a new framework for post-hoc retrofitting of entity embeddings to the structure of a knowledge graph. By explicitly modeling pairwise relations, this framework allows users to encode, learn, and extract information about relation semantics while simultaneously updating entity representations. This framework extends the popular concept of retrofitting to knowledge graphs with diverse entity and relation types. Functional Retrofitting is especially beneficial for graphs in which distinct distributional corpora are available for different entity classes, but it loses no accuracy when applied to simpler knowledge graphs. Finally, we are interested in the possibility of improvements to the optimization procedure outlined in this paper, including dynamic updates of the and parameters to increase trust in the graph structure while the relation functions are learned.
We would like to thank Adam Foster, Ben Peloquin, JJ Plecs, and Will Hamilton for insightful comments, and anonymous reviewers for constructive criticism.
- [Arora et al.2016] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embedtdings. International Conference on Learning Representations.
- [Baker et al.1998] Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pages 86–90. Association for Computational Linguistics.
[Bordes et al.2012]
Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio.
Joint learning of words and meaning representations for open-text
JMLR W&CP: Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012).
- [Bordes et al.2013] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795.
- [Faruqui et al.2015] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1606–1615. Association for Computational Linguistics.
- [Faruqui et al.2016] Manaal Faruqui, Ryan McDonald, and Radu Soricut. 2016. Morpho-syntactic lexicon generation using graph-based semi-supervised learning. Transactions of the Association of Computational Linguistics, 4(1):1–16.
- [Fellbaum2005] Christiane Fellbaum. 2005. Wordnet and wordnets. In Keith Brown et al., editor, Encyclopedia of Language and Linguistics, page 665–670. Oxford: Elsevier, second edition.
- [Fillmore et al.2003] Charles J Fillmore, Christopher R Johnson, and Miriam RL Petruck. 2003. Background to framenet. International journal of lexicography, 16(3):235–250.
- [Finkelstein et al.2001] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM.
- [Hamilton et al.2017] William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216.
- [Lin et al.2015] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In AAAI, pages 2181–2187.
- [Mikolov et al.2013a] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013a. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
- [Mikolov et al.2013b] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 13, pages 746–751.
- [Miller1995] George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
- [Mrkšic et al.2016] Nikola Mrkšic, Diarmuid OSéaghdha, Blaise Thomson, Milica Gašic, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–148.
[Nickel et al.2016]
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich.
A review of relational machine learning for knowledge graphs.Proceedings of the IEEE, 104(1):11–33.
- [Samson and Beyer Jr1982] Duke Samson and Chester W Beyer Jr. 1982. Furosemide in the intraoperative reduction of intracranial pressure in the patient with subarachnoid hemorrhage. Neurosurgery, 10(2):167–169.
- [Socher et al.2013] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934.
- [Usatine and Riojas2010] Richard P Usatine and Marcela Riojas. 2010. Diagnosis and management of contact dermatitis. American family physician, 82(3):249–255.
- [Wang et al.2014] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Appendix A Structure of Knowledge Graphs
|Entity Type||Count||W2V Count|
|Entity Type||Count||W2V Count|
|Derivationally Related Form||60,250|
The structure of the knowledge graph extracted from the SNOMED-CT ontology is shown in Table 8.
a.4 Roam Health Knowledge Graph
The structure of the extracted subgraph of the RHKG is summarized in Table 9. A disjoint set of 11,306 drug–disease relations is reserved for evaluation.
|Has Active Ingredient||Drug||Drug||18,422|
|Active Ingredient Of||Drug||Drug||17,175|
|Inverse Is A||Drug||Drug||10,369|
|Precise Ingredient Of||Drug||Drug||3,562|
|Has Precise Ingredient||Drug||Drug||3,562|
|Possibly Equivalent To||Drug||Drug||1,233|
|Causative Agent of||Drug||Drug||1,070|
|Has Dose Form||Drug||Drug||138|