1 Introduction
Text-to-entity mapping is the task of associating a text with a concept in a knowledge graph (KG) or an ontology (we use two terms, interchangeably). Recent works (Kartsaklis et al., 2018; Hill et al., 2015)
use neural networks to project a text to a vector space where the entities of a KG are represented as continuous vectors. Despite being successful, these models have two main disadvantages. First, they rely on a predefined vector space which is used as a gold standard representation for the entities in a KG. Therefore, the quality of these algorithms depends on how well the vector space is represented. Second, these algorithms are not interpretable; hence, it is impossible to understand why a certain text was linked to a particular entity.
To address these issues we propose a novel technique which first represents an ontology concept as a sequence of its ancestors in the ontology (hypernyms) and then maps the corresponding textual description to this unique representation. For example, given the textual description of the concept swift (“small bird that resembles a swallow and is noted for its rapid flight”), we map it to the hierarchical sequence of entities in a lexical ontology: animal chordate vertebrate bird apodiform_bird. This sequence of nodes constitutes a path.111We only consider hypernymy relations, from the root to the parent node (apodiform_bird) of the entity swift.
Our model is based on a sequence-to-sequence neural network (Sutskever et al., 2014) coupled with an attention mechanism (Bahdanau et al., 2014). Specifically, we use an LSTM (Hochreiter and Schmidhuber, 1997) encoder to project the textual description into a vector space and an LSTM decoder to predict the sequence of entities that are relevant to this definition. With this framework we do not need to rely on the pre-existing vector space of the entities, since the decoder explicitly learns topological dependencies between the entities of the ontology. Furthermore, the proposed model is more interpretable for two reasons. First, instead of the closest points in a vector space, it outputs paths; therefore, we can trace all predictions the model makes. Second, the attention mechanism allows to visualise which words in a textual description the model selects while predicting a specific concept in the path. In this paper, we consider rooted tree graphs222Only single root is allowed. If a tree has more than one root, one can create a dummy root node and connect the roots of the tree to it. only and leave the extension of the algorithm for more generic graphs to future work.
We evaluate the ability of our model in generating graph paths for previously unseen textual definitions on seven ontologies (Section 3). We show that our technique either outperforms or performs on a par with a competitive multi-sense LSTM model Kartsaklis et al. (2018) by better utilising external information in the form of word embeddings. The code and resources for the paper can be found at https://github.com/VictorProkhorov/Text2Path.
2 Methodology
We assume that an ontology is represented as a rooted tree graph , where is a set of entities (e.g. synsets in WordNet), is a set of hyponymy edges, and is a set of textual descriptions such that there is a .
2.1 Node representation
We assume that an ontological concept can be defined by either using a textual description from a dictionary or hypernyms of the defining concept in the ontology. For example, to define the noun swift one can use the dictionary definition mentioned previously. Alternatively, the concept of swift can be understood from its hypernyms, e.g. in the trivial case one can say that swift is an animal. This definition is not very useful since animal is a hypernym for many other nouns. To provide a more specific definition, one can use a sequence of hypernyms e.g. animal chordate vertebrate bird apodiform_bird starting from the most abstract node (root of an ontology) to the most specif (parent node of the noun).
More formally, for each entity we create a path . Each starts from and ends with a hypernym of , i.e., the hierarchical order of entities is preserved. Then the path is aligned with such that each node is defined by a textual definition and a path. This set of aligned representations is used to train the model.
The path representation of an entity ends with its parent node. Therefore, a leaf node will not be present in any of the paths. This is problematic if a novel definition should be attached to a leaf. To alleviate this issue we employ the “dummy source sentences” technique from neural machine translation (NMT)
(Sennrich et al., 2016). We create an additional set of paths from the root node to each leaf. As for the textual definition we leave it empty.2.2 Model
We use a sequence-to-sequence model with an attention mechanism to map a textual description of a node to its path representation.
Encoder.
To encode a textual definition , where is sentence length, we first map each word to a dense embedding and then use a bi-directional LSTM to project the sequence into a latent representation. The final encoding state is obtained by concatenating the forward and backward hidden states of the bi-LSTM.
Decoder.
Decoding the path representation of a node from the latent state of the textual description is done again with an LSTM decoder. Similarly to the encoding stage, we map each symbol in the path to a dense embedding , where
is the path length. To calculate the probability of the path symbol
at time step we first represent the path sequence as Then, we concatenate with the context vector (defined next) and pass the concatenated representation through the softmax function, i.e. , where is a weight parameter. To calculate the context vector we use an attention mechanism, and , where , and are the weight parameters, over the words in the text description.3 Experimental Setup
Ontologies.
We experimented with seven graphs four of which are related to the bio-medical domain: Phenotype And Trait Ontology333http://www.obofoundry.org (PATO), Human Disease Ontology (Schriml et al., 2012, HDO), Human Phenotype Ontology (Robinson et al., 2008, HPO) and Gene Ontology444After prerocessing GO we took its largest connected component. (Ashburner et al., 2000, GO). The other three graphs, i.e. WNanimal.n.01555The subscript in ‘WN’ indicates the name of the root node of the graph., WNplant.n.02 and WNentity.n.01 are subgraphs of the WordNet 3.0 Fellbaum (1998). We present the statistics of the graphs in Table 1.
Graphs | Depth | Branch | A.D | |
PATO | 1742 | (4.94,10) | (3.95,92) | 20 |
WNanimal.n.01 | 3999 | (6.94,12) | (3.79,52) | 26 |
WNplant.n.02 | 4487 | (4.70,9) | (5.91,357) | 28 |
HDO | 9095 | (5.92,12) | (4.59,222) | 27 |
HPO | 13348 | (6.95,14) | (3.40,32) | 24 |
GO | 29682 | (6.40,14) | (3.28,172) | 21 |
WNentity.n.01 | 74374 | (8.01,18) | (4.52,402) | 36 |
|
Models | PATO | WNanimal.n.01 | WNplant.n.02 | HDO | HPO | GO | WNentity.n.01 |
---|---|---|---|---|---|---|---|
BOW-LR | 0.79 | 0.75 | 0.65 | 0.55 | 0.63 | 0.32 | 0.41 |
MS-LSTM | 0.77 | 0.73 | 0.62 | 0.70 | 0.72 | 0.69 | 0.51 |
MS-LSTM | 0.80 | 0.76 | 0.65 | 0.70 | 0.73 | 0.70 | 0.57 |
MS-LSTM | 0.75 | 0.66 | 0.57 | 0.65 | 0.63 | 0.62 | 0.51 |
text2nodes | 0.75 | 0.66 | 0.66 | 0.69 | 0.62 | 0.67 | 0.60 |
text2edges | 0.76 | 0.68 | 0.66 | 0.69 | 0.69 | 0.69 | 0.61 |
MS-LSTM | 0.81 | 0.76 | 0.66 | 0.71 | 0.74 | 0.71 | 0.58 |
text2nodes | 0.83 | 0.71 | 0.68 | 0.71 | 0.69 | 0.70 | 0.62 |
text2edges | 0.83 | 0.77 | 0.70 | 0.73 | 0.74 | 0.72 | 0.65 |
. We use the same number of epochs, batch size and number of latent dimensions both for MS-LSTM and our models (Appendix
C).Ontology Preprocessing.
All the ontologies we experimented with are represented as directed acyclic graphs (DAGs). This creates an ambiguity for node path definitions since there are multiple pathways from a root concept to other concepts. We have assumed that a single unambiguous pathway will reduce the complexity of the problem and leave the comparison with ambiguous pathways (which would inevitably involve a more complex model) to future work. To convert a DAG to a tree we constrain each entity to have only one parent node. The edges between the other parent nodes are removed.666The choice of an edge is performed on random basis.
Path Representations.
We also experiment with two path representations. Our first approach, text2nodes, uses the label of an entity (cf. Section 1) to represent a path. This is not efficient since the decoder of the model needs to select between all of the entities in an ontology and also requires more parameters in the model. Our second approach, text2edges, to reduce the number of symbols for the model to choose from, uses edges to represent the path. To do this we create an artificial vocabulary of the size , where corresponds to the maximum degree of a node. Each edge in the graph is labeled using the artificial vocabulary. For the example in Section 1, the path would be animal [a] chordate [b] vertebrate [c] bird [d] apodiform_bird where {a,b,c,d} is the artificial vocabulary. In the resulting path we discard labels for the entities; therefore, the path reduces to: [a] [b] [c] [d].
3.1 Baselines
Bag-of-Words Linear Regression (BOW-LR):
To represent a textual definition in a vector space we first use a pre-trained set of word embeddings (Speer et al., 2017) to represent words in the definition and then find the mean of the word embeddings. As for the ontology, we use node2vec (Grover and Leskovec, 2016)
, to represent each entity in a vector space. To align the two vector spaces we use linear regression.
Multi-Sense LSTM (MS-LSTM):
Kartsaklis et al. (2018) proposed a model that achieves state-of-the-art results on the text-to-entity mapping on the Snomed CT777https://www.snomed.org/snomed-ct dataset. The approach uses a novel multi-sense LSTM, augmented with an attention mechanism, to project the definition to the ontology vector space. Additionally, for a better alignment between the two vector spaces, the authors augmented the ontology graph with textual features.
3.2 Evaluation Metric
To perform evaluation of the models described above we used Ancestor-F1 score (Mao et al., 2018). This metric compares the ancestors () of the predicted node with the ancestors () of the gold node in the taxonomy.
where and
are precision and recall, respectively. The Ancestor-F1 is then defined as:
3.3 Intrinsic Evaluation
To verify the reliability of our model on text-to-entity mapping we did a set of experiments on the seven graphs (Section 3) where we map a textual definition of a concept to a path.
To conduct the experiments we randomly sampled 10% of leaves from the graph. From this sample, 90% are used to evaluate the model and 10% are used to tune the model. The remaining nodes in the graph are used for training. We sample leaves for two reasons: (1) to predict a leaf, the model needs to make the maximum number of (correct) predictions and (2) this way we do not change the original topology of the graph. Note that the sampled nodes and their textual definitions are not present in the training data.
Both baselines predict a single entity instead of a path. To have the same evaluation framework for all the models, for each node predicted by the baselines we create888We used NetworkX (https://networkx.github.io) to find a path from predicted node to the root of a graph. a path from the root of the node to the predicted node. However, we want to emphasize that this is disadvantageous for our model, since all the symbols in the path are predicted by it and in the case of the baselines only a single node is predicted.
The results are presented in Table 2. Models that are in the last three rows of Table 2 use pre-trained word embeddings (Speer et al., 2017) in the encoder. MS-LSTM and our models that are above the last three rows use randomly initialised word vectors. We had four observations: (1) without pre-trained word embeddings in the encoder our model outperforms the best MS-LSTM only on two of the seven graphs, (2) the text2edges model outperforms all the other models including MS-LSTM, (3) the text2edges model can better exploit pre-trained word embeddings than MS-LSTM, (4) our model performs better when the paths are represented using edges (rather than nodes). We also found that there is a strong negative correlation (Spearman: , Pearson: ) between A.D. (Table 3) and the Ancestor F1 score for the text2edges model, meaning that with an increase in A.D. the Ancestor F1 score decreases.
3.4 Error Analysis
We carried out an analysis on the outputs of our best-performing model, i.e. text2edges with pre-trained word embeddings. One factor that affects the performance is the number of invalid sequences predicted by the text2nodes and text2edges models. An invalid sequence is the path that does not exist in the original graph. This happens because at each time step the decoder outputs a distribution over all the nodes/edges and not just over possible children nodes. We therefore performed a count of the number of invalid sequences produced by the model. The percentage of invalid sequences is in the range of 1.82% - 8.50% (Appendix B), which is relatively low. This analysis was also performed by J. Kusner et al. (2017). To guarantee that the model always produces valid graphs, they use a context-free grammar. A similar method can be adapted in our work.
![]() |
![]() |
Another factor that affects the performance is the length of the generated paths which is expected to match the length of the gold path. To test this, we compared the mean length of the generated sequences with the length of the gold path (the graph on the bottom of Figure 1). Also, in the training set, we associate the length of the sequences with their frequencies (the graph on the top of Figure 1). We found that (1) the length of the generated paths are biased towards the more frequent paths in the training data, (2) if the length of a path is not frequent in the training data, the model either under-generates or over-generates the length (Appendix D).
4 Related Work
Text-to-entity mapping is an essential component of many NLP tasks, e.g. fact verification (Thorne et al., 2018) or question answering (Yih et al., 2015). Previous work has approached this problem with pairwise learning-to-rank method Leaman et al. (2013) or phrase-based machine translation Limsopatham and Collier (2015). However, these methods generally ignore ontology’s structure. More recent work has viewed the problem of text-to-entity mapping as a projection of a textual definition to a single point in a KG (Kartsaklis et al., 2018; Hill et al., 2015). However, despite potential advantages, such as being more interpretable and less brittle (model predicts multiple related entities instead of one), path-based approaches have received relatively little attention. Instead of predicting a single entity, path-based models, such as the one we proposed in this paper, try to map a textual definition to multiple relevant entities in an external resource.
5 Conclusion and Future Work
We presented a model that maps textual definitions to interpretable ontological pathways. We evaluated the proposed technique on seven semantic graphs, showing that it can perform competitively with respect to existing state-of-the-art text-to-entity systems, while being more interpretable and self-contained. We hope this work will encourage further research on path-based text-to-entity mapping algorithms. A natural next step will be to extend our framework to DAGs. Furthermore, we plan to constrain our model to always predict paths that exist in the graph, as we discussed above.
Acknowledgments
We would like to thank the anonymous reviewers for their comments. Also, we would like to thank Dimitri Kartsaklis and Ehsan Shareghi for helpful discussions and comments. This research was supported by an EPSRC Experienced Researcher Fellowship (N. Collier: EP/M005089/1) and an MRC grant (M.T. Pilehvar: MR/M025160/1). We gratefully acknowledge the donation of a GPU from the NVIDIA Grant Program.
References
- Ashburner et al. (2000) Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. 2000. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- Fellbaum (1998) Christiane Fellbaum, editor. 1998. WordNet: An Electronic Database. MIT Press, Cambridge, MA.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. CoRR, abs/1607.00653.
- Hill et al. (2015) Felix Hill, Kyunghyun Cho, Anna Korhonen, and Yoshua Bengio. 2015. Learning to understand phrases by embedding the dictionary. CoRR, abs/1504.00548.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
-
J. Kusner et al. (2017)
Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. 2017.
Grammar variational autoencoder.
-
Kartsaklis et al. (2018)
Dimitri Kartsaklis, Mohammad Taher Pilehvar, and Nigel Collier. 2018.
Mapping text to
knowledge graph entities using multi-sense lstms.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages 1959–1970. Association for Computational Linguistics. - Leaman et al. (2013) Robert Leaman, Rezarta Dogan, and Zhiyong lu. 2013. Dnorm: Disease name normalization with pairwise learning to rank. Bioinformatics (Oxford, England), 29.
- Limsopatham and Collier (2015) Nut Limsopatham and Nigel Collier. 2015. Adapting phrase-based machine translation to normalise medical terms in social media messages. CoRR, abs/1508.02285.
- Mao et al. (2018) Yuning Mao, Xiang Ren, Jiaming Shen, Xiaotao Gu, and Jiawei Han. 2018. End-to-end reinforcement learning for automatic taxonomy induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2462–2472. Association for Computational Linguistics.
- Robinson et al. (2008) Peter N. Robinson, Sebastian Köhler, Sebastian B Bauer, Dominik Seelow, Denise Horn, and Stefan Mundlos. 2008. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. American journal of human genetics, 83 5:610–5.
- Schriml et al. (2012) Lynn M. Schriml, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, Mark Mazaitis, Victor Felix, Gang Feng, and Warren A. Kibbe. 2012. Disease ontology: a backbone for disease semantic integration. In Nucleic Acids Research.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96. Association for Computational Linguistics.
- Speer et al. (2017) Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215.
- Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819. Association for Computational Linguistics.
- Yih et al. (2015) Wen-Tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL.
Appendix A DAGs
Graphs | Mult.P% | AV.P |
---|---|---|
PATO | 31.29 | 2.97 |
WNanimal.n.01 | 0.88 | 2.00 |
WNplant.n.02 | 0.16 | 2.00 |
HDO | 16.23 | 2.13 |
HPO | 23.24 | 2.23 |
GO | 64.01 | 2.77 |
WNentity.n.01 | 1.91 | 2.03 |
Appendix B Invalid Sequences
Graphs | Invalid% | Ntotal |
---|---|---|
PATO | 1.82 | 110 |
WNanimal.n.01 | 4.56 | 263 |
WNplant.n.02 | 2.23 | 314 |
HDO | 4.02 | 622 |
HPO | 7.08 | 847 |
GO | 6.94 | 1845 |
WNentity.n.01 | 8.50 | 5191 |
|
Appendix C Settings for Models
Bow-Lr:
To represent an ontology in a vector space we use node2vec https://snap.stanford.edu/node2vec/. For all the graphs the following hyper-parameters of the algorithm are the same: walk-length= 5, window-size=5 and iter=40. As for the number of dimensions we set it to 128 for PATO, WNanimal.n.01, WNplant.n.02, HDO and HPO graphs. For GO and WNentity.n.01 graphs we set it to 256. All the other parameters of node2vec are default.
We do not modify the numberbatch embeddings https://github.com/commonsense/conceptnet-numberbatch. If a word in a textual definition is missing we initilised the embedding for this word with zeros.
For all the graphs to map the textual vector space into an ontology vector space we use the linear regression model from the scikit-learn API https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Ms-Lstm:
There are only two hyper-parameters that we vary during the embedding of ontology concepts: (we report the values in the paper) and the embedding size of the concepts. We set it to 128 for PATO, WNanimal.n.01, WNplant.n.02, HDO and HPO graphs. For GO and WNentity.n.01 graphs we set it to 256.
For all the graphs the model is trained for 300 epochs, dimensions of word embeddings is set to 64 and bi-LSTM is used instead of LSTM. Batch size is set to 16 and the number of latent dimensions in bi-LSTM is set to 128 for the PATO, WNanimal.n.01, WNplant.n.02, HDO and HPO graphs. For GO and WNentity.n.01 graphs we set these parameters to 128 and 256 respectively. All the other hyper-parameters are default.
When we use pre-trained word embeddings we reduce (with PCA https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) its dimensions from 300 to 64.
Our Model:
For all the graphs the model is trained for 300 epochs, dimensions of word embeddings (also for node/edges embeddings) is set to 64 and bi-LSTM is used in the encoder and LSTM in the decoder. Batch size is set to 16 and the number of latent dimensions in bi-LSTM encoder and LSTM decoder is set to 128 for the PATO, WNanimal.n.01, WNplant.n.02, HDO and HPO graphs. For GO and WNentity.n.01 graphs we set these parameters to 128 and 256 respectively. For optimizer we used RMSProp (https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer) with learning rate = 0.001.
When we use pre-trained word embeddings we reduce (with PCA https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) its dimensions from 300 to 64.
Appendix D Length of Generated Path
In Figure 1 and 2 the blue line indicates the ideal scenario i.e. mean length of the generated sequences is equal to the gold length. Black dot is the mean of the length of decoded sequences and the red bars are the standard deviation. One can notice that the general trend is following: for short sequences the mode generates (slightly) longer sequences and for the long sequences it generated (slightly) shorter sequences than the gold standard. Another trend is that the sequences of the certain length are matching the gold standard. To understand why this is happening one needs to look at the graph which relate the length of the sequence in the training corpus and the frequency of this length in the corpus. It is become clear there is a correlation between the two. Such as the model tends to generate the sequence of the length that is presented the most in the training data.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |