Pre-trained language representation models (PLMs) such as ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019a) and XLNet (Yang et al., 2019) learn effective language representations from large-scale nonstructural and unlabelled corpora and achieve great performance on various NLP tasks. However, they are typically lack of factual world knowledge (Petroni et al., 2019; Logan et al., 2019).
utilize entity embeddings of large-scale knowledge bases to provide external knowledge for PLMs and improve their performance on various NLP tasks. However, they have some issues: (1) They use fixed entity embeddings learned by a separate knowledge embedding (KE) algorithm, which cannot be easily aligned with the language representations because they are essentially in two different vector spaces. (2) They require an entity linker to link the words in context to corresponding entities so that they can benefit from the entity embeddings, which makes them suffer from the error propagation problem. (3) Their sophisticated mechanisms to retrieve and use entity embeddings lead to additional inference overhead compared with vanilla PLMs.
Actually, knowledge embedding methods have a strong connection with NLP models. There are not only many works integrating knowledge embeddings into NLP models to improve the performance of NLP applications such as machine translation (Zaremoodi et al., 2018), reading comprehension (Mihaylov and Frank, 2018; Zhong et al., 2019) and dialogue system (Madotto et al., 2018), but also some early works use text as additional information (Xie et al., 2016; An et al., 2018) or jointly train the knowledge and text embedding in the same space (Wang et al., 2014; Toutanova et al., 2015; Han et al., 2016; Cao et al., 2017, 2018).
In this paper, we propose to learn knowledge embedding and language representation with a unified model and encode them into the same semantic space, which can not only better integrate knowledge into PLMs but also help to learn more informative knowledge embeddings with the effective language representations. We propose KEPLER, which is short for “a unified model for Knowledge Embedding and Pre-trained LanguagE Representation”. We collect informative textual descriptions for entities in the knowledge graph and utilizes a typical PLM to encode the descriptions as text embeddings, then we treat the description embeddings as entity embeddings and optimize a KE objective function on top of them. The key idea is to encode structural knowledge in the textual representation of entities using a PLM, which can generalize to unobserved entities in the knowledge graph.
Our KEPLER enjoys the following advantages: (1) We integrate world knowledge into PLMs with the supervision of the KE objective, which is more flexible for the PLMs, and encode the entity and text into the same space, which avoids the gap between the language representations and fixed entity embeddings. (2) We do not need an entity linker or additional mechanisms to retrieve corresponding entity embeddings, which avoids the error propagation problem and extra overhead. During inference, our KEPLER is exactly the same as standard PLMs, which can be adopted in a wide range of NLP applications. (3) Different from conventional KE methods, our KEPLER encodes textual entity descriptions as entity embeddings, which enables our model to infer knowledge embedding in the inductive setting (get entity embeddings for the unseen entities). This is especially useful for deployment, where the model may deal with unseen entities.
The existing KE datasets are relatively small-scale, which is not sufficient to pre-train a large model, and typically lack of description data and a data split for the inductive setting. Therefore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text description for each entity. Wikidata5m is a subset of Wikidata Vrandečić and Krötzsch (2014), a free knowledge base with about sixty million entities. To ensure each entity is informative and the knowledge base is as clean as possible, we only select the entities with corresponding Wikipedia pages. Our Wikidata5m contains five million entities and twenty million triplets. We also benchmark several classical KE methods on Wikidata5m to facilitate future research. To our knowledge, this is the first million-scale general knowledge graph dataset.
To summarize, our contribution is three-fold: (1) We propose to encode entities and texts into the same space and jointly train the KE and language modeling objectives, and then get a better knowledge-enhanced PLM which avoids error propagation and additional overhead. Experimental results on various NLP tasks demonstrate the effectiveness of our KEPLER. (2) We encode textual descriptions as entity embeddings, which improves KE with textual information and enables inductive KE. (3) We introduce a new large-scale knowledge graph dataset Wikidata5m, which may promote the research on large-scale knowledge graph, inductive knowledge embedding and interactions between knowledge graph and NLP.
2 Related Work
Pre-trained Language Model
There has been a long history of pre-training in NLP. Early works focus on distributed word representation (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014), many of which are still often adopted in current models as word embeddings for their ability to capture syntactic and semantic information from large-scale corpora. Peters et al. (2018b) push this trend a step forward by using a bidirectional LSTM to capture contextualized word embeddings (ELMo) for richer semantic meanings under different circumstances.
Apart from those methods using pre-trained word embeddings as input features, there is another trend exploring pre-trained encoders. Dai and Le (2015) first propose to train an auto-encoder on unlabeled data, and then fine-tune it on downstream tasks. Howard and Ruder (2018) propose a universal language model (ULMFiT) based on AWD-LSTM (Merity et al., 2018). With the powerful Transformer (Vaswani et al., 2017) as its encoder, Radford et al. (2018) demonstrate a pre-trained generative model (GPT) and its effects, while Devlin et al. (2019b) release a pre-trained deep Bidirectional Encoder Representation from Transformers (BERT), achieving state-of-the-arts on dozens of benchmarks.
After Devlin et al. (2019b), similar pre-trained encoders spring up recently. Yang et al. (2019) propose a permutation language model (XLNet) based on TransformerXL (Dai et al., 2019). Later, Liu et al. (2019c) show that more data and more sophisticated parameter tuning would benefit pre-trained encoders a lot and release a new state-of-the-art model (Roberta). Other works explore how to add more tasks (Liu et al., 2019b) and more parameters (Lan et al., 2019; Raffel et al., 2019) to pre-training models.
Recently some works attempt to incorporate knowledge information in pre-training. Zhang et al. (2019) introduce pre-processed knowledge embeddings into the Transformer architecture of BERT (Devlin et al., 2019b). With similar ideas, Peters et al. (2019) incorporate an integrated entity linker in their models. Besides, Logan et al. (2019); Hayashi et al. (2019) utilize relations between entities inside one sentence to help train better generation models. Despite the promising results those methods bring with knowledge-enhanced techniques, they either use fixed external knowledge information or have complex structures or pipelines to handle entities within sentences.
Knowledge Graph Embeddings
In recent years knowledge embeddings have been extensively studied through predicting missing links in graphs. Conventional models define score functions for relation triples and predict head or tail entities with scores of candidate entities. For example, TransE (Bordes et al., 2013) treats tail entities as translations of heads, while DistMult (Yang et al., 2015) use matrix multiplications as score functions and ComplEx (Trouillon et al., 2016) adopt complex operations based on it. RotatE (Sun et al., 2019a) combines the advantages of both of them.
Among these works, Xie et al. (2016) propose to utilize entity descriptions as an external information source and introduce an entity description encoder to enhance the TransE score function. Though similar to our method, Xie et al. (2016) aim at utilizing entity descriptions to help knowledge representation learning, while we take entity descriptions as a tool to incorporate external knowledge in our model.
3 KEPLER Model
In this section, we introduce the structure of our KEPLER model, and how we combine two training goals of masked language modeling and knowledge representation learning.
3.1 Training Objectives
where represents knowledge embedding loss and represents language model loss. Since our PLMs are involved in both tasks, jointly optimizing the two objectives could implicitly integrate knowledge from external graphs with text encoders, while keeping the strong abilities of PLMs for syntactic and semantic understanding.
More specifically, we adopt a general format using negative sampling,
where is the correct triple from knowledge graphs and are negative sampling triples. is the score function, for which we have many choices. Different from conventional knowledge embedding methods, for entity embeddings and , instead of looking up in embedding tables, we use PLMs as our text encoders to extract entity representations from their descriptions.
For , many alternatives for pre-trained language representation can be used, e.g., masked language model (Devlin et al., 2019b). Note that those two tasks only share the text encoder and for each mini-batch, text sampled for and is not (necessarily) the same.
3.2 Model Details
Though we have many alternatives of model structures and training objectives to choose under KEPLER framework, here for better clarity, we introduce a specific one that we use in experiments.
We use the transformer architecture (Vaswani et al., 2017) as in (Devlin et al., 2019b; Liu et al., 2019c), which we will not address in details. To be more specific, we use RoBERTaBASE codes and checkpoints111https://github.com/pytorch/fairseq
in all our experiments since it is one of the state-of-the-art pre-trained models with acceptable computing requirements. Besides the training data and hyperparameters, one of the major differences between RoBERTa and BERT is that RoBERTa uses Byte-Pair Encoding (BPE)(Sennrich et al., 2016) to better tokenize rare words.
Given a sequence of tokens , the input format is , where [CLS] and [EOS] are two special tokens. Model output at [CLS] is often used as the sentence representation.
Inspired by BERT (Devlin et al., 2019b), MLM randomly selects of input tokens, among which are masked with the special mark [MASK], are replaced by another random token, and the rest remain unchanged. Under MLM, models try to predict the correct tokens and a cross-entropy loss is calculated over the selected positions.
We adopt the pre-trained checkpoint of RoBERTaBASE for the initialization of our model. However, we still keep MLM as one of our objectives to avoid catastrophic forgetting (McCloskey and Cohen, 1989) while training towards the KRL loss. Note that experiments show that only further pre-training from RoBERTaBASE checkpoint does not bring promotion, suggesting that the combination of the two tasks contributes most to the performance.
where is the correct triple, are negative sampling triples, is the margin,
is the sigmoid function, andis the score function, for which we choose to follow TransE (Bordes et al., 2013) for its simplicity and efficiency,
where we take the norm as . Due to the limit of computing resources, we take the negative sampling size as 1. The negative sampling policy is to fix the head entity and randomly sample a tail entity, and vice versa.
Different from conventional KE methods, we do not have an entity embedding lookup table. Instead, we use our KEPLER model to encode the corresponding entity descriptions and take the [CLS] outputs as the entity embeddings.
3.3 Downstream Tasks
Like all BERT-like models, we fine-tune KEPLER on downstream tasks and use [CLS] output for sentence-level prediction and the outputs of all tokens for sequence labelling tasks (Devlin et al., 2019b). For supervised relation extraction and few-shot relation extraction, we follow the approaches from (Baldini Soares et al., 2019) and (Gao et al., 2019) respectively.
We construct a new large-scale knowledge graph dataset with aligned text descriptions. Our dataset is built by integrating Wikidata (Vrandečić and Krötzsch, 2014), a large-scale open knowledge base, with Wikipedia. Each entity in the knowledge graph is aligned with its text description in Wikipedia pages. In the following sections, we will first introduce the data collection steps, and then give the benchmarks of popular KE methods on this dataset.
4.1 Data Collection
We pull the latest dump of Wikidata222https://www.wikidata.org and Wikipedia333https://en.wikipedia.org from their websites respectively. We remove pages whose first paragraphs contain fewer than 5 words. For each entity, we align it to a Wikipedia page with the MediaWiki wbgetentities action API. The first section of Wikipedia pages is extracted as the description for entities. Entities that have no corresponding Wikipedia pages are discarded.
To construct the knowledge graph, we retrieve all the statements in entity pages, and map the entities and relations in statements to their canonical IDs in Wikidata. A statement is considered to be a valid triplet if both of its entities can be aligned with Wikipedia pages and its relation has a non-empty page in Wikidata. The final knowledge graph dataset contains 4,813,455 entities, 822 relations and 21,344,269 triplets, where each entity has a text description. Statistics of our Wikidata5m dataset and four widely-used datasets are showed in Table1. Top-5 entity categories are listed in Table 3. We can see that our Wikidata5m is much larger than existing knowledge graph datasets, covering all sorts of domains.
4.2 Data Split
The data split statistics for the conventional transductive setting are also shown in Table 1.
In this work, we also evaluate models on the challenging inductive setting, which requires the models to produce entity embeddings for entities which are not seen at the training time and also do link predictions for the unseen entities. So we provide a data split for the inductive setting evaluation. The statistics for the inductive setting data split are shown in Table 2. In the inductive setting, the entities and triplets in training, validation and test sets are mutually disjoint, while in the transductive setting, only the triplet sets are mutually disjoint.
|TransEBordes et al. (2013)||109370||0.253||0.170||0.311||0.392|
|DistMultYang et al. (2015)||211030||0.253||0.208||0.278||0.334|
|ComplExTrouillon et al. (2016)||244540||0.281||0.228||0.310||0.373|
|SimplEKazemi and Poole (2018)||115263||0.296||0.252||0.317||0.377|
|RotatESun et al. (2019b)||89459||0.290||0.234||0.322||0.390|
To assess the challenges of Wikidata5m, we benchmark several popular knowledge graph embedding models on the dataset. Since the conventional knowledge graph embedding models are inherently transductive, we split the triplets of knowledge graph into train, valid and test sets. Each model is trained on the training set and evaluated on the link prediction task.
We conduct 5 knowledge graph embedding models , including TransE Bordes et al. (2013), DistMult Yang et al. (2015), ComplEx Trouillon et al. (2016), SimplE Kazemi and Poole (2018) and RotatE Sun et al. (2019b). Because their original implementations do not scale to Wikidata5m, we benchmark these methods using the multi-GPU implementation in GraphVite Zhu et al. (2019). The performance of link prediction is evaluated in the filtered setting, where test triplets are ranked against all candidate triplets that are not observed in the knowledge graph. We report the standard metrics of Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hits at N (HITS@N).
Table 4 shows the benchmarks of popular methods on Wikidata5m.
In this section, we introduce the experiment settings and experimental results of KEPLER on various NLP and KE tasks.
5.1 Pre-training settings
In experiments, we choose RoBERTa (Liu et al., 2019c) as our base model and implement our methods in the fairseq framework (Ott et al., 2019) for pre-training. Due to the computing resource limit, we choose RoBERTa architecture and use the released roberta.base444https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md parameters to initialize our model.
5.2 NLP Tasks
In this section, we introduce how our KEPLER can be used as a knowledge-enhanced PLM on various NLP tasks and its performance compared with state-of-the-art models.
Relation classification is an important NLP task that requires models to classify relation types between two given entities from text. We evaluate our model and baselines on two commonly-used datasets: TACRED(Zhang et al., 2017) and FewRel (Han et al., 2018). TACRED covers 42 relation types and contains 106,264 sentences. FewRel is a few-shot relation classification dataset, which has 100 relations and 700 instances for each relation.
Here we follow the relation extraction fine-tuning procedure from Zhang et al. (2019), where four special tokens are added before and after entity mentions in the sentence to highlight where the entities are. Then we take the [CLS] output as the sentence representation for classification.
|Model||5-way 1-shot||5-way 5-shot||10-way 1-shot||10-way 5-shot|
Table 5 shows results of various models on TACRED, from which we can see that our model achieves state-of-the-art on this benchmark. Note that some baselines use the LARGE version of pre-trained language models while we still take the BASE architecture. We have gained a large promotion over our base model (RoBERTaBASE) while staying a little bit advanced over other competitive methods (even if they use a LARGE architecture).
Our model has also shown strength on FewRel dataset. We use Prototypical Networks (Snell et al., 2017) and PAIR (Gao et al., 2019) as the base frameworks and try out different kinds of pre-trained models as encoders. As shown in Table 6, for both frameworks, our models have superior performance over others. We have also compared with current state-of-the-art MTP (Baldini Soares et al., 2019), which outperforms us a little. But note that MTP uses a large version of BERT while we use the base version, and also it carries out a new pre-training task specifically targeting relation extraction, while ours is a general way to combine knowledge and natural language which would benefit all knowledge-related tasks.
Entity typing requires models to classify given entity mentions into pre-defined entity types. For this task, we evaluate all the models on OpenEntity (Choi et al., 2018) following the setting from Zhang et al. (2019), which focuses on nine general entity types.
Evaluation results are demonstrated in Table 7. For now we have achieved better results than RoBERTa, and ERNIE and KnowBERT show slightly better results than ours. It is mainly due to that we use different ways of extracting entity representations. KnowBERT adds special tokens before and after the mention and uses the output of the token before the mention as the representation for typing, while ours, for now, directly uses [CLS]. We will try this better way of entity representation in the future.
5.3 Knowledge Embedding
In this section, we show how our KEPLER works as a KE model, and evaluate it on our Wikidata5m dataset in inductive setting.
We do not use the existing KE benchmarks because they are lack of high-quality text descriptions for their entities and they do not have a reasonable data split for the inductive setting.
6 Conclusion and Future Work
In this paper, we propose KEPLER, a unified model for knowledge embedding and pre-trained language representation. We jointly train the knowledge embedding and language representation objectives on top of the language representation model. Experimental results on extensive tasks demonstrate the effectiveness of our model.
In the future, we will: (1) Evaluate whether our model can recall factual knowledge with more tasks. (2) Try variations of existing models, such as highlighting entity mentions in descriptions or changing knowledge embedding form, to get better understanding of how KEPLER works and bring more promotion for downstream tasks.
- An et al. (2018) Bo An, Bo Chen, Xianpei Han, and Le Sun. 2018. Accurate text-enhanced knowledge graph representation learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 745–755, New Orleans, Louisiana. Association for Computational Linguistics.
- Baldini Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of NIPS, pages 2787–2795.
Cao et al. (2018)
Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, Chengjiang Li, Xu Chen, and Tiansi
learning of cross-lingual words and entities via attentive distant
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 227–237, Brussels, Belgium. Association for Computational Linguistics.
- Cao et al. (2017) Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1623–1633, Vancouver, Canada. Association for Computational Linguistics.
- Choi et al. (2018) Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-fine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 87–96, Melbourne, Australia. Association for Computational Linguistics.
- Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of ICML, pages 160–167.
- Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Proceedings of NIPS, pages 3079–3087.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
- Devlin et al. (2019a) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019a. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT.
- Devlin et al. (2019b) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019b. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Gao et al. (2019) Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019. FewRel 2.0: Towards more challenging few-shot relation classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6251–6256, Hong Kong, China. Association for Computational Linguistics.
- Han et al. (2016) Xu Han, Zhiyuan Liu, and Maosong Sun. 2016. Joint representation learning of text and knowledge for knowledge graph completion. arXiv preprint arXiv:1611.04125.
- Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4803–4809, Brussels, Belgium. Association for Computational Linguistics.
- Hayashi et al. (2019) Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham Neubig. 2019. Latent relation language models. arXiv preprint arXiv:1908.07690.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Kazemi and Poole (2018) Seyed Mehran Kazemi and David Poole. 2018. Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems, pages 4284–4295.
- Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Liu et al. (2019a) Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2019a. K-bert: Enabling language representation with knowledge graph. arXiv preprint arXiv:1909.07606.
- Liu et al. (2019b) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019b. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496, Florence, Italy. Association for Computational Linguistics.
- Liu et al. (2019c) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019c. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Logan et al. (2019) Robert Logan, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh. 2019. Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5962–5971, Florence, Italy. Association for Computational Linguistics.
- Madotto et al. (2018) Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2018. Mem2Seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1468–1478, Melbourne, Australia. Association for Computational Linguistics.
- McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
- Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing lstm language models. In Proceedings of ICLR.
- Mihaylov and Frank (2018) Todor Mihaylov and Anette Frank. 2018. Knowledgeable reader: Enhancing cloze-style reading comprehension with external commonsense knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 821–832, Melbourne, Australia. Association for Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pages 3111–3119.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Peters et al. (2017) Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1756–1765, Vancouver, Canada. Association for Computational Linguistics.
- Peters et al. (2018a) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
- Peters et al. (2018b) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018b. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Peters et al. (2019) Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 43–54, Hong Kong, China. Association for Computational Linguistics.
- Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In Proceedings of Technical report, OpenAI.
- Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087.
- Sun et al. (2019a) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019a. Rotate: Knowledge graph embedding by relational rotation in complex space.
- Sun et al. (2019b) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019b. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197.
- Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, Lisbon, Portugal. Association for Computational Linguistics.
Trouillon et al. (2016)
Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and
Guillaume Bouchard. 2016.
embeddings for simple link prediction.
International Conference on Machine Learning, pages 2071–2080.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS, pages 5998–6008.
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledge base.
- Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591–1601, Doha, Qatar. Association for Computational Linguistics.
Xie et al. (2016)
Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016.
Representation learning of knowledge graphs with entity descriptions.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2659–2665. AAAI Press.
- Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
- Zaremoodi et al. (2018) Poorya Zaremoodi, Wray Buntine, and Gholamreza Haffari. 2018. Adaptive knowledge sharing in multi-task learning: Improving low-resource neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 656–661, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45, Copenhagen, Denmark. Association for Computational Linguistics.
- Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics.
- Zhong et al. (2019) Wanjun Zhong, Duyu Tang, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2019. Improving question answering by commonsense-based pre-training. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 16–28. Springer.
- Zhu et al. (2019) Zhaocheng Zhu, Shizhen Xu, Jian Tang, and Meng Qu. 2019. Graphvite: A high-performance cpu-gpu hybrid system for node embedding. In The World Wide Web Conference, pages 2494–2504. ACM.