Self-supervised neural models like ELMo Peters et al. (2018), BERT Devlin et al. (2019); Liu et al. (2019b), GPT Radford et al. (2018, 2019), or XLNet Yang et al. (2019) have rendered language modeling a very suitable pretraining task for learning language representations that are useful for a wide range of language understanding tasks Wang et al. (2018, 2019). Although shown versatile w.r.t. the types of knowledge Rogers et al. (2020) they encode, much like their predecessors – static word embedding models Mikolov et al. (2013); Pennington et al. (2014) – neural LMs still only “consume” the distributional information from large corpora. Yet, a number of structured knowledge sources exist – knowledge bases (KBs) Suchanek et al. (2007); Auer et al. (2007) and lexico-semantic networks Miller (1995); Liu and Singh (2004); Navigli and Ponzetto (2010) – encoding many types of knowledge that are underrepresented in text corpora.
Starting from this observation, most recent efforts focused on injecting factual Zhang et al. (2019); Liu et al. (2019a); Peters et al. (2019) and linguistic knowledge Lauscher et al. (2019); Peters et al. (2019) into pretrained LMs and demonstrated the usefulness of such knowledge in language understanding tasks Wang et al. (2018, 2019). Joint pretraining models, on the one hand, augment distributional LM objectives with additional objectives based on external resources Yu and Dredze (2014); Nguyen et al. (2016); Lauscher et al. (2019)
and train the extended model from scratch. For models like BERT, this implies computationally expensive retraining from scratch of the encoding transformer network.Post-hoc fine-tuning models Zhang et al. (2019); Liu et al. (2019a); Peters et al. (2019), on the other hand, use the objectives based on external resources to fine-tune the encoder’s parameters, pretrained via distributional LM objectives. If the amount of fine-tuning data is substantial, however, this approach may lead to (catastrophic) forgetting of distributional knowledge obtained in pretraining Goodfellow et al. (2014); Kirkpatrick et al. (2017).
In this work, similar to the concurrent work of wang2020kadapters, we resort to the recently proposed adapter-based fine-tuning paradigm Rebuffi et al. (2018); Houlsby et al. (2019), which remedies for shortcomings of both joint pretraining and standard post-hoc fine-tuning. Adapter-based training injects additional parameters into the encoder and only tunes their values: original transformer parameters are kept fixed. Because of this, adapter training preserves the distributional information obtained in LM pretraining, without the need for any distributional (re-)training. While Wang et al. (2020) inject factual knowledge from Wikidata Vrandečić and Krötzsch (2014) into BERT, in this work, we investigate two resources that are commonly assumed to contain general-purpose and common sense knowledge:111Our results in §3.2 scrutinize this assumption. ConceptNet Liu and Singh (2004); Speer et al. (2017) and the Open Mind Common Sense (OMCS) corpus Singh et al. (2002), from which the ConceptNet graph was (semi-)automatically extracted. For our first model, dubbed CN-Adapt, we first create a synthetic corpus by randomly traversing the ConceptNet graph and then learn adapter parameters with masked language modelling (MLM) training Devlin et al. (2019) on that synthetic corpus. For our second model, named OM-Adapt, we learn the adapter parameters via MLM training directly on the OMCS corpus.
We evaluate both models on the GLUE benchmark, where we observe limited improvements over BERT on a subset of GLUE tasks. However, a more detailed inspection reveals large improvements over the base BERT model (up to 20 Matthews correlation points) on language inference (NLI) subsets labeled as requiring World Knowledge or knowledge about Named Entities. Investigating further, we relate this result to the fact that ConceptNet and OMCS contain much more of what in downstream is considered to be factual world knowledge than what is judged as common sense knowledge. Our findings pinpoint the need for more detailed analyses of compatibility between (1) the types of knowledge contained by external resources; and (2) the types of knowledge that benefit concrete downstream tasks; within the emerging body of work on injecting knowledge into pretrained transformers.
2 Knowledge Injection Models
In this work, we are primarily set to investigate if injecting specific types of knowledge (given in the external resource) benefits downstream inference that clearly requires those exact types of knowledge. Because of this, we resort to arguably the most straightforward mechanisms for injecting the ConceptNet and OMCS information into BERT, and leave the exploration of potentially more effective knowledge injection objectives for future work. We inject the external information into adapter parameters of the adapter-augmented BERT Houlsby et al. (2019) via BERT’s natural objective – masked language modelling (MLM). OMCS, already a corpus in natural language, is directly subjectable to MLM training – we filtered out non-English sentences. To subject ConceptNet to MLM training, we need to transform it into a (synthetic) corpus.
Following established previous work (Perozzi et al., 2014; Ristoski and Paulheim, 2016), we induce a synthetic corpus from ConceptNet by randomly traversing its graph. We convert relation strings into NL phrases (e.g., synonyms to is a synonym of) and duplicate the object node of a triple, using it as the subject for the next sentence. For example, from the path “alcoholism stigma christianity religion” we create the text “alcoholism causes stigma. stigma is used in the context of christianity. christianity is part of religion.”. We set the walk lengths to
relations and sample the starting and neighboring nodes from uniform distributions. In total, we performed 2,268,485 walks, resulting with the corpus of 34,560,307 synthetic sentences.
We follow Houlsby et al. (2019) and adopt the adapter-based architecture for which they report solid performance across the board. We inject bottleneck adapters into BERT’s transformer layers. In each transformer layer, we insert two bottleneck adapters: one after the multi-head attention sub-layer and another after the feed-forward sub-layer. Let
be the sequence of contextualized vectors (of size) for the input of
tokens in some transformer layer, input to a bottleneck adapter. The bottleneck adapter, consisting of two feed-forward layers and a residual connection, yields the following output:
where (with bias ) and (with bias ) are adapter’s parameters, that is, the weights of the linear down-projection and up-projection sub-layers and
is the non-linear activation function. Matrixcompresses vectors in to the adapter size , and the matrix projects the activated down-projections back to transformer’s hidden size .
We first briefly describe the downstream tasks and training details, and then proceed with the discussion of results obtained with our adapter models.
3.1 Experimental Setup.
We evaluate BERT and our two adapter-based models, CN-Adapt and OM-Adapt, with injected knowledge from ConceptNet and OMCS, respectively, on the tasks from the GLUE benchmark (Wang et al., 2018):
CoLA Warstadt et al. (2018): Binary sentence classification, predicting grammatical acceptability of sentences from linguistic publications;
SST-2 Socher et al. (2013): Binary sentence classification, predicting binary sentiment (positive or negative) for movie review sentences;
MRPC Dolan and Brockett (2005): Binary sentence-pair classification, recognizing sentences which are are mutual paraphrases;
STS-B Cer et al. (2017): Sentence-pair regression task, predicting the degree of semantic similarity for a given pair of sentences;
QQP Chen et al. (2018): Binary classification task, recognizing question paraphrases;
MNLI Williams et al. (2018): Ternary natural language inference (NLI) classification of sentence pairs. Two test sets are given: a matched version (MNLI-m) in which the test domains match the domains from training data, and a mismatched version (MNLI-mm) with different test domains;
QNLI: A binary classification version of the Stanford Q&A dataset (Rajpurkar et al., 2016);
RTE Bentivogli et al. (2009): Another NLI dataset, ternary entailment classification for sentence pairs;
Diag Wang et al. (2018): A manually curated NLI dataset, with examples labeled with specific types of knowledge needed for entailment decisions.
We inject our adapters into a BERT Base model ( transformer layers with attention heads each; ) pretrained on lowercased corpora. Following Houlsby et al. (2019), we set the size of all adapters to and use GELU Hendrycks and Gimpel (2016) as the adapter activation . We train the adapter parameters with the Adam algorithm Kingma and Ba (2015) (initial learning rate set to , with warm-up steps and the weight decay factor of ). In downstream fine-tuning, we train in batches of size and limit the input sequences to
wordpiece tokens. For each task, we find the optimal hyperparameter configuration from the following grid: learning rate
, epochs in.
3.2 Results and Analysis
Table 1 reveals the performance of CN-Adapt and OM-Adapt in comparison with BERT Base on GLUE evaluation tasks. We show the results for two snapshots of OM-Adapt, after 25K and 100K update steps, and for two snapshots of CN-Adapt, after 50K and 100K steps of adapter training. Overall, none of our adapter-based models with injected external knowledge from ConceptNet or OMCS yields significant improvements over BERT Base on GLUE. However, we observe substantial improvements (of around 3 points) on RTE and on the Diagnostics NLI dataset (Diag), which encompasses inference instances that require a specific type of knowledge.
Since our adapter models draw specifically on the conceptual knowledge encoded in ConceptNet and OMCS, we expect the positive impact of injected external knowledge – assuming effective injection – to be most observable on test instances that target the same types of conceptual knowledge. To investigate this further, we measure the model performance across different categories of the Diagnostic NLI dataset. This allows us to tease apart inference instances which truly test the efficacy of our knowledge injection methods. We show the results obtained on different categories of the Diagnostic NLI dataset in Table 2.
The improvements of our adapter-based models over BERT Base on these phenomenon-specific subsections of the Diagnostics NLI dataset are generally much more pronounced: e.g., OM-Adapt (25K) yields a 7% improvement on inference that requires factual or common sense knowledge (KNO), whereas CN-Adapt (100K) yields a 6% boost for inference that depends on lexico-semantic knowledge (LS). These results suggest that (1) ConceptNet and OMCS do contain the specific types of knowledge required for these inference categories and that (2) we managed to inject that knowledge into BERT by training adapters on these resources.
Fine-Grained Knowledge Type Analysis.
In our final analysis, we “zoom in” our models’ performances on three fine-grained categories of the Diagnostics NLI dataset – inference instances that require Common Sense Knowledge (CS), World Knowledge (World), and knowledge about Named Entities (NE), respectively. The results for these fine-grained categories are given in Table 3.
These results show an interesting pattern: our adapter-based knowledge-injection models massively outperform BERT Base (up to 15 and 21 MCC points, respectively) for NLI instances labeled as requiring World Knowledge or knowledge about Named Entities. In contrast, we see drops in performance on instances labeled as requiring common sense knowledge. This initially came as a surprise, given the common belief that OMCS and ConcepNet contain the so-called common sense knowledge. Manual scrutiny of the diagnostic test instances from both CS and World categories uncovers a noticeable mismatch between the kind of information that is considered common sense in KBs like ConceptNet and what is considered common sense knowledge in the downstream. In fact, the majority of information present in ConceptNet and OMCS falls under the World Knowledge definition of the Diagnostic NLI dataset, including factual geographic information (stockholm [partOf] sweden), domain knowledge (roadster [isA] car) and specialized terminology (indigenous [synonymOf] aboriginal).222We compare NLI examples from the World Knowledge and Common Sense categories in the Supplementary Material. In contrast, many of the CS inference instances require complex, high-level reasoning, understanding metaphorical and idiomatic meaning, and making far-reaching connections. In such cases, explicit conceptual links often do not suffice for a correct inference and much of the required knowledge is not explicitly encoded in the external resources. Consider, e.g., the following CS NLI instance: [premise: My jokes fully reveal my character ; hypothesis: If everyone believed my jokes, they’d know exactly who I was ; entailment]. While ConceptNet and OMCS may associate character with personality or personality with identity, the knowledge that the phrase who I was may refer to identity is beyond these resources.
We presented two simple strategies for injecting knowledge from ConceptNet and OMCS, respectively, into BERT via bottleneck adapters. Additional adapter parameters store the external knowledge and allow for the preservation of corpus knowledge in the original transformer parameters. We demonstrated the effectiveness of these models in language understanding settings that require precisely the type of knowledge one finds in ConceptNet and OMCS, in which our adapter-based models outperform BERT up to 20 performance points. Our findings stress the importance for detailed analyses comparing the types of knowledge found in external sources and the types of knowledge needed in concrete reasoning tasks.
The work of Anne Lauscher and Goran Glavaš is supported by the Eliteprogramm of the Baden-Württemberg Stiftung (AGREE grant). Leonardo F. R. Ribeiro has been supported by the German Research Foundation as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1. This work has been supported by the German Research Foundation within the project “Open Argument Mining” (GU 798/25-1), associated with the Priority Program “Robust Argumentation Machines (RATIO)” (SPP-1999).
- Dbpedia: a nucleus for a web of open data. In The semantic web, pp. 722–735. Cited by: §1.
- The fifth pascal recognizing textual entailment challenge.. In TAC, Cited by: §3.1.
- SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Cited by: §3.1.
- Quora question pairs. Cited by: §3.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §1.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: §3.1.
An empirical investigation of catastrophic forgeting in gradientbased neural networks. In In Proceedings of International Conference on Learning Representations (ICLR, Cited by: §1.
- Gaussian error linear units (gelus). External Links: Cited by: §3.1.
Parameter-efficient transfer learning for nlp. In
International Conference on Machine Learning, pp. 2790–2799. Cited by: §1, §2, §2, §3.1.
- Adam: a method for stochastic optimization. In Proceedings of ICLR, Cited by: §3.1.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
- Informing unsupervised pretraining with external linguistic knowledge. arXiv preprint arXiv:1909.02339. Cited by: §1.
- ConceptNet—a practical commonsense reasoning tool-kit. BT technology journal 22 (4), pp. 211–226. Cited by: §1, §1.
K-bert: enabling language representation with knowledge graph. arXiv preprint arXiv:1909.07606. Cited by: §1.
- RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
- BabelNet: building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 216–225. Cited by: §1.
- Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of ACL, pp. 454–459. Cited by: §1.
Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1.
- DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 701–710. External Links: Cited by: §2.
- Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §1.
- Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 43–54. Cited by: §1.
- Improving language understanding by generative pre-training. OpenAI Technical Report. Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Cited by: §3.1.
- Efficient parametrization of multi-domain deep neural networks. In CVPR, Cited by: §1.
- Rdf2vec: rdf graph embeddings for data mining. In International Semantic Web Conference, pp. 498–514. Cited by: §2.
- A primer in bertology: what we know about how bert works. arXiv preprint arXiv:2002.12327. Cited by: §1.
- Open mind common sense: knowledge acquisition from the general public. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”, pp. 1223–1237. Cited by: §1.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §3.1.
Conceptnet 5.5: an open multilingual graph of general knowledge.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
- Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp. 697–706. Cited by: §1.
- Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §1.
- Superglue: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3261–3275. Cited by: §1, §1.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Blacbox NLP Workshop, pp. 353–355. Cited by: §1, §1, §3.1, §3.1.
- K-adapter: infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808. Cited by: §1.
- Neural network acceptability judgments. arXiv preprint arXiv:1805.12471. Cited by: §3.1.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Cited by: §3.1.
- XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
- Improving lexical embeddings with semantic knowledge. In Proceedings of ACL, pp. 545–550. Cited by: §1.
- ERNIE: enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451. Cited by: §1.