Understanding Few-Shot Commonsense Knowledge Models

01/01/2021 ∙ by Jeff Da, et al. ∙ Allen Institute for Artificial Intelligence 0

Providing natural language processing systems with commonsense knowledge is a critical challenge for achieving language understanding. Recently, commonsense knowledge models have emerged as a suitable approach for hypothesizing situation-relevant commonsense knowledge on-demand in natural language applications. However, these systems are limited by the fixed set of relations captured by schemas of the knowledge bases on which they're trained. To address this limitation, we investigate training commonsense knowledge models in a few-shot setting with limited tuples per commonsense relation in the graph. We perform five separate studies on different dimensions of few-shot commonsense knowledge learning, providing a roadmap on best practices for training these systems efficiently. Importantly, we find that human quality ratings for knowledge produced from a few-shot trained system can achieve performance within 6 few-shot performance enables coverage of a wide breadth of relations in future commonsense systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language understanding systems that are grounded to world knowledge remain elusive. While large-scale language models have led to considerable advances on popular NLP benchmarks (Wang2018GLUEAM; Wang2019SuperGLUEAS)

, the desired interplay between knowledge and language representations is not reliably captured by current leading training schemes for NLP systems. Consequently, practitioners turn to commonsense knowledge graphs to ground language with explicit rules about the world.

However, commonsense knowledge graphs are limited in breadth and diversity, and can not be grown with high fidelity to the scale needed to model general purpose commonsense knowledge. Manual approaches to knowledge graph construction yield reliable and precise annotations of commonsense relationships, but are cost-prohibitive: human annotation is not cheap Speer2017ConceptNet5A; Sap2019ATOMICAA. Automatic, extraction-based methods achieve much greater scale Zhang2020TransOMCSFL, but yield tuples of lower precision and questionable fidelity (Hwang2020COMETATOMIC2O). Reporting bias gordon2013reporting — the idea that obvious details go unstated in text (grice1975logic) — limits the degree of useful commonsense knowledge that can be directly extracted from text.

More recently, commonsense knowledge models have emerged as a potential solution to this bottleneck Bosselut2019COMETCT. These models learn to represent knowledge graphs implicitly and accessibly. They are pretrained on large text corpora Radford2019LanguageMA; Lewis2020BARTDS; Raffel2019ExploringTL and then further fine-tuned on examples from a knowledge graph. This two stage process allows them to transfer implicit, but inaccessible, representations of knowledge learned from language (Petroni2019LanguageMA) to the task of hypothesizing declarative knowledge. Since their inception, they have become a popular mechanism for providing commonsense knowledge on-demand to downstream NLP systems chakrabarty-etal-2020-r; ammanabrolu2020automated; Kearns2020AWI; majumder-etal-2020-like.

While commonsense knowledge graphs cover a diverse set of heads and tails (Bosselut2019COMETCT), the graphs are restricted in the number of relations they are able to cover (each relation has thousands of tuples in the graph). Because of this, commonsense systems seeded from these graphs cannot generalize from the fixed relationships originally present in the KG. To reach the coverage needed for broad applicability for NLP systems, knowledge models must be able to rapidly ingest new relations of commonsense knowledge.

In response, we perform the first comprehensive study of the few-shot learning potential of commonsense knowledge models. We explore five different dimensions for knowledge model transfer in the few-shot setting: learning vs. context augmentation, model scale, input representation, example selection, and learning schedule. Along these axes, we provide novel insights into the few-shot learning behavior of knowledge models. Our empirical results showcase the following main takeaways:

  • [nosep]

  • Few-shot learning is more flexible than few-shot adaptation (§4.1), allowing a 16x smaller model to exceed generalization performance of GPT-3 Brown2020LanguageMA.

  • Under similar experimental settings, commonsense knowledge models with more parameters generalize more effectively (§4.2).

  • Expressive prompting that uses language descriptions of relations yields more efficient transfer — 10x-100x fewer examples (§4.3).

  • When annotating examples for few-shot transfer, situation breadth (i.e., diverse knowledge tuple heads) is more beneficial than situation depth (i.e., multiple tails per head) (§5.1).

  • Pretraining on other relations improves zero-shot relation induction, but benefits subside when as few as 30 examples of a new relation are available for transfer (§5.2).

2 Background

In this work, we investigate commonsense knowledge model performance under different model and training settings for few-shot learning. Below, we describe background concepts that are helpful for contextualizing these analyses.

2.1 Commonsense Knowledge Graphs

Commonsense knowledge graphs are structured, relational representations of commonsense knowledge. In our study, we use the Atomic Hwang2020COMETATOMIC2O knowledge graph, a commonsense knowledge graph with 1.33M everyday inferential knowledge tuples about entities and events. Atomic represents a large-scale commonsense repository of textual descriptions that encode social and physical aspects of common human experiences. Across its 1.33M tuples, Atomic captures information about 23 relationship types: 9 relations about social interactions, 7 physical-entity commonsense relations, and 7 event-centered commonsense relations. Example head entities and relations can be found in Table 2. The Atomic knowledge graph is adversarially split into training, development, and test subsets such that no head entities in one set appear in any other. This property allows models trained on these resources to be evaluated on their capacity to generalize commonsense relationships to new entities.

2.2 Commonsense Knowledge Models

Commonsense knowledge models represent facts by learning to encode a commonsense knowledge graph (Bosselut2019COMETCT). They are seeded with language models and are provided knowledge tuples as training data to learn to hypothesize knowledge relationships through language generation. After training on a large collection of tuples from a knowledge graph, they learn the structure and relationships of that knowledge graph. Furthermore, because they are seeded with pretrained language models, they learn to generalize the relationships to other entities about which the language model implicitly encodes knowledge (Petroni2019LanguageMA; roberts-etal-2020-much). Consequently, they can be used to produce precise knowledge on-demand for any entity that can be expressed through language.

2.3 Few-Shot Learning vs. Augmentation

Recently, the term “few-shot” has taken two meanings: the classical definition of training on limited examples, and a new definition linked to few-shot context augmentation (Brown2020LanguageMA). In few-shot learning, language models are trained directly on a limited set of examples from the knowledge graph. In few-shot augmentation (or adaptation), models are given examples as a prepended augmentation of their context and can attend to these examples to recognize the structure of the task. While the results of few-shot adaptation have been impressive, classical few-shot learning has key advantages. Larger example sets will exceed maximum context windows for transformer language models, limiting the number of examples that can be used for few-shot adaptation. This bottleneck potentially lowers performance when additional training examples could be used for adaptation (§4.1).

width=0.48 Relation Prompt ObjectUse is used for AtLocation You are likely to find a in a xIntent Because of , PersonX wanted xWant After , PersonX would want . xAttr is seen as isAfter Something that happens after is oWant As a result of , others would want

Table 1: Examples of natural language prompts used to represent input for knowledge models. Prompts significantly speed up transfer in few-shot learning settings.
Head Relation Generated Tail (COMeT) Generated Tail (GPT-3)
nail AtLocation construction site ✓ wall ✓
state highway ObjectUse statewide transportation ✓ drive a car ✗
video camera ObjectUse video recording ✓ record ✓
PersonX takes it to the vet HinderedBy
PersonX doesn’t have
money to pay the vet ✓
PersonX gets a new pet ✗
PersonX makes PersonY very sick HinderedBy
PersonX isn’t
close to PersonY. ✓
PersonY is not sick ✓
PersonX finds another job isAfter PersonX leaves job ✓ PersonX gets a new job ✗
PersonX falls ill isBefore PersonX feels sick ✓ to be happy with personx ✗
PersonX gets a call
for an interview
xAttr qualified ✓ hopeful ✓
PersonX wants to
learn how to swim
xAttr
PersonX isn’t confident
in the water ✓
to be able to swim ✗
PersonX falls ill xEffect is hospitalized ✓
they are under
the weather ✓
PersonX sprays by a skunk xEffect PersonX will be sick ✓ their smell ✓
PersonX works really hard xIntent to be appreciated ✓ to be rewarded ✓
PersonX misses class xNeed to have a valid excuse ✓ to be in class ✗
PersonX notices a strange smell xWant to investigate the smell ✓ to get rid of it ✓
PersonX wants to learn how to sing xWant to take voice lessons ✓ to learn how to sing ✗
Table 2: Examples of few-shot () generations produced by COMeT (T5) and GPT-3. We find that COMeT (T5) is able to produce diverse and novel tail hypotheses despite learning from only a few examples.

3 Experimental Setup

To address the challenge of few-shot learning of knowledge models, we set up empirical studies to evaluate the effect of different modeling, training, and data considerations. In this section, we outline our final approach for training a few-shot learner. In the following sections (§4, §5), we describe studies that led to this system’s design.

3.1 Input

The schema of Atomic contains 23 relations across social and physical commonsense scenarios. Each link in the knowledge graph is composed of a {head h, relation r, tail t} triplet. The head and tail entities in the triplet are natural language words or phrases. Commonsense knowledge models are trained by providing the tokens of h and r as inputs to the model and learning to generate the tokens of t. In this work, rather than initializing r with a new token and random embedding Bosselut2019COMETCT, we automatically format the input tuples into natural language prompts to represent the relation (Feldman2019CommonsenseKM; jiang20tacl). Table 1 shows examples of such prompts.

3.2 Few-Shot Data Generation

We source training tuples from the Atomic training set Hwang2020COMETATOMIC2O. When constructing few-shot training sets, we set a target number of examples to randomly sample from the knowledge graph for each relation (i.e., examples relations = total training examples). Unless stated otherwise (§5.1), for each relation, examples are selected by randomly selecting head entities, and then selecting one tail entity (connected through that relation) for each head entity. This procedure ensures a diversity of head entities for training because each head can have a multiple connected tail entities in the graph.

3.3 Training

Once a training set of examples is produced, the model is trained on this subset to minimize the negative log-likelihood of the tokens of the tail entity for each tuple. We use a constant learning rate of 0.001, a mini-batch size of 4, and train the model for 3 epochs. Unless stated otherwise (§

4.2), we use T5-11B Raffel2019ExploringTL as a seed language model for all experiments. We checkpoint the model after each few-shot training epoch and select the best one based on training loss (a development set is available, but using it for early stopping would violate the few-shot setting).

3.4 Evaluation

We evaluate the knowledge hypothesized by the trained few-shot models using human and automatic evaluations. For the human evaluation (Accept % in Table 3), we use the procedure described in Hwang2020COMETATOMIC2O. We ask annotators to label the quality of the {given head, given relation, generated tail} tuple using a 4-point Likert scale. Our scale corresponds to the following assessments of plausibility about the generated tuple: {always/often true , sometimes/likely true , false/untrue , nonsensical }. We collect 3 annotations per relation, convert each annotation to an acceptability label (i.e., ✓, ✗) and use the majority label as the acceptability judgment. Evaluation agreement is measured using Fleiss’s for the acceptability judgment (i.e., moderate agreement between raters). Due to the cost of human annotation, and the number of experiments pursued, we use automated metrics, such as BLEU Papineni2002BleuAM, METEOR Banerjee2005METEORAA, ROUGE-L Lin2004ROUGEAP, CIDEr Vedantam2015CIDErCI) to report performance in most experiments.

4 How do knowledge models learn?

width= Methodology Model BLEU-2 METEOR ROUGE-L CIDEr Accept % zero-shot GPT-2 XL 2.8 8.2 9.8 4.7 36.6 few-shot () GPT-2 XL (augmentation) 5.7 10.2 13.8 6.6 38.3 GPT-3 (augmentation) 15.3 18.2 25.5 17.5 73.0 COMeT (T5) (learning) 21.9 19.5 25.7 19.2 78.6 fully COMeT (GPT-2 XL) 24.8 29.2 48.5 65.3 72.5 supervised COMeT (BART) 28.6 33.0 49.5 65.8 84.5 COMeT (T5) 28.6 33.5 47.1 59.7 84.6

Table 3: Comparison between various methods of training knowledge models. Few-shot () knowledge models transfer well in both the learning (COMeT) and augmentation (GPT-3) settings, suggesting that many of the beneficial effects knowledge modeling can be achieved from limited example sets. 222For the setting per relation , augmentation is primed with for the relevant relation, and learning runs are given no priming but train on examples. (§4.1)
Figure 2: Effect of model size for commmonsense knowledge modeling. Small (60M;   ), Large (770M;   ), 11B (  ). The difference between model sizes is greatest in the few-shot settings.

In this section, we survey modeling and training considerations for few-shot commonsense knowledge models (e.g. training using the COMeT framework). First, we explore the general few-shot learning capability of large-scale language models. Then, we investigate the effect of model size on few-shot learning potential of language models. Finally, we explore the importance of natural language prompts for representing relations over symbolic representations.

4.1 Can commonsense knowledge models be trained in few-shots?

We find that the knowledge models trained in few-shots can learn to produce high-quality commonsense knowledge tuples (Table 3).

Motivation

Most work on commonsense knowledge modeling learns from knowledge graphs with a fixed schema and a limited number of relations (e.g., 9 in ATOMIC; Sap2019ATOMICAA, 23 in Atomic; Hwang2020COMETATOMIC2O). These fixed schemas force commonsense understanding to be achieved through a small number of relationships, limiting the types of knowledge that can be represented. Pretrained language models offer a promising solution. They learn to represent commonsense knowledge, and can be queried for it in a zero-shot manner by converting knowledge graph relations to natural language prompts Feldman2019CommonsenseKM. However, zero-shot induction of commonsense relationships from only language is not reliable (jiang20tacl). For knowledge models to achieve broad applicability, they must efficiently learn new relationships from few examples.

Experiments

We train COMeT models as described in §3. In this experiment, we set the number of training examples per relation =5 (the number of adaptation examples often selected for GPT-3). For our few-shot augmentation baselines (e.g., GPT-{2,3}), we prepend the training examples (for the same relation) to the start of the input sequence. The baseline model conditions on these additional examples to generate the tokens of the predicted tail. For the fully supervised setting, we train on all tuples in Atomic. We also report the scores for fully supervised COMeT models from Hwang2020COMETATOMIC2O seeded with GPT-2 XL (Radford2019LanguageMA) and BART (Lewis2020BARTDS).

Findings

We find that both the few-shot adaptation and few-shot learning settings are able to produce high quality tuples for large models. Using only tuples per relation, both GPT-3 and COMeT (T5) outperform the fully-supervised COMeT (GPT-2 XL) model. We also see a significant improvement from the few-shot COMeT (T5) model over GPT-3, indicating that few-shot learning may be a richer adaptation strategy than few-shot augmentation. A model with 16 fewer parameters is able to transfer more successfully to the task. Finally, we observe that zero-shot knowledge elicitation from language (GPT-2 XL) is not a viable way of hypothesizing language tuples. In Table 2, we show examples of generations comparing the few-shot settings across different relations.

4.2 How does model size affect knowledge model learning?

We find that larger models generalize better in few-shot settings.

Method

We train a few-shot COMeT (T5) model across different pretrained language model sizes. We implement the COMeT model with T5-Small (60M parameters), T5-Large (770M parameters), and T5-11B (11B parameters), and record their performance across different values of for the few-shot training sets. Training hyper-parameters remain the same between model sizes.

Findings

In Figure 2, we observe that the performance difference between model sizes is largest in few-shot settings, but gradually decreases as more examples are available for training. However, the larger seed language models provide a consistent improvement regardless of the number of examples available for training.

4.3 How do prompts influence knowledge model learning?

Figure 3: Illustration of knowledge model prompting. With prompting, the commonsense relation is converted using a natural language template.
Figure 4: Comparison of training using natural language prompting (  ) versus previous methods utilizing a relation embedding (  ). We show that priming methods improves the data efficiency curve of knowledge model training.
Figure 5: Illustration of sampling tuples from a knowledge graph for the same relation by prioritizing diverse head entities (  ) or multiple tail entities per head entity (  ).
Figure 6: Comparison of training on examples with diverse head entities (  ) or multiple tail entities per head (  ). Our results suggest that example breadth is more important for training knowledge models than example depth.
Figure 7: We investigate two training schedules: 1) Pretraining on the seed relations, followed by training on examples of the target relations (  ); 2) Training only on examples of the target relations (  ).
Figure 8: Comparison of few-shot training only on target relations (  ) compared to pre-training on seed relations in the Atomic training set, and then few-shot training on the target relations (  ).

Using natural language prompts to express relations accelerates learning commonsense knowledge models in few-shot settings (Figure 4).

Motivation

Recent work has shows that prompts can help models elicit knowledge from pre-trained language representations Feldman2019CommonsenseKM; Shin2020AutoPromptEK. However, eliciting knowledge through zero-shot prompting has drawbacks. First, because the language model is not explicitly trained for this purpose, the prompt may not yield salient knowledge in practice. Second, the output is sensitive to subtle variations in the construction of the language prompt (jiang20tacl). Here, we explore whether prompts can accelerate few-shot learning in a stable manner.

Method

We explore two different settings for modeling relations in knowledge models (Figure 3). In the first, we initialize natural language prompts for each relation following the procedure defined in Section 3.1. Example prompts for each relation can be found in Table 1. In the second, we follow Bosselut2019COMETCT and initialize a unique token for each relation, which is appended to the tokens of the head entity and maps to a unique learnable embedding for each relation.

Findings

In Figure 4, we see that knowledge models can efficiently learn from fewer examples when relations are represented using natural language prompts. Prompts are especially important in the more restrictive few-shot settings where there is little signal to learn a relation embedding from scratch. Once approximately 3000 examples per relation are available for training, the performance between the models trained with and without prompts is similar, suggesting a point where prompts no longer help. Interestingly, we find a steeper slope for the prompt model when jumping from 3 to 30 examples per relation, implying that the model requires a minimum number of examples to map an understanding of the relation to the prompt words during few-shot training.

5 How should we annotate new knowledge relations?

In this section, we design two studies to explore the effect of training set construction on few-shot learning of commonsense knowledge models, guiding future annotation efforts in commonsense knowledge graphs. First, we compare the tradeoff of example breadth and example depth by evaluating the learning improvement from training on different head entities for the same relation (i.e., breadth) or different tail entities for the same head entity (i.e., depth). Second, we explore whether pretraining on other relations in the graph can help few-shot transfer for a specific target relation, simulating a situation where we may want to learn a new commonsense relation in an online manner.

5.1 Heads or Tails: Example Breadth vs. Example Depth

Training knowledge models with diverse heads, rather than many tails per head leads to better few-shot learning performance (Figure 6).

Method

In this setting, we compare two strategies for sampling examples for few-shot learning (Figure 5). In the first, we sample diverse head entities (i.e., unique seed instances), and limit any head entity to be sampled only once in the few-shot training set. In the second strategy, we sample multiple tail entities per head entity. Here, the model can learn from a richer set of commonsense relationships for each head entity, but at the expense of learning from fewer head entities overall. We sample head entities that are linked to at least 5 tail entities through the same relation in Atomic, and collect all tails for those sampled heads.

Findings

In Figure 6, we see a large empirical gain when training on diverse heads rather than training on more tails per head.333We do not endorse extrapolating these findings to other common formulations of the “heads or tails?” question. While this difference is most notable for small (i.e., more extreme few-shot settings), a gap persists across all values of . Consequently, when annotating commonsense knowledge graphs for few-shot transfer, we advise annotating diverse head entities for each relation, rather than annotating multiple inferences for the same head entities (e.g., training with contrast sets; gardner-etal-2020-evaluating).

5.2 Does knowledge of other relations help to learn a new relation?

Models benefit from pretraining on the other relations in Atomic in ultra few-shot settings. As the number of examples for a relation increases, pretraining no longer helps (Figure 8).

Method

In this study, we select a subset of 6 relations from Atomic as the few-shot set. Then, we pretrain the knowledge model on all examples of the remaining relations in Atomic. As before, we use natural language prompts to represent relations, as we want knowledge about relations to transfer between them, which symbolic relation representations would hinder. We then perform few-shot training on the set-aside relations using examples from each relation. Here, we cap our study at a maximum , as we see no improvement from pretraining with larger . We compare these results to a baseline trained on examples from the set-aside relations with no pretraining.

Findings

We find that when more than 30 examples of a particular relation are available (i.e., annotated), the benefit of pretraining on other relations evaporates. This result is surprising given the diversity of head entities available among the pretraining tuples. We would expect the model to see greater benefit from seeing many commonsense relationships during pretraining. Consequently, we conclude that annotating even a few examples of a new relation is the most fruitful way to improve a knowledge model’s understanding of that relation.

However, it appears that knowledge models do manage to transfer some information between relations, with zero-shot relation induction improving drastically when the knowlege model is trained on other relations in the graph.

6 Related Work

Commonsense Knowledge Graphs

Our work uses Atomic as the transfer commonsense knowledge graph, but other works have designed CSKGs as well. Part of Atomic is built off ATOMIC, a crowdsourced graph with 9 social commonsense relations (Sap2019ATOMICAA). ConceptNet (Speer2017ConceptNet5A), was also partially manually constructed from crowdsourced statements of commonsense knowledge. Recently, Zhang2020TransOMCSFL built off the ConceptNet schema by constructing a knowledge graph using automatically converted syntactic parses from sentences. More recent works have explored adding context to knowledge graph head entities to provide a richer learning space for commonsense knowledge models (mostafazadeh-etal-2020-glucose).

Commonsense Knowledge Models

Our work uses commonsense knowledge models, first proposed by Bosselut2019COMETCT, to learn from commonsense knowledge graphs. Hwang2020COMETATOMIC2O also trained commonsense models on Atomic

, but focused on fully-supervised learning. Other works have developed commonsense knowledge models that are grounded to visual scenes

(Park2020VisualCOMETRA; Da2020EditedMU), requiring multimodal commonsense inference generation. Recent works extend commonsense knowledge models beyond generating single-hop inferences and generate multi-branch (Bosselut2019DynamicKG) and multi-hop (wang-etal-2020-connecting) inferential structures. Commonsense knowledge base completion is also a closely related task to commonsense inference generation (li2016commonsense; saito2018commonsense). Recent works on this task combine language and graph structure representations for improved generalization (Malaviya2019ExploitingSA; Wang2020InductiveLO).

7 Conclusion

In this work, we investigate five different dimensions of few-shot adaptation for knowledge model learning: adaptation strategy (i.e., learning vs. augmentation), model size, relation prompting, few-shot training set selection, and knowledge graph pretraining. Our studies yield a roadmap for efficient few-shot learning of knowledge models. We use these insights to train a few-shot knowledge model that exceeds the performance of GPT-3 on commonsense knowledge hypothesization, and comes within 6% of the performance of a fully-supervised model trained on all of Atomic.

Acknowledgments

This research was supported in part by DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031), JD.com, and the Allen Institute for AI (AI2).

References

8 Appendix

8.1 Accuracy in zero-shot MLM setting

BLEU-1 BLEU-2
T5 - Zero-shot 0.067 0.017

While little work has explored few-shot knowledge completion, recent works have investigated performance of zero-shot knowledge graphs Petroni2019LanguageMA; Feldman2019CommonsenseKM. Thus, we investigate the ability of T5 to complete commonsense knowledge in a zero-shot setting. Different from the few-shot and supervised approaches, we do not use teacher forcing, but rather use prompts to leverage the masking objective of the language model pretraining. In addition to mask prediction, we try a couple variants. Since the mask only predicts several tokens at a time, for relations with longer responses (e.g. ATOMIC relations), we allow the model to predict up to 7 mask tokens in succession, or until the model predicts an empty string for the mask. We suggest that this is still only a workaround, and masked models are poor predictors of longer length tail entities.