KFCNet: Knowledge Filtering and Contrastive Learning Network for Generative Commonsense Reasoning

09/14/2021 ∙ by Haonan Li, et al. ∙ Microsoft The University of Melbourne 5

Pre-trained language models have led to substantial gains over a broad range of natural language processing (NLP) tasks, but have been shown to have limitations for natural language generation tasks with high-quality requirements on the output, such as commonsense generation and ad keyword generation. In this work, we present a novel Knowledge Filtering and Contrastive learning Network (KFCNet) which references external knowledge and achieves better generation performance. Specifically, we propose a BERT-based filter model to remove low-quality candidates, and apply contrastive learning separately to each of the encoder and decoder, within a general encoder–decoder architecture. The encoder contrastive module helps to capture global target semantics during encoding, and the decoder contrastive module enhances the utility of retrieved prototypes while learning general features. Extensive experiments on the CommonGen benchmark show that our model outperforms the previous state of the art by a large margin: +6.6 points (42.5 vs. 35.9) for BLEU-4, +3.7 points (33.3 vs. 29.6) for SPICE, and +1.3 points (18.3 vs. 17.0) for CIDEr. We further verify the effectiveness of the proposed contrastive module on ad keyword generation, and show that our model has potential commercial value.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained language models have achieved impressive results across a wide range of NLP tasks (Devlin et al., 2019; Yang et al., 2019; Sun et al., 2019; Liu et al., 2019; Lewis et al., 2020a; Qi et al., 2020; He et al., 2020b). However, their ability to accurately reflect factual knowledge or perform logical inference is still limited. To investigate the ability of systems to capture commonsense knowledge, datasets such as CommonsenseQA (Talmor et al., 2019), SWAG (Zellers et al., 2018), and WinoGrande (Sakaguchi et al., 2020) have been proposed. Separate to these discriminative tasks that require models to choose the correct option from multiple candidates, CommonGen (Lin et al., 2020) is framed as a generation task, and requires the system to generate a logical and coherent sentence describing an everyday scenario based on a concept set. Experiments show that state-of-the-art generation models are not adequate or accurate enough to generate plausible sentences or reflect commonsense assumptions in this setting.

External knowledge provides not only information about the sorts of relationships that hold between concepts, to potentially guide generation models in capturing the implicit logic between concepts, but also interpretability. Inspired by Lewis et al. (2020b) and Fan et al. (2020), we adopt a retrieval-and-generation framework, and propose a BERT-filter and two contrastive learning modules for retrieval and generation, respectively.

For retrieval, previous research (Lewis et al., 2020b)

has shown that traditional sparse vector space models, such as TF-IDF and BM25, perform better than dense representation-based retrieval on heavily entity-centric tasks such as FEVER

(Thorne et al., 2018). However, while using sparse vector space retrieval models can retrieve relevant prototypes that contain a set of concepts, there can be significant domain mismatches between the retrieved results and target distribution, making it difficult for generation models to bridge between prototypes and targets. We argue that a two-stage retrieval strategy alleviates this issue by combining sparse vector space search and dense representation filters. First, a sparse vector retrieval model is used to find passage candidates with high coverage of concept words, and then a dense vector-based filter is applied to score the candidates, and filter out low-quality prototypes.

For generation, we apply contrastive learning to each of the encoder and decoder, in a general encoder–decoder architecture. The core idea of contrastive learning is to construct positive and negative samples from an anchor sample, and draw together the anchor and positive samples while pushing away the anchor from all negative samples in the embedding space during training. Given that high-quality prototypes can be used as clusters of positive samples, we propose a decoder contrastive module that minimizes the distance between decoded sentence representations with distinct prototypes retrieved from the same concept set. Common scenario information and abstract concept relationships can be learned based on the contrasts between different prototypes. Moreover, we propose an encoder contrastive module to force the encoder to learn sentence representations, and save it into a global token which is visible to the decoder during decoding. In this way, global sentence-level semantics can be captured better.

The main contributions of this work are threefold: (1) we demonstrate that adding a high-quality matching model to the word overlap-based retriever benefits entity-centric retrieval tasks; (2) we propose two contrastive learning modules that can be applied to a general encoder–decoder generation model; and (3) we conduct experiments on CommonGen and an ad keyword generation task, and show that our method achieves large-scale improvements on both tasks.

2 Related Work

2.1 Knowledge Enhanced Generation

There is significant work on incorporating external knowledge from knowledge bases and incorporating retrieved information in language generation tasks (Weston et al., 2018; Cao et al., 2018; Guan et al., 2019; Hossain et al., 2020). Lewis et al. (2020b) explore a general-purpose fine-tuning recipe for retrieval-augmented generation that combines a dense passage retriever (Karpukhin et al., 2020) with a BART (Lewis et al., 2020a) generator. For commonsense generation, Liu et al. (2020)

propose a knowledge graph-augmented language generation model that encompasses concepts from a knowledge graph, and produces more logical and natural sentences.

Fan et al. (2020) propose to retrieve prototypes based on sparse vector similarity, and introduce a scaling module and a prototype position indicator to explicitly deal with retrieval noise.

This paper proposes a two-stage retrieval strategy and differs in applying contrastive learning to make better use of prototypes for generation.

2.2 Contrastive Learning

Recently, contrastive learning has achieved remarkable results in many self-supervised and supervised learning tasks, primarily for computer vision. The two key elements of contrastive learning are: (1) the construction of positive and negative samples; and (2) the learning framework.

2.2.1 Sample Construction

Usually in contrastive learning, positive samples are augmented forms of anchor data points, and negative samples are augmented forms of other data points. In NLP, Meng et al. (2021) create positive samples by masking and cropping tokens from sentences; Gunel et al. (2020) and Fang and Xie (2020) use back-translation to create positive augmentations of original sentences; Chi et al. (2020) and Wei et al. (2021) regard parallel sentences distributed in one or multiple languages as different views of the same semantics to learn cross-lingual representations; and Gao et al. (2021) demonstrate that constructing positive pairs with only standard dropout as minimal data augmentation works surprisingly well on the NLI task. Distinct from these methods, we propose to create positive sample pairs from retrieved results.

2.2.2 Learning Framework

Previous contrastive learning methods have required either specialized architectures (Bachman et al., 2019; Hénaff, 2020) or a memory bank to store large volumes of negative samples (Wu et al., 2018; Tian et al., 2020). Chen et al. (2020)

present a simple framework consisting of a feature extraction module, and a non-linear transformation module, which outperforms previous work on ImageNet

(Russakovsky et al., 2015) without using a specialized architecture or a memory bank. However, it requires a large batch size to yield high performance, which is computationally prohibitive. Moco (He et al., 2020a) addresses this issue by maintaining a queue of data samples as the memory bank, and enqueuing encoded representations of the current mini-batch and dequeuing the oldest representations on each iteration. They further propose a momentum encoder to maintain the consistency of representations in the queue. In this work, we use the Moco framework to train our contrastive learning modules.

3 Method

In this section, we detail our method: Knowledge Filtering and Contrastive learning Network (KFCNet). First, we introduce our prototype retrieval strategy together with the knowledge filter model. Then we present our generative model, based on an encoder–decoder architecture with two contrastive learning modules. Finally, we show how we adapt the Moco contrastive learning framework, and deal with multiple positive samples.

3.1 Task Formulation

We use to denote a set of concepts, where , and is the concept vocabulary, and use to denote all possible concept sets. The commonsense generation task is to learn a function that maps the concept set to a sentence , where , and is the target sentence space. The generated sentence must be a plausible sentence that describes a common scenario in our daily life based on the contents of .

3.2 Prototype Retrieval and Filtering

In order to retrieve prototypes that contain the concepts in a given concept set while keeping the retrieval results and target sentences semantically as similar as possible, we use a two-stage retrieval strategy combining sparse vector space search and dense representation matching. In Stage 1, a sparse vector retrieval model is used to retrieve candidate prototypes from the corpus , where . Then in Stage 2, a dense representation-based scorer is used to score the candidates, and the top- scored candidates are chosen as the final prototypes.

3.2.1 Stage 1

Given a concept set , we first split corpus into parts according to the number of concepts the sentence contains, where sentences in contain concepts in . Given that most concepts in are verb and noun lemmas, we pre-process based on lemmatization and stemming. Then we choose sentences as candidates from the parts, prioritized such that .

3.2.2 Stage 2

After retrieving candidate prototypes, a scorer is used to rank the candidates and filter out candidates that are far from the targets in embedding space. In this work, we use a BERT-based model as an encoder, and use the hidden state of the [CLS]

token as the sample representation. The representation is then feed into a multi-layer perceptron with a scalar output as follows:


where is the training sample created from the concept set and candidates/targets, is the sample representation, and is the final score of the sample. Theoretically, the label of the training set can be any real number in the range

, but we find that it is sufficient to train the scorer as a simple binary classifier. We create the positive samples by combining a concept set with each corresponding target sentence, and create the negative samples by combining the concept set with a different candidate prototype or random sentence from

. During inference, we score all candidates and choose the candidate with the highest score as the prototype.

For all experiments in this paper, we set and . means that for each input, we construct one positive sample which is widely used in contrastive leaning work. is because experience shows that at least 2 high-quality prototypes can be retrieved with 8 candidates.

Figure 1: Generation model structure with contrastive learning.

3.3 Contrastive Learning for Generation

3.3.1 Encoder–decoder Architecture

The encoder–decoder architecture is widely used in generation tasks. Compared to single decoder generation models such as GPT-2 (Radford et al., 2019) where words are conditioned only on left context, models using an encoder–decoder framework such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020) enable bidirectional interactions with an encoder, and auto-regressive generation with a decoder. In this work, we use BART with an auto-regressive objective for generation as shown in the middle of Figure 1, and propose separate contrastive modules for the encoder and decoder, corresponding to the right and left sub-structures in the figure.

Typical generation inputs only contain the source sentence, which is the concept set in CommonGen. The difference here is that we append one of the retrieved prototypes to to enhance the input.

3.3.2 Encoder Contrastive Loss

Although BART learns bidirectional interactions using a transformer-based encoder and implements cross-attention at each layer of the decoder, the global target information is not explicitly learned during encoding, meaning that the decoder needs to find important local information for the current step during each timestep of decoding, without having access to the global goal of generation. Here, we propose to force the model to learn global target information during encoding and save it to a special token, using source–target contrastive learning. The special token is visible to the decoder via cross-attention during each timestep of decoding. Specifically, given a concept set and a target sentence , where , we denote the retrieved prototypes as , where each is a complete prototype. We construct the input for encoder contrastive learning by concatenating with and with , respectively. As illustrated in Figure 1, the concatenation of and will be used as the input to the main encoder, which is followed by a decoder with gradient and auto-regressive generation loss, and the concatenation of and will be used as the input to another encoder without a decoder or gradient.

3.3.3 Decoder Contrastive Loss

At the retrieval step, multiple high-quality retrieval results are collected as prototypes to augment generation. Although these retrieved results substantially boost external information, they inevitably introduce noise. In order to learn general information associated with the concept set and eliminate noise in the prototypes, we propose a decoder contrastive learning module, which we apply to the sentence representation at the decoder output. Formally, we concatenate with (), which is a different prototype for from the one used in the main-branch BART model. Note also that different from the main-branch model, here the gradient is not back-propagated.

3.4 Momentum Contrast with Memory Bank

Most existing training methods greatly limit the number of in-batch negative samples, limiting the potential of contrastive learning. To enable large-scale interactions between negative samples, we follow Moco (He et al., 2020a) in maintaining a dictionary as a queue of encoded/decoded data samples. The keys of the dictionary are samples from data after encoding/decoding and the queries are samples in current mini-batches after encoding/decoding during training. Learning is formulated as minimizing the contrastive loss, which makes the query similar with its matching key and dissimilar to other keys.

3.4.1 Memory Bank as Queues

We use two dictionaries to store the representations of the encoder and decoder output, respectively. In each training iteration, the newest encoded representations are enqueued and the oldest are dequeued, to maintain a fixed queue size. For each sample, the number of contrast pairs is the size of the queue, where usually only the matching key in the same mini-batch is positive, and all others are negative.

3.4.2 Dealing with Multiple Positive Samples

In the CommonGen task, the mapping between source sequences and gold targets can be many-to-many. independent sample pairs are created for 1-to- and -to-1 source–target pairs, which can be distributed across mini-batches. However, these samples should be all regarded as mutually positive. To enable interactions between positive samples intra- and inter-mini-batch, we assign each source–target pair an identity, which indicates pairs that share the same source or target. These identities are saved in another queue that is synchronously updated with an encoder and decoder memory bank.

3.4.3 Momentum Updated Parameters

To keep the consistency of representations in the memory banks, we update the parameters of the key encoder and decoder with momentum. Formally, denote the parameters of the query encoder and decoder as and , and the parameters of the key encoder and decoder as and . The parameters and are updated by back-propagation, and the parameters and are updated by:


Here, is a momentum coefficient which is set to be close to 1. In this way, the parameters of the key encoder and decoder evolve more smoothly than those of the query, which maintains the consistency of key representations in the memory bank.

3.5 Training Objective

Consider a batch of query-key pairs , where there is only one positive key for a given query . After encoding, we fetch the representation of the last <EOS> tokens and apply a projection to it as:


The encoder contrastive loss function, called InfoNCE, is as follows:


where denotes the similarity function, is a temperature hyper-parameter, and denotes all indices in the memory bank. The denominator has total terms, including one positive and negative samples. Intuitively, the loss function is the log loss of an -way softmax classifier that tries to classify according to the positive . Eqn (7) is only able to deal with the case of a single positive key existing for each query. To generalize it to an arbitrary number of positives, inspired by SupCon (Khosla et al., 2020), we consider the following loss functions:


Here, denotes all positive indices of the sample , Eqn (8) summarizes the positive samples outside of the log function, and Eqn (9) summarizes those inside it.

The decoder contrastive loss can be obtained in the same way, except that the sentence representation is fetched from the <EOS> token after decoding. During training, we try to minimize the sum of the encoder contrastive loss, decoder contrastive loss, and the decoder auto-regressive generation loss:


Here, denotes the cross-entropy generation loss, and and are tunable scalars.

During inference, we discard the momentum encoder and decoder, together with the projection layers.

4 Experiments

max width= # Sentences 4,118,993 70,245,048 # Length (avg) 11.10 18.76 # Missing concept (avg)   size=3 0.40 0.48   size=4 0.56 1.00   size=5 0.80 1.98

Table 1: Statistics of the two corpora. “Missing concept” indicates the number of missing concepts in the top-2 retrieved sentences, broken down by concept-set size.

4.1 Datasets

CommomGen (Lin et al., 2020) contains 32,651/993/1,497 unique training/development/test concept sets, corresponding to 67,389 and 4,018 English target sentences in the training and development sets, meaning that one concept-set can map to multiple target sentences. The percentage of concept-sets in the development and test sets that are unseen in the training set are 99.60%, and 100.00% respectively, making the dataset challenging for compositional generalization.

4.2 Prototype Collection

4.2.1 In-domain Corpus

As CommonGen was created from visually-grounded caption datasets that describe everyday scenarios, we build an in-domain corpus from datasets of image captions, video captions, and natural language inference. In detail, we extracted sentences from ActivityNet (Krishna et al., 2017), VaTeX (Wang et al., 2019), Conceptual Captions (Sharma et al., 2018)


(Bowman et al., 2015), and MNLI (Williams et al., 2018) as the in-domain corpus ().

4.2.2 Out-of-domain Corpus

In addition to in-domain experiments, we create an out-of-domain corpus () from Wikipedia,111English Wikipedia dump from May 01, 2020. using SpaCy222https://spacy.io/ as our sentence tokenizer.

For both corpora, sentences with fewer than 5 tokens or more than 100 tokens were removed. Table 1 shows the basic statistics of the two corpora. Although is much larger than , sentences retrieved from contain more required concepts than those from on average. Specifically, for concept-sets of size 4 and 5, the retrieved sentences from have 0.44 and 1.18, respectively, more relevant concepts than .

4.3 Experimental Setup

4.3.1 Implementation Details

We employ the pre-trained BART-large model as the base generation model, and initialize the momentum encoder and decoder by copying parameters from the base model. We use the Adam optimizer with , , and 0.1 weight decay, with the initial learning rate setting selected from . We use the polynomial decay learning rate scheduler with 500 warmup steps, and set dropout to 0.1. We set the max tokens per batch to 3000 and max batch-size to 48, with 15k total updates. For the auto-regressive generation loss, we use cross-entropy loss with 0.1 label-smoothing penalty. During decoding, we use beam search with size 5, and 1.0 length penalty.

For contrastive learning, we use an MLP as the projection network, with a single hidden layer of 1024d and the output size of 128d. We use Eqn (8) as the loss function, with similarity measured by dot-product, and set the temperature to 0.1. The queue size of the memory bank is set to 4096, and the momentum coefficient is set to 0.999.

4.3.2 Baselines

We use several state-of-the-art pre-trained language generation models as baselines: GPT-2 (Radford et al., 2019), BERT-Gen (Bao et al., 2020), UniLM (Dong et al., 2019), UniLM-v2 (Bao et al., 2020), T5 (Raffel et al., 2020), and BART (Lewis et al., 2020a). All models are fine-tuned in seq2seq mode. We also compare our model with two strong baselines that use external knowledge: EKI (Fan et al., 2020) and KG-BART (Liu et al., 2020).

4.3.3 Evaluation Metrics

To evaluate generation performance, we use BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005)

, in addition to evaluation metrics for captioning tasks, namely CIDEr

(Vedantam et al., 2015) and SPICE (Anderson et al., 2016). As all metrics score the output in the range , we also present the average score across all metrics.

5 Results

max width= Model ROUGE-2/L BLEU-3/4 METEOR CIDEr SPICE Overall GPT-2 (Radford et al., 2019) 17.18 39.28 30.70 21.10 26.20 12.15 25.90 24.64 BERT-Gen (Bao et al., 2020) 18.05 40.49 30.40 21.10 27.30 12.49 27.30 25.30 UniLM (Dong et al., 2019) 21.48 43.87 38.30 27.70 29.70 14.85 30.20 29.44 UniLM-v2 (Bao et al., 2020) 18.24 40.62 31.30 22.10 28.10 13.10 28.10 25.93 T5 (Raffel et al., 2020) 22.01 42.97 39.00 28.60 30.10 14.96 31.60 29.89 BART (Lewis et al., 2020a) 22.23 41.98 36.30 26.30 30.90 13.92 30.60 28.89 EKI-out (Fan et al., 2020) 24.36 45.42 42.90 32.10 32.00 16.80 32.50 32.29 KFCNet-out 24.10 45.59 44.09 34.20 32.83 17.39 33.11 33.04 KG-BART (Liu et al., 2020) 23.38 44.54 42.10 30.90 32.40 16.83 32.70 31.83 EKI (Fan et al., 2020) 25.43 46.53 46.00 36.10 33.80 17.80 33.40 34.15 KFCNet w/o FC 25.16 46.13 50.22 41.97 36.22 18.85 35.90 36.35 KFCNet w/o C 25.91 46.81 54.75 47.33 38.19 20.21 38.20 38.77 KFCNet 26.81 47.52 57.33 51.46 38.92 20.98 39.15 40.31

Table 2: Overall performance of the different models on CommonGen (v1.0). Models in the first block are fine-tuned pre-trained language models without external knowledge; models in the second block use out-of-domain knowledge; models in the last two blocks use in-domain knowledge, where the KG-BART uses ConceptNet, and both EKI and KFCNet (our model) use the in-domain prototype corpus as a knowledge base.

Table 2 presents the experimental results across all the metrics.333Note that the latest test set (v1.1) adds one more human reference to each example in the test set (v1.0), but is not publicly available. Additionally, EKI and KG-BART were evaluated on v1.0, so this is what we use for our experiments. We observe the following: (1) Methods in the 2nd, 3rd, and 4th blocks of Table 2 that use external knowledge outperform the fine-tuned pre-trained language models in the first block. This demonstrates that external knowledge benefits commonsense reasoning and generation. (2) The overall performance of EKI and our method (KFCNet) that both use natural sentences as prototypes is better than KG-BART, which incorporates structured knowledge from knowledge bases. We hypothesize that this is because pre-trained language models like BART can more easily exploit natural language samples than structured information, even with elaborate modules for information fusion. (3) Prototypes retrieved from the in-domain corpus result in better performance than those from the out-of-domain corpus. (4) Simply fine-tuning BART on our retrieved prototypes beats previous published SOTA on several metrics, and using filtered prototypes boosts the performance again. This on the one hand shows that the quality of prototypes has a large impact on generation, and on the other hand, indicates our retrieval method is better than that of EKI, and our filter helps in selecting good prototypes. (5) Our KFCNet achieves new state-of-the-art results which surpass all other methods by a large margin.

w/o Retrieval 26.30 13.92 30.60
BM25 36.84 17.33 32.96
0Lemma & Stem 41.97 18.85 35.90
0BERT Filter 47.33 20.21 38.20
Table 3: Results for fine-tuning BART based on different retrieval strategies over the test set.

5.1 Ablation study

To better understand the impact of the different modules in KFCNet, we perform a number of ablation experiments.

5.1.1 Retrieval and Filter

Prototype retrieval is a key part of any retrieval-based generation model. To assess the effectiveness of the retrieval-and-filter mechanism, we retrieve prototypes from the in-domain corpus and run ablations on a single BART model. Table 3 shows the results. Compared to models without retrieval, using prototypes retrieved by a simple BM25 model improves generation performance, which we suggest is due to the retrieved prototypes helping the model to better capture relationships between concepts, and construct a coherent scenario. With word lemmatization and stemming, the variety of the retrieval results increases, resulting in better prototypes. Adding a BERT filter boosts the performance again, achieving 5.38, 1.36, and 2.30 absolute improvements for BLEU-4, CIDEr, and SPICE. This verifies the effectiveness of using a high-quality matching model as an auxiliary module for a word overlap-based retriever.

KFCNet 36.10 17.96 33.89
34.09 17.24 33.97
30.82 16.20 33.16
Table 4: Contrastive ablation study on CommonGen development set. and denote the encoder and decoder contrastive modules, respectively.

5.1.2 Contrastive Learning

The contrastive loss plays an important role in our model. We perform an ablation study on the development set of CommonGen, by comparing the model without the contrastive module, using only encoder contrastive learning, and using both encoder and decoder contrastive learning. As shown in Table 4, using only encoder contrastive learning leads to improvements over the baseline BART model, and adding decoder contrastive learning further improves results based on BLEU-4 and CIDEr.

5.2 Similarity Function and Summation Location

We further compare the performance of different similarity functions and positive summation locations, as mentioned in Section 3.5. The results in Table 5 demonstrate that the combination of dot-product similarity with summation outside of the log function performs best, consistent with the findings of Khosla et al. (2020).

Out 34.11 16.77 33.59
In 32.45 16.54 33.79
Out 32.49 16.62 33.73
In 33.52 16.64 33.39
Table 5: Comparison of different similarity functions and positive sample summation locations. denotes dot-product similarity, and

denotes cosine similarity.

5.3 Model Efficiency

5.3.1 Retrieval

The prototype retrieval is done separately from the generation model, and the retrieval time consists of 2 parts: (1) sparse vector matching time, in the form of BM25 search; and (2) BERT filter inference, for fine-grained selection, noting that only a few candidates (8 in our experiments) are left after stage 1, which can be processed in a single mini-batch.

5.3.2 Contrastive Module

During training, the momentum encoder and decoder parameters are updated by Eqn (3) and there are no gradients or back-propagation in these modules. Therefore it takes no more than double the training time without contrastive modules. During inference, the contrastive modules are disabled, and hence the efficiency does not decrease.

5.4 Final Leaderboard Results

Human 46.49 37.64 52.43
KFCNet 42.45 18.37 33.27
RE-T5 40.86 17.66 31.07
KG-BART 33.86 16.92 29.63
EKI-BART 35.94 16.99 29.58
T5-Large 31.96 15.12 28.85
BART 31.82 13.97 27.99
UniLM 30.61 14.88 27.42
BERT-Gen 23.46 12.60 24.82
GPT-2 26.83 12.18 23.56
Table 6: Final CommonGen leaderboard results, using SPICE to rank the methods.

Table 6 shows the final evaluation results on the latest test set with additional human references (v1.1).444https://inklab.usc.edu/CommonGen/leaderboard.html Note that the model in second place (RE-T5) expands the original training data and does continuous pretraining before fine-tuning on CommonGen. Our method, KFCNet, performs best on all metrics. Among all fine-tuned methods, KFCNet beats the previous state-of-the-art by a large margin: 6.51 (18.11%) for BLEU-4, 1.38 (8.12%) for CIDEr, and 3.64 (11.95%) for SPICE.

5.5 Experiment on Keyword Generation

To test the effectiveness of the proposed contrastive learning modules, we constructed a real-world adword dataset, based on an advertising platform (Edelman et al., 2007). The goal is to display a list of ads that matches the user intent, for which the first step is to retrieve relevant keywords provided by advertisers given a user query. The dataset contains 72,876 training samples, 10,000 dev samples, and 10,000 test samples from a major search engine, with each sample corresponding to a query–keyword pair. Titles of the top-two web search results of the query from the search engine are kept as prototypes.

We fine-tune BART models following the same sequence generation experimental design. The results are shown in Table 7.

From the first two lines, we see that directly appending the retrieved information to the source does not lead to noticeable improvements, almost certainly because of noise in the retrieved results. However, our contrastive modules alleviate the effects of noise, and beat BART on all metrics.

Model ROUGE-2/L BLEU-3/4 AVG
BART 33.03/60.17 31.61/25.03 38.96
33.68/60.31 31.69/24.75 39.70
35.18/61.24 33.60/26.78 41.44
Table 7: Experimental results on ad keyword generation.

6 Conclusion

In this paper, we present KFCNet: a novel knowledge filtering and contrastive learning model for retrieval-augmented sequence generation, which achieves state-of-the-art results on the CommonGen benchmark. Two contrastive learning modules are proposed to capture global target semantics and learn general features from multiple retrieved prototypes. A prototype retrieval ablation study showed the effectiveness of the proposed filter for filtering low-quality candidates, and further experiments on ad keyword generation showed that our model has potential commercial value. In the future, we plan to extend the contrastive module to more general settings, such as natural language understanding and representation learning.


  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In ECCV - 14th European Conference, Part V, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Vol. 9909, pp. 382–398. External Links: Link, Document Cited by: §4.3.3.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 15509–15519. External Links: Link Cited by: §2.2.2.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL, J. Goldstein, A. Lavie, C. Lin, and C. R. Voss (Eds.), pp. 65–72. External Links: Link Cited by: §4.3.3.
  • H. Bao, L. Dong, F. Wei, W. Wang, N. Yang, X. Liu, Y. Wang, J. Gao, S. Piao, M. Zhou, and H. Hon (2020) UniLMv2: pseudo-masked language models for unified language model pre-training. In

    Proceedings of the 37th International Conference on Machine Learning, ICML

    pp. 642–652. External Links: Link Cited by: §4.3.2, Table 2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton (Eds.), pp. 632–642. External Links: Link, Document Cited by: §4.2.1.
  • Z. Cao, W. Li, S. Li, and F. Wei (2018) Retrieve, rerank and rewrite: soft template based neural summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, I. Gurevych and Y. Miyao (Eds.), pp. 152–161. External Links: Link, Document Cited by: §2.1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020) A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119, pp. 1597–1607. External Links: Link Cited by: §2.2.2.
  • Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X. Mao, H. Huang, and M. Zhou (2020) InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. CoRR abs/2007.07834. External Links: Link, 2007.07834 Cited by: §2.2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: §1.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 13042–13054. External Links: Link Cited by: §4.3.2, Table 2.
  • B. Edelman, M. Ostrovsky, and M. Schwarz (2007) Internet advertising and the generalized second-price auction: selling billions of dollars worth of keywords. American Economic Review 97 (1), pp. 242–259. Cited by: §5.5.
  • Z. Fan, Y. Gong, Z. Wei, S. Wang, Y. Huang, J. Jiao, X. Huang, N. Duan, and R. Zhang (2020) An enhanced knowledge injection model for commonsense generation. In Proceedings of the 28th International Conference on Computational Linguistics, COLING, D. Scott, N. Bel, and C. Zong (Eds.), pp. 2014–2025. External Links: Link, Document Cited by: §1, §2.1, §4.3.2, Table 2.
  • H. Fang and P. Xie (2020) CERT: contrastive self-supervised learning for language understanding. CoRR abs/2005.12766. External Links: Link, 2005.12766 Cited by: §2.2.1.
  • T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. CoRR abs/2104.08821. External Links: Link, 2104.08821 Cited by: §2.2.1.
  • J. Guan, Y. Wang, and M. Huang (2019) Story ending generation with incremental encoding and commonsense knowledge. In

    The Thirty-Third AAAI Conference on Artificial Intelligence

    pp. 6473–6480. External Links: Link, Document Cited by: §2.1.
  • B. Gunel, J. Du, A. Conneau, and V. Stoyanov (2020) Supervised contrastive learning for pre-trained language model fine-tuning. CoRR abs/2011.01403. External Links: Link, 2011.01403 Cited by: §2.2.1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020a) Momentum contrast for unsupervised visual representation learning. In

    Conference on Computer Vision and Pattern Recognition, CVPR

    pp. 9726–9735. External Links: Link, Document Cited by: §2.2.2, §3.4.
  • P. He, X. Liu, J. Gao, and W. Chen (2020b) DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654. External Links: Link, 2006.03654 Cited by: §1.
  • O. J. Hénaff (2020) Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119, pp. 4182–4192. External Links: Link Cited by: §2.2.2.
  • N. Hossain, M. Ghazvininejad, and L. Zettlemoyer (2020) Simple and effective retrieve-edit-rerank text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 2532–2538. External Links: Link, Document Cited by: §2.1.
  • V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 6769–6781. External Links: Link, Document Cited by: §2.1.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. In Advances in Neural Information Processing Systems, NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §3.5, §5.2.
  • R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-captioning events in videos. In International Conference on Computer Vision, ICCV, pp. 706–715. External Links: Link, Document Cited by: §4.2.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020a) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 7871–7880. External Links: Link, Document Cited by: §1, §2.1, §3.3.1, §4.3.2, Table 2.
  • P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2020b) Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, §1, §2.1.
  • B. Y. Lin, M. Shen, W. Zhou, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren (2020) CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Conference on Automated Knowledge Base Construction, AKBC, D. Das, H. Hajishirzi, A. McCallum, and S. Singh (Eds.), External Links: Link Cited by: §1, §4.1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.3.
  • Y. Liu, Y. Wan, L. He, H. Peng, and P. S. Yu (2020) KG-BART: knowledge graph-augmented BART for generative commonsense reasoning. CoRR abs/2009.12677. External Links: Link, 2009.12677 Cited by: §2.1, §4.3.2, Table 2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1.
  • Y. Meng, C. Xiong, P. Bajaj, S. Tiwary, P. Bennett, J. Han, and X. Song (2021) COCO-LM: correcting and contrasting text sequences for language model pretraining. CoRR abs/2102.08473. External Links: Link, 2102.08473 Cited by: §2.2.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL, pp. 311–318. External Links: Link, Document Cited by: §4.3.3.
  • W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou (2020)

    ProphetNet: predicting future n-gram for sequence-to-sequence pre-training

    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 2401–2410. External Links: Link, Document Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §3.3.1, §4.3.2, Table 2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    J. Mach. Learn. Res. 21, pp. 140:1–140:67. External Links: Link Cited by: §3.3.1, §4.3.2, Table 2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), pp. 211–252. External Links: Link, Document Cited by: §2.2.2.
  • K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2020) WinoGrande: an adversarial Winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 8732–8740. External Links: Link Cited by: §1.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018)

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, I. Gurevych and Y. Miyao (Eds.), pp. 2556–2565. External Links: Link, Document Cited by: §4.2.1.
  • Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019) ERNIE: enhanced representation through knowledge integration. CoRR abs/1904.09223. External Links: Link, 1904.09223 Cited by: §1.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4149–4158. External Links: Link, Document Cited by: §1.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 809–819. External Links: Link, Document Cited by: §1.
  • Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In ECCV - 16th European Conference, Proceedings, Part XI, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Vol. 12356, pp. 776–794. External Links: Link, Document Cited by: §2.2.2.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: Consensus-based image description evaluation. In Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4566–4575. External Links: Link, Document Cited by: §4.3.3.
  • X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang (2019) VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In International Conference on Computer Vision, ICCV, pp. 4580–4590. External Links: Link, Document Cited by: §4.2.1.
  • X. Wei, R. Weng, Y. Hu, L. Xing, H. Yu, and W. Luo (2021) On learning universal representations across languages. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §2.2.1.
  • J. Weston, E. Dinan, and A. H. Miller (2018) Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI, SCAI@EMNLP, A. Chuklin, J. Dalton, J. Kiseleva, A. Borisov, and M. S. Burtsev (Eds.), pp. 87–92. External Links: Link, Document Cited by: §2.1.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT), pp. 1112–1122. External Links: Link Cited by: §4.2.1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3733–3742. External Links: Link, Document Cited by: §2.2.2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Link Cited by: §1.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 93–104. External Links: Link, Document Cited by: §1.