Pre-trained language models have achieved impressive results across a wide range of NLP tasks (Devlin et al., 2019; Yang et al., 2019; Sun et al., 2019; Liu et al., 2019; Lewis et al., 2020a; Qi et al., 2020; He et al., 2020b). However, their ability to accurately reflect factual knowledge or perform logical inference is still limited. To investigate the ability of systems to capture commonsense knowledge, datasets such as CommonsenseQA (Talmor et al., 2019), SWAG (Zellers et al., 2018), and WinoGrande (Sakaguchi et al., 2020) have been proposed. Separate to these discriminative tasks that require models to choose the correct option from multiple candidates, CommonGen (Lin et al., 2020) is framed as a generation task, and requires the system to generate a logical and coherent sentence describing an everyday scenario based on a concept set. Experiments show that state-of-the-art generation models are not adequate or accurate enough to generate plausible sentences or reflect commonsense assumptions in this setting.
External knowledge provides not only information about the sorts of relationships that hold between concepts, to potentially guide generation models in capturing the implicit logic between concepts, but also interpretability. Inspired by Lewis et al. (2020b) and Fan et al. (2020), we adopt a retrieval-and-generation framework, and propose a BERT-filter and two contrastive learning modules for retrieval and generation, respectively.
For retrieval, previous research (Lewis et al., 2020b)
has shown that traditional sparse vector space models, such as TF-IDF and BM25, perform better than dense representation-based retrieval on heavily entity-centric tasks such as FEVER(Thorne et al., 2018). However, while using sparse vector space retrieval models can retrieve relevant prototypes that contain a set of concepts, there can be significant domain mismatches between the retrieved results and target distribution, making it difficult for generation models to bridge between prototypes and targets. We argue that a two-stage retrieval strategy alleviates this issue by combining sparse vector space search and dense representation filters. First, a sparse vector retrieval model is used to find passage candidates with high coverage of concept words, and then a dense vector-based filter is applied to score the candidates, and filter out low-quality prototypes.
For generation, we apply contrastive learning to each of the encoder and decoder, in a general encoder–decoder architecture. The core idea of contrastive learning is to construct positive and negative samples from an anchor sample, and draw together the anchor and positive samples while pushing away the anchor from all negative samples in the embedding space during training. Given that high-quality prototypes can be used as clusters of positive samples, we propose a decoder contrastive module that minimizes the distance between decoded sentence representations with distinct prototypes retrieved from the same concept set. Common scenario information and abstract concept relationships can be learned based on the contrasts between different prototypes. Moreover, we propose an encoder contrastive module to force the encoder to learn sentence representations, and save it into a global token which is visible to the decoder during decoding. In this way, global sentence-level semantics can be captured better.
The main contributions of this work are threefold: (1) we demonstrate that adding a high-quality matching model to the word overlap-based retriever benefits entity-centric retrieval tasks; (2) we propose two contrastive learning modules that can be applied to a general encoder–decoder generation model; and (3) we conduct experiments on CommonGen and an ad keyword generation task, and show that our method achieves large-scale improvements on both tasks.
2 Related Work
2.1 Knowledge Enhanced Generation
There is significant work on incorporating external knowledge from knowledge bases and incorporating retrieved information in language generation tasks (Weston et al., 2018; Cao et al., 2018; Guan et al., 2019; Hossain et al., 2020). Lewis et al. (2020b) explore a general-purpose fine-tuning recipe for retrieval-augmented generation that combines a dense passage retriever (Karpukhin et al., 2020) with a BART (Lewis et al., 2020a) generator. For commonsense generation, Liu et al. (2020)
propose a knowledge graph-augmented language generation model that encompasses concepts from a knowledge graph, and produces more logical and natural sentences.Fan et al. (2020) propose to retrieve prototypes based on sparse vector similarity, and introduce a scaling module and a prototype position indicator to explicitly deal with retrieval noise.
This paper proposes a two-stage retrieval strategy and differs in applying contrastive learning to make better use of prototypes for generation.
2.2 Contrastive Learning
Recently, contrastive learning has achieved remarkable results in many self-supervised and supervised learning tasks, primarily for computer vision. The two key elements of contrastive learning are: (1) the construction of positive and negative samples; and (2) the learning framework.
2.2.1 Sample Construction
Usually in contrastive learning, positive samples are augmented forms of anchor data points, and negative samples are augmented forms of other data points. In NLP, Meng et al. (2021) create positive samples by masking and cropping tokens from sentences; Gunel et al. (2020) and Fang and Xie (2020) use back-translation to create positive augmentations of original sentences; Chi et al. (2020) and Wei et al. (2021) regard parallel sentences distributed in one or multiple languages as different views of the same semantics to learn cross-lingual representations; and Gao et al. (2021) demonstrate that constructing positive pairs with only standard dropout as minimal data augmentation works surprisingly well on the NLI task. Distinct from these methods, we propose to create positive sample pairs from retrieved results.
2.2.2 Learning Framework
Previous contrastive learning methods have required either specialized architectures (Bachman et al., 2019; Hénaff, 2020) or a memory bank to store large volumes of negative samples (Wu et al., 2018; Tian et al., 2020). Chen et al. (2020)et al., 2015) without using a specialized architecture or a memory bank. However, it requires a large batch size to yield high performance, which is computationally prohibitive. Moco (He et al., 2020a) addresses this issue by maintaining a queue of data samples as the memory bank, and enqueuing encoded representations of the current mini-batch and dequeuing the oldest representations on each iteration. They further propose a momentum encoder to maintain the consistency of representations in the queue. In this work, we use the Moco framework to train our contrastive learning modules.
In this section, we detail our method: Knowledge Filtering and Contrastive learning Network (KFCNet). First, we introduce our prototype retrieval strategy together with the knowledge filter model. Then we present our generative model, based on an encoder–decoder architecture with two contrastive learning modules. Finally, we show how we adapt the Moco contrastive learning framework, and deal with multiple positive samples.
3.1 Task Formulation
We use to denote a set of concepts, where , and is the concept vocabulary, and use to denote all possible concept sets. The commonsense generation task is to learn a function that maps the concept set to a sentence , where , and is the target sentence space. The generated sentence must be a plausible sentence that describes a common scenario in our daily life based on the contents of .
3.2 Prototype Retrieval and Filtering
In order to retrieve prototypes that contain the concepts in a given concept set while keeping the retrieval results and target sentences semantically as similar as possible, we use a two-stage retrieval strategy combining sparse vector space search and dense representation matching. In Stage 1, a sparse vector retrieval model is used to retrieve candidate prototypes from the corpus , where . Then in Stage 2, a dense representation-based scorer is used to score the candidates, and the top- scored candidates are chosen as the final prototypes.
3.2.1 Stage 1
Given a concept set , we first split corpus into parts according to the number of concepts the sentence contains, where sentences in contain concepts in . Given that most concepts in are verb and noun lemmas, we pre-process based on lemmatization and stemming. Then we choose sentences as candidates from the parts, prioritized such that .
3.2.2 Stage 2
After retrieving candidate prototypes, a scorer is used to rank the candidates and filter out candidates that are far from the targets in embedding space. In this work, we use a BERT-based model as an encoder, and use the hidden state of the [CLS]
token as the sample representation. The representation is then feed into a multi-layer perceptron with a scalar output as follows:
where is the training sample created from the concept set and candidates/targets, is the sample representation, and is the final score of the sample. Theoretically, the label of the training set can be any real number in the range
, but we find that it is sufficient to train the scorer as a simple binary classifier. We create the positive samples by combining a concept set with each corresponding target sentence, and create the negative samples by combining the concept set with a different candidate prototype or random sentence from. During inference, we score all candidates and choose the candidate with the highest score as the prototype.
For all experiments in this paper, we set and . means that for each input, we construct one positive sample which is widely used in contrastive leaning work. is because experience shows that at least 2 high-quality prototypes can be retrieved with 8 candidates.
3.3 Contrastive Learning for Generation
3.3.1 Encoder–decoder Architecture
The encoder–decoder architecture is widely used in generation tasks. Compared to single decoder generation models such as GPT-2 (Radford et al., 2019) where words are conditioned only on left context, models using an encoder–decoder framework such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020) enable bidirectional interactions with an encoder, and auto-regressive generation with a decoder. In this work, we use BART with an auto-regressive objective for generation as shown in the middle of Figure 1, and propose separate contrastive modules for the encoder and decoder, corresponding to the right and left sub-structures in the figure.
Typical generation inputs only contain the source sentence, which is the concept set in CommonGen. The difference here is that we append one of the retrieved prototypes to to enhance the input.
3.3.2 Encoder Contrastive Loss
Although BART learns bidirectional interactions using a transformer-based encoder and implements cross-attention at each layer of the decoder, the global target information is not explicitly learned during encoding, meaning that the decoder needs to find important local information for the current step during each timestep of decoding, without having access to the global goal of generation. Here, we propose to force the model to learn global target information during encoding and save it to a special token, using source–target contrastive learning. The special token is visible to the decoder via cross-attention during each timestep of decoding. Specifically, given a concept set and a target sentence , where , we denote the retrieved prototypes as , where each is a complete prototype. We construct the input for encoder contrastive learning by concatenating with and with , respectively. As illustrated in Figure 1, the concatenation of and will be used as the input to the main encoder, which is followed by a decoder with gradient and auto-regressive generation loss, and the concatenation of and will be used as the input to another encoder without a decoder or gradient.
3.3.3 Decoder Contrastive Loss
At the retrieval step, multiple high-quality retrieval results are collected as prototypes to augment generation. Although these retrieved results substantially boost external information, they inevitably introduce noise. In order to learn general information associated with the concept set and eliminate noise in the prototypes, we propose a decoder contrastive learning module, which we apply to the sentence representation at the decoder output. Formally, we concatenate with (), which is a different prototype for from the one used in the main-branch BART model. Note also that different from the main-branch model, here the gradient is not back-propagated.
3.4 Momentum Contrast with Memory Bank
Most existing training methods greatly limit the number of in-batch negative samples, limiting the potential of contrastive learning. To enable large-scale interactions between negative samples, we follow Moco (He et al., 2020a) in maintaining a dictionary as a queue of encoded/decoded data samples. The keys of the dictionary are samples from data after encoding/decoding and the queries are samples in current mini-batches after encoding/decoding during training. Learning is formulated as minimizing the contrastive loss, which makes the query similar with its matching key and dissimilar to other keys.
3.4.1 Memory Bank as Queues
We use two dictionaries to store the representations of the encoder and decoder output, respectively. In each training iteration, the newest encoded representations are enqueued and the oldest are dequeued, to maintain a fixed queue size. For each sample, the number of contrast pairs is the size of the queue, where usually only the matching key in the same mini-batch is positive, and all others are negative.
3.4.2 Dealing with Multiple Positive Samples
In the CommonGen task, the mapping between source sequences and gold targets can be many-to-many. independent sample pairs are created for 1-to- and -to-1 source–target pairs, which can be distributed across mini-batches. However, these samples should be all regarded as mutually positive. To enable interactions between positive samples intra- and inter-mini-batch, we assign each source–target pair an identity, which indicates pairs that share the same source or target. These identities are saved in another queue that is synchronously updated with an encoder and decoder memory bank.
3.4.3 Momentum Updated Parameters
To keep the consistency of representations in the memory banks, we update the parameters of the key encoder and decoder with momentum. Formally, denote the parameters of the query encoder and decoder as and , and the parameters of the key encoder and decoder as and . The parameters and are updated by back-propagation, and the parameters and are updated by:
Here, is a momentum coefficient which is set to be close to 1. In this way, the parameters of the key encoder and decoder evolve more smoothly than those of the query, which maintains the consistency of key representations in the memory bank.
3.5 Training Objective
Consider a batch of query-key pairs , where there is only one positive key for a given query . After encoding, we fetch the representation of the last <EOS> tokens and apply a projection to it as:
The encoder contrastive loss function, called InfoNCE, is as follows:
where denotes the similarity function, is a temperature hyper-parameter, and denotes all indices in the memory bank. The denominator has total terms, including one positive and negative samples. Intuitively, the loss function is the log loss of an -way softmax classifier that tries to classify according to the positive . Eqn (7) is only able to deal with the case of a single positive key existing for each query. To generalize it to an arbitrary number of positives, inspired by SupCon (Khosla et al., 2020), we consider the following loss functions:
The decoder contrastive loss can be obtained in the same way, except that the sentence representation is fetched from the <EOS> token after decoding. During training, we try to minimize the sum of the encoder contrastive loss, decoder contrastive loss, and the decoder auto-regressive generation loss:
Here, denotes the cross-entropy generation loss, and and are tunable scalars.
During inference, we discard the momentum encoder and decoder, together with the projection layers.
CommomGen (Lin et al., 2020) contains 32,651/993/1,497 unique training/development/test concept sets, corresponding to 67,389 and 4,018 English target sentences in the training and development sets, meaning that one concept-set can map to multiple target sentences. The percentage of concept-sets in the development and test sets that are unseen in the training set are 99.60%, and 100.00% respectively, making the dataset challenging for compositional generalization.
4.2 Prototype Collection
4.2.1 In-domain Corpus
As CommonGen was created from visually-grounded caption datasets that describe everyday scenarios, we build an in-domain corpus from datasets of image captions, video captions, and natural language inference. In detail, we extracted sentences from ActivityNet (Krishna et al., 2017), VaTeX (Wang et al., 2019), Conceptual Captions (Sharma et al., 2018)
, SNLI(Bowman et al., 2015), and MNLI (Williams et al., 2018) as the in-domain corpus ().
4.2.2 Out-of-domain Corpus
In addition to in-domain experiments, we create an out-of-domain corpus () from Wikipedia,111English Wikipedia dump from May 01, 2020. using SpaCy222https://spacy.io/ as our sentence tokenizer.
For both corpora, sentences with fewer than 5 tokens or more than 100 tokens were removed. Table 1 shows the basic statistics of the two corpora. Although is much larger than , sentences retrieved from contain more required concepts than those from on average. Specifically, for concept-sets of size 4 and 5, the retrieved sentences from have 0.44 and 1.18, respectively, more relevant concepts than .
4.3 Experimental Setup
4.3.1 Implementation Details
We employ the pre-trained BART-large model as the base generation model, and initialize the momentum encoder and decoder by copying parameters from the base model. We use the Adam optimizer with , , and 0.1 weight decay, with the initial learning rate setting selected from . We use the polynomial decay learning rate scheduler with 500 warmup steps, and set dropout to 0.1. We set the max tokens per batch to 3000 and max batch-size to 48, with 15k total updates. For the auto-regressive generation loss, we use cross-entropy loss with 0.1 label-smoothing penalty. During decoding, we use beam search with size 5, and 1.0 length penalty.
For contrastive learning, we use an MLP as the projection network, with a single hidden layer of 1024d and the output size of 128d. We use Eqn (8) as the loss function, with similarity measured by dot-product, and set the temperature to 0.1. The queue size of the memory bank is set to 4096, and the momentum coefficient is set to 0.999.
We use several state-of-the-art pre-trained language generation models as baselines: GPT-2 (Radford et al., 2019), BERT-Gen (Bao et al., 2020), UniLM (Dong et al., 2019), UniLM-v2 (Bao et al., 2020), T5 (Raffel et al., 2020), and BART (Lewis et al., 2020a). All models are fine-tuned in seq2seq mode. We also compare our model with two strong baselines that use external knowledge: EKI (Fan et al., 2020) and KG-BART (Liu et al., 2020).
4.3.3 Evaluation Metrics
, in addition to evaluation metrics for captioning tasks, namely CIDEr(Vedantam et al., 2015) and SPICE (Anderson et al., 2016). As all metrics score the output in the range , we also present the average score across all metrics.
Table 2 presents the experimental results across all the metrics.333Note that the latest test set (v1.1) adds one more human reference to each example in the test set (v1.0), but is not publicly available. Additionally, EKI and KG-BART were evaluated on v1.0, so this is what we use for our experiments. We observe the following: (1) Methods in the 2nd, 3rd, and 4th blocks of Table 2 that use external knowledge outperform the fine-tuned pre-trained language models in the first block. This demonstrates that external knowledge benefits commonsense reasoning and generation. (2) The overall performance of EKI and our method (KFCNet) that both use natural sentences as prototypes is better than KG-BART, which incorporates structured knowledge from knowledge bases. We hypothesize that this is because pre-trained language models like BART can more easily exploit natural language samples than structured information, even with elaborate modules for information fusion. (3) Prototypes retrieved from the in-domain corpus result in better performance than those from the out-of-domain corpus. (4) Simply fine-tuning BART on our retrieved prototypes beats previous published SOTA on several metrics, and using filtered prototypes boosts the performance again. This on the one hand shows that the quality of prototypes has a large impact on generation, and on the other hand, indicates our retrieval method is better than that of EKI, and our filter helps in selecting good prototypes. (5) Our KFCNet achieves new state-of-the-art results which surpass all other methods by a large margin.
|Lemma & Stem||41.97||18.85||35.90|
5.1 Ablation study
To better understand the impact of the different modules in KFCNet, we perform a number of ablation experiments.
5.1.1 Retrieval and Filter
Prototype retrieval is a key part of any retrieval-based generation model. To assess the effectiveness of the retrieval-and-filter mechanism, we retrieve prototypes from the in-domain corpus and run ablations on a single BART model. Table 3 shows the results. Compared to models without retrieval, using prototypes retrieved by a simple BM25 model improves generation performance, which we suggest is due to the retrieved prototypes helping the model to better capture relationships between concepts, and construct a coherent scenario. With word lemmatization and stemming, the variety of the retrieval results increases, resulting in better prototypes. Adding a BERT filter boosts the performance again, achieving 5.38, 1.36, and 2.30 absolute improvements for BLEU-4, CIDEr, and SPICE. This verifies the effectiveness of using a high-quality matching model as an auxiliary module for a word overlap-based retriever.
5.1.2 Contrastive Learning
The contrastive loss plays an important role in our model. We perform an ablation study on the development set of CommonGen, by comparing the model without the contrastive module, using only encoder contrastive learning, and using both encoder and decoder contrastive learning. As shown in Table 4, using only encoder contrastive learning leads to improvements over the baseline BART model, and adding decoder contrastive learning further improves results based on BLEU-4 and CIDEr.
5.2 Similarity Function and Summation Location
We further compare the performance of different similarity functions and positive summation locations, as mentioned in Section 3.5. The results in Table 5 demonstrate that the combination of dot-product similarity with summation outside of the log function performs best, consistent with the findings of Khosla et al. (2020).
denotes cosine similarity.
5.3 Model Efficiency
The prototype retrieval is done separately from the generation model, and the retrieval time consists of 2 parts: (1) sparse vector matching time, in the form of BM25 search; and (2) BERT filter inference, for fine-grained selection, noting that only a few candidates (8 in our experiments) are left after stage 1, which can be processed in a single mini-batch.
5.3.2 Contrastive Module
During training, the momentum encoder and decoder parameters are updated by Eqn (3) and there are no gradients or back-propagation in these modules. Therefore it takes no more than double the training time without contrastive modules. During inference, the contrastive modules are disabled, and hence the efficiency does not decrease.
5.4 Final Leaderboard Results
Table 6 shows the final evaluation results on the latest test set with additional human references (v1.1).444https://inklab.usc.edu/CommonGen/leaderboard.html Note that the model in second place (RE-T5) expands the original training data and does continuous pretraining before fine-tuning on CommonGen. Our method, KFCNet, performs best on all metrics. Among all fine-tuned methods, KFCNet beats the previous state-of-the-art by a large margin: 6.51 (18.11%) for BLEU-4, 1.38 (8.12%) for CIDEr, and 3.64 (11.95%) for SPICE.
5.5 Experiment on Keyword Generation
To test the effectiveness of the proposed contrastive learning modules, we constructed a real-world adword dataset, based on an advertising platform (Edelman et al., 2007). The goal is to display a list of ads that matches the user intent, for which the first step is to retrieve relevant keywords provided by advertisers given a user query. The dataset contains 72,876 training samples, 10,000 dev samples, and 10,000 test samples from a major search engine, with each sample corresponding to a query–keyword pair. Titles of the top-two web search results of the query from the search engine are kept as prototypes.
We fine-tune BART models following the same sequence generation experimental design. The results are shown in Table 7.
From the first two lines, we see that directly appending the retrieved information to the source does not lead to noticeable improvements, almost certainly because of noise in the retrieved results. However, our contrastive modules alleviate the effects of noise, and beat BART on all metrics.
In this paper, we present KFCNet: a novel knowledge filtering and contrastive learning model for retrieval-augmented sequence generation, which achieves state-of-the-art results on the CommonGen benchmark. Two contrastive learning modules are proposed to capture global target semantics and learn general features from multiple retrieved prototypes. A prototype retrieval ablation study showed the effectiveness of the proposed filter for filtering low-quality candidates, and further experiments on ad keyword generation showed that our model has potential commercial value. In the future, we plan to extend the contrastive module to more general settings, such as natural language understanding and representation learning.
- SPICE: semantic propositional image caption evaluation. In ECCV - 14th European Conference, Part V, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Vol. 9909, pp. 382–398. External Links: Cited by: §4.3.3.
- Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 15509–15519. External Links: Cited by: §2.2.2.
- METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL, J. Goldstein, A. Lavie, C. Lin, and C. R. Voss (Eds.), pp. 65–72. External Links: Cited by: §4.3.3.
UniLMv2: pseudo-masked language models for unified language model pre-training.
Proceedings of the 37th International Conference on Machine Learning, ICML, pp. 642–652. External Links: Cited by: §4.3.2, Table 2.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton (Eds.), pp. 632–642. External Links: Cited by: §4.2.1.
- Retrieve, rerank and rewrite: soft template based neural summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, I. Gurevych and Y. Miyao (Eds.), pp. 152–161. External Links: Cited by: §2.1.
- A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119, pp. 1597–1607. External Links: Cited by: §2.2.2.
- InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. CoRR abs/2007.07834. External Links: Cited by: §2.2.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Cited by: §1.
- Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 13042–13054. External Links: Cited by: §4.3.2, Table 2.
- Internet advertising and the generalized second-price auction: selling billions of dollars worth of keywords. American Economic Review 97 (1), pp. 242–259. Cited by: §5.5.
- An enhanced knowledge injection model for commonsense generation. In Proceedings of the 28th International Conference on Computational Linguistics, COLING, D. Scott, N. Bel, and C. Zong (Eds.), pp. 2014–2025. External Links: Cited by: §1, §2.1, §4.3.2, Table 2.
- CERT: contrastive self-supervised learning for language understanding. CoRR abs/2005.12766. External Links: Cited by: §2.2.1.
- SimCSE: simple contrastive learning of sentence embeddings. CoRR abs/2104.08821. External Links: Cited by: §2.2.1.
Story ending generation with incremental encoding and commonsense knowledge.
The Thirty-Third AAAI Conference on Artificial Intelligence, pp. 6473–6480. External Links: Cited by: §2.1.
- Supervised contrastive learning for pre-trained language model fine-tuning. CoRR abs/2011.01403. External Links: Cited by: §2.2.1.
Momentum contrast for unsupervised visual representation learning.
Conference on Computer Vision and Pattern Recognition, CVPR, pp. 9726–9735. External Links: Cited by: §2.2.2, §3.4.
- DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654. External Links: Cited by: §1.
- Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119, pp. 4182–4192. External Links: Cited by: §2.2.2.
- Simple and effective retrieve-edit-rerank text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 2532–2538. External Links: Cited by: §2.1.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 6769–6781. External Links: Cited by: §2.1.
- Supervised contrastive learning. In Advances in Neural Information Processing Systems, NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: §3.5, §5.2.
- Dense-captioning events in videos. In International Conference on Computer Vision, ICCV, pp. 706–715. External Links: Cited by: §4.2.1.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 7871–7880. External Links: Cited by: §1, §2.1, §3.3.1, §4.3.2, Table 2.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: §1, §1, §2.1.
- CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Conference on Automated Knowledge Base Construction, AKBC, D. Das, H. Hajishirzi, A. McCallum, and S. Singh (Eds.), External Links: Cited by: §1, §4.1.
- ROUGE: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.3.
- KG-BART: knowledge graph-augmented BART for generative commonsense reasoning. CoRR abs/2009.12677. External Links: Cited by: §2.1, §4.3.2, Table 2.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §1.
- COCO-LM: correcting and contrasting text sequences for language model pretraining. CoRR abs/2102.08473. External Links: Cited by: §2.2.1.
- BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL, pp. 311–318. External Links: Cited by: §4.3.3.
ProphetNet: predicting future n-gram for sequence-to-sequence pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 2401–2410. External Links: Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §3.3.1, §4.3.2, Table 2.
Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, pp. 140:1–140:67. External Links: Cited by: §3.3.1, §4.3.2, Table 2.
- ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), pp. 211–252. External Links: Cited by: §2.2.2.
- WinoGrande: an adversarial Winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 8732–8740. External Links: Cited by: §1.
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, I. Gurevych and Y. Miyao (Eds.), pp. 2556–2565. External Links: Cited by: §4.2.1.
- ERNIE: enhanced representation through knowledge integration. CoRR abs/1904.09223. External Links: Cited by: §1.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4149–4158. External Links: Cited by: §1.
- FEVER: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 809–819. External Links: Cited by: §1.
- Contrastive multiview coding. In ECCV - 16th European Conference, Proceedings, Part XI, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Vol. 12356, pp. 776–794. External Links: Cited by: §2.2.2.
- CIDEr: Consensus-based image description evaluation. In Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4566–4575. External Links: Cited by: §4.3.3.
- VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In International Conference on Computer Vision, ICCV, pp. 4580–4590. External Links: Cited by: §4.2.1.
- On learning universal representations across languages. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Cited by: §2.2.1.
- Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI, SCAI@EMNLP, A. Chuklin, J. Dalton, J. Kiseleva, A. Borisov, and M. S. Burtsev (Eds.), pp. 87–92. External Links: Cited by: §2.1.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT), pp. 1112–1122. External Links: Cited by: §4.2.1.
- Unsupervised feature learning via non-parametric instance discrimination. In Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3733–3742. External Links: Cited by: §2.2.2.
- XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Cited by: §1.
- SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 93–104. External Links: Cited by: §1.