Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

11/03/2020 ∙ by Beliz Gunel, et al. ∙ Stanford University 16

State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. Cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, the SCL loss we propose obtains improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in both the high-data and low-data regimes, and it does not require any specialized architecture, data augmentation of any kind, memory banks, or additional unsupervised data. We also demonstrate that the new objective leads to models that are more robust to different levels of noise in the training data, and can generalize better to related tasks with limited labeled task data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art for most existing natural language processing (NLP) classification tasks is currently achieved by systems that are first pre-trained on auxiliary language modeling tasks and then fine-tuned on the task of interest with cross-entropy loss

(Radford et al., 2019; Howard and Ruder, 2018; Liu et al., 2019; Devlin et al., 2019)

. Although commonly used, cross-entropy loss – the KL-divergence between one-hot vectors of labels and the distribution of model’s output logits – has several shortcomings. Cross entropy loss leads to poor generalization performance due to poor margins

(Liu et al., 2016; Cao et al., 2019), and it lacks robustness to noisy labels (Zhang and Sabuncu, 2018; Sukhbaatar et al., 2014) or adversarial examples (Elsayed et al., 2018; Nar et al., 2019). Effective alternatives have been proposed to change the reference label distributions through label smoothing (Szegedy et al., 2016; Müller et al., 2019), Mixup (Zhang et al., 2018), CutMix (Yun et al., 2019), knowledge distillation (Hinton et al., 2015) or self-training (Yalniz et al., 2019; Xie et al., 2020).

Additionally, it has been recently demonstrated in NLP that fine-tuning using cross entropy loss tends to be unstable (Zhang et al., 2020; Dodge et al., 2020), especially when supervised data is limited, a scenario in which pre-training is particularly helpful. To tackle the issue of unstable fine-tuning, recent work proposes local smoothness-inducing regularizers (Jiang et al., 2020) and regularization methods inspired by the trust region theory (Aghajanyan et al., 2020) to prevent representation collapse that lead to poor generalization performance. Empirical analysis suggests that fine-tuning for longer, reinitializing top few layers (Zhang et al., 2020), and using debiased Adam optimizer during fine-tuning (Mosbach et al., 2020) can make the fine-tuning procedure more stable.

We are inspired by the learning strategy that humans deploy when given a few examples – try to find the commonalities between the examples of each class and contrast them with examples from other classes. We hypothesize that a similarity-based loss will be able to hone in on the important dimensions of the multidimensional hidden representations and lead to better few-shot learning results and be more stable while fine-tuning pre-trained models. We propose a novel objective for fine-tuning pre-trained language models that includes a supervised contrastive learning term that pushes examples from the same class close and examples of different classes further apart. The new term is similar to the contrastive objective used for self-supervised representation learning in various domains such as image, speech, and video domains.

(Sohn, 2016; Oord et al., 2018; Wu et al., 2018; Bachman et al., 2019; Hénaff et al., 2019; Baevski et al., 2020; Conneau et al., 2020; Tian et al., 2019; Hjelm et al., 2019; Han et al., 2019; He et al., 2020; Misra and Maaten, 2020; Chen et al., 2020a, b)

. In constrast to these methods, however, we use a contrastive objective for supervised learning of the final task, instead of contrasting different augmented views of examples.

Adding supervised contrastive learning (SCL) term to the fine-tuning objective improves performance on several natural language understanding tasks from the GLUE benchmark (Wang et al., 2019)

, including SST-2, CoLA, MRPC, RTE, and QNLI over the state-of-the-art models fine-tuned with cross entropy loss. The improvements are particularly strong in few-shot learning settings (20, 100, 1000 labeled examples), and models trained with SCL are not only robust to the noise in the training data, but also have better generalization ability to related tasks with limited labeled data. Our approach does not require any specialized architectures

(Bachman et al., 2019; Hénaff et al., 2019), memory banks (Wu et al., 2018; Tian et al., 2019; Misra and Maaten, 2020), data augmentation of any kind, or additional unsupervised data. To the best of our knowledge, our work is the first to successfully integrate a supervised contrastive learning objective for fine-tuning pre-trained language models.

  • We propose a novel objective for fine-tuning of pre-trained language models that includes a supervised contrastive learning term, as described in Section 2.

  • We show that our proposed objective improves over cross-entropy loss on several natural language classification tasks of the GLUE benchmark (Wang et al., 2019), including SST-2, CoLA, MRPC, RTE and QNLI, as shown in Table 2, leading up to 1.2 improvement.

  • We obtain strong improvements on few-shot learning settings (20, 100, 1000 labeled examples) as shown in Table 4, leading up to 10.7 improvement for 20 labeled examples.

  • We demonstrate that our proposed objective is more robust across augmented training datasets with varying noise levels as shown in Table 5, leading to 7 average improvement on MNLI across augmented training sets.

  • We show that the task-models fine-tuned with our proposed objective has better generalization ability to a related task with limited labeled data as shown in Table 7, leading to 2.9

    improvement on Amazon-2 along with significant reduction in variance across few-shot training samples, when transferred from the source SST-2 task model.

2 Approach

We propose a novel objective that includes a supervised contrastive learning term for fine-tuning pre-trained language models. The loss is meant to capture similarities between examples of the same class and contrast them with examples from other classes.

We work with a batch of training examples of size N, . denotes the normalized embedding of the final encoder hidden layer before the softmax projection; is the total number of examples in the batch that have the same label as ; is an adjustable scalar temperature parameter that controls the separation of classes; and

is a scalar weighting hyperparameter that we tune for each downstream task. The loss is given by the following formulas:

(1)
(2)
(3)

The overall loss is a weighted average of CE and the SCL loss, as given in equation (1). The canonical definition of CE that we use is given in equation (2). The novel SCL loss is given in equation (3).

This loss can be applied using a variety of encoders

– for example a ResNet for a computer vision application or a pre-trained large language model such as BERT for an NLP application. In this work, we focus on fine-tuning pre-trained language models for single sentence and sentence-pair classification. For single sentence classification, each example

consists of sequence of tokens prepended with the special token . The length of sequence L is constrained such that . Similarly, for sentence-pair classification tasks, each example is a concatenation of two sequences of tokens and corresponding to the sentences with special tokens delimiting them: . The length of concatenated sequences is constrained such that . In both cases, uses the embedding of token as the representation for example . These settings follow standard practices for fine-tuning pre-trained language models for classification (Devlin et al., 2019; Liu et al., 2019).

Figure 1:

Our proposed objective includes a cross-entropy term (CE) and supervised contrastive learning (SCL) term, and it is formulated to push examples from the same class close and examples of different classes further apart. We show examples from SST-2 sentiment analysis dataset from the GLUE benchmark, where class A (shown in red) is negative movie reviews and class B (shown in blue) is positive movie reviews. Although we show a binary classification case for simplicity, the loss is generally applicable to any multi-class classification setting.

Empirical observations show that both normalization of the encoded embedding representations and an adjustable scalar temperature parameter improve performance. Lower temperature increases the influence of examples that are harder to separate, effectively creating harder negatives. Using hard negatives has been previously shown to improve performance in the context of margin-based loss formulations such as triplet loss (Schroff et al., 2015). The empirical behavior of the adjustable temperature parameter is consistent with the observations of previous work related to supervised contrastive learning. (Chen et al., 2020a; Khosla et al., 2020).

Relationship to Self-Supervised Contrastive Learning Self-supervised contrastive learning has shown success in learning powerful representations, particularly in the computer vision domain. (Chen et al., 2020a; He et al., 2020; Tian et al., 2019; Mnih and Kavukcuoglu, 2013; Gutmann and Hyvärinen, 2012; Kolesnikov et al., 2019) Self-supervised learning methods do not require any labeled data; instead they sample a mini batch from unsupervised data and create positive and negative examples from these samples using strong data augmentation techniques such as AutoAugment (Cubuk et al., 2019) or RandAugment (Cubuk et al., 2020) for computer vision. Positive examples are constructed by applying data augmentation to the same example (cropping, flipping, etc. for an image), and negative examples are simply all the other examples in the sampled mini batch. Intuitively, self-supervised contrastive objectives are learning representations that are invariant to different views of positive pairs; while maximizing the distance between negative pairs. The distance metric used is often the inner product or the Euclidean distance between vector representations of the examples.

For a batch of size N, self-supervised contrastive loss is defined as:

(4)

where is the

normalization embedding from the encoder before the final classification softmax layer;

is a scalar temperature parameter. A is defined as a data augmentation block that generates two randomly generated augmented examples, and from the original example : ) = . As an example, A can be RandAugment for a computer vision application; or it could be a back-translation model for an NLP application.

3 Related Work

Traditional Machine Learning and Theoretical Understanding

Several works have analyzed the shortcomings of the widely adopted cross-entropy loss, demonstrating that it leads to poor generalization performance due to poor margins (Liu et al., 2016; Cao et al., 2019), and lack of robustness to noisy labels (Zhang and Sabuncu, 2018; Sukhbaatar et al., 2014) or adversarial examples (Elsayed et al., 2018; Nar et al., 2019)

. On the other hand, there has been a body of work that has explored the performance difference for classifiers trained with discriminative (i.e., optimizing for

, where y is the label and x is the input) losses such as cross-entropy loss and generative losses (i.e. optimizing for ). Ng and Jordan (2001)

show that classifiers trained with generative losses can outperform their counterparts trained with discriminative losses in the context of Logistic Regression and Naive Bayes.

Raina et al. (2003) show that a hybrid discriminative and generative objective outperforms both solely discriminative and generative approaches. In the context of contrastive learning, Arora et al. (2019) propose a theoretical framework for analyzing contrastive learning algorithms through hypothesizing that semantically similar points are sampled from the same latent class, which allows showing formal guarantees on the quality of learned representations.

Contrastive Learning There has been several investigations for the use of contrastive loss formulations for self-supervised, semi-supervised, and supervised learning methods, primarily in the computer vision domain. Chen et al. (2020a)

propose a framework for contrastive learning to learn visual representations without specialized architectures or a memory bank and show state-of-the-art results on ImageNet ILSVRC-2012

(Russakovsky et al., 2015)

, outperforming previous methods for self-supervised, semi-supervised and transfer learning. Similarly,

Khosla et al. (2020)

propose a supervised contrastive loss that outperforms cross entropy loss and gets state-of-the-art results on ImageNet on both ResNet-50 and ResNet-200

(He et al., 2016) with AutoAugment (Cubuk et al., 2019) data augmentation. They also show increased robustness on the ImageNet-C dataset (Hendrycks and Dietterich, 2019), and demonstrate that supervised contrastive loss is less sensitive to hyperparameter settings such as optimizers or data augmentations compared to cross-entropy loss. Liu and Abbeel (2020)

propose a hybrid discriminative-generative training of energy-based models where they approximate the generative term with a contrastive loss using large batch sizes and show improved classification accuracy of WideResNet-28-10

(Zagoruyko and Komodakis, 2016)

on CIFAR-10 and CIFAR-100

(Krizhevsky, 2009) datasets, outperforming state-of-the-art discriminative and generative classifiers. They also demonstrate improved performance for WideResNet-28-10 on robustness, out-of-distribution detection, and calibration, compared to other state-of-the-art generative and hybrid models. Finally, Fang and Xie (2020) propose pre-training language models using a self-supervised contrastive learning objective at the sentence level using back-translation as the augmentation method, followed by fine-tuning by predicting whether two augmented sentences originate from the same sentence – showing improvements over fine-tuning BERT on a subset of GLUE tasks.

Stability and Robustness of Fine-tuning Language Models There has been several works on analyzing robustness of fine-tuning large pre-trained language models, since they tend to overfit to the labeled task data and fail to generalize to unseen data when there is limited labeled data for the downstream task. To improve the generalization performance, Jiang et al. (2020)

propose a local smoothness-inducing regularizer to manage the complexity of the model and a Bregman proximal point optimization method, an instance of trust-region methods, to prevent aggressive updating of the model during fine-tuning. They show state-of-the-art performance on GLUE, SNLI

(Bowman et al., 2015), SciTail (Khot et al., 2018), and ANLI (Nie et al., 2020) natural language understanding benchmarks. Similarly, Aghajanyan et al. (2020)

propose a regularized fine-tuning procedure inspired by trust-region theory that replaces adversarial objectives with parametric noise sampled from normal or uniform distribution in order to prevent representation collapse during fine-tuning for better generalization performance, without hurting the performance. They show improved performance on a range of natural language understanding and generation tasks including DailyMail/CNN

(Hermann et al., 2015)

, Gigaword

(Napoles et al., 2012), Reddit TIFU (Kim et al., 2019)

, and the GLUE benchmark. There has also been some empirical analysis that suggests fine-tuning for more epochs, reinitializing top few layers 

(Zhang et al., 2020) instead of only the classification head, and using debiased Adam optimizer instead of BERTAdam (Devlin et al., 2019) during fine-tuning (Mosbach et al., 2020) make the fine-tuning procedure more stable across different runs.

4 Experimental Setup

4.1 Datasets and Training Details

We use datasets from the GLUE natural language understanding benchmark (Wang et al., 2019) for evaluation. We include both single sentence classification tasks and sentence-pair classification tasks to test whether our hypothesis is generally applicable across tasks. We summarize each dataset based on their main task, domain, number of training examples and number of classes in Table 1.

In all of our experiments (full dataset and few-shot learning), we sample half of the original validation set of GLUE benchmark and use it as our test set, and sample 500 examples for our validation set from the original validation set, both taking the label distribution of the original validation set into account. For each task, we want the validation set to be small enough to avoid easy overfitting on the validation set, and big enough to avoid high-variance when early-stopping at various epochs for few-shot learning experiments. We keep the same smaller validation for full dataset experiments in order to allow easy comparison between low-data and high-data regimes. For full dataset experiments such as the ones shown in Table 2 and Table 3, we use the full training sets of the GLUE benchmark.

We run each experiment with 10 different seeds, and pick the top model out of 10 seeds based on validation accuracy and report its corresponding test accuracy. We pick the best hyperparameter combination based on the average validation accuracy across 10 seeds. For few-shot learning experiments such as the ones shown in Table 4 and Table 5, we sample 10 different training set samples based on the total number of examples

specified from the original training set of the GLUE benchmark, taking the label distribution of the original training set into account. We report the average and the standard deviation of the test accuracies of the top 3 models based on their validation accuracies out of 10 random training set samples. Best hyperparameter combination is picked based on the average validation accuracy of the top 3 models. The reason why we focus on the top 3 models for this setting is that we would like to reduce the variance across training set samples.

Dataset Task Domain #Train #Classes
SST-2 sentiment analysis movie reviews 67k 2
CoLA grammatical correctness linguistic publications 8.5k 2
MRPC paraphrase news 3.7k 2
RTE textual entailment news/Wikipedia 2.5k 2
QNLI question answering/textual entailment Wikipedia 105k 2
MNLI textual entailment multi-domain 393k 3
Table 1: GLUE Benchmark datasets used for evaluation.

We use fairseq Ott et al. (2019)

library and the open-source RoBERTa-Large model for all of our experiments. During all the fine-tuning runs, we use Adam optimizer with a learning rate of 1e-5, batch size of 16 (unless specified otherwise), and dropout rate of 0.1.

4.2 Constructing Augmented Noisy Training Datasets

Machine learning researchers or practitioners often do not know how noisy their datasets are, as input examples might be corrupted or ground truth labeling might not be perfect. Therefore, it is preferable to use robust training objectives that can get more information out of datasets of different noise levels, even where there is limited amount of labeled data. We simulate augmented training datasets of different noise levels using a back-translation model (Edunov et al., 2018), where we increase the temperature parameter to create more noisy examples. Back-translation refers to the procedure of translating an example in language A into language B and then translating it back to language A, and it is a commonly used data augmentation procedure for NLP applications, as the new examples obtained through back-translation provide targeted inductive bias to the model while preserving the meaning of the original example. Specifically, we use WMT’18 English-German and German-English translation models, use random sampling to get more diverse examples, and employ and augmentation ratio of 1:3 for supervised examples:augmented examples. We observe that employing random sampling with a tunable temperature parameter is critical to get diverse paraphrases for the supervised examples, consistent with previous work (Edunov et al., 2018; Xie et al., 2019), since commonly used beam search results in very regular sentences that do not provide diversity to the existing data distribution. We keep the validation and test sets same with the experiments shown in Table 2 and Table 4.

5 Analysis and results

5.1 GLUE Benchmark Full Dataset Results

In Table 2, we report results using our proposed objective on six downstream tasks from the GLUE benchmark. We use a very strong baseline of fine-tuning RoBERTa-Large with cross-entropy loss, which is currently the standard practice for the state-of-the-art NLP classification models. Details of the experimental setup are explained in Section 4.

We observe that adding the supervised contrastive learning (SCL) term to the objective improves the performance over the strong RoBERTa-Large baseline across 5 out of 6 datasets, leading to 1.2 improvement on SST-2, 0.9 improvement on CoLA, and 0.9 improvement on QNLI. This shows that our proposed objective is effective both for binary single sentence classification such as sentiment analysis and grammatical correctness; and sentence pair classification tasks such as textual entailment and paraphrasing. On the other hand, we observe that our proposed method does not lead to improvement on MNLI which is a three-way classification textual entailment task. We believe this is due to the fact that number of positive example pairs are quite sparse when we fine-tune our RoBERTa-Large models with batch size 16 due to memory constraints. We leave experiments with larger batch sizes that require additional engineering effort for future work. We show evidence for this hypothesis in our ablation studies that we show in Table 3, where we conduct the full dataset experiments for CE+SCL with the same experimental setup described here for Table 2 on SST-2, CoLA, and QNLI for batch sizes 16, 64, and 256 using RoBERTa-Base. We observe that as we increase the batch size, performance improves significantly across all datasets. Specifically, we observe 0.4 improvement on SST-2, 0.5 improvement on CoLA, and 0.8 improvement on QNLI, when we increase the batch size from 16 to 256.

Model Loss SST-2 CoLA MRPC RTE QNLI MNLI Avg
RoBERTa CE 94.7 86.4 87.3 85.0 94.5 90.0 89.7
RoBERTa CE + SCL 95.9 87.3 87.8 85.6 95.4 89.9 90.3
Table 2: Results on the GLUE benchmark. We compare fine-tuning RoBERTa-Large with CE with and without SCL using the full training set of each task.
Model Loss Bsz SST-2 CoLA QNLI
RoBERTa CE + SCL 16 93.9 83.4 92.1
RoBERTa CE + SCL 64 94.2 84.8 92.7
RoBERTa CE + SCL 256 94.3 84.9 92.9
Table 3: Ablation study fine-tuning RoBERTa-Base with CE+SCL using the full training set of each task, increasing the batch size (Bsz).
Figure 2: tSNE plots of learned CLS embedding on SST-2 test set where we have 20 labeled examples, comparing CE with and without SCL term. Blue: positive examples; red: negative examples.

5.2 GLUE Benchmark Few-shot Learning Results

We proposed adding the SCL term inspired by the learning strategy of humans when they are given few examples. In Table 4, we report our few-shot learning results on SST-2, QNLI, and MNLI from the GLUE benchmark with 20, 100, 1000 labeled training examples. Details of the experimental setup are explained in Section 4. We use a very strong baseline of fine-tuning RoBERTa-Large with cross-entropy loss. We observe that the SCL term improves performance over the baseline significantly across all datasets and data regimes, leading to 10.7 improvement on QNLI, 3.4 improvement on MNLI, and 2.2 improvement on SST-2, where we have 20 labeled examples for training. We see that as we increase the number of labeled examples, performance improvement over the baseline decreases, leading to 1.9 improvement on MNLI for 100 examples and 0.6 improvement on QNLI for 1000 examples. In Figure 2, we show tSNE plots of the learned representations of CLS embeddings on SST-2 test set when trained with 20 labeled examples, comparing CE with and without SCL term. We can see that SCL term enforces more compact clustering of examples with the same label; while the distribution of embeddings learned with CE is close to random. We include a more detailed comparison of tSNE plots for CE and CE+SCL, where we have 20, 100 labeled examples and full dataset respectively for training in Figure 3 in the Appendix.

Model Loss N SST-2 QNLI MNLI
RoBERTa CE 20 85.92.1 65.0 39.3
RoBERTa CE + SCL 20 88.13.3 75.74.8 42.74.6
RoBERTa CE 100 91.11.3 81.90.4 59.22.1
RoBERTa CE + SCL 100 92.81.3 82.50.4 61.13.0
RoBERTa CE 1000 94.00.6 89.20.6 81.40.2
RoBERTa CE + SCL 1000 94.10.5 89.80.4 81.50.2
Table 4: Few-shot learning results on the GLUE benchmark where we have N=20, 100, 1000 labeled examples for training. Reported results are the mean and the standard deviation of the test accuracies of the top 3 models based on validation accuracy out of 10 random training set samples.

5.3 Robustness Across Augmented Noisy Training Datasets

In Table 5, we report our results on augmented training sets with varying levels of noise. We have 100 labeled examples for training for each task, and we augment their training sets with noisy examples using a back-translation model, as described in detail in Section 4.2. Note that we use the back-translation model to simulate training datasets of varying noise levels and not as a method to boost model performance. Experimental setup follows what is described in Section 4 for few-shot learning experiments. T is the temperature for the back-translation model used to augment the training sets, and higher temperature corresponds to more noise in the augmented training set.

We observe consistent improvements over the RoBERTa-Large baseline with our proposed objective across all datasets across all noise levels, with 0.4 improvement on SST-2, 2.5 improvement on QNLI, and 7 improvement on MNLI on average across augmented training sets. The improvement is particularly significant for inference tasks (QNLI, MNLI) when the noise levels are higher (higher temperature), leading to 7.7 improvement on MNLI when T=0.7, and 4.2 improvement on QNLI when T=0.9. We show some samples of augmented examples used in this robustness experiment in Table 6. For T=0.3, examples mostly stay the same with minor changes in their phrasing, while for T=0.9, some grammatical mistakes and factual errors are introduced.

Dataset Loss Original T=0.3 T=0.5 T=0.7 T=0.9 Average
SST-2 CE 91.11.3 92.01.3 91.41.0 91.71.3 90.00.5 91.31.2
SST-2 CE + SCL 92.81.3 92.60.9 91.51.0 91.20.6 91.51.0 91.71.0
QNLI CE 81.90.4 81.12.3 80.02.9 78.93.7 75.94.0 79.03.5
QNLI CE + SCL 82.50.4 82.71.9 81.92.5 81.30.6 80.12.5 81.52.0
MNLI CE 59.22.1 54.01.1 55.32.4 54.62.2 47.01.8 52.73.9
MNLI CE + SCL 61.13.0 61.22.3 62.10.9 62.31.1 53.02.1 59.74.3
Table 5: Results on the GLUE benchmark for robustness across noisy augmented training sets. Average shows the average performance across augmented training sets.
Dataset Type Sentence
SST-2 Original As possibly the best actor working in movies today.
SST-2 Augmented (T=0.3) As perhaps the best actor who now stars in films.
SST-2 Original The young stars are too cute; the story and ensuing complications are too manipulative.
SST-2 Augmented (T=0.9) The babies are too cute, the image and complications that follow too manipulative.
QNLI Original Brain tissue is naturally soft, but can be stiffened with what liquid?
QNLI Augmented (T=0.3) Brain tissue is omitted naturally, but with what fluid it can be stiffened?
QNLI Original In March 1968, CBS and Sony formed CBS/Sony Records, a Japanese business joint venture.
QNLI Augmented (T=0.9) CBS was founded by CBS and Sony Records in March 1962, a Japanese company.
MNLI Original However, the link did not transfer the user to a comment box particular to the rule at issue.
MNLI Augmented (T=0.3) However, the link did not send the user to a comment field specifically for the rule.
MNLI Original Tenants could not enter the apartment complex due to a dangerous chemical spill.
MNLI Augmented (T=0.9) Tenants were banned from entering the medical property because of a blood positive substance.
Table 6: Sample of augmented examples with different noise levels for the robustness experiment shown in Table 5. Higher temperature (T) corresponds to more noise in the augmented training set.

5.4 Generalization Ability of Task Models

In this experiment, we first fine-tune RoBERTa-Large on SST-2 using its full training set and get a task model with and without SCL term. Then, we transfer this task model to two related single sentence sentiment analysis binary classification tasks for the movie reviews domain – Amazon-2 and Yelp-2 (Zhang et al., 2015). For both, we sample 20 labeled examples for each class, and follow the few-shot learning experimental setup described in Section 4. In Table 7, we demonstrate that using the SCL term for both source (SST-2) and target domains (Amazon-2, Yelp-2) lead to better generalization ability, with 2.9 improvement on Amazon-2 and 0.4 improvement on Yelp-2 along with significant reduction in variance across training set samples.

Model Loss N Amazon-2 Yelp-2
RoBERTa CE 40 87.46.4 90.82.2
RoBERTa CE + SCL 40 90.30.6 91.20.4
Table 7: Generalization of the SST-2 task model (fine-tuned using the full training set) to related tasks (Amazon-2, Yelp-2) where there are 20 labeled examples for each class.

6 Conclusion

We propose a supervised contrastive learning objective for fine-tuning pre-trained language models and demonstrate improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in both high-data and low-data regimes. We also show that our proposed objective leads to models that are more robust to different levels of noise in the training data and can generalize better to related tasks with limited labeled task data. Currently, data augmentation methods in NLP and their effects on the downstream tasks are neither as effective nor as well understood as their counterparts in the computer vision domain. In future work, we plan to study principled and automated data augmentation techniques for NLP that would allow extending our supervised contrastive learning objective to both semi-supervised and self-supervised learning settings.

References

  • A. Aghajanyan, A. Shrivastava, A. Gupta, N. Goyal, L. Zettlemoyer, and S. Gupta (2020) Better fine-tuning by reducing representational collapse. ArXiv abs/2008.03156. Cited by: §1, §3.
  • S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi (2019) A theoretical analysis of contrastive unsupervised representation learning. ArXiv abs/1902.09229. Cited by: §3.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. ArXiv abs/1906.00910. Cited by: §1, §1.
  • A. Baevski, H. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv abs/2006.11477. Cited by: §1.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. ArXiv abs/1508.05326. Cited by: §3.
  • K. Cao, C. Wei, A. Gaidon, N. Aréchiga, and T. Ma (2019) Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, Cited by: §1, §3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020a) A simple framework for contrastive learning of visual representations. ArXiv abs/2002.05709. Cited by: §1, §2, §2, §3.
  • T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton (2020b) Big self-supervised models are strong semi-supervised learners. ArXiv abs/2006.10029. Cited by: §1.
  • A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli (2020) Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. Cited by: §1.
  • E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space.

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    , pp. 3008–3017.
    Cited by: §2.
  • E. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le (2019) AutoAugment: learning augmentation strategies from data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 113–123. Cited by: §2, §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §2, §3.
  • J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. A. Smith (2020) Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. ArXiv abs/2002.06305. Cited by: §1.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. ArXiv abs/1808.09381. Cited by: §4.2.
  • G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio (2018) Large margin deep networks for classification. ArXiv abs/1803.05598. Cited by: §1, §3.
  • H. Fang and P. Xie (2020) CERT: contrastive self-supervised learning for language understanding. ArXiv abs/2005.12766. Cited by: §3.
  • M. Gutmann and A. Hyvärinen (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, pp. 307–361. Cited by: §2.
  • T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1483–1492. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020) Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.
  • O. J. Hénaff, A. Srinivas, J. Fauw, A. Razavi, C. Doersch, S. Eslami, and A. Oord (2019) Data-efficient image recognition with contrastive predictive coding. ArXiv abs/1905.09272. Cited by: §1, §1.
  • D. Hendrycks and T. G. Dietterich (2019)

    Benchmarking neural network robustness to common corruptions and perturbations

    .
    ArXiv abs/1807.01697. Cited by: §3.
  • K. Hermann, T. Kociský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In NIPS, Cited by: §3.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. ArXiv abs/1503.02531. Cited by: §1.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio (2019)

    Learning deep representations by mutual information estimation and maximization

    .
    ArXiv abs/1808.06670. Cited by: §1.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 328–339. Cited by: §1.
  • H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and T. Zhao (2020) SMART: robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In ACL, Cited by: §1, §3.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. ArXiv abs/2004.11362. Cited by: §2, §3.
  • T. Khot, A. Sabharwal, and P. Clark (2018) SciTaiL: a textual entailment dataset from science question answering. In AAAI, Cited by: §3.
  • B. Kim, H. Kim, and G. Kim (2019) Abstractive summarization of reddit posts with multi-level memory networks. ArXiv abs/1811.00783. Cited by: §3.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1920–1929. Cited by: §2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §3.
  • H. Liu and P. Abbeel (2020) Hybrid discriminative-generative training via contrastive learning. ArXiv abs/2007.09070. Cited by: §3.
  • W. Liu, Y. Wen, Z. Yu, and M. Yang (2016)

    Large-margin softmax loss for convolutional neural networks

    .
    In ICML, Cited by: §1, §3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1, §2.
  • I. Misra and L. V. D. Maaten (2020) Self-supervised learning of pretext-invariant representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6706–6716. Cited by: §1, §1.
  • A. Mnih and K. Kavukcuoglu (2013) Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, Cited by: §2.
  • M. Mosbach, M. Andriushchenko, and D. Klakow (2020) On the stability of fine-tuning bert: misconceptions, explanations, and strong baselines. ArXiv abs/2006.04884. Cited by: §1, §3.
  • R. Müller, S. Kornblith, and G. E. Hinton (2019) When does label smoothing help?. In NeurIPS, Cited by: §1.
  • C. Napoles, M. R. Gormley, and B. V. Durme (2012) Annotated gigaword. In AKBC-WEKEX@NAACL-HLT, Cited by: §3.
  • K. Nar, O. Ocal, S. Sastry, and K. Ramchandran (2019) Cross-entropy loss and low-rank features have responsibility for adversarial examples. ArXiv abs/1901.08360. Cited by: §1, §3.
  • A. Y. Ng and M. I. Jordan (2001) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In NIPS, Cited by: §3.
  • Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela (2020) Adversarial nli: a new benchmark for natural language understanding. ArXiv abs/1910.14599. Cited by: §3.
  • A. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. ArXiv abs/1807.03748. Cited by: §1.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1.
  • R. Raina, Y. Shen, A. Y. Ng, and A. McCallum (2003) Classification with hybrid generative/discriminative models. In NIPS, Cited by: §3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, pp. 211–252. Cited by: §3.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: a unified embedding for face recognition and clustering

    .
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. Cited by: §2.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In NIPS, Cited by: §1.
  • S. Sukhbaatar, J. Bruna, M. Paluri, L. D. Bourdev, and R. Fergus (2014) Training convolutional networks with noisy labels. arXiv: Computer Vision and Pattern Recognition. Cited by: §1, §3.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Cited by: §1.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. ArXiv abs/1906.05849. Cited by: §1, §1, §2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, External Links: Link Cited by: 2nd item, §1, §4.1.
  • Z. Wu, Y. Xiong, S. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §1.
  • Q. Xie, Z. Dai, E. H. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv: Learning. Cited by: §4.2.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020) Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §1.
  • I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan (2019)

    Billion-scale semi-supervised learning for image classification

    .
    arXiv preprint arXiv:1905.00546. Cited by: §1.
  • S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) CutMix: regularization strategy to train strong classifiers with localizable features. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022–6031. Cited by: §1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. ArXiv abs/1605.07146. Cited by: §3.
  • H. Zhang, M. Cissé, Y. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. ArXiv abs/1710.09412. Cited by: §1.
  • T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi (2020) Revisiting few-sample bert fine-tuning. ArXiv abs/2006.05987. Cited by: §1, §3.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In NIPS, Cited by: §5.4.
  • Z. Zhang and M. R. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, Cited by: §1, §3.

Appendix A Appendix

Figure 3: tSNE plots of learned CLS embedding on SST-2 test set where we have 20, 100 labeled examples, and full dataset respectively, comparing CE with and without SCL term. Blue: positive examples; red: negative examples.