The ever-increasing amount of unverified information online makes it challenging to judge what to believe or distrust. Fact verification, the task of identifying whether a textual claim is supported or refuted by the given evidence text, can play a critical role in recognizing and correcting false information. Consequently, it has drawn lots of attention from the NLP community to promote the veracity and correctness of factual claims.
and pre-trained language models, such as BERT(Devlin et al., 2019), leading to improvements in which more complex claims can be accurately fact-checked. However, recent works have demonstrated that the process of data collection using crowdsourcing often introduces idiosyncratic biases due to annotation artifacts (Gururangan et al., 2018; Geva et al., 2019; Schuster et al., 2019). These biases are typically characterized as superficial surface patterns that are strongly associated with target labels. As an example, in the FEVER dataset, negation phrases such as “did not” and “failed to” in the claim are highly correlated with the REFUTES label, irrespective of the given evidence (Schuster et al., 2019).
As a result of such biases, models tend to exploit the spurious patterns between shortcut words and labels in the dataset instead of performing factual reasoning over the given evidence, as depicted in Figure 1. In turn, models often appear to perform well on in-domain evaluation sets but show substantial performance degradation on out-of-distribution samples. Moreover, this behavior makes models vulnerable against adversarial sets, consisting of counterexamples that cause classification errors in the existing systems (Thorne and Vlachos, 2019). Therefore, overcoming such biases is a key challenge in developing robust fact verification models.
To tackle this issue, previous methods either reduce the importance of the biased examples via a modified training objective (Karimi Mahabadi et al., 2020), regularize the confidence of the model on biased examples (Utama et al., 2020), or train a model in an ensemble with a biased model to discourage it from leveraging statistical shortcuts (Clark et al., 2019). However, the majority of these methods target a specific bias; as a result, they achieve improvement on the targeted set while generally resulting in poor performance on the evaluation sets that include different types of biases.
In this paper, we propose CrossAug, an alternative approach for debiasing fact verification models by augmenting the data with contrastive samples. CrossAug generates new data samples through a novel two-stage augmentation pipeline: 1) neural-based negative claim generation and 2) lexical search-based evidence modification. The generated claim and evidence are then paired cross-wise with the original pair, yielding contrastive pairs that are subtly different from one another in respect to context but are assigned opposite labels. We postulate that such contrastive samples encourage a model to rely less on spurious correlations, leading to better reasoning capabilities by learning more robust representations for the task.
Indeed, our approach outperforms the regularization-based state-of-the-art method by 3.6% on the Symmetric FEVER dataset (Schuster et al., 2019), an unbiased evaluation set, and also shows a consistent performance boost on other fact verification datasets.
To verify the performance of the proposed debiasing method on real world application scenarios where fact verification datasets often have limited amount of data, we further experiment on data-scarce settings by sub-sampling the FEVER set. Experimental results demonstrate that our approach is also effective at debiasing in these low-resource conditions, exceeding the baseline performance on the Symmetric dataset with just 1% of the original data.
In summary, our contributions in this work are as follows:
We propose CrossAug, a novel contrastive data augmentation method for debiasing fact verification models.
We empirically show that training a model with the data augmented by our proposed method leads to the state-of-the-art performance on the Symmetric FEVER dataset.
Our augmentation-based debiasing approach show performance improvements particularly in low-resource conditions compared to previous regularization-based debiasing approaches.
2.1. Task Formulation
Given a textual claim and evidence , the objective of the fact verification task is to predict whether the claim is supported by the evidence, refuted by the evidence, or the evidence has not enough information for verification. We denote a sample in the dataset of size as a triplet , where is the data label.
|Train method||FEVER dev||Symmetric||Adversarial||FM2 dev||sym.||avg.|
|No augmentation (baseline)||86.15 0.42||58.77 1.29||49.66 0.37||40.81 0.43||-||-|
|EDA||85.09 0.25||58.55 1.63||51.41 1.14||41.21 1.11||0.22%||+0.22%|
|Paraphrasing||84.33 0.34||59.02 1.38||52.53 1.20||40.60 0.71||+0.25%||+0.27%|
|Re-weighting||85.56 0.32||61.87 1.16||49.92 0.80||43.80 0.46||+3.10%||+1.44%|
|Product of Experts (PoE)||86.50 0.35||65.30 1.73||51.07 1.20||46.69 1.11||+6.53%||+3.54%|
|CrossAug (ours)||85.34 0.68||68.90 1.68||51.78 1.02||44.17 1.27||+10.13%||+3.70%|
|- Negative claim only augmentation||85.70 0.28||61.00 0.71||51.96 0.90||43.06 0.40||+2.23%||+1.58%|
|- Negative evidence only augmentation||85.87 0.16||67.06 0.99||51.46 0.43||43.70 0.97||+8.29%||+3.18%|
Experimental results on fact verification datasets. The mean and the standard deviation of the classification accuracy over five different runs are reported for each method.
2.2. Data augmentation pipeline
In our proposed method, we generate three additional synthetic samples for each original claim-evidence pair through two stages of augmentation. Note that CrossAug utilizes only positive claims (SUP claims), which are verifiable by specific evidence, in the FEVER dataset. The whole process of our data augmentation pipeline is shown in Figure 2.
(1) Negative Claim Generation: The first stage is to generate a negative claim based on a positive claim by adopting a neural sequence-to-sequence model. This generative process involves transforming of a positive claim, such as inserting a negation or replacing a word with antonyms. In turn, the generated claim has a different meaning so that it is refuted by the evidence supporting a positive claim, which is why we call this negative claim. Through this process, we form a new data sample .
To this end, we fine-tune BART (Lewis et al., 2020) on WikiFactCheck-English dataset, which provides pairs of positive claims and their corresponding negative claims (Sathe et al., 2020). In fine-tuning, positive claims are used as the source text and negative claims are taken as the target text of the model. To provide a richer context, we also fine-tune the model with the source and target text reversed since a positive claim can be seen as a refuted version of a negative claim.
(2) Evidence Modification: The negative claim generated in the first stage often only differs from the positive claim by a few words, and can thus be seen as a span replacement. For example, “over 30 days” is simply substituted with “less than 10 days”, as shown in Figure 2. This phenomenon is due to the characteristics of the data used for fine-tuning the model: the negative claim was manually written by annotators under the constraints of both the sentence length and the subject to be similar to the positive claim (Sathe et al., 2020). We also observe that the words replaced in the positive claim are often found verbatim in the evidence. This is because the changed part of the claim usually matches the factual information taken from the evidence.
As the second stage of our data augmentation process, we build upon these observations to perform a lexical search-based evidence modification. First, we compare the positive claim with the negative claim to identify the changed part in the first stage. Once the words replaced in the positive claim are recognized (“over 30 days” in Figure 2), we search for the same words in the evidence and replace them with the substituted words in the negative claim (“less than 10 days” in Figure 2). Since this substitution induces the same factual modification on the evidence as that applied to the negative claim, it can be logically concluded that the resulting modified evidence supports the negative claim and refutes the positive claim . Consequently, we form two additional contrastive samples and in the second stage.
Exceptional Cases: In the first stage, the generated negative claim is occasionally the exact same as the positive claim . For such samples, we skip performing augmentation. Also, in the second stage, we carry out the evidence modification only when the number of replaced words in the claims are less than or equal to , where is a threshold value. This is necessary to prevent invalid evidence modification. When the replaced part is large, it frequently contains inappropriate terms for reconstructing the evidence, such as non-factual words, producing an illogical sentence. However, we still keep the sample from the first stage even when the evidence modification stage is skipped.
For our experiments, we evaluate our proposed data augmentation method on four datasets, including FEVER, and compare its performance with the existing methods.
FEVER (Thorne et al., 2018) is a crowdsourced fact verification dataset containing claim-evidence pairs based on Wikipedia articles. We only use the claims paired with a single evidence for training and evaluation in this work.
Symmetric (Schuster et al., 2019)
is a test set based on the FEVER development dataset designed for unbiased evaluation. It eliminates the correlation of n-grams in the claim and labels by careful construction.
Adversarial (Thorne and Vlachos, 2019) is an adversarially constructed dataset explicitly designed to induce errors in models trained on the FEVER dataset.
Fool Me Twice (FM2) (Eisenschlos et al., 2021) is a Wikipedia-based fact verification dataset composed of 13k claim-evidence pairs that are collected through the games among crowd-workers.
3.2. Compared Methods
Data Augmentation Methods: We compare our method against two data augmentation techniques commonly used for various natural language processing tasks: Easy Data Augmentation (EDA) (Wei and Zou, 2019) and neural paraphrasing. EDA applies simple mutations, such as random swapping or synonym replacement, to the original sentence to generate new examples. For neural paraphrasing, we use a GPT-2 model (Radford et al., 2019) fine-tuned on back-translated data to paraphrase the original text (Krishna et al., 2020). For each original claim-evidence pair, we create a new pair that holds the same relation by transforming only the claim using these methods, leading to the augmentation ratio of augmented to original data 1:1.
Regularization-based Debiasing Methods: We also compare with two debiasing techniques that reduce the reliance on biases by regularizing the model on the biased samples. The first one is an example re-weighting method that targets biases from the shortcut words (Schuster et al., 2019). By re-weighting the importance of claims containing those words, it forces a model to focus on the hard examples in which relying on the bias results in incorrect predictions. The other one is Product of Experts (PoE) (Karimi Mahabadi et al., 2020), which computes the training loss in an ensemble of the base model and the bias-only model. Similar to the first method, it controls the base model’s loss depending on the prediction of the bias-only model for each example.
3.3. Implementation Details
For our experiments, we use the BERT-base-uncased model (Devlin et al., 2019), which demonstrates competitive performance for fact verification tasks. We fine-tune BERT with an additional layer on top of the [CLS] token embedding. We concatenate the claim and the evidence, and insert [SEP] token in between to make the input sequence. Following previous works, we set maximum sequence length to 128, batch size to 32, and optimize the model through a standard cross-entropy loss using the Adam optimizer (Kingma and Ba, 2015)
with a learning rate of 2e-5. We train the model on the FEVER train set and evaluate the generalization performance using the development set for Symmetric, Adversarial and FM2 datasets. We train the model for 3 epochs with 5 different random seeds for all experiments and report the averaged result. For our augmentation pipeline, we set the maximum span size for evidence modificationto , which produces an augmented dataset with an augmentation ratio of 1:0.58.
3.4. Results on the Full Dataset
First we augment the full FEVER train set with our approach and compare its performance in Table 1. Our proposed method achieves 10.13% improvement over the baseline and 3.6% improvement over the previous state-of-the-art debiasing technique on the Symmetric dataset. This result shows that our method is highly effective at preventing the model to predict from unnecessary biases. Our approach also shows a 2.12% improvement on the Adversarial dataset and 3.36% improvement on the FM2 dataset compared to the baseline, indicating that our augmentation method benefits not just on the diagnostic dataset for lexical bias but for fact verification in general. Finally, our method leads to the greatest overall improvement across the datasets out of all compared training methods, This result empirically proves that the contrastive samples generated from our augmentation method enhance the factual reasoning capabilities by learning a more robust feature representation, achieving strong generalization.
Compared to our approach, the example re-weighting and PoE methods perform slightly worse on the Symmetric and Adversarial dataset and slightly better on the original FEVER development set and FM2 dataset. On the other hand, EDA and paraphrasing augmentations show negligible performance improvement on the Symmetric dataset. These results suggest that simply training with more data does not necessarily help mitigate the bias in data.
3.5. Ablation Studies
We conduct ablation study to verify the effectiveness of the augmented samples generated from each step of our augmentation process. The results shown in Table 1 reveal that a moderate performance improvement on the Symmetric, Adversarial and FM2 evaluation sets are attained even with only using the negative claims generated from the first step. However, its performance on the Symmetric dataset is still significantly lower compared to the full augmentation method, implying that augmenting with negative claims alone is less effective for debiasing the model.
Training with the negative evidence augmented data exhibits a more competitive performance on the other hand, especially outperforming previous state-of-the-art technique on the Symmetric dataset. The results imply that the key component of our debiasing approach is training the model with contrastive samples sharing the same claim, which enables the model to learn by comparing the claim to the input evidence instead of just relying on artifacts in the claim. Nevertheless, the full augmentation method outperforms all ablations, indicating that contrastive data samples in general help the model learn more robust representations.
3.6. Results on Low-resource Conditions
In the real world, the available training corpus size for fact verification is often lacking due to the expensive cost and difficulty of collecting and labelling the data. Training on such a limited amount of data could lead to even more biased models due to the lack of samples to learn factual reasoning from. Therefore, it is important to investigate whether the debiasing methods perform well in low-resource conditions.
Thus, we further evaluate our data augmentation method in low-resource conditions, simulating a data-scarce setting by sampling subsets of the original FEVER training data.
We perform a class-balanced sub-sample on the FEVER training data with 5 different random seeds, and train the model on each subset with another 5 different random seeds to account for statistical variations.
As in the previous experiments, we augment the subset with an augmentation ratio of 1:1 for EDA and back-translation methods, while our augmentation method results in an average augmentation ratio of 1:0.47.
We evaluate on the subsets of 0.1% (), 0.2% (), 0.5% (), and 1.0% () of the original training data, and the results are presented in Figure 3.
Our augmentation method shows a consistent improvement in the low-resource conditions over all evaluated datasets, and especially outperforms the baseline trained on the full dataset with just 1% of the original training data for the Symmetric evaluation set. On the other hand, the PoE method shows little to no improvements on all of the datasets except FM2, and in some cases show a slight performance drop. These results indicate that PoE, which relies on training a biased model to regularize learning from biased samples, do not generalize well in data-scarce settings.
EDA and paraphrasing augmentation show a moderate improvement over the baselines, verifying their effectiveness in low-resource conditions. However, our augmentation method is still able to achieve a marked improvement over EDA and paraphrasing on Adversarial and FM2 datasets, implying that our augmentation approach is more robust and effective at generalizing to out-of-distribution samples. In summary, the results with varying training dataset sizes show that our augmentation method is effective for low-resource domains.
In this work, we propose a novel data augmentation method for debiasing fact verification models. Our approach consists of generating negative claim and evidence pairs and forming contrastive samples to augment the data, which facilitates the training model to rely less on the spurious correlations and learn better representations. We evaluate our approach on various fact verification datasets and show that our method outperforms previous methods on the unbiased evaluation set. We also show that our approach is effective in low-resource conditions with limited data compared to regularization-based debiasing approaches.
Acknowledgements.Kyomin Jung is with ASRI, Seoul National University, Seoul, Korea. This work was supported by AIRS Company in Hyundai Motor and Kia through HMC/KIA-SNU AI Consortium Fund.
- Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4069–4082. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §3.3.
- Fool me twice: entailment from wikipedia gamification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 352–365. Cited by: §3.1.
- Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1161–1166. Cited by: §1.
- Annotation artifacts in natural language inference data. In NAACL-HLT, Cited by: §1.
- End-to-end bias mitigation by modelling biases in corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8706–8716. External Links: Cited by: §1, §3.2.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.3.
- Reformulating unsupervised style transfer as paraphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 737–762. External Links: Cited by: §3.2.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Cited by: §2.2.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §3.2.
- Automated fact-checking of claims from wikipedia. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 6874–6882. Cited by: §2.2, §2.2.
- Towards debiasing fact verification models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3419–3425. External Links: Cited by: §1, §1, §3.1, §3.2.
- FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 809–819. External Links: Cited by: §1, §3.1.
- Adversarial attacks against Fact Extraction and VERification. External Links: Cited by: §1, §3.1.
- Mind the trade-off: debiasing nlu models without degrading the in-distribution performance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8717–8729. Cited by: §1.
- ”Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In ACL, Cited by: §1.
- EDA: easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6383–6389. Cited by: §3.2.