Log In Sign Up

Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks

There has been significant progress in recent years in the field of Natural Language Processing thanks to the introduction of the Transformer architecture. Current state-of-the-art models, via a large number of parameters and pre-training on massive text corpus, have shown impressive results on several downstream tasks. Many researchers have studied previous (non-transformer) models to understand their actual behavior under different scenarios, showing that these models are taking advantage of clues or failures of datasets and that slight perturbations on the input data can severely reduce their performance. In contrast, recent models have not been systematically tested with adversarial-examples in order to show their robustness under severe stress conditions. For that reason, this work evaluates three transformer-based models (RoBERTa, XLNet, and BERT) in Natural Language Inference (NLI) and Question Answering (QA) tasks to know if they are more robust or if they have the same flaws as their predecessors. As a result, our experiments reveal that RoBERTa, XLNet and BERT are more robust than recurrent neural network models to stress tests for both NLI and QA tasks. Nevertheless, they are still very fragile and demonstrate various unexpected behaviors, thus revealing that there is still room for future improvement in this field.


page 1

page 2

page 3

page 4


Pre-training Polish Transformer-based Language Models at Scale

Transformer-based language models are now widely used in Natural Languag...

Knowledge Enhanced Attention for Robust Natural Language Inference

Neural network models have been very successful at achieving high accura...

PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance

Large Transformer-based models have exhibited superior performance in va...

NPE: An FPGA-based Overlay Processor for Natural Language Processing

In recent years, transformer-based models have shown state-of-the-art re...

Comparing Test Sets with Item Response Theory

Recent years have seen numerous NLP datasets introduced to evaluate the ...

On Adversarial Robustness of Synthetic Code Generation

Automatic code synthesis from natural language descriptions is a challen...

1 Introduction

The use of deep learning has allowed for solving several problems related to natural language processing (NLP), even outperforming human performance in some tasks. However, previous research has shown that neural networks are powerful enough to memorize the training data, which limits their ability to generalize or to really understand the tasks they are dealing with

[35]. Moreover, some recent studies propose evaluation scenarios for neural-based models in various natural language understanding (NLU) tasks.

One way to test NLP models is by using adversarial tests, which implies an intentional perturbation of the input sentence to confuse a model into making wrong predictions. This methodology has shown that models are still weak [2, 10, 22, 6]. Other researchers have also shown that language models can “falsely” solve the task. In other words, they might be taking advantage of dataset failures or artifacts on the input sentences in order to guess the answer [7, 1, 13]. These evaluations, also known as “stress tests”, have been performed on classic models based on recurrent networks (RNN). However, transformer-based models such as RoBERTa [14], XLNet [33] and BERT [5], which are state-of-the-art for NLU tasks, have not been systematically evaluated under severe stress conditions. Only BERT has been tested with similar objectives as ours [9, 11, 18], but not in a systematic way as here nor in the same scenarios.

In this work, we focus on three language models based on the state-of-the-art transformer architecture (RoBERTa, XLNet and BERT), with the aim of carrying out a stress test evaluation on two NLU tasks. On the one hand, Natural Language Inference (NLI), also known as recognizing textual entailment (RTE) which consists of finding semantic relations between a premise sentence and an associated hypothesis, by classifying if they are entailed, in contradiction or in neutral relationship. On the other hand, we apply stress tests on a question-answering (QA) task, also known as machine reading comprehension (MRC) which consists of predicting the answer to a question given a paragraph.

The evaluation of the NLI task was performed using the MultiNLI dataset [30] following the methodology of naik18coling. For the QA task we used the SQuAD dataset [21] and adversarial techniques introduced by jia-liang-2017-adversarial. We also developed a new adversarial dataset for SQuAD, using techniques inspired on belinkov2018synthetic111we released the dataset at

All test procedures propose adversarial examples to prove the strength of the models, by distracting, confusing or proving their competence.

Experiments show that all models are affected by stress tests, but on transformer-based models, the adversaries have smaller impact compared to previous models based on RNNs. This behavior could be explained by the large number of parameters and prior training of these models. Nevertheless, in this work we not only measure the impact on performance of various adversarial or noisy conditions, but also reveal that in some cases the state-of-the-art models behave in strange and unexpected ways.

We provide detailed quantitative analysis on all the performed tests, and in some cases we report representative examples via inspection of the attention matrices that these models produce during inference when tested under adversarial test scenarios.

2 Transformer for Natural Language Understanding

The Transformer [27]

is a deep learning architecture originally proposed to improve the performance of neural machine translation applications. The main idea behind this model is the multi-head self-attention, the ability to attend to different parts and aspects of the input sequence to compute a contextual representation of it, at increasing levels of abstraction (layers). This architecture allows surpassing long-term dependency problems that are common on Recurrent Neural Networks (RNN) models, and adding the possibility of being highly parallelizable.

Early works such as GPT [20] and BERT [5] proposed variants of the Transformer architecture for language modeling [3]. These works show that the representations learned on large-scale language modeling datasets are effective for downstream sentence-level tasks (i.e. NLI) and token-level tasks (i.e. QA) via fine-tuning. However, no systematic evaluation of robustness and failure modes for these kind of models (specially the most recent variants) have been performed in previous works, compared to RNNs.

In this work, we evaluate three state-of-the-art models on their large version: BERT [5], which was the first model to introduce bidirectional representation in the transformer encoder and masked modeling, XLNet [33] that proposed the permutation modeling to prevent the corruption of the input with masks, and RoBERTa [14]

, which can be seen as a BERT optimization that includes additional pre-training and hyperparameter improvements.

We use the HuggingFace python library [32], which includes pre-trained models, in order to fine-tune each model to a classifier for the NLI task and a regressor for the QA task. We used the hyperparameters specified in the original paper for each model, to achieve an accuracy close to the ones reported for each task.

Additionally, we include pre-transformer baselines as a comparison reference. These models are based on the LSTM architecture [8] and are task-dependent. However, our analysis and discussion are mainly about experiments on transformer-based models.

3 NLI Task Description

3.1 Task

The MultiNLI corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information from a broad range of genres. In this task given a premise, the model has to determine whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral).

3.2 Baselines

As a baseline to evaluate stress test performance for this task, we chose the winner of RepEval 2017 Shared Task [16]

, which proposed a model of stacked BiLSTMs with residual connections

[17]. Also, we used the baseline proposed in the original paper [30] of the dataset, which consists of a standard BiLSTM.

4 QA Task Description

4.1 Task

SQuAD, the Stanford Question Answering Dataset [21] is a widely used Question Answering benchmark that consists of a collection of English Wikipedia paragraphs with more than 100k associated question-answer pairs generated via crowdsourcing. The task is designed in a way that the solution to each question is literally contained in the corresponding paragraph, so the task is to predict the answer text span in the corresponding passage. We use SQuAD v1.1 instead of SQuAD v2.0 to allow comparability with previous work.

4.2 Baselines

To be consistent with previous work, we used BiDAF [24] and Match-LSTM [29]

as baselines to compare stress tests against transformer-based models. BiDAF consists of embedding, attention and modeling layers with a BiLSTM, that outputs a vector with information of the context and the query, and finally an output layer with probabilities indicating where the answer starts and ends in the context text. In the case of Match-LSTM, the model is an architecture that remembers important word-level matching results to get better predictions of the answers.

5 Experiments

5.1 NLI Task Evaluation

Our experiments on the MultiNLI dataset closely follow the naik18coling procedure, which conducted a stress test evaluation of several models of the RepEval 2017 Shared Task. Below we describe each test set222We use the sets provided by the authors to avoid discrepancy during the procedure. used in this work and Table 1 shows some examples, however for further details of the sets construction we refer the readers to the work by naik18coling.

Test Set Premise Hypothesis
Then he ran. He ran like an athlete and true is true.
Then he ran and true is true and true is true
and true is true and true is true and true is true.
He ran like an athlete.
Negation Then he ran. He ran like an athlete and false is not true.
Then he ran. He ran like an athleet.
Antonymy The Joint Venture had justified itself by failure. The Joint Venture had justified itself by success.
Adam spent 1/6 of his lifetime in adolescence. Adam spent less than 1/6 of his lifetime in adolescence.
Table 1: Examples of stress tests for the NLI task.

5.1.1 Distraction Test

The distraction test explores the model robustness after a text with a clear ”True” value is added.

  • One way to evaluate this is by decreasing the lexical similarity between premise and hypothesis. On the one hand, the word overlay set adds a tautology (“and true is true”) at the end of each hypothesis sentence. On the other hand, the length mismatch set adds five times the same tautology to each premise.

  • We can also evaluate this by the inclusion of strong negations. The negation set is quite similar to the previous ones, but in this case, the tautology added to the hypothesis includes negation words (“and false is not true”).

5.1.2 Noise Test

This test verifies the model strength against noisy data, in terms of spelling errors. It has two types of permutations on a word randomly selected from the hypothesis: swap of adjacent characters within the word, and random substitution of a character next to it on the English keyboard. Note that only one substitution is performed for the entire sentence.

5.1.3 Competence Test

The competence test consists of two evaluation sets to measure the reasoning ability of the models.

  • Understanding of antonymy relationships. This set includes sentences that result in contradiction simply by using an antonym in some adjectives or nouns.

  • Numerical reasoning ability of a model. This evaluation includes statements of simple algebraic problems with solutions as premises

    . The entailed, contradictory and neutral hypotheses were generated through the use of heuristic rules.

Distraction Test Noise Test Competence Test
RoBERTa 90.0 89.7 64.3 62.3 59.0 58.5 87.5 88.2 85.3 85.7 63.9 59.2 64.9
XLNet 89.2 89.1 71.0 68.9 60.0 59.5 87.2 87.5 83.5 83.7 74.7 70.9 63.9
BERT 86.0 86.1 61.2 56.8 57.3 57.6 83.7 84.6 79.5 79.8 64.6 59.2 56.8
S-BiLSTM 74.2 74.8 47.2 47.1 39.5 40.0 48.2 47.3 51.1 49.8 15.1 19.3 21.2
BiLSTM 70.2 70.8 57.0 58.5 51.4 51.9 49.7 51.2 65.0 65.1 13.2 9.8 31.3
Table 2: Classification accuracy (%) of transformer-based models and baselines. Both genre-matched (M) and mismatched (MM) sets were evaluated.

5.2 NLI Task Results

Table 2 shows the results of the performed tests. It can be seen that all models decrease their accuracy in all evaluations. However, transformer-based models show more robustness in some tests. The analysis of the results of the models in each stress test is shown on the following sections.

Figure 1: Accuracy results in the development set and adversarial sets: word overlay, negation, length mismatch and spelling error. Only matched partition is shown.

5.2.1 Models Performance on Distraction Test

Figure 1 shows a bar graph of the “matched” partition of the evaluation sets on the different types of distraction tests. As mentioned in a previous section, the distraction tests allow us to check the robustness in two different ways.

On the one hand, the effect of introducing negation words drops the models performance below 60% of accuracy, close to the baselines. We checked the model predictions on the negation test v/s the development set and we found that BERT and XLNet obtained 93% and 91% of E-N (entailment predicted as neutral) error respectively. In contrast, RoBERTa obtained 85% of N-E error (neutral predicted as entailment). This could occur due to the introduction of extra negation words (“false” and “not”).

On the other hand, the decrease of lexical similarity by word overlap and length mismatch evaluation shows:

  • In the first case (word overlap set), the transformer-based models reach around 60% accuracy, which is approximately 20% less than in the development set. We found a similar behavior with the previous set (negation), where BERT and XLNet obtained 83% and 61% of E-N error respectively. It also stands out that RoBERTa achieved 89% of N-E error.

  • In the second case (length mismatch), the models performed better than expected, because they reached almost the same accuracy as in the development set. We hypothesize that these results may be due to the length mismatch set modifying the premise sentence instead of the hypothesis as in the negation of the word overlap sets, which suggests that in order to answer the model is paying more attention to that sentence.

To verify the results on the length mismatch set, we extended the evaluation by testing the addition of the tautology “and true is true” in the hypothesis or in the premises times (where ). Figure 2 shows the performance of XLNet in these tests, likewise we observed similar behavior on the other models. We noticed that the inclusion of the distractions to the premise sentence does not affect the model performance. However, when we add the tautology a single time (which is equivalent to the word overlap test) to the hypothesis sentence, the performance drops about 20%, but the more repetitions we add, the more accuracy increases, almost reaching the same performance obtained in the development set. We also checked the attention weights, and did not identify anomalous behavior.

The unexpected result in accuracy indicates that the lexical similarity is not a strong enough signal to generate distraction in this type of model; in other words, the model can discern the tautologies. Moreover, the model seems to pay more attention to the hypothesis sentence to perform this task, without discarding the premise. However, the distraction evaluation indicates that these transformer-based models are fragile to adversarial attacks that include strong negation words.

Figure 2: Accuracy (%) of XLNet after the addition of different number of tautologies in hypothesis or premises.

5.2.2 Models Performance on Noise Test

The noise test with the spelling error set exhibits that transformer-based models perform very well. They only lose between 2 to 5 percentage points in accuracy with respect to the development set. The results suggest that the multi-head self-attention mechanism of these models is very effective at recovering the global information from the corrupted sentence.

However, the adversarial attacks of this set only modify one word of the hypothesis. This explains why there is no sudden drop in performance in models, even for the BiLSTM-based models.

5.2.3 Models Performance on Competence Test

As we supposed, transformer-based models work quite well in this evaluation task. In the case of the antonymy test, the models exceeded baselines by approximately 50 percentage points in accuracy. This is probably because transformers were pre-trained on a diverse and big corpus, allowing them to adequately represent the majority of the words of the dictionary. XLNet and BERT were trained with BookCorpus and Wikipedia, so we expected better accuracy of RoBERTa which used additional data. However, XLNet outperformed others by at least 10 percentage points, suggesting that permutation modeling could help capture antonymy relationships better.

Furthermore, the results on the numerical reasoning evaluation show a lower performance for all models. In this task, XLNet and RoBERTa have similar accuracy but have different behavior. On the one hand, XLNet specialized in classifying the “entailment”, achieving 90% in that class. On the other hand, RoBERTa specialized in “neutral” category, obtaining 89% of correct answers. In both cases, the remaining classes achieved less than 74% of accuracy (the model finds it hard to distinguish between those classes). These results indicate that transformer-based models trained in the NLI task have serious difficulties in numerical reasoning and that they take different strategies to solve the task.

For both evaluations, we also explored the attention weights via the BertViz library [28]. Appendix B shows a brief analysis of some specific cases on all the mentioned transformer-based models.

5.2.4 Annotation Artifacts Exploitation Test

gururangan-etal-2018-annotation found that MultiNLI dataset has annotation artifacts. It means that crowd workers who participated in the creation of the data, adopted heuristics to generate the hypothesis in an easy and fast way. For instance, they usually use some keywords such as “not”, “never”, etc. to create negation sentences.

To evaluate if transformer-based models leverage the artifacts, we tested the models by removing the premise sentence in the development set. In other words, the models are unaware of the premises of the dataset.

Table 3 shows the results of this experiment. It is possible to see that transformer-based models perform similar to the majority class333The majority class is used as a baseline of random guessing., which denotes an unbiased guess of the models. In contrast, BiLSTM-based models show significant proportion of correctly classified samples without even looking at the premise (which is an undesirable behavior). This result demonstrates that transformer-based models are in fact learning to take into account and relate the two sentences of the NLI task in order to choose the correct answer, which is consistent with the findings in Section 5.2.1

Matched Mismatched
Majority Class 35.4 35.2
RoBERTa 35.2 35.8
XLNet 35.4 35.8
BERT 35.5 35.7
S-BiLSTM 45.2 45.4
BiLSTM 37.4 38.3
Table 3: Performance (%) of premise-unaware text models on MultiNLI development set. Greater accuracy means more exploitation of artifacts, thus smaller numbers mean the models performed best.

5.3 QA Task Evaluation

One of our test scenarios was taken from jia-liang-2017-adversarial, which intentionally adds a new adversarial sentence at the end of SQuAD passages of the development set. These sentences are especially designed (via different strategies) to act as a decoy to confuse the model. The other test scenario is inspired on belinkov2018synthetic. Although originally proposed for a different task, we replicated the 5 types of noise proposed by the authors, and applied them on the development set of SQuAD.

5.3.1 Adversarial Sentence Tests

In jia-liang-2017-adversarial, the authors proposed 4 strategies to create a sentence especially designed to confuse models by pretending to be the correct answer to a specific question, although they are unrelated with the question. This adversarial sentence is concatenated to the corresponding paragraph provided at test time. The 4 strategies proposed were:

  • AddOneSent: Adjectives and nouns of the question are replaced by antonyms. Named entities and numbers are replaced by their nearest word in GloVe [19]. This modified question is then turned into declarative form (using a set of manually defined rules) and a fake answer of the same type as the original answer is inserted. Finally the sentence is manually checked and fixed via crowdsourcing.

  • AddSent: Identical to AddOneSent but generating multiple candidate sentences (adversaries) and keeping only the one that induces the biggest error when tested on a specific model.

  • AddAny: The adversarial sentence is generated by sampling random words and successively replacing them by elements from a sampled set of words each time. Words are selected from this set by using a criterion that tries to minimize the confidence of the model on the correct answer. The 20-word set is sampled from a list of common words plus the words from the question. This process is repeated iteratively 6 times for each adversarial phrase.

  • AddCommon: Identical to AddAny, but in this case the 20-word set is sampled from the list of common words directly.

Article: Super Bowl 50

Context: Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.

Question: What is the name of the quarterback who was 38 in Super Bowl XXXIII?

Original prediction: John Elway

Prediction after adversarial phrase is added: Jeff Dean

Figure 3: An example of an AddOneSent adversarial sample. This example was taken from jia-liang-2017-adversarial. In this case we can see that the model correctly answered the original question, but after the inclusion of the adversarial sentence (in italic blue), the model fails (answer in red).

5.3.2 Noise Tests

Although originally proposed for a different task, we replicated the 5 types of noise introduced by belinkov2018synthetic. In each experiment, a specific noise type was applied to each word in the passage of SQuAD’s development set. The question was kept unchanged, and the answers were adapted to preserve consistency with the modified passage. In contrast to the noise tests performed in the NLI setting (Section 5.1.2), the scenario tested here is significantly more aggressive because it introduces noise to every word in the reference text.

The 5 noise types tested are:

  • Natural Noise: Words are replaced by real typing errors of people. To automate this, we used a collection of word corrections performed by people in web platforms that keep track of edits history [15, 34, 31, 23].

  • Swap Noise: For each word in the text, one random pair of consecutive characters is swapped (e.g. ).

  • Middle Random Noise: For each word in the text, all characters are shuffled, except for the first and last characters. (e.g. ).

  • Fully Random Noise: For each word in the text, all characters are shuffled (e.g. ).

  • Keyboard Typo Noise: For each word in the text, one character is replaced by an adjacent character in traditional English keyboards (e.g. ).

5.4 QA Task Results

Similarly to the observations for the NLI experiments, for QA it is clear that the performance of all models is affected by the stress tests, with transformer-based models being the most robust in all the cases analyzed.

5.4.1 Results on Adversarial Sentence Tests

Figure 5 shows a bar graph that compares the accuracy of the tested models under the different adversarial strategies.

When we analyze the results of the AddOneSent experiments, we notice an accuracy reduction between and for the transformer-based models, and greater than for non-transformer models. In spite of showing greater robustness in comparison with their counterpart, transformer-based models still suffer from a significant impact on performance, which elucidates a clear opportunity for future improvements on these kind of models. The same phenomenon is observed for AddSent adversaries, but more pronounced (as expected, since AddSent tests the worst case for each candidate question). We see accuracy reductions ranging from and for transformer-based models, and greater than for non-transformer models.

We notice that as the model is more powerful in the main task (accuracy in the unmodified SQuAD 1.1 development set), it also achieves greater robustness. This conclusion is hopeful because other works have asserted that more powerful models could justify their performance on their higher capacity for memorization [35]. These experiments, in contrast, indicate that the models are improving their reading capabilities in a balanced fashion.

Interestingly, AddAny and AddCommon adversaries show that those strategies are very model-specific, as evidenced by the fact that transformer-based models only reduce their accuracy in small degree when tested against adversaries where other models failed. These results are interesting because, as reported by jia-liang-2017-adversarial, those adversaries (and especially AddAny) turned to be very effective when trying to mislead the models that they were targeting. This cross-check between different model’s adversaries for AddAny is consistent with the results reported by jia-liang-2017-adversarial, although in the case of transformer-based models, the before-mentioned behavior is even more pronounced. For the case of AddCommon, in the other hand, this tests were not reported in previous work nor analyzed by the authors that proposed these adversaries, thus this finding is especially relevant.

Further details on the results of every experiment performed can be found in Appendix A. Also in Appendix C we perform a more qualitative analysis of the attention matrices that these models produce during inference.

5.4.2 Results on Noise Tests

As shown in Figure 6, all five types of noise have a significant negative impact on accuracy on all the tested models. The accuracy reduction is more prominent than on Adversarial Sentence tests (Section 5.4.1) due to the aggressiveness of the strategies tested here.

Swap Noise has a significant impact on accuracy (between and ), although only a single pair of characters per word are altered. Performance is only slightly better than when using Middle Random Noise (and in that scenario, all the letters are shuffled, except for the first and last characters). We hypothesize that this is due to the fact that by introducing this change, the resulting tokenizations differ significantly from the original ones and are also very different from the ones seen in training or fine-tuning, and thus the model is not prepared to answer accurately.

Article: Genghis Khan

Context: (…) Maluqi, a tsuretd lteneitnau, was given cmmnoad of the Monogl focres angisat the Jin dytasny whlie Gneghis Kahn was ftgniihg in Ctneral Aais, and Stbuaui and Jbee were aeolwld to prusue the Great Raid itno the Cuaucsas and Kaiven Rus’, an idea tehy had peestrned to the Kaaghn on tehir own ieivtnitia. Whlie grnniatg his gneaelrs a gerat dael of amotonuy in mkiang canommd diesscion, Gnhgeis Kahn also epecxted uvwannrieg layolty from them.

Question: Who was delegated command of the Mongol forces against the Jin dynasty?

Answer: Maluqi

Figure 4: An adversarial example from SQuAD after the introduction of Middle Random noise. Note that only the context (and the answer, accordingly) is modified, but not the question.

Note also that, in absolute terms, under Middle Random noise, the model is still able to correctly answer one in four questions, despite the fact that the text is severely transformed (for an example see Figure 4).

Another interesting pattern that these tests showed is the fact that for transformer-based models the Keyboard Typo noise is clearly more difficult to deal with than Swap Noise. This finding is especially interesting because Keyboard Typo noise corrupts only one character for each word, and Swap Noise corrupts two. For this reason this result is opposed to what we expected, and reveals that swapping operations affect these models less than replacement operations. This effect may be caused by the fact that the tokenized representation of words with swapped characters might be closer to the original one (in the embedding space of each model), or maybe it is because this kind of noise might be more frequent in real misspellings than keyboard typos, and thus the models were more exposed to this kind of noise during pre-training. Further study is required to find out which phenomenon is the dominant one in this case, but this analysis is out of the scope of this work.

Similarly to what was reported in belinkov2018synthetic, Natural Noise is significantly easier to overcome than the other 4 tested noise types, even considering that in the dataset we built for Natural Noise

, we forcedly replaced every word by a noisy version of it (when real typing errors were available). It is natural to think that in real scenarios, misspelled words will appear at a much lower rate than in this test. Thus this result can be seen as a kind of lower-bound estimator for performance on

Natural Noise in real scenarios. When we compare the result of the Natural Noise experiments with those of the Swap Noise experiments, we hypothesize that the gap in favour to Natural Noise is because during the pre-training phase, the model observed this type of noise (in real occurrences) and was therefore able to learn useful representations both for well-written words and for versions with common misspellings.

Figure 5: Accuracy results in the adversarial sets proposed by jia-liang-2017-adversarial. AddAny* and AddCommon* report the worst accuracy after running against all the alternative adversarial datasets of that specific type published by the original authors. For fair comparison, experiments on the adversaries generated for the model itself are excluded in those two specific cases.
Figure 6: Accuracy results in SQuAD when the models are exposed to noise tests. It is clear that all noise types heavily affect the performance of all the models. Further comparative analysis (Section 5.4.2) show some interesting and unexpected findings in these results.

6 Related Work and Discussion

Prior work [25] discusses the importance of evaluation frameworks that allow characterizing model success and failures. During previous years, several approaches to test NLP models have been proposed on various tasks, showing that most of the time, predictions are memorized without really understanding the real meaning of utterances.

Early research demonstrated that NLP models are fragile to input perturbations. Some attempts at performing stress tests on machine translation systems demonstrated that by adding small perturbations on the input text, the general performance of language models could be profoundly affected [2, 22, 6]

. In the same line, the inspiring work of jia-liang-2017-adversarial proposed an evaluation procedure for language models using the SQuAD dataset. They used SQuAD samples, concatenating adversarial sentences at the end of the paragraph that contains the answer, and showed that 14 open-source models failed when these changes are introduced.

Other relevant findings reveal that models take advantage of lexical cues of the dataset, allowing them to solve the problem falsely. gururangan-etal-2018-annotation observed that some NLI datasets have annotation artifacts that models efficiently use to predict the answer without even seeing the rest of the sentence. The same problem was found in the Visual Question Answering (VQA) field. agrawal-etal-2016-analyzing analyzed the behavior of three models based on CNN, LSTM, and attention mechanism by adding adversaries only to the caption of the image, obtaining that most of the times models were paying attention to the text and not the image at inference time.

The success of language models based on the Transformer architecture in tasks such as machine translation [27, 26]

, text summarization

[12], reading comprehension [4], among others, motivated new research. Recent works have performed adversarial testing of BERT in Argument Reasoning Comprehension Task [18]. They have shown that tested against adversaries, BERT outperforms BiLSTM and Bag of Vectors baselines, but still has trouble with logic understanding. Furthermore, jin2019bert showed that BERT is the language model that best performs under adversary attacks when compared to CNN and LSTM in terms of success rate and perturbation rate, preservation of semantic content, and efficiency for text classification tasks. hsieh-etal-2019-robustness also studied BERT and compared it with recurrent architectures, inspecting the attention matrices of the models and proposing an algorithm to generate adversaries focusing on distracting models but not humans.

Although there is considerable progress in this area, it can be seen that this article differentiates from previous works by systematically evaluating adversaries, artifacts and various severe stress conditions on the state-of-the-art language models based on transformer (BERT and the models that came after it), in order to verify their language comprehension capabilities and generalization power.

7 Conclusion

We conducted a stress test evaluation for transformer-based language models in NLI and QA tasks. In general, our experiments indicate that applying stress tests influenced the performance of all models, but as expected, more recent models such as XLNet and RoBERTa are more robust, showing a better response to this evaluation.

In the NLI task, we verified that distraction and noise sets significantly reduce the performance of all models. However, concerning the competency test, the models perform better because they were pre-trained for this particular task.

Moreover, in the QA task, experiments revealed that all models suffer in performance when tested with adversarial or noisy samples. Despite this, transformer-based models turned out to be more robust than their predecessors. We compared transformer-based models against each and observed that while improving in the main task, models also improved in their robustness in a balanced way. We also noticed that some adversaries are model-specific, as they affect one model but not the rest. Specifically, in the noise tests, we observed that the robustness trend also holds, but noticed some unexpected behavior in relative analysis, as some types of noise affect the models more severely than others, thus revealing specific weak points across all transformer-based models that did not seem evident at first sight.

We consider this evaluation to be valuable to the community because it exhibits the strengths and weaknesses of the state-of-the-art models. We argue that it is vital that models pass behavioral checks to ensure proper performance in extreme scenarios, where data failures are not being considered. Taking this into consideration, we see that there is still room for future improvements on transformer-based models.

8 Bibliographical References


  • [1] A. Agrawal, D. Batra, and D. Parikh (2016-11) Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1955–1960. External Links: Link, Document Cited by: §1.
  • [2] Y. Belinkov and Y. Bisk (2018) Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: §1, §6.
  • [3] Y. Bengio, R. Ducharme, and P. Vincent (2001) A neural probabilistic language model. In Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), pp. 932–938. External Links: Link Cited by: §2.
  • [4] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser (2018) Universal transformers. arXiv preprint arXiv:1807.03819. Cited by: §6.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, §2.
  • [6] J. Ebrahimi, D. Lowd, and D. Dou (2018-08) On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 653–663. External Links: Link Cited by: §1, §6.
  • [7] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018-06) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 107–112. External Links: Link, Document Cited by: §1.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [9] Y. Hsieh, M. Cheng, D. Juan, W. Wei, W. Hsu, and C. Hsieh (2019-07)

    On the robustness of self-attentive models

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1520–1529. External Links: Link, Document Cited by: §1.
  • [10] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer (2018-06) Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1875–1885. External Links: Link, Document Cited by: §1.
  • [11] D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2019) Is bert really robust? a strong baseline for natural language attack on text classification and entailment. External Links: 1907.11932 Cited by: §1.
  • [12] D. Kroening, N. Sharygina, S. Tonetta, A. Tsitovich, and C. M. Wintersteiger (2008) Loop summarization using abstract transformers. In International Symposium on Automated Technology for Verification and Analysis, pp. 111–125. Cited by: §6.
  • [13] O. Levy, S. Remus, C. Biemann, and I. Dagan (2015-May–June) Do supervised distributional methods really learn lexical inference relations?. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 970–976. External Links: Link, Document Cited by: §1.
  • [14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.
  • [15] A. Max and G. Wisniewski (2010-05) Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta. External Links: Link Cited by: 1st item.
  • [16] N. Nangia, A. Williams, A. Lazaridou, and S. Bowman (2017-09) The RepEval 2017 shared task: multi-genre natural language inference with sentence representations. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, Copenhagen, Denmark, pp. 1–10. External Links: Link, Document Cited by: §3.2.
  • [17] Y. Nie and M. Bansal (2017-09) Shortcut-stacked sentence encoders for multi-domain inference. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, Copenhagen, Denmark, pp. 41–45. External Links: Link, Document Cited by: §3.2.
  • [18] T. Niven and H. Kao (2019) Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355. Cited by: §1, §6.
  • [19] J. Pennington, R. Socher, and C. Manning (2014-10) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: 1st item.
  • [20] A. Radford and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.
  • [21] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016-11) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §1, §4.1.
  • [22] M. T. Ribeiro, S. Singh, and C. Guestrin (2018-07) Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 856–865. External Links: Link Cited by: §1, §6.
  • [23] K. Šebesta, Z. Bedřichová, K. Šormová, B. Štindlová, M. Hrdlička, T. Hrdličková, J. Hana, V. Petkevič, T. Jelínek, S. Škodová, P. Janeš, K. Lundáková, H. Skoumalová, Š. Sládek, P. Pierscieniak, D. Toufarová, M. Straka, A. Rosen, J. Náplava, and M. Poláčková (2017) CzeSL grammatical error correction dataset (CzeSL-GEC). Note: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University External Links: Link Cited by: 1st item.
  • [24] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2016) Bidirectional attention flow for machine comprehension. ArXiv abs/1611.01603. Cited by: §4.2.
  • [25] N. A. Smith (2012) Adversarial evaluation for models of natural language. arXiv preprint arXiv:1207.0245. Cited by: §6.
  • [26] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, et al. (2018) Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416. Cited by: §6.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §6.
  • [28] J. Vig (2019) A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714. External Links: Link Cited by: §5.2.3.
  • [29] S. Wang and J. Jiang (2016) Machine comprehension using match-lstm and answer pointer. ArXiv abs/1608.07905. Cited by: §4.2.
  • [30] A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. External Links: Link Cited by: §1, §3.2.
  • [31] K. Wisniewski, K. Schöne, L. Nicolas, C. Vettori, A. Boyd, D. Meurers, A. Abel, and J. Hana (2013) MERLIN: an online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data. Note: URL External Links: Link Cited by: 1st item.
  • [32] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §2.
  • [33] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §2.
  • [34] T. Zesch (2012-04) Measuring contextual fitness using error contexts extracted from the Wikipedia revision history. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 529–538. External Links: Link Cited by: 1st item.
  • [35] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. External Links: Link Cited by: §1, §5.4.1.

Appendix A: Detailed Results on SQuAD Tests

In this section, we report the detailed results of all the experiments performed on the adversarial versions of the SQuAD dataset. In all the experiments, each model was trained/fine-tuned on the original SQuAD v1.1 training set, and tested on each one of the generated adversarial datasets. Table 4 shows the results on the adversarial proposed by jia-liang-2017-adversarial and Table 5 reports the results of the tests using different noise types inspired on belinkov2018synthetic. As a result, we see that all models are affected by these adversarial samples, but also found that some adversaries are model-specific because they do not affect all models as much as they affect the model they are targeting.

Model under Evaluation
Match- BERT- XLNet- RoBERTa-
Targeted Model LSTM BiDAF Large Large Large
Original (for reference only)
Match-LSTM Single
Match-LSTM Ensemble
BiDAF Single
BiDAF Ensemble
Match-LSTM Single
Match-LSTM Ensemble
BiDAF Single
BiDAF Ensemble
Match-LSTM Single -
Match-LSTM Ensemble - -
BiDAF Single -
BiDAF Ensemble - -
Table 4: Adversarial examples transferability between models. Each row measures accuracy (%) on adversarial examples designed to attack one particular model. Each column reports the test results of one particular model on all the adversarial datasets. Information from the first two columns was taken from jia-liang-2017-adversarial and was included as a reference. For that reason, not all experiments have available results.
BERT-Large XLNet-Large RoBERTa-Large
Original (for reference only)
Swap Noise
Middle Random Noise
Fully Random Noise
Keyboard Typo Noise
Natural Noise
Table 5: Results of applying different types of noise to SQuAD. It is interesting to see how all models get profoundly affected by all noise types, with Natural Noise being the least aggressive, and Fully Random being the most harmful in terms of accuracy.

Appendix B: Attention-level Results of NLI Task

Antonymy Evaluation

For this analysis, we took a representative adversarial example where a word in the sentence was replaced by its antonym. The model is asked to decide if there is a contradiction, neutral, or entailment relationship between them. We expect the model to connect the attention between the replaced words to predict the correct answer. Assume the following pair of sentences:

I saw that daylight was coming, and heard the people sleeping up.
I saw that daylight was coming, and heard the people waking

In this representative example for testing antonyms, we computed the attentions produced by XLNet, RoBERTa, and BERT. We checked the layers and heads where a clear attention pattern was present between the word and its antonym, as shown in Figures Antonymy EvaluationAntonymy Evaluation. Within this particular case, for XLNet, we saw that only 2.86% of the total attention heads and layers had this pattern. For RoBERTa, this number was 2.60%, and for BERT 1.56%. On the other hand, for all models, most of the attention was paid to separators and all words from the reference sentence without distinction (Figure Antonymy Evaluation).

figureXLNet antonym test figureRoBERTa antonym test figureBERT antonym test figureFailed antonym test

Numerical Reasoning Evaluation

For samples of numerical reasoning for NLI, the expectation is that the model should pay attention to words like ”more” or ”less” to check if there is a change in numerical references. Assume the following pair of sentences:

The next day Bob took the test and with this grade, included the new average, was more than 48.
The next day Bob took the test and with this grade, included the new average, was 78.

Nevertheless, for this testing example, the premise includes ”more than 48” and the hypothesis replaces this last part by ”78”, but all the models (XLNet, RoBERTa and BERT) incorrectly predicted ”contradiction”. We observed that the expected pattern (shown in Figures Numerical Reasoning EvaluationNumerical Reasoning Evaluation) is a very infrequent pattern for all models (for XLNet it appeared in 5.20% of the cases, for RoBERTa in only 4.42% and for BERT this percentage was 1.30%). For other cases, they focused on sentence separators (as shown in Fig Numerical Reasoning Evaluation).

figureXLNet numerical test figureRoBERTa numer. test figureBERT numerical test figureFailed numerical test

Appendix C: Attention-level Results of QA Task

QA task attention-level evaluation

For the QA task, we manually inspected failure cases to see the amount of attention the model paid to the introduced adversaries versus to the correct answer. Here we show one representative example of a ”what” question:

Question: What company took over Edison Machine works?.
Answer: General Electric.
Adversary: Stark Industries.

In this particular example, with the question ”What company took over Edison Machine works?”, the correct answer was ”General Electric”, and the artificially introduced adversary was ”Stark Industries”, appended at the end of the context of the original sample.

All models fell into the same trap. It can be seen in Figures QA task attention-level evaluationQA task attention-level evaluation that they paid attention to the wrong answer. In this case, this pattern appeared in 52% of the layer-heads of XLNet, 60% in the case of RoBERTa, and 30% on BERT. Nevertheless, while checking the level of certainty of each model in the predicted wrong answer for this example, XLNet had a 43.3% certainty probability, 75.5 % BERT, and the most mistaken was RoBERTa with a 99.9% certainty probability for predicting the wrong answer (which is consistent with the sharpness of attention in Figure QA task attention-level evaluation). This behavior provides evidence that the three models behave slightly different and that increased accuracy in the main task (before adversarial evaluation) is no direct indicator of increased robustness in all cases, but only in the average case.

figureXLNet SQUAD figureRoBERTa SQUAD figureBERT SQUAD