Contextualized representations trained over large-scale text data have given remarkable improvements to a wide range of NLP tasks, including natural language inference [Bowman2015ALA], question answering [rajpurkar-etal-2018-know] and reading comprehension [Lai2017RACELR]. Giving new state-of-the-art results that approach or surpass human performance on several benchmark datasets, it is an interesting question what types of knowledge are learned in pre-trained contextualized representations in order to better understand how they benefit the NLP problems above. There has been work investigating the nature of syntactic [liu-etal-2019-linguistic], semantic [liu-etal-2019-linguistic] and word sense [kim-etal-2019-probing] knowledge contained in such contextualized representations, in particular BERT [devlin-etal-2019-bert], showing that such knowledge can be effectively learned via language model (LM) pre-training over large scale data.
Commonsense knowledge spans “a huge portion of human experience, encompassing knowledge about the spatial, physical, social, temporal, and psychological aspects of typical everyday life. ” [Liu2004ConceptNetA]. Intuitively, such knowledge is at least as useful as semantic and syntactic knowledge in natural language inference, reading comprehension and coreference resolution. For example, the word “it” in the sentence “the dog cannot cross the street because it is too X” can refer to three different entities when the word “X” is “timid”, “wide” and “dark”, respectively, and resolving such ambiguity can require that a system has relevant commonsense knowledge beyond the sentence level. However, relatively little work has been conducted on systematically evaluating the nature of commonsense knowledge learned in contextualized representations.
|CA||They broadcast an announcement, but a subway came into the station and I couldn’t hear it.||✓|
|They broadcast an announcement, before a subway came into the station and I couldn’t hear it .||✗|
|WSC||The trophy doesn’t fit into the brown suitcase because the trophy is too large.||✓|
|The trophy doesn’t fit into the brown suitcase because the suitcase is too large.||✗|
|SM||money can be used for buying cars||✓|
|money can be used for buying stars||✗|
|SMR||“ he put an elephant into the fridge” (because) an elephant is much bigger than a fridge .||✓|
|“ he put an elephant into the fridge ” (because) elephants are usually gray…||✗|
|“ he put an elephant into the fridge ” (because) an elephant cannot eat a fridge .||✗|
|SWAG||Someone unlocks the door and they go in. Someone leads the way in.||✓|
|Someone unlocks the door and they go in. Someone opens the door and walks out.||✗|
|Someone unlocks the door and they go in. Someone walks out of the driveway.||✗|
|Someone unlocks the door and they go in. Someone walks next to someone and sits on a pew.||✗|
|HellaSwag||A carved pumpkin with a light in it glows on a counter. Supplies for carving are then shown.|
|A woman cuts the top off the pumpkin, emptying the seeds.||✓|
|she cuts down all the pieces and dumps them in a trash bin in the end.||✗|
|she then carves the traced lines to cut out the design.||✗|
|she tapes the top shut as the continue carving the pumpkin.||✗|
|ARCT||People can choose not to use Google Other search engines don’t redirect to Google|
|Google is not a harmful monopoly||✓|
|People can choose not to use Google All other search engines redirect to Google|
|Google is not a harmful monopoly||✗|
We fill this gap by evaluating five state-of-the-art contextualized embedding models on seven commonsense benchmarks. The models include off-the-shelf embeddings111https://github.com/huggingface/transformers from GPT [rad-2018], GPT2 [radford2019language], BERT [devlin-etal-2019-bert], XLNet [zhilin-19] and RoBERTa [liu2019roberta], and the benchmarks include Conjunction Acceptability, Sense Making [wang-etal-2019-make], Winograd Schema Challenge [Levesque:2012:WSC:3031843.3031909], SWAG [zellers-etal-2018-swag], HellaSwag [zellers-etal-2019-hellaswag], Sense Making with Reasoning [wang-etal-2019-make], and Argument Reasoning Comprehension [habernal-etal-2018-argument]. We evaluate commonsense knowledge contained in the above models by unifying the form of all the datasets and comparing LM perplexities on positive and negative samples (i.e., sentences that make sense and those that do not make sense, respectively). Commonsense contained in our data covers a wide range of subjects, from physical world knowledge to social conventions, from scientific domains to daily life scenes. We further categorize them by the difficulty level, namely the number of inference steps necessary in making sense.
We reframe the datasets in order to conduct both word- and sentence-level testing. For word-level testing, negative samples are drawn by replacing words from positive samples. We are concerned about nouns, verbs, adjectives, adverbs, pronouns and conjunctions, which reflect different aspects of commonsense. For example, while verbs such as “buy, throw, sell …” are relatively more associated with event knowledge, conjunctions such as “because, but, so …” are more associated with logical reasoning. For sentence-level testing, negative examples are drawn by replacing a full subsentences (such as a clause) with irrelevant or conflicting contents. Sentence-level tests concern more about commonsense inference.
From the results we have four salient observations. First, the pre-trained models give consistently better performances than random baselines, which demonstrates that language model pre-training is useful for learning commonsense knowledge. Second, models based on bi-directional contexts such as BERT, XLNet and RoBERTa are stronger in learning commonsense knowledge compared to those based on uni-directional contexts, such as GPT and GPT2. Third, more commonsense knowledge can be learned from larger training sets, which conforms well to the intuition. Fourth, the models have a certain degree of commonsense reasoning ability. However, as the number of necessary inference steps increase, the model performances drop, which shows that commonsense is still a big challenge that is not completely solved by pre-trained contextualized language models (LMs).
Finally, we further test the robustness of the five models by making dual test samples. Here a dual test sample is built by adding, deleting and replacing words in a test sample, or swapping two words in the sample, thereby resulting in a closely related test case. In theory, a model equipped with relevant commonsense should give consistent predictions on a pair of dual test cases. However, we find that none of the models are able to reach such consistency. Instead, the models are confused by the modification, tending to give the same predictions over a pair of dual samples despite they may have different gold labels. This further reveals that commonsense contained in the pre-trained models may remain in a surface level, without deep semantic comprehension. We publicly release our datasets, named commonsense ability tests (CATs), and the test script at GitHub. 222https://github.com/XuhuiZhou/CATS
Tasks for Evaluating Commonsense
Commonsense ability can be broadly divided to two categories. First, a model with commonsense ability should have basic knowledge about the world, for example, water always goes down. Second, it should have the ability to reason over commonsense knowledge, such as water always goes down because there is gravity on the earth and if you are injured, you should go to the hospital. To comprehensively test different models’ commonsense ability, we synthesize six challenging tasks by taking positive and negative samples from existing benchmarks, and further introduce a new task called Conjunction Acceptability (CA).
We reframe all the tasks into sentence-scoring tasks by substitution or concatenation. For example, we create positive and negative samples by replacing a pronoun in the sentence of a WSC question with the candidates to obtain a test instance as Table 2. A model is asked to score the sentences and we pick the sentence with the highest score as its prediction in a test instance. Below we introduce the data sources and reframed tasks in detail (the correct answer is bolded).
Sense Making (SM)
Introduced by wang-etal-2019-make (2019), this task tests whether a model can differentiate sense-making and non-sense-making statements. Given a pair of statements (i.e a test instance), it requires the model to choose the more sensible statement. One example is: I work 8 hours a day / I work 25 hours a day. This task conforms to our evaluation schema without a change. More examples are shown in the SM section of Table 1. The statements typically differ only in one key word which covers nouns, verbs, adjectives, and adverbs.
Winograd Schema Challenge (WSC)
The Winograd Schema Challenge (WSC) dataset [Levesque:2012:WSC:3031843.3031909] consists 273 instances of the pronoun resolution problem. Each instance contains a sentence with a pronoun referring to one of nouns; the original question is to pick the correct noun. For our task, we transform the test as shown in Table 2. More examples are shown in the WSC section of Table 1. WSC is recognized as one of the most difficult commonsense datasets.
Conjunction Acceptability (CA)
As stated by lobue-yates-2011-types (2011), logic-based commonsense knowledge is an important part of world knowledge in addition to content-based knowledge. We aim to probe a model’s ability to understand the logic relations in the language by extracting 189 positive samples from the WSC dataset and replacing the conjunction manually with another conjunction to obtain a negative sample. We pair the positive and negative samples to obtain a test instance. For example, The lawyer asked the witness a question, and the witness was reluctant to answer it / The lawyer asked the witness a question, but the witness was reluctant to answer it. More examples are shown in the CA section of Table 1. This task using “because”, “before”, “when”, “but”, “and” to correspond to the Cause and Effect, Preconditions, Simultaneous Conditions, Contradiction, and Addition logic relations, respectively. It is complementary to the other token-level tasks which focus more on content-based knowledge.
SWAG [zellers-etal-2018-swag] is a dataset with multiple choices questions about grounded situations. It questions models’ understanding towards the relationship between two physical scenes. With the help of adversarial filtering (AF), zellers-etal-2018-swag created a sufficiently large amount of questions automatically. For example, given On stage, a woman takes a seat at the piano. She, the question is to choose the following candidates: A. sits on a bench as her sister plays with the doll B. smiles with someone as the music plays C.is in the crowd, watching the dancers D. nervously sets her fingers on the keys. We obtain a positive or negative sample by concatenating the context and a candidate together (e.g On stage, a woman takes a seat at the piano. She nervously sets her fingers on the keys). There are one positive sample and three negative samples in a SWAG test instance. More examples are shown in the SWAG section of Table 1. By forcing the model to predict the next action, it requires inductive reasoning and temporal reasoning.
HellaSwag [zellers-etal-2019-hellaswag] is an argumented version of SWAG with the same data format as SWAG, more inference steps and higher data quality. While HellaSwag also includes the dataset from WikiHow, we choose only the instances coming from ActivityNet to make the results comparable to the original SWAG dataset.
Sense Making with Reasoning (SMR)
Sense Making with Reasoning focuses on identifying the reason behind a statement [wang-etal-2019-make] against commonsense. A model needs to understand that a specific statement (e.g can is usually made of gold) is against commonsense and to make a choice for the reason behind from three candidates (e.g gold is too bright to make cans, gold is too soft to make cans and gold is too expensive to make cans). We make a positive or negative sample by concatenating the statement and candidate reason together. For each test instance in SMR, there is a positive sample and two negative samples. More examples are shown in the SMR section of Table 1. This task is intuitively difficult since it requires a model to have deeper knowledge of with higher-level inference, which belongs to abductive reasoning.
Argument Reasoning Comprehension Task (ARCT)
Similar to SMR, habernal-etal-2018-argument (2018) propose the ARCT dataset to test a model’s abductive reasoning ability. Its domain lies in social topics such as search engine and LGBT rights, which is different from the daily-routine scenarios. For example, given a reason : I find the idea that it is a sin to be born or live a life at all to be preposterous and a claim : Christians have created a harmful atmosphere for gays, this task is to pick the correct warrant from two candidates: A. being gay isn’t considered a sin B. being gay is considered a sin, where . We make a positive or negative sample by concatenating the reason, candidate warrant and claim together (e.g I find the idea that it is a sin to be born or live a life at all to be preposterous and since being gay is considered a sin, Christians have created a harmful atmosphere for gays). A test instance in ARCT contains a pair of positive and negative samples. More examples are shown in the ARCT section of Table 1. We further break this task into two variants, where ARCT1 represents the original dataset, ARCT2 represents an argumented dataset by adding negation to original instances to alleviate the statistical cues in the dataset [niven-kao-2019-probing].
We integrated the above test sets into a commonsense ability test (CATs) benchmark, released for future research.
We take six contextualized representation models that give the state-of-the-art performances on NLP benchmarks such as GLUE [wang-etal-2018-glue] and SQuAD [rajpurkar-etal-2018-know]. Off-the-shelf models are taken. Below we give the detailed settings.
GPT [rad-2018] is a uni-directional transformer LM trained on 800M tokens of BookCorpus [Zhu2015AligningBA]. Given a text sequence , GPT works in a way similar to conventional auto-regressive (AR) LM:
where . The model has dimension of hidden states , attention head numbers , number of layers and total parameter size .
works similarly as GPT with a few modifications on the hyperparameters. In particular, GPT2 optimizes the layer normalization, expands the vocabulary size to 50,257, increases the context size from 512 to 1024 tokens, and optimizes with a larger batchsize of 512. In addition, GPT2 is pre-trained on WebText, which was created from scraping web pages. The dataset roughly contains 8 million documents (40 GB). We study GPT2-base and GPT2-medium, with model sizeand , respectively, where the definitions of H, L and A are the same as for GPT.
BERT [devlin-etal-2019-bert] jointly trains on a masked language modeling task and a next sentence prediction task (NSP). The model is trained on the BookCorpus and English Wikipedia, a total of approximately 3300M tokens. BERT is designed with the following objective:
where is a corrupted version of text sequence , and is masked tokens. if token belongs to .
Here we consider BERT-base and BERT-large, with and , respectively, where the definitions of H, L and A are the same as for GPT.
XLNet [zhilin-19] is trained with a permutation-based language modeling objective to capture bidirectional contexts while retain the benefits of AR models. Specifically, they let be the set of all possible permutations of the length-T sequence :
where and are the -th element and the first elements of a permutation , respectively. In this way, XLNet ensures that any specific token in has seen all the tokens before or after it.
We consider XLNet-base and XLNet-large, whose model sizes are and , respectively, where the definitions of H, L and A are the same as for GPT. Note that XLNet-base is trained with the same data as BERT, while XLNet-large is trained with a larger dataset that consists of 32.98B subword pieces coming from Wiki, BookCorpus, Giga5, ClueWeb, and Common Crawl.
RoBERTa [liu2019roberta] has the same architecture as BERT but is trained with dynamic masking, FULL-SENTENCES without NSP loss, a larger batch-size and a larger vocabulary size. Given the optimized design choice, one key difference of RoBERTa with other models is its large training dataset, which consists of BookCorpus, CC-NEWS, OpenWebText, and STORIES. With a total 160GB text, RoBERTa has access to more potential knowledge than the other models.
The CAT datasets are applicable to any model that has a method to score a sentence. They fit with the pre-trained models above, which are by nature language models. We derive the score of a sentence below with uni-directional-context LMs and bi-directional-context LMs, respectively.
Formally, suppose the sentence S of n words . We define the score of a sentence as:
where the denominator is for alleviating the influence of the sentence length to models’ prediction, especially in sentence-level tasks. For a uni-directional model, . The numerator becomes , which is factorized from . This is essentially a LM. For a bi-directional model, the , which represents the with the -th word being removed. In particular, the -th word can be removed with being replaced by a special token ‘[MASK]’ in BERT. The numerator can also be factorized from under the assumption that is independent of the successive words (i.e. ), which is the bi-directional-context LM.
can be interpreted as how probable a wordis given the : or . For example, let He put an [MASK] into the fridge, and . should have a relatively larger value since filling in the “elephant” in the first case results in an improper sentence, which is against commonsense.
As introduced earilier (Table 1), all CATs tasks consist of instances with positive and negative sentences. After we score each sample in a test instance, the models predict the positive sample simply by taking the highest score in the instance.
Commonsense Tests Results
Table 3 shows the model performances with random choices as the baseline. Take WSC for example, the random baseline is 0.5, the human is 0.920 and all the models range between 0.512 and 0.694 with RoBERTa-large giving the best result of 0.694. Except for the ARCT task, all tested models demonstrate stronger performances than RANDOM, which indicates that the models all have varying degrees of commonsense. However, for most of the tasks, all of models are well below human performance.
Uni-directional Vs Bi-directional LM
We compare uni-directional (GPT, GPT2-base, GPT2-medium) and bi-directional models (Bert-base, Bert-large, XLNet-base, XLNet-large, RoBERTa-base, and RoBERTa-large). Picking the strongest model from each group, RoBERTa-large outperforms GPT2-medium by a large margin for every task. As mentioned before, RoBERTa-large has the same parameter size as GPT2-medium. However, RoBERTa-large is trained with much more data than GPT2-medium.
From Figure 1, we can see that except for the SM task, both BERT-large and XLNet-large outperform GPT2-medium while BERT-large is trained with a smaller dataset than GPT2-medium. This indicates that bi-directional context can be more useful for learning commonsense. Intuitively, the models with bi-directional context can make more sentence-level inference. While only the predecessing words receive sufficient context in a uni-directional model, every word has the full context for bi-directional models. Table 4 shows examples where RoBERTa-large makes the correct prediction but GPT2-medium does not, we can see that the key tokens, which are considered to be the most influential part in making the correct prediction, lie in the middle of the sentence. This can be the main reason why bi-directional context is important for models’ commonsense ability.
Scale of Training Data
A larger training dataset intuitively allows a model to have access to more commonsense knowledge, thus performs better in our tests. Trained with by far the most data, RoBERTa is the winner for every task. Most of the models are in fact trained on a subset of the dataset used to train RoBERTa. However, larger dataset do not always work when the model capacity is limited with regard to commonsense. For example, GPT2-base underperforms GPT for many tasks in our dataset, which suggests GPT2-base underfits the WebText dataset with regard to commonsense. The fact that RoBERTa-base has the same parameter size as GPT2-base, yet benefits from the larger dataset suggests that bi-directional models have larger representative power in commonsense ability.
Number of Inference Steps
Similar to humans, the model performance can intuitively drop when commonsense inference becomes more complicated. To verify this intuition, we pick 100 sentences randomly from each test dataset and annotate the number of required inference steps (IS) of each instance manually. The inference step of each test dataset is defined as the average of the number of the turns of reasoning necessary for the instances from the test dataset. We choose to answer the question by counting the logical operations that exist in an instance. For example, for the sentence
They add a lot to the piece and I look forward to reading comments, but since comments sections always distract me from my work, Comment sections have failed., the logic chain is (They add a lot to the piece I look forward to reading comments) comments sections always distract me from my work Comment sections have failed. Thus, this instance needs three inference steps.
In this way, we obtain the Inference Step (IS) for seven test datasets. Each instance is labeled by two expert annotators, and the inter-annotator agreement is 93%. The final IS is the average from both annotators. Figure 2 shows the results 333The performances on tasks with more than one negative sample are transformed to binary-choice scales. on the test cases with different IS. There is a decrease of performances as IS increases. SWAG and HellaSwag fall out the trend, which may suggest that the models have stronger commonsense ability in temporal reasoning.
Generally speaking, all of our tested models outperform the random baselines except for the ARCT task, which suggests that despite of using different modeling schemas, language modeling stands as an effective objective for extracting commonsense knowledge from large, raw texts. For each task, the overall performance increases with a larger model parameter size, a more sophisticated model design, and larger training data.
The robustness of models in commonsense reasoning is an important perspective in evaluating deep commonsense ability. Intuitively, a person can reason whether a statement makes sense or not because he has consistent knowledge. If the statement changes slightly, for example, changing a key word, that person should still make the correct judgement.
We aim to test the robustness of the five models by making dual test samples. A dual instance to the original instance should test the same commonsense knowledge point or largely relevant to the original one. In this way, we expect that the model can demonstrate consistency in the decision. One example is shown in Table 5, which choosing A in the original instance should lead to choosing B in the dual case (See Figure 3 for more examples).
We consider multiple ways to construct a dual test instance. Particularly, a dual test instance is built by methods: adding, deleting and replacing words in a test sample, or swapping two words in the sample, thereby resulting in a closely related test instance. All of our dual test instances are constructed from the original commonsense test data.
We construct 75 dual instances for each method above over WSC, SM, and ARCT, keeping the instances from each dataset approximately equivalent in order to evaluate the influence of different duality methods to the models. We then pair each dual instance with the original instance to form a new test case. If the model gives the correct or wrong prediction for both of the instances in this case, we recognize it as a consistent case.
The results are shown in Table 6. In theory, a model equipped with relevant commonsense should give consistent predictions on a pair of dual test case. However, we find that none of the models reach consistency. In fact, their consistency is well below the random baselines except for the Swap method.
To better investigate the reason behind the poor consistency, we look at inconsistent cases from the pre-trained model (i.e RoBERTa-large). Similar to Trinh-2018-a (2018), we investigate how the model makes decision between two candidate sentences and where they have the same number of words. In particular, we look at:
where . It follows that the choice between and is made by the value being bigger than 0 or not. Visualizing the value of each provides more insights into the decisions of the model.
From Figure 3, we can tell that the model is confused by the modification, tending to give the same predictions over a pair of dual samples despite that they have different gold labels, especially for Sub, Add and Del. This further reveals that the commonsense knowledge contained in the pre-trained models may remain in a surface level, without deep semantic comprehension.
liu-etal-2019-linguistic (2019) evaluate BERT [devlin-etal-2019-bert], GPT [rad-2018], and ELMo [peters-etal-2018-deep] on a variety of linguistics tasks. Their results suggest that the features generated by pre-trained contextualizer are sufficient for high performance on a board set of tasks but models fail on tasks requiring fine-grained linguistics knowledge. Tenney2019WhatDY (2019) evaluate similar models on a variety of sub-sentence linguistic analysis tasks. Their results suggest that contextualized word representation encode both syntax and semantics. Our work is in line in the sense that contextualized representation encode rich knowledge to be ‘probed’. However, we focus on evaluating the commonsense in those representations. To our best knowledge, this is the first work to systematically evaluate commonsense in pre-trained models.
Our evaluation method is similar to Trinh-2018-a (2018), who make use of LM to score a sentence. However, they focus on Winograd schema questions with only self-trained recurrent LMs while we test five models’ commonsense with seven diverse tasks.
We studied the commonsense knowledge and reasoning ability of pre-trained contextualizers with a suite of seven diverse probing tasks, showing that large-scale pre-trained contextualized representation has a certain degree of commonsense knowledge, but there is still a quite large gap between the current state-of-the-art representation models and robust human-level commonsense reasoning, which may require more breakthrough in modeling. We release our test sets, named CATs, publicly.
We would like to thank the anonymous reviewers for their insightful comments, and Mr. Cunxiang Wang for his help on the collection of the data. Yue Zhang is the corresponding author.