Exploring Unsupervised Pretraining and Sentence Structure Modelling for Winograd Schema Challenge

04/22/2019 ∙ by Yu-Ping Ruan, et al. ∙ Queen's University USTC Anhui USTC iFLYTEK Co 0

Winograd Schema Challenge (WSC) was proposed as an AI-hard problem in testing computers' intelligence on common sense representation and reasoning. This paper presents the new state-of-theart on WSC, achieving an accuracy of 71.1 We demonstrate that the leading performance benefits from jointly modelling sentence structures, utilizing knowledge learned from cutting-edge pretraining models, and performing fine-tuning. We conduct detailed analyses, showing that fine-tuning is critical for achieving the performance, but it helps more on the simpler associative problems. Modelling sentence dependency structures, however, consistently helps on the harder non-associative subset of WSC. Analysis also shows that larger fine-tuning datasets yield better performances, suggesting the potential benefit of future work on annotating more Winograd schema sentences.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The impressive advance achieved in the last several years on distributed representation and neural networks has resulted in significant improvement in many research areas

(Krizhevsky et al., 2012; Mnih et al., 2013; Mikolov et al., 2013; Bahdanau et al., 2014; Vaswani et al., 2017; Lample et al., 2017; Devlin et al., 2018), which in turn triggers further curiosity to understand how much such models can further solve hard AI problems.

Winograd Schema Challenge (WSC) was proposed as an AI-complete problem in testing computers’ intelligence on common sense reasoning (Levesque et al., 2012; Morgenstern and Ortiz, 2015; Marcus et al., 2016). WSC elegantly embeds common sense reasoning into a simple form: a binary classification on coreference resolution. A typical WSC example was proposed in Winograd (1972): for the sentence “The city councilmen refused the demonstrators a permit because they feared violence.” and a corresponding question “Who does the word they refers to?”, a computer is expected to find the answer, i.e., the city councilmen but not the demonstrators.

While answering WSC questions is usually very simple for common human beings, it presents a great challenge for machines. Recently, there have been many different efforts attempting to solve the problem (refer to Section 2 for a brief survey).

In this paper, we present the new state-of-the-art model for WSC, achieving an accuracy of 71.1%. We demonstrate that the leading result benefits from jointly introducing sentence structures into modelling, utilizing external knowledge learned from cutting-edge pretraining, together with fine-tuning using the Raham-Ng dataset (Rahman and Ng, 2012).

In addition, we conduct detailed analyses. We observed that that fine-tuning is critical for achieving the performance, but it helps more on solving the simpler associative problems (Trichelair et al., 2018). Modelling sentence dependency structures, however, consistently helps on the harder non-associative subset of WSC. Analysis also shows that larger fine-tuning datasets yield better performances, suggesting the potential benefit of future work on annotating more Winograd schema sentences, which may help further complement and leverage the pretrained models.

2 Related Work

Some previous efforts on resolving the Winograd Schema problem relied heavily on the annotated knowledge, hand-crafted features, and (or) rule-based reasoning (Peng et al., 2015; Bailey et al., 2015; Schüller, 2014; Liu et al., 2017b, a). In particular, Rahman and Ng (2012) employed human annotators to build more supervised training data, in which the models utilized nearly 70 thousand hand-crafted features, including querying data from Google Search API.

More recently Sharma et al. (2015) utilized a semantic parser on the WSC sentences, queried texts through Google Search, and reasoned on the graph produced by the parser. Emami et al. (2018) showed better performance along the same direction. Schüller (2014)

formalized a knowledge-graph data structure and a reasoning process based on cognitive linguistics theories.

Bailey et al. (2015) introduced a framework for reasoning using expensive annotated knowledge bases as axioms, while Liu et al. (2017b) incorporated several knowledge bases into the training process of skip-gram word embeddings, and the resulting knowledge-enhanced embeddings were used to better score the candidates.

Unlike the above work, we are first curious about the effectiveness of cutting-edge pretrained models. We show they do help achieve the state-of-the-art performance, but we observe that they help more on the simpler associative problems. We propose to incorporate sentence structures into the pretraining and fine-tuning framework, which consistently helps on the harder non-associative subset of WSC.

For research on unsupervised pretraining, there exists considerable work in the literature but the typical models proposed in the most recent years include GPT, ELMo, and BERT  (Radford et al., 2018; Peters et al., 2018; Devlin et al., 2018), which achieved impressive performance on a wide variety of tasks. Among the models, we choose BERT (Devlin et al., 2018), which is among the-state-of-art.

3 Understanding the Roles of Pretraining and Fine-Tuning for WSC

While the unsupervised pretrained models on large text corpus have recently achieved impressive performance on various NLP tasks, it remains a fundamental question on how such pretrained model learn common sense to help solve hard common sense reasoning problems. For simple inference problems, BERT has achieved performance comparable to that of human being’s, e.g, on the recently published SWAG dataset (Zellers et al., 2018). This invites a further investigation on the harder WSC problems.

While BERT has been tested on the GLUE dataset (Wang et al., 2018), there have no conclusive results on WSC due to data splitting issues (Devlin et al., 2018). This paper is the first to perform a detailed study on the state-of-the-art pretraining-finetuning framework for WSC.

WSC as Next Sentence Prediction

It is reasonable to utilize BERT to learn and encode common sense knowledge existing in large text corpora in an unsupervised manner. While BERT has two targets, i.e., optimizing masked language models (LMs) and next-sentence prediction (refer to (Devlin et al., 2018) for details), to solve the WSC problems, we formulate the WSC sentences in the next-sentence prediction framework; a specific example is presented as follows:

  • Original WSC sentence: The trophy doesn’t fit into the brown suitcase because it is too large.

  • Candidate sentence 1: [CLS] The trophy doesn’t fit into the brown suitcase because [SEP] the trophy is too large . [SEP]

  • Candidate sentence 2: [CLS] The trophy doesn’t fit into the brown suitcase because [SEP] the brown suitcase is too large. [SEP]

In the formulation, we replace the pronoun “it” in the original WSC sentence with the two candidate entities ”the trophy” and “the brown suitcase”, respectively, to derive two candidate sentences, where a special classification token [CLS] and delimiter token [SEP] are inserted in following Devlin et al. (2018).

Then we score these two candidate sentences using pretrained next-sentence-prediction. The substitution that results in a more probable candidate sentence will be the correct answer.

Fine-tuning BERT for WSC

As we will discuss later, we find fine-tuning is critical for WSC. The original WSC dataset has only 273 sentences. We did not observed any performance gain using the 273 sentences to fine-tune BERT with 10-fold cross validation. It is therefore interesting to understand if more Winograd schema sentences will benefit the framework. Fortunately, Rahman and Ng (2012) gathered a dataset consisting of sentences of pronoun resolution problems111available at http://www.hlt.utdallas.edu/~vince/data/emnlp12/. To ensure that there is no overlapping sentences between this dataset and the WSC dataset, we manually investigated and removed from the sentences sentences that are overlapping with the WSC set. The remaining sentences, referred to as the Raham-Ng dataset in the remainder of this paper, will be used to fine-tune the pretrained BERT models to develop our state-of-the-art models and analyze how pretraining and fine-tuning help WSC.

Sentence Type Examples Proportion
Fred and Alice had very warm down coats, but they were not prepared
for the cold in Alaska.
I’m sure that my map will show this building; it is very good.
13.5% (37)
The large ball crashed right through the table because it was made
of steel.
The firemen arrived after the police because they were coming from
so far away.
86.5% (236)
Table 1: Examples and proportions of associative and non-associative instances in the WSC.

4 Modelling Sentence Structures

In addition to investigating common sense knowledge learned with pretraining and encoded in the Transformer-based data structures, we believe carefully modelling WSC sentences themselves are important, as human rely heavily on sentence syntax and semantics to solve the WSC problems. Unlike previous models (Liu et al., 2017b; Sharma et al., 2015), in this paper we explore modelling sentence structures together with the pretraining-finetuning framework, to leverage and combine their complementary strengths.

BERT is mainly built on deep-stacked bidirectional Transformers (Vaswani et al., 2017). A typical Transformer works purely on attention mechanism and does not have an explicit notion of word order beyond marking each word with its absolute position embedding. It has been found that the RNNs are superior to the full attention network, i.e, Transformer, on modelling some hierarchical structures in sentences (Tran et al., 2018). In this paper we will explicitly incorporate dependency structure into BERT. For many tasks that are sensitive to sentence structures where a relatively small training data are provided, an explicit utilization of sentence structures often benefits. We wonder if WSC can benefit from both sentence structure and pre-learned knowledge jointly, and if so how?

Figure 1: Dependency parsing of a sentence “A cat sits on the desk.”, and the corresponding dependency mask matrix.

Adding Dependency into Transformers

Transformers in BERT consist of multiple layers (Vaswani et al., 2017), among which the multi-head self-attention layer serves as the most important component. The self-attention layer in Transformer is presented as following, while the detailed discussion can be found in (Vaswani et al., 2017).

For input hidden states corresponding to tokens in the sequence, which can be the output of last Transformer layer or the embedding layer, one self-attention head, i.e., , derive its output as follows:


where , , represents the query, key, value matrices respectively, and , , are three projection matrices.

Then a dot-product attention and softmax-weighted sum is applied to , , to derive the output :


where is the scaling factor. Finally, outputs from multiple heads will be concatenated to form the output of the self-attention layer.

For the original Transformers in BERT, as illustrated above, each word will attend to all words in the sequence when in the process of self-attention. To add explicit structural constraints, we propose to combine the dependency structures with self-attention in Transformer.

We propose a Dependency Mask technology to constrain a word to specially attend to its head word, child words, and itself. As shown in Figure 1, for sentence “A cat sits on the desk.”, we first derive its dependency tree and then convert it to a simpler dependency mask matrix, denoted as here, each row in represents the relation between the -th word and all words in the sentence, the positions indicating its head word, child words, and itself are set to . The dependency mask are added to the softmax-weighted sum as follows:


where and are the same with that in Eq. 4 and Eq. 3, the is used to mask the dot-attention weights .

We explore two approaches to adding the dependency structures into BERT: the inside and outside approach.

  • Inside: In the inside approach, the dependency mask is added to Transformer layers inside BERT—we mask the attention weights in the last, middle, or first Transformer layers.

  • Outside: While BERT is powerful, it may be very sensitive to the modification that is made only in the fine-tuning phase (but not in pretraining), particularly directly in the internal layers. Motivated by that, we also propose the outside incorporation. Specifically we define layers of new Transformers, denoted as TransformerRNN- here, upon the last Transformer layer of the original BERT model. These new layers share the same set of parameters. The TransformerRNN- works in a recurrent manner, motivated by the universal Transformers (Dehghani et al., 2018). With this, the dependency mask is added to all layers in TransformerRNN-.

5 Evaluation Protocols

Due to the difficulty to acquire high-quality samples for common-sense reasoning under the Winograd Schema settings, the WSC dataset comprises only 273 test instances.222Recently, 12 new sentences have been added. Previous state-of-the-art on the full WSC set for single-model performance is around 55% (Trinh and Le, 2018; Emami et al., 2018)

. Indeed, it has been found that there is more than a 1-in-3 chance of scoring above 55% accuracy with a set of random classifiers

(Trichelair et al., 2018). So achieving above random accuracy on the WSC does not necessarily corresponding to success on common sense reasoning. In particular, Trichelair et al. (2018) proposed two new evaluation protocols to alleviate above evaluation problems by subdivide and augment the WSC data according to two properties: associativity and swithchability.

In this paper we use these effective protocols to help probe the roles of pretraining-funetuning frameworks and our proposed models in WSC.


As one of the designing principle (Levesque et al., 2012), the sentence in WSC should not be resolvable via simple statistics that associate a candidate antecedent to certain components in the sentences. For example, in the above formulation, the statement “The lions ate the zebras because they are predators” (Rahman and Ng, 2012), we have two pairs of candidate sentences “The lions ate the zebras because lions are predators” and “The lions ate the zebras because zebras are predators”. Such sentences can be resolved based on a much stronger association/collocation of lions with predators than that of zebras.  Trichelair et al. (2018) released a dataset in which the sentences in WSC are manually annotated either as associative or non-associative. Table 1 lists some examples and the size of these two subsets.

Full WSC Associative Non-Associative Unswitched Switched Consistent
(#273) (#37) (#236) (#131) (#131) (#131)
Previous state-of-art models
Single LM 54.8% 73.0% 51.7% 55.0% 54.2% 31.0%
Ensemble 14 LMs 63.7% 83.8% 60.6% 63.4% 53.4% 28.0%
Knowledge Hunter 57.1% 50.0% 58.3% 58.8% 58.8% 52.9%
BERT without fine-tuning
BERT-base 52.0% 56.8% 51.3% 51.9% 55.0% 13.7%
BERT-large 52.0% 48.6% 52.5% 52.7% 54.2% 22.7%
BERT-base + dependency 52.7% 59.5% 51.7% 51.9% 54.2% 12.2%
BERT-large + dependency 52.7% 48.6% 53.4% 54.2% 56.5% 25.2%
BERT with fine-tuning
BERT-base 64.5% 81.1% 61.9% 63.4% 64.9% 53.4%
BERT-large 68.1% 75.7% 67.0% 70.2% 71.8% 64.9%
BERT-base + dependency 67.4% 78.4% 65.7% 66.4% 61.8% 53.4%
BERT-large + dependency 71.1% 81.1% 69.5% 74.1% 72.5% 66.4%
Table 2: Performance of different models on the full WSC dataset and subsets evaluated with different protocols.


A switchable sentence in WSC (Trichelair et al., 2018) means that switching the two antecedents does not obscure the sentence nor affect the rationale to make the resolution decision. A typical example from the WSC is as follows:

  • Original sentence: Paul tried to call George on the phone, but [Paul/George] wasn’t successful.

  • Switched sentence: George tried to call Paul on the phone, but [Paul/George] wasn’t successful.

When switching the antecedents Paul and George, the correct answer changes from Paul to George as well. A system that can correctly resolves both the original and the switched sentence indicates it learns the reasoning better than a system that is confused by the switching. The switchable subset contains 131 instances, which accounts for 47% of the original WSC set (Trichelair et al., 2018). For the evaluation, we report not only accuracy on the original unswitched and switched sentences in switchable subset but also consistent accuracy. Our consistent accuracy is computed as the number of correctly answered pairs (i.e., correctly answered WSC sentences both before and after a switch) divided by the total number of switchable sentences (i.e., 131), to reflect the absolute differences of systems’ performance on the switchable subset.

6 Experiment Results and Analysis

6.1 Baselines & Training Details

We use two state-of-the-art systems as our baselines:

  • Pretrained LMs: These are specially trained and ensembled language models (LMs) (Trinh and Le, 2018), in which the language models are used to score the two sentences obtained by replacing the pronoun with the two candidate entities. The sentence that is assigned a higher probability is chosen as the answer. We will use in our experiments the best Single LM and ensembling of 14 LMs from Trinh and Le (2018).

  • Knowledge Hunter

    : The Knowledge Hunter is a rule-based system that uses search engines to gather information for the candidate resolutions and then reasons over the gathered knowledge without relying on the entities themselves

    (Emami et al., 2018).

Training Details

All sentences in the Raham-Ng dataset are used as training set to fine-tune the pretrained BERT. The dependency parsing for the sentences are conducted by using spaCy tool333https://spacy.io/usage/linguistic-features#section-dependency-parse

. Concerning the final dependency mask matrix input to the BERT, for the subword after tokenization, we just duplicate its original word’s dependency relation vector

as its own. And for the special token [CLS] and [SEP] we set all elements in their corresponding to .

For the hyperparameters during the fine-tuning process, we use most default settings. Specifically, the learning rate is

. The batch size for BERT-base and BERT-large are set to and respectively. The warmup rate for BERT-base is and that for BERT-large is . The max sequence length is set to

, max training epochs is

. Dropout rates for both BERT-base and BERT-large are set

. We use the PyTorch implementation of BERT with the pretrained model files

bert-base-uncased and bert-large-uncased provided by Google.444https://github.com/huggingface/pytorch-pretrained-BERT#Fine-tuning-with-BERT-running-the-examples

6.2 Overall Performance

Table 2 shows the overall performances of different models. The results of the baselines (Single LM, Ensemble 14 LMs, and Knowledge Hunter) are copied from Trichelair et al. (2018) and the consistent accuracies are computed from the consistency scores in Trichelair et al. (2018). Together with these baselines, we present the performances of different models with regard to whether adopting fine-tuning or not, across different evaluation protocols and subsets. Note that BERT-base and BERT-large are pretrained on the same corpus but with different model sizes: BERT-base has 110M parameters and BERT-large has 348M. 555

Note that when we submit the paper, the GPT-2 (

https://blog.openai.com/better-language-models/) has just been posted, which is pretrained on much larger text corpora and has much more parameters than BERT-large; however, the corresponding code and model have not been fully released.

We can see that our proposed model that leverages dependency structures with the fine-tuned BERT-large framework achieves the best performance on all metrics and across different evaluation protocols. Particularly it achieves the new state-of-the-art accuracy, 71.1%, on the full WSC dataset. The detailed analysis of adding dependency information will be discussed later in this section.

We also observe that the pretraining frameworks (BERT-large) with proper fine-tuning (using the Raham-Ng dataset) achieves an accuracy of 68.1%, which outperforms all previous (baseline) models already. Fine-tuning plays a critical role in achieving the performance, and it helps significantly more on the associative subset. We also see that incorporating dependency structures consistently helps on the harder non-associative subset.

Effects of Fine-Tuning

We can find that all BERT models without fine-tuning perform worse than all previous state-of-the-art models, but have a significant performance improvement after fine-tuning (i.e., about - accuracy improvement on the full WSC set), which shows that pure pretrained BERT models are not competent on WSC, while fine-tuning is critical, which combines knowledge from Winograd annotation and that from pretrained models.

Effects of Adding Dependency

Both fine-tuned BERT-base and BERT-large models benefit from leveraging dependency structures and achieve an accuracy improvement for about on the full WSC set, which shows the effectiveness of incorporating sentence structures into BERT for the WSC problem.

Effects of Pretrained Model Sizes

As BERT-large has a bigger model size than BERT-base, it can potentially accommodate to encode more knowledge during the pretraining process, with its advantages having been shown in many recent research efforts. Specifically for WSC, Table 2 shows that fine-tuned BERT-large models significantly outperform the corresponding fine-tuned BERT-base models on across all metrics and evaluation protocols.

Associative v.s. Non-Associative

As discussed before, the associative WSC sentences are likely to be resolved by utilizing statistics such as co-occurency found in large text corpora. However, the non-associative WSC sentences are much more challenging.

From the results of BERT models in Table 2 and as we have already highlighted above, the associative sentences benefit more from the fine-tuning strategy than non-associative sentences. We due this to the capability of pretraining and fine-tuning frameworks in capturing simple statistics to judge that ”lions are predators” is more plausible than ”zebras are predators”, as in the example we discussed earlier. However, such statistics seems to be less effective for solving harder non-associative problems.

With the associative knowledge being effectively captured in the pretraining-finetuning mechanism, dependency knowledge can further help solve the harder non-associative problems consistently.

Switchable & Consistency

For the results on switchable subset in WSC, from Table 2, we can observe that all models have significant performance drops on the consistent accuracy, compared to performance on the unswitched accuracy. Among these, the fine-tuned BERT-large models and Knowledge Hunter behave the most stable on the switchable subset and decrease by less than from the unswitched accuracy to consistent accuracy, which demonstrates that the BERT-large models do not only outperform all other models, but also have robustness comparable to the rule-based Knowledge Hunter.

Percentage of
trainin data
Full WSC Associative Non-Associative Unswitched Switched Consistent
(#273) (#37) (#236) (#131) (#131) (#131)
0% 52.7% 48.6% 53.4% 54.2% 56.5% 25.2%
20% 61.9% 67.6% 61.0% 64.1% 59.5% 46.6%
40% 65.2% 73.0% 64.0% 66.4% 64.1% 57.3%
60% 66.7% 70.3% 66.1% 68.7% 70.2% 63.4%
80% 68.5% 67.6% 68.6% 71.0% 67.9% 61.1%
100% 71.1% 81.1% 69.5% 74.1% 72.5% 66.4%
Table 3: Performance of fine-tuned BERT-large + dependency when gradually increasing randomly sampled fine-tuning data.
BERT-base 64.5%
 + mask first 5 layers (0-4) 63.7%
 + mask middle 5 layers (3-7) 67.0%
 + mask last 5 layers (7-11) 67.4%
 + mask all 12 layers (0-11) 63.7%
 + TransformerRNN- 61.5%
 + TransformerRNN- 65.6%
 + TransformerRNN- 61.5%
 + TransformerRNN- 58.2%
BERT-large 68.1%
 + mask first 5 layers (0-4) 67.8%
 + mask middle 5 layers (10-14) 65.6%
 + mask last 5 layers (19-23) 71.1%
 + mask all 24 layers (0-23) 66.3%
Table 4: Performance of different settings for incorporating dependency into BERT with fine-tuning.

6.3 Detailed Analysis

Adding Dependency into BERT

As discussed in Section 4, we explore two approaches to leveraging the dependency structure into pretrained models, i.e., the inside and outside modelling. Specifically, for the inside method, we show here the performances of different ways to add the dependency information: masking the first, middle, last 5 layers, or masking all Transformer layers. For the outside approach, we also show results of different settings: TransformerRNN-, TransformerRNN-, TransformerRNN-, and TransformerRNN- respectively.

We fine-tune the BERT models on the Raham-Ng dataset and these results are present in Table 4. For the inside method, we can see that both BERT-base and BERT-large with masking the last 5 layers achieve the best accuracy, and all other masking settings except masking middle 5 layers do not improve the performance for both BERT-base and BERT-large. For the outside approach, we can find that only TransformerRNN- benefits the BERT-base’s performance on the WSC, which is still inferior to the best setting of inside manner. We do not experiment the outside method on the BERT-large model due to its poor performance on BERT-base and the much larger model size of BERT-large.

The Effects of Fine-Tuning Data Sizes

Our experiments have shown that pretrained BERT models can obtain the state-of-the-art performances on WSC with fine-tuning on a relative small dataset, i.e., the Raham-Ng dataset. A natural question is that how the sizes of fine-tuning data affect the performances.

We randomly selected different percentages of data from the Raham-Ng dataset to fine-tune our best “BERT-large + dependency” model and test their performances on the WSC dataset. The results are presented in Table 3, which shows that a larger tuning dataset yields a better performance, suggesting the potential benefit of future work on annotating more Winograd schema sentences.

7 Conclusions and Discussion

We report the new state-of-the-art performance, a 71.1% accuracy, on the Winograd Schema Challenge (WSC). The proposed state-of-the-art solver for WSC benefits from jointly modelling sentence structures, utilizing knowledge learned from cutting-edge pretraining models, and performing proper fine-tuning. We conduct detailed analyses, showing that fine-tuning is critical for achieving the performance, but it helps more on the simpler associative problems. Modelling sentence dependency structures, however, consistently helps on the harder non-associative subset of WSC. Analysis also shows that larger fine-tuning datasets yield better performances, suggesting the potential benefit of future work on annotating more Winograd schema sentences. Although this work focuses on exploring distributed representation for WSC, caution should certainly be taken on if that by itself will result in a final solution.