Pre-trained language models are a key component in many natural language understanding (NLU) tasks such as semantic textual similarity [cer2017semeval], question answering [rajpurkar2016squad] and sentiment classification [socher2013recursive]
. In order to get reliable language representations, neural language models are designed to define the joint probability function of sequences of words in text with unsupervised learning. Different from traditional word-specific embedding in which each token is assigned a global representation, recent work, such as Cove[mccann2017learned], ELMo [peters2018deep], GPT [radford2018improving] and BERT [devlin2018bert]
, derives contextualized word vectors from a language model trained on a large text corpus. These models have been shown effective for many downstream NLU tasks.
Among the context-sensitive language models, BERT has taken the NLP world by storm. It is designed to pre-train bidirectional representations by jointly conditioning on both left and right context in all layers and model the representations by predicting masked words only through the contexts. However, it does not make the most of underlying language structures.
According to Elman [elman1990finding]
’s study, the recurrent neural networks was shown to be sensitive to regularities in word order in simple sentences. Since language fluency is determined by the ordering of words and sentences, finding the best permutation of a set of words and sentences is an essential problem in many NLP tasks, such as machine translation and NLU[hasler2017comparison]. Recently, word ordering was treated as LM-based linearization solely based on language models [schmaltz2016word]. Schmaltz showed that recurrent neural network language models [mikolov2010recurrent]hochreiter1997long] cells work effectively for word ordering even without any explicit syntactic information.
In this paper, we introduce a new type of contextual representation, StructBERT, which incorporates language structures into BERT pre-training by proposing two novel linearization strategies. Specifically, in addition to the existing masking strategy, StructBERT extends BERT by leveraging the structural information: word-level ordering and sentence-level ordering. We augment model pre-training with two new structural objectives on the inner-sentence and inter-sentence structures, respectively. In this way, the linguistic aspects [elman1990finding] are explicitly captured during the pre-training procedure. With structural pre-training, StructBERT encodes dependency between words as well as sentences in the contextualized representation, which provides the model with better generalizability and adaptability.
StructBERT significantly advances the state-of-the-art results on a variety of NLU tasks, including the GLUE benchmark [wang2018glue], the SNLI dataset [bowman2015large] and the SQuAD v1.1 question answering task [rajpurkar2016squad]. All of these experimental results clearly demonstrate StructBERT’s exceptional effectiveness and generalization capability in language understanding.
We make the following major contributions:
We propose novel structural pre-training that extends BERT by incorporating the word structural objective and the sentence structural objective to leverage language structures in contextualized representation. This enables the StructBERT to explicitly model language structures by forcing it to reconstruct the right order of words and sentences for correct prediction.
StructBERT significantly outperforms the previous state-of-the-art models on a wide range of NLU tasks. This model extends the superiority of BERT, and boosts the performance in many language understanding applications such as semantic textual similarity, sentiment analysis, textual entailment, and question answering.
2 Related Work
2.1 Contextual Language Representation
A word could have different semantics depending on the current context, contextualized word representation is considered to be an important part of modern NLP research.
Among the various pre-training LMs [mccann2017learned, peters2018deep, radford2018improving, devlin2018bert], ELMo [peters2018deep] learns two unidirectional LMs based on long short-term memory networks (LSTMs). A forward LM reads the text from left to right, and a backward LM encodes the text from right to left. Following the similar idea of ELMo, OpenAI GPT [radford2018improving] expands the unsupervised language model to a much larger scale by training on a giant collection of free text corpora. Different from ELMo, it builds upon a multi-layer Transformer [vaswani2017attention] decoder, and uses a left-to-right Transformer to predict a text sequence word-by-word.
In contrast, BERT [devlin2018bert] employs a bidirectional Transformer encoder to fuse both the left and the right context, and introduces two novel pre-training tasks for better language understanding. We base our LM on the architecture of BERT, and further extend it by introducing word and sentence structures into pre-training tasks for deep language understanding.
2.2 Word & Sentence Ordering
The task of linearization is to recover the original order of a shuffled sentence [schmaltz2016word]
. Part of larger discussion as to whether LSTMs are capturing syntactic phenomena linearization, is standardized in a recent line of research as a method useful for isolating the performance of text-to-text generation[zhang2015discriminative] models. Recently, Transformers have emerged as a powerful architecture for learning the latent structure of language. For example, Bidirectional Transformers (BERT) has reduced the perplexity for language modeling task. We revisit Elman’s question by applying BERT to the word-ordering task, without any explicit syntactic approaches. We find that pre-trained language models are effective for various downstream tasks with linearization.
Many important downstream tasks such as STS and NLI [wang2018glue] are based on understanding the relationship between two text sentences, which is not directly captured by language modeling. BERT [devlin2018bert]
pre-trains a binarized next sentence prediction task to understand sentence relationships, we take one step further and treat it as a sentence ordering task. The goal of sentence ordering is to arrange a set of sentences into a coherent text in a clear and consistent manner which can be viewed as a ranking problem[chen2016neural]. The task is general and yet challenging, and once is especially important for natural language generation [reiter1997building]. A text should be organized according to it discourse coherence [chen2016neural]
of the following properties: rhetorical coherence, topical relevancy, chronological sequence, and cause-effect. In this work, we focus on what is arguably the most basic characteristics of a sequence: their order. Most of previous researches of sentence ordering were integrated into the study of downstream task, such as multi-document summarization[bollegala2010bottom], we revisit this problem in our language modeling task as a new sentence prediction.
3 StructBERT Model Pre-training
StructBERT builds upon the BERT architecture, which uses a multi-layer bidirectional Transformer network[vaswani2017attention]. Given a single text sentence or a pair of text sentences, BERT packs them in one token sequence and learns a contextualized vector representation for each token. An input token is represented according to the word, the position, and the text segment it belongs to. Next, the input vectors are fed into a stack of multi-layer bidirectional Transformer blocks, which uses self-attention to compute the text representations by considering the whole input sequence.
The original BERT introduces two unsupervised prediction tasks to pre-train the model: i.e., a masked LM task and a next sentence prediction task. Different from original BERT, our StructBERT amplifies the ability of the masked LM task by randomly shuffling certain number of tokens after word masking and then predicting the right order. Moreover, to better understand the relationship between sentences, StructBERT randomly exchanges the sentence order so as to predict both the next sentence and the previous sentence as a new sentence prediction task. In this way, the new model not only explicitly captures the fine-grained word structure in every sentence, but also properly models the inter-sentence structure in a bidirectional manner. Once the StructBERT language model is pre-trained with these two auxiliary tasks, we can fine-tune it on task-specific data for a wide range of downstream tasks.
3.1 Input Representation
The input is a word sequence, it can be either a single sentence or a pair of sentences packed together. The input representation follows that used in BERT [devlin2018bert]. For each input token, its vector representation is computed by summing the corresponding token embedding, position embedding, and segment embedding. We always add a special classification embedding ([CLS]) as the first token of every sequence, and a special end-of-sequence ([SEP]) token to the end of each segment. Texts are tokenized to subword units by WordPiece [wu2016google] and absolute positional embeddings are learned with supported sequence lengths up to 512 tokens. In addition, the segment embeddings are used to differentiate a pair of text as in BERT.
3.2 Transformer Encoder
We use a multi-layer bidirectional Transformer encoder [vaswani2017attention] to encode contextual information for input representation. Given the input vectors , an -layer Transformer is used to encode the input as:
where , and . We use the hidden vector as the contextualized representation of the input token .
3.3 Pre-training Objectives
To make full use of the rich inner-sentence and inter-sentence structures in language, we extend the pre-training objectives of original BERT in two ways: 1⃝ word structural objective (mainly for the single-sentence task), and 2⃝ sentence structural objective (mainly for the sentence-pair task). We pre-train these two auxiliary LM objectives together with the original masked LM objective in a unified model to exploit inherent language structures.
3.3.1 Word Structural Objective
The word structural objective is used to endow the model with the ability to reconstruct the right order of certain number of intentionally shuffled word tokens. We use this new objective as a supplement to the original masked LM objective, and jointly train the objectives together.
In particular, we first treat the pre-training procedure as a Cloze task as in BERT [devlin2018bert] by masking some percentage of the input tokens at random and then predicting only these masked tokens. The corresponding output vectors
of the masked tokens computed by the bidirectional Transformer encoder are fed into a softmax classifier to predict the original tokens.
Next, to take word order into consideration, every sentence is treated as a collection of continuous trigram sequences. We additionally select some percentage of trigrams and randomly shuffle the order of the three words within each trigram, e.g., it [MASK] raining outside it [MASK] outside raining. In this way, the StructBERT model is expected to reconstruct the right order of the shuffled trigram sequence, i.e., [MASK] outside raining [MASK] raining outside. The output vectors of the shuffled tokens computed by the bidirectional Transformer encoder are then fed into a softmax classifier to predict the original tokens.
In both of the above-mentioned objectives, the parameters of StructBERT are learned to minimize the cross-entropy loss between predicted tokens and original tokens. The two objectives are jointly learned together in a unified pre-trained language model with equal weights.
3.3.2 Sentence Structural Objective
The next sentence prediction task is considered easy for the pre-trained BERT model (the prediction accuracy of BERT can easily achieve 97%-98% at this task [devlin2018bert]). We, therefore, extend the sentence prediction task by predicting both the next sentence and the previous sentence, to make the pre-trained language model aware of the sequential order of the sentences in a bidirectional manner.
Given a pair of sentences (, ) as input, we predict whether is the next sentence that follows , or the previous sentence that precedes , or a random sentence from a different document. Specifically, for the sentence , of the time we choose the text span that follows as the second sentence , of the time the previous sentence ahead of is selected, and of the time a sentence randomly sampled from the other documents is used as . The input token sequence is packed as “[CLS][SEP][SEP]”. We pool the model output by simply taking the hidden state corresponding to the first token [CLS], and feed the encoding vector of [CLS] into a softmax classifier to make a three-class prediction.
3.4 Pre-training Setup
The training objective function is a linear combination of the word structural objective and the sentence structural objective. For the masked LM objective, we followed the same masking rate and settings as in BERT [devlin2018bert]. For the trigram shuffling objective, we selected 5% of trigrams for random shuffling.
We used documents from English Wikipedia (2,500M words) and BookCorpus [zhu2015aligning] as pre-training data, following the preprocess and the WordPiece tokenization of [devlin2018bert]. The maximum length of input sequence was 512.
We used Adam with learning rate of 1e-4, , , L2 weight decay of 0.01, learning rate warmup over the first 10% of the total steps, and linear decay of the learning rate. We used a dropout probability of 0.1 on all layers. The gelu activation [hendrycks2016gaussian] was used as GPT [radford2018improving].
We denote the number of Transformer block layers as L, hidden size as H, and the number of self-attention heads as A. Following the practice in BERT, We primarily report results on two model sizes:
StructBERTBase: L=12, H=768, A=12, Total parameters=110M
StructBERTLarge: L=24, H=1024, A=16, Total parameters=340M
Training of StructBERT was performed on a distributed computing cluster consisting of 64 Telsa V100 GPU cards. For the StructBERTBase
, we ran the pre-training procedure for 40 epochs, which took about 38 hours, and the training of StructBERTLarge took about 7 days to complete.
4 Fine-tuning on Downstream Tasks
We can fine-tune the pre-trained StructBERT with additional task-specific layers for various downstream tasks. Enriched with the new word structural objective and sentence structural objective, the new pre-trained model is adapted to a wide range of single-sentence and sentence-pair understanding applications, including the GLUE benchmark [wang2018glue], SQuAD question answering [rajpurkar2016squad] and SNLI language inference [bowman2015large].
4.1 Single-Sentence Understanding
For single-sentence understanding, we mainly focus on classification of a single sentence for various tasks. During fine-tuning, we use the encoding vector [CLS] as the representation of an input sequence X, denoted as . An additional task-specific linear head is added on top of . The probability that X is labeled as class is predicted by a softmax classifier as:
We maximize the likelihood of the labeled training examples by updating the parameters of the pre-trained StructBERT and the added linear head .
4.2 Sentence-Pair Understanding
Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on the understanding of the relationship between two text sentences. For sentence-pair understanding, we pack the two sentences (denoted as and ), together with the special tokens as BERT, to form an input sequence “[CLS][SEP][SEP]”. The packed input sequence is then fed into pre-trained StructBERT and the task-specific layers for fine-tuning on a target task.
For a sentence-pair classification task, we use the same softmax classifier with the task-specific linear head as in Equation 2 for updating the model parameters.
For a text similarity task, we also use the encoding vector as the sequence representation. The similarity score is computed as:
where is a task-specific parameter vector, and
is a sigmoid function that maps the score to a real value of the range. We use the mean squared error as in [liu2019multi] for the training objective.
For an extractive question answering task, we follow the method used in [devlin2018bert] and add two linear heads to predict the start index and end index, respectively. The probability of word being the start or end of an answer span is computed by a softmax over all of the words in the given paragraph as:
where is linear head parameters for start or end, and the training objective is the log-likelihood of true start and end positions.
4.3 Fine-tuning Setup
Following BERT’s practice, during fine-tuning on downstream tasks, we performed a grid search or an exhaustive search (depending on the data size) on the following sets of parameters and chose the model that performed the best on the dev set. all the other parameters remain the same as those in pre-training:
Batch size: 16, 24, 32
Learning rate: 2e-5, 3e-5, 5e-5
Number of epochs: 2, 3
Dropout rate: 0.05, 0.1
We observed that small learning rates are more likely to achieve better results on small data sets such as RTE, MRPC and CoLA, especially for large models.
In this section, we report results of StructBERT on a variety of downstream tasks including General Language Understanding Evaluation (GLUE benchmark), Standford Natural Language inference (SNLI corpus) and extractive question answering (SQuAD v1.1).
|CoLA (Acceptability)||8.5k/1k/1k||Matthews corr|
|Pairwise Text Classification|
|QQP (Paraphrase)||364k/40k/391k||F1 score/Accuracy|
|MRPC (Paraphrase)||3.7k/408/1.7k||F1 score/Accuracy|
|BERT on STILTs [phang2018sentence]||62.1||94.3||90.2/86.6||88.7/88.3||71.9/89.4||86.4/85.6||92.7||80.1||65.1||28.3||82.0|
|Snorkel MeTaL [ratner2017snorkel]||63.8||96.2||91.5/88.5||90.1/89.7||73.1/89.9||87.6/87.2||93.9||80.9||65.1||39.9||83.2|
|MT-DNN ensemble [liu2019multi]||65.4||96.5||92.2/89.5||89.6/89.0||73.7/89.9||87.9/87.4||96.0||85.7||65.1||42.8||84.2|
5.1 General Language Understanding
5.1.1 GLUE benchmark
The General Language Understanding Evaluation (GLUE) benchmark [wang2018glue] is a collection of nine NLU tasks as in Table 1, covering textual entailment (RTE [bentivogli2009fifth] and MNLI [williams2017broad]), question-answer entailment (QNLI [wang2018glue]), paraphrase (MRPC [dolan2005automatically]), question paraphrase (QQP), textual similarity (STS-B [cer2017semeval]), sentiment (SST-2 [socher2013recursive]), linguistic acceptability (CoLA), and Winograd Schema (WNLI [levesque2012winograd]).
On the GLUE benchmark, given the similarity of MRPC/RTE/STS-B to MNLI, we fine-tuned StructBERT on MNLI before training on MRPC/RTE/STS-B data for the respective tasks. This follows the two-stage transfer learning STILTs introduced in[phang2018sentence]. For all the other tasks (i.e., RTE, QNLI, QQP, SST-2, CoLA and MNLI), we fine-tuned StructBERT for each single task only on its in-domain data.
Table 2 presents the GLUE test results obtained from the official benchmark evaluation server. At the time of paper submission (May 21, 2019), the StructBERT model, which was submitted under a different name ALICE, achieved the best performance on the leaderboard, creating a new state-of-the-art result of 84.5% for the ensemble model and 83.9% for the single model.
As shown in the Table 2, we achieved state-of-the-art results in six of the nine tasks (we excluded WNLI 111As noted in the GLUE website https://gluebenchmark.com/faq, there are issues in the WNLI dataset, and none of the submitted systems has ever outperformed the majority voting baseline whose accuracy is 65.1.). In the most popular MNLI task, our single model improved the best result by 0.3%/0.5%, since we fine-tuned MNLI only on its in-domain data, this improvement is entirely attributed to our new training objectives. The most significant improvement over BERT was observed on CoLA (4.8%), which may be due to the strong correlation between the word order task and the grammatical error correction task. In the SST-2 task, our model improved over BERT while performed worse than MT-DNN did, which indicates that sentiment analysis based on single sentences benefits less from the word structural objective and sentence structural objective.
Natural Language Inference (NLI) is one of the important tasks in natural language understanding. The goal of this task is to test the ability of the model to reason the semantic relationship between two sentences. In order to perform well on an NLI task, a model needs to capture the semantics of sentences, and thus to infer the relationship between a pair of sentences: entailment, contradiction or neutral.
|System||Dev set||Test set|
We evaluated our model on the most widely used NLI dataset: The Stanford Natural Language Inference (SNLI) Corpus [bowman2015large], which consists of 549,367/9,842/9,824 premise-hypothesis pairs in train/dev/test sets and target labels indicating their relations. We performed a grid search on the sets of parameters as described in Section 4.3, and chose the model that performed best on the dev set.
Table 3 shows the results on the SNLI dataset of our model with other published models. StructBERT outperformed all existing systems on SNLI, creating new state-of-the-art results 91.7%, which amounts to 0.4% absolution improvement over the previous state-of-the-art model SJRC and 0.9% absolution improvement over BERT. Since the network architecture of our model is identical to that of BERT, this improvement is entirely attributed to the new pre-training objectives, which justifies the effectiveness of the proposed tasks of word prediction and sentence prediction.
5.2 Extractive Question Answering
SQuAD v1.1 is a popular machine reading comprehension dataset consisting of 100,000+ questions created by crowd workers on 536 Wikipedia articles [rajpurkar2016squad]. The goal of the task is to extract the right answer span from the corresponding paragraph given a question.
We fine-tuned our StructBERT language model on the SQuAD dataset for 3 epochs, and compared the result against the state-of-the-art methods on the official leaderboard 222https://rajpurkar.github.io/SQuAD-explorer/, as shown in Table 4. We can see that without any additional data augmentation techniques, the proposed StructBERT model was superior to all the other methods 333We submitted the model under the name of ALICE to the SQuAD v1.1 CodaLab for evaluation on the test set on May 5th, 2019 for the single model and on May 9th, 2019 for the ensemble model. However, due to crash of the Codalab evaluation server, we have not got our test result back yet at the time of paper submission. We will update the result once it is announced.. It demonstrates the effectiveness of the proposed pre-trained StructBERT in modeling the question-paragraph relationship for extractive question answering. Incorporating the word and sentence structures significantly improves the understanding ability in this fine-grained answer extraction task.
5.3 Effect of Different Structural Objectives
We have demonstrated the strong empirical results of the proposed model on a variety of downstream tasks. In the StructBERT pre-training, the two new structural prediction tasks are the most important components. Therefore, we conducted an ablation study by removing one structural objective from pre-training at a time to examine how the two structural objectives influence the performance on various downstream tasks.
Results are presented in Table 5. From the table, we can see that: (1) the two structural objectives were both critical to most of the downstream tasks, except for the word structural objective in the SNLI task. Removing any word or sentence objective from pre-training always led to degraded performance in the downstream tasks. The StructBERT model with structural pre-training consistently outperformed the original BERT model, which shows the effectiveness of the proposed structural objectives. (2) For the sentence-pair tasks such as MNLI, SNLI, QQP and SQuAD, incorporating the sentence structural objective significantly improved the performance. It demonstrates the effect of inter-sentence structures learned by pre-training in understanding the relationship between sentences for downstream tasks. (3) For the single-sentence tasks such as CoLA and SST-2, the word structural objective played the most important role. Especially in the CoLA task, which is related to the grammatical error correction, the improvement was over 5%. The ability of reconstructing the order of words in pre-training helped the model better judge the acceptability of a single sentence.
We also studied the effect of both structural objectives during unsupervised pre-training. Figure 1 illustrates the loss and accuracy of word and sentence prediction over the number of pre-training steps for StructBERTBase and BERTBase. From the two sub-figures on top, it is observed that compared with BERT, the augmented shuffled token prediction in StructBERT’s word structural objective had little effect on the loss and accuracy of masked token prediction. On the other hand, the integration of the simpler task of shuffled token prediction (lower loss and higher accuracy) provides StructBERT with the capability of word reordering. In contrast, the new sentence structural objective in StructBERT leads to a more challenging prediction task than that in BERT, as shown in the two figures at the bottom. This new pre-training objective enables StructBERT to exploit inter-sentence structures, which benefits sentence-pair downstream tasks.
In this paper, we propose novel structural pre-training which incorporates word and sentence structures into BERT pre-training. A word structural objective and a sentence structural objective are introduced as two new pre-training tasks for deep understanding of natural language in different granularities. Experimental results demonstrate that the new StructBERT model can obtain new state-of-the-art results in a variety of downstream tasks, including the popular GLUE benchmark, the SNLI Corpus and the SQuAD v1.1 question answering.