Deepening Hidden Representations from Pre-trained Language Models for Natural Language Understanding

11/05/2019 ∙ by Junjie Yang, et al. ∙ 0

Transformer-based pre-trained language models have proven to be effective for learning contextualized language representation. However, current approaches only take advantage of the output of the encoder's final layer when fine-tuning the downstream tasks. We argue that only taking single layer's output restricts the power of pre-trained representation. Thus we deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer. Utilizing RoBERTa as the backbone encoder, our proposed improvement over the pre-trained models is shown effective on multiple natural language understanding tasks and help our model rival with the state-of-the-art models on the GLUE benchmark.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language representation is essential to the understanding of text. Recently, pre-training language models based on Transformer Vaswani et al. (2017) such as GPT Radford et al. (2018), BERT Devlin et al. (2019), XLNet Yang et al. (2019), and RoBERTa Liu et al. (2019b) have been shown to be effective for learning contextualized language representation. These models have since continued to achieve new state-of-the-art results on a variety of natural processing tasks. They include question answering Rajpurkar et al. (2018); Lai et al. (2017), natural language inference Williams et al. (2018); Bowman et al. (2015)

, named entity recognition

Tjong Kim Sang and De Meulder (2003)

, sentiment analysis

Socher et al. (2013) and semantic textual similarity Cer et al. (2017); Dolan and Brockett (2005).

Normally, Transformer-based models are pre-trained on large-scale unlabed corpus in a unspervised manner, and then fine-turned on the downstream tasks through introducing task-specific output layer. When fine-tuning on the supervised downstream tasks, the models pass directly the output of Transformer encoder’s final layer, which is consided as the contextualized representation of input text, to the task-specific layer.

However, due to the numerous layers (i.e., Transformer blocks) and considerable depth of these pre-training models, we argue that the output of the last layer may not always be the best representation of the input text during the fine-tuning for downstream task. Devlin et al. (2019) shows diverse combinations of different layers’ outputs of the pre-trained BERT result in distinct performance on CoNNL-2003 Named Entity Recognition (NER) task Tjong Kim Sang and De Meulder (2003). Peters et al. (2018) points out for pre-trained language models, including Transformer, the most transferable contextualized representations of input text tend to occur in the middle layers, while the top layers specialize for language modeling. Therefore, the onefold use of last layer’s output may restrict the power of the pre-trained representation.

In this paper, we introduce RTRHI: Refined Transformer Representation with Hidden Information based on the fine-tuning approach with Transformer-based model, which leverages the hidden information in the Transformer’s hiddens layer to refine the language representation. Our approch consists of two main additional components:

  1. HIdden Representation Extractor (HIRE) dynamically learns a complementary representation which contains the information that the final layer’s output fails to capture. We put 2-layer bidirectional GRU beside the encoder to summarize the output of each layer into a single vector which will be used to compute the contribution score.

  2. Fusion layer integrates the hidden information extracted by the HIRE with Transformer final layer’s output through two steps of different functionalities, leading to a refined contextualized language representation.

Taking advantage of the robustness of RoBERTa by using it as the Transformer-based encoder of RTRHI, we conduct experiments on GLUE benchmark Wang et al. (2018), which consists of nine Natural Language Understanding (NLU) tasks. RTRHI outperforms our baseline model RoBERTa on 5/9 of them and advances the state-of-the-art on SST-2 dataset. Even though we don’t make any modification to the encoder’s internal architecture or redefine the pre-training procedure with different objectives or datasets, we still get the comparable performance with other state-of-the-art models on the GLUE leaderboard. These results highlight RTRHI’s excellent ability to refine Transformed-based model’s language representation.

2 Related Work

Transformer-based language models take Transformer Vaswani et al. (2017) as their model architecture, but pre-trained with different objectives or language corpus. OpenAI GPT Radford et al. (2018) is the first model which introduced Transformer architecture into unsupervised pre-training. The model is pre-trained on 12-layer left-to-right Transformer with BooksCorpus Zhu et al. (2015) dataset. But instead of using a left-to-right architecture like GPT, BERT Devlin et al. (2019) adopts Masked LM objective when pre-training, which enables the representation to incorporate context from both direction. The next sentence prediction (NSP) objective is also used by BERT to better understand the relationship between two sentences. The training procedure is conducted on a combination of BooksCorpus plus English Wikipedia. XLNet Yang et al. (2019), as a generalized autoregressive language model, uses a permutation language modeling objective during pre-training on the other hand. In addition to BooksCorpus and English Wikipedia, it also uses Giga5, ClueWeb 2012-B and Common Crawl for pre-training. Trained with dynamic masking, large mini-batches and a larger byte-level BPE, full-sentences without NSP, RoBERTa Liu et al. (2019b) improves BERT’s performance on the downstream tasks. The pre-training corpora includes BooksCorpus, CC-News, Openwebtext and Stories. By fine-tuning on the downstream tasks in a supervised manner, these powerful Transformer-based models all push the state-of-the-art results on the various NLP tasks to a new level.

Recent works have proposed new methods for fine-tuning the downstream tasks, including multi-task learning Liu et al. (2019a), adversarial training Zhu et al. (2019) or incorporating semantic information into language representation Zhang et al. (2019b).

3 Model and Method

Figure 1: Architecture of RTRHI.

3.1 Transformer-based encoder layer

Transformer-based encoder layer is responsible for encoding input text into a sequence of high-dimensional vectors, which is consided as the contextualized representation of the input sequence. Let {} represent a sequence of words of input text, we use a Transformer-based encoder to encode the input sequence, thereby to obtain its universal contextualized representation :


where is the hidden size of the encoder. It should be noted that is the output of Transformer-based encoder’s last layer which has the same length as the input text. We call it preliminary representation in this paper to distinguish it with the one that we introduce in section 3.2. Here, we omit a rather extensive formulations of Transformer and refer readers to Radford et al. (2018), Devlin et al. (2019) and Yang et al. (2019) for more details.

3.2 Hidden Representation Extractor

Since Transformer-based encoder normally has many identical layers stacked together, for example, and all contain 24 layers of the identical structure, the output of the final layer may not be the most perfect candidate to fully represent the information contained in the input text.

Trying to solve this problem, we introduce an HIdden Representation Extractor (HIRE) beside the encoder to draw from the hidden states the information that the output of the last layer fails to capture. Since each layer’s hidden states don’t carry the information of same importance to represent a certain input sequence, we adopt a mechanism which can compute the importance dynamically. We name the importance as contribution score.

The input to the HIRE is where and represents the number of layers in the encoder. Here is the initial embedding of input text, which is the input of the encoder’s first layer but is updated during training and is the hidden-state of the encoder at the output of layer . For the sake of simplicity, we call them all hidden-state afterwards.

For each hidden-state of encoder, we use the same 2-layer Bidirectional Gated Recurrent Unit (GRU)

Cho et al. (2014) to summarize it. Instead of taking the whole output of GRU as the representation of the hidden state, we concatenate GRU’s each layer and each direction’s final state together. In this way, we manage to summarize the hidden-state into a fixed-sized vector. Hence, we obtain with the summarized vector of :


where . Then the importance value for hidden-state is calculated by:


where and are trainable parameters. Let represent the computation scores for all hidden-states. is computed as follows:


It should be noted that where is the weight of hidden-state when computing the representation. Subsequently, we obtain the input sequence’s new representation by:


With the same shape as the output of Transformer-based encoder’s final layer, HIRE’s output is expected to contain the additional useful information from the encoder’s hidden-states which is helpful for a better understanding of the input text and we call it complementary representation.

Figure 2: Architecture of the Hidden Representation Extractor. The GRUs share the same parameters.

3.3 Fusion Layer

This layer fuses the information contained in the output of Tansformed-based encoder and the one extracted from encoders’ hidden states by HIRE.

Given the preliminary representation , instead of letting it flow directly into task-specfic output layer, we combine it together with the complementary representation to yeild , which we define by:


where is elementwise multiplication (Hadamard Product) and is concatenation across the last dimension.

Later, two-layer bidirectional GRU, with the output size of for each direction, is used to fully fuse the information contained in the preliminary representation and the additional useful information included in the complementary representation. We concatenate the outputs of the GPUs in two dimensions together, and we hence obtain the final contextualized representation of input text:


The use of GRUs enables the complete interaction between the two different kinds of information mentioned before. Therefore, is expected to be a refined universal representation of input text.

3.4 Output layer

The output layer is task-specific, which means we can adopt HIRE and fusion layer to other downstream tasks by only changing the output layer, such as question answering.

GLUE benchmark contains two types of tasks: 1. classification; 2. regression. For classification tasks, given the input text’s contextualized representation , following Devlin et al. (2019), we take the first row of corresponding to the first input token ([CLS]) as the aggregate representation. Let be the number of labels in the datasets, we pass through a feed-forward network(FFN):


with , , and

the only parameters that we introduce in output layer. Finally, the probability distribution of predicted label is computed as:


For regression task, we obtain in the same manner with , and take as the predicted value.

3.5 Training

For classification tasks, the training loss to be minimized is defined by the Cross-Entropy:


where is the set of all parameters in the model, is the number of examples in the dataset, is the predicted probability of class for example and is the binary indicator defined as below:

For regression tasks, we define the training loss by mean squared error (MSE):


where is the predicted value for example and is the ground truth value for example and , are same as the ones in equation 10.

4 Experiments

4.1 Dataset

We conducted the experiments on the General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2018) to evaluate our method’s performance. GLUE is a collection of 9 diverse datasets for training, evaluating, and analyzing natural language understanding models. Three different tasks are presented in GLUE benchmark according to the original paper:

Single-sentence tasks: The Corpus of Linguistic Acceptability (CoLA) Warstadt et al. (2018) requires the model to determine whether a sentence is grammatically acceptable; the Stanford Sentiment Treebank (SST-2) Socher et al. (2013) is to predict the sentiment of movie reviews with label of positive or negative.
Similarity and paraphrase tasks: Similarity and paraphrase tasks are to predict whether each pair of sentences captures a paraphrase/semantic equivalence relationship. The Microsoft Research Paraphrase Corpus (MRPC) Dolan and Brockett (2005), the Quora Question Pairs (QQP) Shankar et al. (2016) and the Semantic Textual Similarity Benchmark (STS-B) Cer et al. (2017) are presented in this category.
Natural Language Inference (NLI) tasks: Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. GLUE benchmark contains the following tasks: the Multi-Genre Natural Language Inference Corpus (MNLI) Williams et al. (2018), the Stanford Question Answering Dataset (QNLI) Rajpurkar et al. (2016), the Recognizing Textual Entailment (RTE) Bentivogli et al. (2009) and the Winograd Schema Challenge (WNLI) Levesque et al. (2012).
Four official metrics are adopted to evaluate the model performance: Matthews correlation Matthews (1975), accuracy, F1 score, Pearson and Spearman correlation coefficients. More details will be presented in section 4.3.

Single Sentence Similarity and Paraphrase Natural Language Inference
(Mcc) (Acc) (Acc) (Acc) (Pearson) (Acc) (Acc) (Acc)
MT-DNN 63.5 94.3 87.5 91.9 90.7 87.1/86.7 92.9 83.4
63.6 95.6 89.2 91.8 91.8 89.8/- 93.9 83.8
ALBERT(1.5M) 71.4 96.9 90.9 92.2 93.0 90.8/- 95.3 89.2
RoBERTa(Baseline) 68.0 96.4 90.9 92.2 92.4 90.2/90.2 94.7 86.6
RTRHI 69.6 96.8 90.9 92.0 92.2 90.6/90.4 95.0 86.6
Table 1:

GLUE Dev results. RTRHI results are based on single model trained with single task and a median over five runs with different random seed but the same hyperparameter is reported for each task. The results of MT-DNN,

, ALBERT(1.5M) and RoBERTa are from Liu et al. (2019a), Yang et al. (2019), Lan et al. (2019) and Liu et al. (2019b). See the lower-most row for the performance of our approach.

4.2 Implementation

Our implementation of RTRHI is based on the PyTorch implementation of Transformer


Preprocessing: Following Liu et al. (2019b)

, we adopt GPT-2

Radford et al. (2019) tokenizer with a Byte-Pair Encoding (BPE) vocabulary of subword units size 50K. We format the input sequence in the following way. Given single sequence , we add <s> token at the begining and </s> token at the end: <s>X</s>. For a pair of sequences , we additionally use </s> to separate these two sequences: <s>X</s>Y</s>.

Model configurations: We use RoBERTa-Large as the Transformer-based encoder and load the pre-training weights of RoBERTa Liu et al. (2019b). Like BERT-Large, RoBERTa-Large model contains 24 Transformer-blocks, with the hidden size being 1024 and the number of self-attention heads being 16 Liu et al. (2019b); Devlin et al. (2019).

Optimization: We use Adam optimizer Kingma and Ba (2014) with , and

and the learning rate is selected amongst {5e-6, 1e-5, 2e-5, 3e-5} with a warmup rate ranging from 0.06 to 0.25 depending on the nature of the task. The number of training epochs ranges from 4 to 10 with the early stop and the batch size is selected amongst {16, 32, 48}. In addition to that, we clip the gradient norm within 1 to prevent exploding gradients problem occuring in the recurrent neural networks in our model.

Regularization: We employ two types of regularization methods during training. We apply dropout Srivastava et al. (2014) of rate 0.1 to all layers in the Transformer-based encoder and GRUs in the HIRE and fusion layer. We additionally adopt L2 weight decay of 0.1 during training.

4.3 Main results

Table 1 compares our method RTRHI with a list of Transformer-based models on the development set. To obtain a direct and fair comparison with our baseline model RoBERTa, following the original paper Liu et al. (2019b), we fine-tune RTRHI separately for each of the GLUE tasks, using only task-specific training data. The single-model results for each task are reported. We run our model with five different random seeds but the same hyperparameters and take the median value. Due to the problematic nature of WNLI dataset, we exclude its results in this table. The results shows that RTRHI consistently outperforms RoBERTa on 4 of the GLUE task development sets, with an improvement of 1.6 points, 0.4 pionts, 0.4/0.2 points, 0.3 points on CoLA, SST-2, MNLI and QNLI respectively. And on the QQP and RTE task, our model get the same result as RoBERTa. It should be noted that the improvement is entirely attributed to the introduction of HIdden Representation Extractor and Fusion Layer in our model.

Single Sentence Similarity and Paraphrase Natural Language Inference
8.5k 67k 3.7k 364k 7k 393k 108k 2.5k 634
XLNet 67.8 96.8 93.0/90.7 74.2/90.3 91.6/91.1 90.2/89.7 98.6 86.3 90.4 88.4
MT-DNN 68.4 96.5 92.7/90.3 73.7/89.9 91.1/90.7 87.9/87.4 96.0 86.3 89.0 87.6
FreeLB-RoBERTa 68.0 96.8 93.1/90.8 74.8/90.3 92.4/92.2 91.1/90.7 98.8 88.7 89.0 88.8
ALICE v2 69.2 97.1 93.6/91.5 74.4/90.7 92.7/92.3 90.7/90.2 99.2 87.3 89.7 89.0
ALBERT 69.1 97.1 93.4/91.2 74.2/90.5 92.5/92.0 91.3/90.0 99.2 89.2 91.8 89.4
T5 70.8 97.1 91.9/89.2 74.6/90.4 92.5/92.1 92.0/91.7 96.7 92.5 93.2 89.7
RoBERTa(Baseline) 67.8 96.7 92.3/89.8 74.3/90.2 92.2/91.9 90.8/90.2 98.9 88.2 89.0 88.5
RTRHI 68.6 97.1 93.0/90.7 74.3/90.2 92.4/92.0 90.7/90.4 95.5 87.9 89.0 88.3
Table 2: GLUE Test results, scored by the official evaluation server. All the results are obtained from GLUE leaderboard ( at the time of submitting RTRHI (3 November, 2019). The number below each task’s name indicates the size of training dataset. The state-of-the-art results are in bold. RTRHI takes RoBERTa as its Transformer-based encoder. Mcc, acc and pearson denote Matthews correlation, accuracy and Person correlation coefficient respectively.

Table 2 presents the results of RTRHI and other models on the test set that have been submitted to the GLUE leaderboard. Following Liu et al. (2019b), we fine-tune STS-B and MRPC starting from the MNLI single-task model. Given the simplicity between RTE, WNLI and MNLI, and the large-scale nature of MNLI dataset (393k), we also initialize RTRHI with the weights of MNLI single-task model before fine-tuning on RTE and WNLI. We submitted the ensemble-model results to the leaderboard. The results show that RTRHI still boosts the strong RoBERTa baseline model on the test set. To be specific, RTRHI outperforms RoBERTa over CoLA, SST-2, MRPC, SST-B, MNLI-mm with an improvement of 0.8 points, 0.4 points, 0.7/0.9 points, 0.2/0.1 points and 0.2 points respectively. In the meantime, RTRHI gets the same results as RoBERTa on QQP and WNLI. By category, RTRHI has better performance than RoBERTa on the single sentence tasks, similarity and paraphrase tasks. It’s worth noting that our model obtains state-of-art results on SST-2 dataset, with a score of 97.1. The results is quite promising since HIRE does not make any modification with the encoder internal architecture Yang et al. (2019) or redefine the pre-training procedure Liu et al. (2019b) and we still get the comparable results with them.

5 Analysis

Figure 3: Distribution of contribution scores over different layers when computing the complementary representation for various NLU tasks. The contribution scores are normalized by SoftMax to sum up to 1 for each row. The numbers on the abscissa axis indicate the corresponding layer with 0 being the first layer and 23 being the last layer.

We compare the contribution score’s distribution of different NLU tasks. For each task, we run our best single model over the development set and the results are calculated by averaging the values across all the examples within each dataset. The results are showed in Figure 3. From the top to the bottom of the headmap, the results are placed in the following order: single-sentence tasks, similarity and paraphrase tasks and natural language inference tasks. From figure 3, we find that the distribution differs among the different tasks, which demonstrates RTRHI’s dynamic ability to adapt for distinct task when computing the complementary representation. The most important contribution occurs below the final layer for all the tasks except MRPC and RTE. All layers have a close contribution for MRPC and RTE task.

Figure 4 presents the distribution of contribution scores over different layers for each example of SST-2 dataset. The number on the ordinate axis denotes the index of the example. We observe that even though there are subtle differences among these example, they follow certain same patterns when calculating the complementary representation, for example, layer 21 and 22 contribute the most for almost all the examples and also the layers around them. But the figure shows also that for some examples, all layers contribute almost equally.

Figure 4: Distribution of contribution scores over different layers for each example of SST-2 dataset. The number on the ordinate axis denotes the index of the example.

6 Conclusion

In this paper, we have introduced RTRHI, a novel approach that refines language representation by leveraging the Transformer-based model’s hidden layers. Specifically, an HIdden Representation Extractor is used to dynamically generate complementary imformation which will be incorporated with preliminary representation in the Fusion Layer. The experimental results demonstrate the effectiveness of refined language representation for natural language understanding. The analysis highlights the distinct contribution of each layer’s output for diverse task and different example. We expect future work could be conducted in the following domains: (1) explore sparse version of Hidden Representation Extractor for more effective computation and less memory usage; (2) incorporating extra knowledge information Zhang et al. (2019a) or structured semantic information Zhang et al. (2019b) with current language representation in the fusion layer during fine-tuning; (3) integrate multi-tasks training Caruana (1997) or knowledge distillation Buciluǎ et al. (2006); Hinton et al. (2015) into our model.