SegaBERT: Pre-training of Segment-aware BERT for Language Understanding

by   He Bai, et al.

Pre-trained language models have achieved state-of-the-art results in various natural language processing tasks. Most of them are based on the Transformer architecture, which distinguishes tokens with the token position index of the input sequence. However, sentence index and paragraph index are also important to indicate the token position in a document. We hypothesize that better contextual representations can be generated from the text encoder with richer positional information. To verify this, we propose a segment-aware BERT, by replacing the token position embedding of Transformer with a combination of paragraph index, sentence index, and token index embeddings. We pre-trained the SegaBERT on the masked language modeling task in BERT but without any affiliated tasks. Experimental results show that our pre-trained model can outperform the original BERT model on various NLP tasks.



There are no comments yet.


page 2


A Comprehensive Exploration of Pre-training Language Models

Recently, the development of pre-trained language models has brought nat...

MPNet: Masked and Permuted Pre-training for Language Understanding

BERT adopts masked language modeling (MLM) for pre-training and is one o...

On the Sentence Embeddings from Pre-trained Language Models

Pre-trained contextual representations like BERT have achieved great suc...

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally ex...

MeLT: Message-Level Transformer with Masked Document Representations as Pre-Training for Stance Detection

Much of natural language processing is focused on leveraging large capac...

MWP-BERT: A Strong Baseline for Math Word Problems

Math word problem (MWP) solving is the task of transforming a sequence o...

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Pre-trained contextual representations (e.g., BERT) have become the foun...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large neural language model (LM) trained on a massive amount of text data has shown great potential on transfer learning and achieved state-of-the-art results in various natural language processing tasks. Compared with traditional word embedding such as Skip-Gram 

Mikolov et al. (2013) and Glove Pennington et al. (2014), pre-trained LM can learn contextual representation and can be fine-tuned as the text encoder for downstream tasks, such as OpenAIGPT Radford (2018), BERT Devlin et al. (2018), XLNET Yang et al. (2019b) and BART Lewis et al. (2019). Therefore, pre-trained LM has emerged as a convenient technique in natural language processing.

Most of these pre-trained models use a multi-layer Transformer Vaswani et al. (2017)

and are pre-trained with self-supervised tasks such as masked language modeling (MLM). The Transformer network was initially used in the seq2seq architecture for machine translation, whose input is usually a sentence. Hence, it is intuitive to distinguish each token with its position index in the input sequence.

However, in the LM pre-training scenery, the inputs usually range from 512 tokens to 1024 tokens and come from different sentences and paragraphs. Although the token position embedding can help the transformer be aware of the token order by distinguishing each token in the input sequence, the sentence order, paragraph order, token position in a sentence, sentence position in a paragraph, and paragraph position in a document are all implicit. Such segmentation information is essential for language understanding and can help text encoder generate better contextual representations.

Hence, we propose a novel segment-aware BERT which encodes paragraph index in a document, sentence index in a paragraph, token index in a sentence all together during pre-training and fine-tuning stage. We pre-train the SegaBERT with MLM target in the same settings with BERT but without the next sentence prediction (NSP) or other affiliated tasks. According to the experimental results, our pre-trained model, SegaBERT, outperforms BERT on both general language understanding and machine reading comprehension: 1.17 average score improvement on GLUE Wang et al. (2019a) and 1.14/1.54 exact match/F1 score improvement on SQUAD v1.1 Rajpurkar et al. (2016).

2 Model

Figure 1: Input Representation of SegaBERT

Our SegaBERT is based on the BERT architecture, which is a multi-layer transformer-based bidirectional masked language model. For a long sequence of text, SegaBERT leverages the segment information, such as the sentence position and paragraph position information, to learn a better contextual representation for each token. The original BERT uses a learned position embedding to encode the tokens’ position information. Instead of using the global token indexing, we introduce three types of embeddings: Token Index Embedding, Sentence Index Embedding, and Paragraph Index Embedding, as shown in Figure 1. Thus, the global token position is uniquely determined by three parts in SegaBERT: the token index within a sentence, the sentence index within a paragraph, and the paragraph index within a long document.

Input Representation Input is a sequence of tokens, which can be one or more sentences or paragraphs. Similar to the input representation used in BERT, the representation for the token is computed by summing the corresponding token embedding , token index embedding , sentence index embedding , and paragraph index embedding . Two special tokens [CLS] and [SEP] are added into the text sequence before the first token and after the last token. Following BERT, the text is tokenized into subwords with WordPiece and the maximum sequence length is 512.

Encoder Architecture

The multi-layer bidirectional Transformer encoder is used to encode the contextual representation for the inputs. With L-layer Transformer, the last layer hidden vector

of token is used as the contextualized representation. With the rich segmentation information present in the input representation, the encoder has a better contextualization ability.

Training Objective Following BERT, we use the masked LM as our training objective. The other training objective, next sentence prediction, is not used in our model.

Training Setup For the SegaBERT-base, we pre-train the model with English Wikipedia. For the SegaBERT-large, English Wikipedia and BookCorpus are used. The text is preprocessed following BERT and tokenized into sub-tokens with WordPiece. For each document of Wikipedia, we firstly split that into paragraphs, and all the sub-tokens in the -th paragraph are assigned the same Paragraph Index Embedding . The paragraph index starts from 0 for each document. Similarly, each paragraph is further segmented into sentences, and all the sub-tokens in the -th sentence are assigned the same Sentence Index Embedding . The sentence index starts from 0 for each paragraph. Within each sentence, all the sub-tokens are indexed from 0. -th sub-token will have its Token Index Embedding . The maximum paragraph index, sentence index, and token index are 50, 100, and 256, respectively.

We conduct our experiments based on two model sizes: SegaBERT-base and SegaBERT-large:

  • SegaBERT-base: =12, =768, =12

  • SegaBERT-large: =24, =1024, =24

Here, we use to denote the number of Transformer layers, to denote the hidden size and to denote the number of self-attention heads.

The pre-training is based on 16 Tesla V100 GPU cards. We train 500K steps for the SegaBERT-base and 1M steps for SegaBERT-large. For the optimization, we use Adam with the learning rate of 1e-4, =0.9, =0.999, with learning rate warm-up over the first 1% of the total steps and with linear decay of the learning rate.

3 Experiments

Task(Metrics) LARGE model (wikibooks 1M steps)
dev test
CoLA (Matthew Corr.) 60.6 65.3 60.5 62.6
SST-2 (Acc.) 93.2 94.7 94.9 94.8
MRPC (F1) - 92.3 89.3 89.7
STS-B (Spearman Corr.) - 90.3 86.5 88.6
QQP (F1) - 89.1 72.1 72.5
MNLI-m (Acc.) 86.6 87.6 86.7 87.9
MNLI-mm (Acc.) - 87.5 85.9 87.7
QNLI (Acc.) 92.3 93.6 92.7 94.0
RTE (Acc.) 70.4 78.3 70.1 71.6
Average - 86.5 82.1 83.3
Table 1: The results on GLUE benchmark. Scores of BERT large dev are from (Sun et al., 2019) and scores of BERT large test are from (Devlin et al., 2018).

In this section, we show the results of the SegaBERT on several downstream tasks: General Language Understanding Evaluation (GLUE) benchmark and extractive question answering (SQUAD v1.1).

3.1 Glue

On the GLUE benchmark, we conduct the fine-tuning experiments in the following manner: For single sentence classification, such as sentiment classification (SST-2), the sentence will be assigned Paragraph Index 0 and Sentence Index 0. For sentence pair classification, such as question-answer entailment (QNLI), the first sentence will be assigned Paragraph Index 0 and Sentence Index 0 and the second sentence will be assigned Paragraph Index 1 and Sentence Index 0.

Our base models are trained mainly for the ablation study purpose and hence fine-tuned without grid search. The default hyper-parameters are 3e-5 learning rate, 256 batch size, and 3 epochs. The other hyper-parameters are the same as in the HuggingFace Transformers library.


For the large model, we pre-trained a large SegaBERT model and conduct a grid search with GLUE dev set for the low-resource tasks (CoLA, MRPC, RTE, SST-2, and STS-B). For QQP, MNLI, and QNLI, the hyper-parameters are the same as the base model. Our grid search space is as follows:

  • Batch size: 16, 24, 32

  • Learning rate: 2e-5, 3e-5, 5e-5

  • Number of epochs: 3-10

As we can see from Table 1, the average GLUE score of our SegaBERT is higher than BERT on both the dev set and test set. Our large model outperforms BERT by over 1.2 points on average GLUE score and achieves better scores nearly on all tasks. These results demonstrate SegaBERT’s effectiveness and generalization capability in natural language understanding.

3.2 Reading Comprehension

System Dev
XLNet (Single+DA) Yang et al. (2019b) 89.7 95.1
StructBERT Large (Single) Wang et al. (2019b) 85.2 92.0
KT-NET Yang et al. (2019a) 85.2 91.7
BERT base (Single) Devlin et al. (2018) 80.8 88.5
BERT large (Single) Devlin et al. (2018) 84.1 90.9
BERT large (Single+DA) Devlin et al. (2018) 84.2 91.1
SegaBERT base (Single) 83.2 90.2
SegaBERT large (Single) 85.3 92.4
Table 2: Evaluation results of SQUAD v1.1
Task(Metrics) Base model Trained on wikipedia for 500k steps
CoLA (Matthew Corr.) 54.7 48.6 52.1
SST-2 (Acc.) 91.4 91.5 92.1
MRPC (F1) 89.7 91.5 91.2
STS-B (Spearman Corr.) 87.8 88.2 88.7
QQP (F1) 86.5 87.1 87.0
MNLI-m (Acc.) 83.2 83.2 83.8
MNLI-mm (Acc.) 83.4 83.7 84.1
QNLI (Acc.) 90.4 91.5 91.5
RTE (Acc.) 61.0 62.1 66.4
GLUE Average 80.9 80.8 81.9
SQUAD v1.1 (EM/F1) 81.89/89.40 83.0/90.3 83.23/90.15
Table 3: Results of base models on dev set of GLUE and SQUAD v1.1

For Reading Comprehension task, we fine-tune in the following manner: The question is assigned Paragraph Index 0 and Sentence Index 0. For the context with paragraphs, Paragraph Index 1 to +1 are assigned to them accordingly. Within each paragraph, the sentences are indexed from 0.

This task can benefit from the segment information more than the tasks in GLUE benchmark because for the reading comprehension task, the context is usually longer, spanning several sentences. The segment position embedding can guide the self-attention layer to encode the text better.

We fine-tune our SegaBERT model with SQUAD v1.1 for 4 epochs with 128 batch size and 3e-5 learning rate. As we can see from Table 2, without any data augmentation (DA) or model ensemble, SegaBERT large outperforms BERT large with DA, StructBERT Wang et al. (2019b) which pre-trains BERT with multiple unsupervised tasks, and KT-NET Yang et al. (2019a) which uses external knowledge bases on BERT. Although there is still a gap between XLNet and SegaBERT, our proposed method is not specified for BERT but transformer. We believe that the segmentation information can also help other pre-trained models including XLNet, which could be verified in the future.

3.3 Ablation Study

We further conduct an ablation study on token position embedding. Our proposed method replaces the original 512 token position embeddings into 50 paragraph embeddings, 100 sentence embeddings, and 256 token embeddings (ranging from 0 for each sentence of the input sequence). Therefore, the total parameter size is reduced. Therefore, we pre-train a base model, BERT with P.S., by adding paragraph embeddings and sentence embeddings to the original BERT token position embeddings (ranging from 0 to 512 for input sequence). Then, the parameter size of this model is increased compared with BERT.

The results are shown in Table 3. From this table, we can see that the BERT with P.S. pre-trained with paragraph index and sentence index still outperforms the BERT base model on most tasks except the CoLA dev set, due to the smaller training data size. Besides, the SQUAD v1.1 results of BERT with P.S. and SegaBERT are similar, both outperform the BERT base significantly, which indicates that the segmentation information is crucial for machine reading comprehension. The results further demonstrate the effectiveness and generalization capability of our method.

4 Conclusion

In this paper, we proposed a novel segment-aware transformer which incorporates richer positional information into pre-training. By applying this method on BERT pre-training, we pre-trained a new model named SegaBERT. Experimental results demonstrate that the proposed method works for BERT on a variety of downstream tasks, including the GLUE benchmark and SQUAD v1.1 question answering.


This work was partially supported by NSERC OGP0046506, the National Key R&D Program of China 2018YFB1003202, and the Canada Research Chair Program.

We would like to thank Wei Zeng and his team in Peng Cheng Laboratory (PCL) for the computing resource support to this project.


  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1, Table 1, Table 2.
  • I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.) (2017) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. Cited by: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017).
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019)

    BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    CoRR abs/1910.13461. External Links: Link, 1910.13461 Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. External Links: Link Cited by: §1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans (Eds.), pp. 1532–1543. External Links: Link, Document Cited by: §1.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh (Eds.), pp. 2383–2392. External Links: Link, Document Cited by: §1.
  • Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang (2019) ERNIE 2.0: A continual pre-training framework for language understanding. CoRR abs/1907.12412. External Links: Link, 1907.12412 Cited by: Table 1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. See Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA, Guyon et al., pp. 5998–6008. External Links: Link Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1.
  • W. Wang, B. Bi, M. Yan, C. Wu, Z. Bao, L. Peng, and L. Si (2019b) StructBERT: incorporating language structures into pre-training for deep language understanding. CoRR abs/1908.04577. External Links: Link, 1908.04577 Cited by: §3.2, Table 2.
  • A. Yang, Q. Wang, J. Liu, K. Liu, Y. Lyu, H. Wu, Q. She, and S. Li (2019a) Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2346–2357. External Links: Link, Document Cited by: §3.2, Table 2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019b) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Link Cited by: §1, Table 2.