Pre-trained language models such as ELMo [peters2018deep], BERT [devlin2019BERT], ERNIE-Baidu [sun2019ernie, sun2019ernie2], ERNIE-Tsinghua [zhang2019ernie], XLNet [yang2019xlnet], RoBERTa [Liu2019RoBERTaAR] and MegatronLM111https://nv-adlr.github.io/MegatronLM
have demonstrated remarkable successes in modeling contextualized word representations by utilizing the massive amount of training text. As a fundamental technique in natural language processing (NLP), the language models pre-trained on text could be easily transferred to learn downstream NLP tasks with finetuning, which achieve the state-of-the-art performances on many tasks including sentiment analysis, machine reading comprehension, sentence matching, named entity recognition and natural language inference.
The existing pre-trained language models are mostly learned from English corpora (e.g., BooksCorpus and English Wikipedia). There are several attempts to train the models specifically for the Chinese language, including Google’s BERT [devlin2019BERT] for Chinese, ERNIE-Baidu [sun2019ernie, sun2019ernie2] and BERT-WWM [cui2019pre]. All of the models are based on Transformer [vaswani2017attention] and trained on two unsupervised tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP). In the MLM task, the model learns to recover the masked words in the training sentences. In the NSP task, it tries to predict whether one sentence is the next sentence of the other. One of the main differences among the Chinese models lies their word masking strategy in the MLM task. Google’s BERT masks each Chinese character or WordPiece token [wu2016google] independently. ERNIE-Baidu further makes the MLM task more challenging by masking the entities or phrases in a sentence as a whole, where each entity or phrase may contain multiple characters or tokens. BERT-WWM takes a similar strategy called Whole Word Masking (WWM), which enforces that all the tokens belonging to a Chinese word should be masked together. Besides, in the most recently published ERNIE-Baidu 2.0 [sun2019ernie2], additional pre-training tasks such as Token-Document Relation Prediction and Sentence Reordering, are also incorporated.
In this technical report, we present our practice of pre-training language models NEZHA (NEural contextualiZed representation for CHinese lAnguage understanding), which is currently based on BERT and trained on Chinese text. Specifically, we employ a technique called Functional Relative Positional Encoding
in our model. In the vanilla Transformer as well as the BERT model, the positional encoding of each word in the sequence is a vector with its absolute position information encoded. The positional encodings are added to the word embeddings as the inputs to the Transformer. There are two typical ways to determine the positional encodings. One is thefunctional positional encoding, where the positional encodings are determined by some pre-defined functions (e.g., sinusoidal functions in [vaswani2017attention]). The other is the parametric positional encodings, which are part of the model parameters and learned as in [devlin2019BERT]. [shaw2018self] proposes a parametric relative positional encoding, where the relative position information is incorporated in the self-attention layers of Transformer. Later, Transformer-XL [dai2019transformer] and XLNet [yang2019xlnet] propose using a sinusoid encoding matrix and two trainable bias terms to represent the relative positions. In this technical report, we employ a functional relative positional encoding scheme, which encodes the relative positions in self-attention by pre-defined functions without any trainable parameter. Our empirical study shows that it is an effective positional encoding scheme for the pre-trained language models, and it makes consistent gains in our experiments. Besides, we employed three techniques shown to be effective in the pre-training of the BERT model, which are Whole Word Masking [cui2019pre], Mixed Precision Training [micikevicius2017mixed] and the LAMB Optimizer [you2019reducing], in training NEZHA.
The contribution of this technical report is that we systematically study the problem of pre-training language models on large-scale Chinese corpora, evaluate the models on several Chinese NLU tasks, and assess the effectiveness of training factors including positional encoding scheme, masking strategy, sources of training corpora, length of training sequences. We will release our NEZHA models as well as the source code to the community.
2 Pre-training NEZHA Models
In this section, we present our NEZHA model in details. Section 2.1 presents the preliminaries of the BERT model and the positional encoding schemes. Section 2.2 presents the functional relative positional encoding adopted in our model. Section 2.3, 2.4 and 2.5 introduce the three techniques used in our pre-training, i.e., whole word masking, mixed precision training and the LAMB optimizer.
2.1 Preliminaries: BERT Model & Positional Encoding
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model, which is a stack of Transformer encoders. Each Transformer encoder is a multi-head self-attention laryer followed by a position-wise feed-forward network. It uses residual connections around each sub-layer, followed by a layer normalization. We refer the reader to[vaswani2017attention] for more details of the Transformer architecture. Each sample in the training data of BERT is a pair of sentences. In each sample, 12% tokens are masked and 1.5% tokens are randomly replaced by another token in the vocabulary. Besides, in the training set, each sample (containing sentences A and B) is constructed as follows. 50% of the times, B is actually the next sentence of A and 50% times B is a random sentence from the corpus, which is not the next sentence of A. In the pre-training phase, BERT has two tasks. One is the masked language modeling (MLM), which aims to predict the masked tokens from other tokens. The second pre-training task is the next sentence prediction (NSP). It predicts if the second sentence in each training sample is the next sentence of the first sentence or not. In some sense, BERT can be regarded as a denoising auto-encoder, since one of its training objectives is to recover the data with noises added.
In Transformer, each attention head operates on a sequence of tokens , where , and outputs a sequence of the same length, where . Each attention head has three parametric matrices and to be learned. The output is calculated as follows.
The attention score between the hidden states in position and position is computed by using a softmax function:
is the scaled dot product between the linear transformations of the input elements:
Since the multi-head attention in Transformer (and BERT) is permutation invariant, and thus not sensitive to the word order. Therefore, [vaswani2017attention] incorporates an absolute positional encoding for each position, which is an embedding vector and added to the token embedding directly. Later on, [shaw2018self] proposes a parametric relative positional encoding for Transformer. In the relative positional encoding scheme, the computation of the attention scores involves a parametric embedding regarding the relative distance between the two positions. Specifically, it modifies the computation of the output in equation 1 and the in equation 3 as follows:
In the two equations above, , are two vectors with the relative position between and encoded, and they are shared across all attention heads. Transformer-XL [dai2019transformer] and XLNet [yang2019xlnet] implement the relative positional encoding with a different formulation. We refer the reader to their paper for more details.
2.2 Functional Relative Positional Encoding
In the current version of NEZHA, we employ functional relative positional encoding, where the computation of the outputs and attention scores involves sinusoidal functions of their relative position. This idea is inspired by the functional absolute positional encoding adopted in Transformer [vaswani2017attention]. Specifically, in our model, and are both derived from sinusoidal functions and fixed during the model training. In the remainder of this technical report, we denote to present the formulation of and for clarity. Consider the dimension and the dimension of respectively,
That is, each dimension of the positional encoding corresponds to a sinusoid, and the sinusoidal functions for different dimensions have different wavelengths. In the above equations, is equal to the hidden size per head of the NEZHA model (i.e., the hidden size divided by the number of heads). The wavelengths form a geometric progression from to . We choose the fixed sinusoidal functions mainly because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
2.3 Whole Word Masking
In the vanilla BERT, each token or Chinese character is masked randomly. In [cui2019pre], whole word masking (WWM) strategy is found to be more effective than random masking for training BERT. In WWM, once a Chinese character is masked, the other characters belonging to the same Chinese word are all masked together. In implementing WWM for NEZHA, we used a tokenization tool Jieba222https://github.com/fxsjy/jieba for the Chinese word segmentation (i.e., finding the boundaries of the Chinese words). In the WWM training data, each sample contains several masked Chinese words, and the total number of masked Chinese characters is roughly 12% of its length and 1.5% randomly replaced characters.
2.4 Mixed Precision Training
In the pre-training of our NEZHA models, we adopt the technique of mixed precision training [micikevicius2017mixed]. The technique can speed up the training by 2-3 times and also reduce the space consumption of the model, as a result of which, a larger batch size could be utilized.
Conventionally, the training of deep neural networks uses FP32 (i.e., single-precision float point format) to present all the variables (including the model parameters and gradients) involved in the training. Mixed precision training[micikevicius2017mixed] adopts mixed-precision in the training. Specifically, it maintains a single-precision copy (called Master Weights) of the weights in the model. In each training iteration, it rounds the Master Weights into FP16 (i.e., half-precision float point format) and performs the forward and backward pass with the weights, activations and gradients stored in FP16 format. Finally, it converts the gradients into FP32 format and updates the Master Weights by using the FP32 gradients.
2.5 LAMB Optimizer
The LAMB optimizer [you2019reducing]
is designed for the large batch-size synchronous distributed training of deep neuron networks. Training DNN with large mini-batches is an effective method to speed up the training. However, without careful tuning of the schedule of the learning rate, the performance could be largely harmed when the batch size is beyond a certain threshold. Instead of hand-tuning of the learning rate, the LAMB optimizer employs a general adaptation strategy and meanwhile provides insight into the convergence by theoretical analysis. The optimizer speeds up the training of BERT by using a very large batch size (up to more than 30k in[you2019reducing]) without incurring a loss of the performance and even obtains the state-of-the-art performance in many tasks. Remarkably, the training time of BERT is significantly reduced from 3 days to 76 minutes.
In this section, we report the experimental results on pre-training our NEZHA models for Chinese text and finetuning on Chinese NLU downstream tasks. It should be noted that the training techniques are not limited to Chinese and can be readily applied to other languages.
3.1 Experimental Setting
We adopt three Chinese corpora for pre-training the NEZHA models:
Chinese Wikipedia 333https://zh.wikipedia.org/wiki/. Chinese Wikipedia is a Chinese-language encyclopedia containing 1,067,552 articles. We downloaded the latest Chinese Wikipedia dump and cleaned the raw data with the tool named WikiExtractor444https://github.com/attardi/wikiextractor. The cleaned corpus contains both simplified and traditional Chinese and has roughly 202M tokens.
Baidu Baike 555https://baike.baidu.com/. We crawled webpages from the Baidu Baike, which is a Chinese-language, collaborative, web-based encyclopedia owned and produced by the Chinese search engine Baidu. As of August 2018, Baidu Baike has more than 15.4 million articles. The cleaned corpus contains 4,734M tokens.
Chinese News. We crawled Chinese News corpus from multiple news websites (e.g., Sina News). The cleaned corpus contains 5,600M tokens.
For each corpus above, we prepared two versions of the pre-training data for NEZHA. The first version is processed the same as that in [devlin2019BERT], which contains 12% masked Chinese characters and 1.5% randomly replaced Chinese characters. We used tools provided by the BERT Github project 666https://github.com/google-research/bert to convert the text data into the pre-training examples. The second version is based on the whole word masking (WWM) strategy. We created the WWM pre-training examples with the Chinese word segmenter Jieba for identifying the boundaries of Chinese words. In the WWM examples, each sample contains several masked Chinese words, and the total number of masked Chinese characters is roughly 12% of its length and 1.5% randomly replaced characters. Table 1 summarizes the statistics of the datasets for several pre-trained models.
We train the NEZHA models on 10 servers on Huawei Cloud 777https://www.huaweicloud.com/product/modelarts.html, each of which has 8 NVIDIA Tesla V100 GPUs with 32GB memory. The distributed training algorithm is the Ring-AllReduce888https://github.com/baidu-research/baidu-allreduce and was employed with the framework named Horovod [sergeev2018horovod]. We trained each model from scratch and terminated the training when the training loss converged. For the NEZHA models, we set the maximum learning rate to be (with 1800 warm-up steps and linear decay). The batch size on each GPU is 180 and thus the total batch size is 180 * 8 * 10 = 14400. For the NEZHA models, we set the maximum learning rate to be (with 1800 warm-up steps and polynomial decay). The batch size on each GPU is 64, and thus the total batch size is 64 * 8 * 10 = 5120. In addition, we adopted the mixed-precision training using FP16 [micikevicius2017mixed] in the pre-training phase.
|Model||Pre-Training Corpora||#Tokens||Vocabulary size||Activation function||Hidden Size/#Layers||#Heads|
|ERNIE-Baidu 1.0||Wikipedia+Baike+Tieba||9,388 M||18,000||ReLU||768/12||12|
|Model||Pre-Training Tasks||Training Precision||Optimizer||Position Encoding|
|BERT||MLM||NSP||-||Single Precision (FP32)||ADAM||PAPE|
|ERNIE-Baidu 1.0||MLM (KM)||NSP||-||Single Precision (FP32)||ADAM||PAPE|
|ERNIE-Baidu 2.0||MLM (KM)||SR & SD||DR & IR||Mixed Precision||ADAM||PAPE|
|NEZHA||MLM (WWM)||NSP||-||Mixed Precision||LAMB||FRPE|
3.2 Experimental Results
In the experiment, we compared NEZHA models with the state-of-the-art Chinese pre-trained language models: Goolge’s BERT [devlin2019BERT] for Chinese, BERT-WWM [cui2019pre] and ERNIE-Baidu [sun2019ernie, sun2019ernie2]. Their model configurations are shown in Table 1. We also summarize pre-training techniques adopted in each Chinese pre-trained language models in Table 2. Note that ERNIE-Baidu has three different versions, which are ERNIE-Baidu 1.0, ERNIE-Baidu 2.0 and ERNIE-Baidu 2.0. ERNIE-Baidu and ERNIE-Baidu
2.0 introduced many different pre-training tasks and we refer the readers to thier papers for the details of these tasks. As shown in the table, the unique technique in our models is the functional relative position encoding. We test the performances of the pre-trained models by finetuning on a variety of natural language understanding (NLU) tasks, which are listed as follows. The hyperparameters of finetuning each task are shown in Table3.
CMRC (Chinese Machine Reading Comprehension 2018) [cui2018span]: A machine reading comprehension task that returns an answer span in a given passage for a given question.
XNLI (Cross-lingual Natural Language Inference) [conneau2018xnli]: The Chinese portion of XNLI, which is a version of MultiNLI where the dev and test sets have been translated (by humans) into 15 languages. XNLI is a natural language inference task. The goal of this task is to predict if the second sentence is a contradiction, entailment or neutral to the first sentence.
LCQMC (Large-scale Chinese Question Matching Corpus) [liu2018lcqmc]: A sentence pair matching task. Given a pair of sentences, the task is to determine if the two sentences are semantically equivalent or not.
PD-NER (People’s Daily Named Entity Recognition) 999https://github.com/ProHiryu/bert-chinese-ner: A sequence labeling task that identifies the named entities from text. The corpus is from People’s Daily, a Chinese News Media.
ChnSenti (Chinese Sentiment Classification) 101010https://github.com/pengming617/bert_classification: A binary classification task which predicts if the sentiment of a given sentence is positive or negative.
|Task Name||Batch Size||SL||LR||Epochs||#Train||#Dev||#Test||Domain|
We show the comparison results of different pre-trained models on the aforementioned tasks in Table 4. Among the groups of both base models and large models, either ERNIE-Baidu 2.0 or NEZHA achieves the best performances. Note that the part of the results are directly copied from the original papers [cui2019pre, sun2019ernie2]. Due to the possible differences in the experimental setting or finetuning methods, the comparison may not be entirely fair. We notice that there is consistent gaps between our implementation and the results reported in [cui2019pre, sun2019ernie2] on the CMRC task. Once the ERNIE-Baidu 2.0 Chinese models are released, we will evaluate them under the same setting and update this report.
|BERT-WWM (in [cui2019pre])||66.30||85.60||79.00||78.20||89.40||87.00||95.30||65.10||95.10||95.40|
|ERNIE-Baidu 1.0 (in [sun2019ernie])||65.10||85.10||79.9||78.4||89.70||87.40||-||-||95.20||95.40|
|ERNIE-Baidu 2.0 (in [sun2019ernie2])||69.10||88.60||81.20||79.70||90.90||87.90||-||-||95.70||95.50|
|ERNIE-Baidu 2.0 (in [sun2019ernie2])||71.50||89.90||82.60||81.00||90.90||87.90||-||-||96.10||95.80|
3.3 Ablation Study
In this section, we study the effectiveness of the data and different techniques for training NEZHA, which are listed as follows.
Positional Encoding: the effectiveness of the functional relative positional encoding (FRPE) employed in our work compared with the parametric absolute positional encoding (PAPE) and parametric relative positional encoding (PRPE) adopted in the existing studies.
Masking Strategy: the effect of the whole word masking (WWM) on the performance of the pre-trained models.
Training Sequence Length: the impact of the training with longer sequences.
Training Corpora: the impact of the source of the training data.
With the above objectives, we compare the performances of several variants of NEZHA model, as shown in Table 5. The results demonstrate that the techniques mentioned above generally have positive contributions to the downstream tasks, where functional relative positional encoding shows a notable advantage compared with other positional encoding methods. For instance, we can see that when trained with a maximum of 128 tokens, the model using the absolute positional encodings performs significantly worse than those using relative positional encodings on the CMRC task, where the input passages can be much longer.
|News, PAPE, SL:128||37.96||58.40||78.79||77.72||89.31||86.74||94.87||98.10||94.17||95.67|
|News, PRPE, SL:128||65.26||86.17||79.18||77.98||89.21||86.92||96.93||98.12||94.67||95.08|
|News, FRPE, SL:128||65.95||86.46||79.96||78.32||89.40||87.23||96.69||98.10||95.58||95.75|
|News, FRPE, SL:512||67.79||86.60||80.57||79.52||90.06||86.73||97.04||97.62||95.09||95.08|
|News+Wiki+Baike, FRPE, SL:128||66.95||86.41||81.25||79.06||89.83||87.13||97.21||97.41||95.25||94.42|
|News+Wiki+Baike, FRPE, WWM, SL:128||67.82||86.25||81.25||79.11||89.85||87.10||97.41||98.35||94.75||95.84|
|News+Wiki+Baike, FRPE, WWM, SL:512||66.45||86.16||80.96||79.86||89.64||86.18||96.79||98.10||95.08||95.42|
In the technical report, we have presented our practice on training the large scale pre-trained language models NEZHA on Chinese corpora. We have employed an effective functional relative positional encoding scheme, which leads to notable improvement over the other positional encodings. The pre-training of the NEZHA models also integrates several techniques, including whole word masking strategy, mixed precision training, and the LAMB optimizer. Experiments show that our models can achieve state-of-the-art performances on several Chinese natural language understanding tasks. In the future, we plan to continue the work on improving NEZHA on Chinese and other languages and extend the applications of NEZHA to more scenarios.