Conditional BERT Contextual Augmentation

by   Xing Wu, et al.

We propose a novel data augmentation method for labeled sentences called conditional BERT contextual augmentation. Data augmentation methods are often applied to prevent overfitting and improve generalization of deep neural network models. Recently proposed contextual augmentation augments labeled sentences by randomly replacing words with more varied substitutions predicted by language model. BERT demonstrates that a deep bidirectional language model is more powerful than either an unidirectional language model or the shallow concatenation of a forward and backward model. We retrofit BERT to conditional BERT by introducing a new conditional masked language model[The term "conditional masked language model" appeared once in original BERT paper, which indicates context-conditional, is equivalent to term "masked language model". In our paper, "conditional masked language model" indicates we apply extra label-conditional constraint to the "masked language model".] task. The well trained conditional BERT can be applied to enhance contextual augmentation. Experiments on six various different text classification tasks show that our method can be easily applied to both convolutional or recurrent neural networks classifier to obtain obvious improvement.



page 1

page 2

page 3

page 4


Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

We propose a novel data augmentation for labeled sentences called contex...

Not Enough Data? Deep Learning to the Rescue!

Based on recent advances in natural language modeling and those in text ...

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

We show that BERT (Devlin et al., 2018) is a Markov random field languag...

Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

Even though BERT achieves successful performance improvements in various...

Contextual BERT: Conditioning the Language Model Using a Global State

BERT is a popular language model whose main pre-training task is to fill...

Latin BERT: A Contextual Language Model for Classical Philology

We present Latin BERT, a contextual language model for the Latin languag...

A little goes a long way: Improving toxic language classification despite data scarcity

Detection of some types of toxic language is hampered by extreme scarcit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural network-based models are easy to overfit and result in losing their generalization due to limited size of training data. In order to address the issue, data augmentation methods are often applied to generate more training samples. Recent years have witnessed great success in applying data augmentation in the field of speech areaJaitly and Hinton (2013); Ko et al. (2015)

and computer vision

Krizhevsky et al. (2012); Simard et al. (1998); Szegedy et al. (2015). Data augmentation in these areas can be easily performed by transformations like resizing, mirroring, random cropping, and color shifting. However, applying these universal transformations to texts is largely randomized and uncontrollable, which makes it impossible to ensure the semantic invariance and label correctness. For example, given a movie review “The actors is good”, by mirroring we get “doog si srotca ehT”, or by random cropping we get “actors is”, both of which are meaningless.

Existing data augmentation methods for text are often loss of generality, which are developed with handcrafted rules or pipelines for specific domains. A general approach for text data augmentation is replacement-based method, which generates new sentences by replacing the words in the sentences with relevant words (e.g. synonyms). However, words with synonyms from a handcrafted lexical database likes WordNetMiller (1995) are very limited , and the replacement-based augmentation with synonyms can only produce limited diverse patterns from the original texts. To address the limitation of replacement-based methods, KobayashiKobayashi (2018) proposed contextual augmentation for labeled sentences by offering a wide range of substitute words, which are predicted by a label-conditional bidirectional language model according to the context. But contextual augmentation suffers from two shortages: the bidirectional language model is simply shallow concatenation of a forward and backward model, and the usage of LSTM models restricts their prediction ability to a short range.

BERT, which stands for Bidirectional Encoder Representations from Transformers, pre-trained deep bidirectional representations by jointly conditioning on both left and right context in all layers. BERT addressed the unidirectional constraint by proposing a “masked language model” (MLM) objective by masking some percentage of the input tokens at random, and predicting the masked words based on its context. This is very similar to how contextual augmentation predict the replacement words. But BERT was proposed to pre-train text representations, so MLM task is performed in an unsupervised way, taking no label variance into consideration.

This paper focuses on the replacement-based methods, by proposing a novel data augmentation method called conditional BERT contextual augmentation. The method applies contextual augmentation by conditional BERT, which is fine-tuned on BERT. We adopt BERT as our pre-trained language model with two reasons. First, BERT is based on Transformer. Transformer provides us with a more structured memory for handling long-term dependencies in text. Second, BERT, as a deep bidirectional model, is strictly more powerful than the shallow concatenation of a left-to-right and right-to left model. So we apply BERT to contextual augmentation for labeled sentences, by offering a wider range of substitute words predicted by the masked language model task. However, the masked language model predicts the masked word based only on its context, so the predicted word maybe incompatible with the annotated labels of the original sentences. In order to address this issue, we introduce a new fine-tuning objective: the ”conditional masked language model”(C-MLM). The conditional masked language model randomly masks some of the tokens from an input, and the objective is to predict a label-compatible word based on both its context and sentence label. Unlike Kobayashi’s work, the C-MLM objective allows a deep bidirectional representations by jointly conditioning on both left and right context in all layers. In order to evaluate how well our augmentation method improves performance of deep neural network models, following KobayashiKobayashi (2018), we experiment it on two most common neural network structures, LSTM-RNN and CNN, on text classification tasks. Through the experiments on six various different text classification tasks, we demonstrate that the proposed conditional BERT model augments sentence better than baselines, and conditional BERT contextual augmentation method can be easily applied to both convolutional or recurrent neural networks classifier. We further explore our conditional MLM task’s connection with style transfer task and demonstrate that our conditional BERT can also be applied to style transfer too.

Our contributions are concluded as follows:

  • We propose a conditional BERT contextual augmentation method. The method allows BERT to augment sentences without breaking the label-compatibility. Our conditional BERT can further be applied to style transfer task.

  • Experimental results show that our approach obviously outperforms existing text data augmentation approaches.

To our best knowledge, this is the first attempt to alter BERT to a conditional BERT or apply BERT on text generation tasks.

2 Related Work

2.1 Fine-tuning on Pre-trained Language Model

Language model pre-training has attracted wide attention and fine-tuning on pre-trained language model has shown to be effective for improving many downstream natural language processing tasks. Dai

Dai and Le (2015) pre-trained unlabeled data to improve Sequence Learning with recurrent networks. HowardHoward and Ruder (2018)

proposed a general transfer learning method, Universal Language Model Fine-tuning (ULMFiT), with the key techniques for fine-tuning a language model. Radford

Radford et al. (2018) proposed that by generative pre-training of a language model on a diverse corpus of unlabeled text, large gains on a diverse range of tasks could be realized. RadfordRadford et al. (2018) achieved large improvements on many sentence-level tasks from the GLUE benchmarkWang et al. (2018). BERTDevlin et al. (2018) obtained new state-of-the-art results on a broad range of diverse tasks. BERT pre-trained deep bidirectional representations which jointly conditioned on both left and right context in all layers, following by discriminative fine-tuning on each specific task. Unlike previous works fine-tuning pre-trained language model to perform discriminative tasks, we aim to apply pre-trained BERT on generative tasks by perform the masked language model(MLM) task. To generate sentences that are compatible with given labels, we retrofit BERT to conditional BERT, by introducing a conditional masked language model task and fine-tuning BERT on the task.

2.2 Text Data Augmentation

Text data augmentation has been extensively studied in natural language processing. Sample-based methods includes downsampling from the majority classes and oversampling from the minority class, both of which perform weakly in practice. Generation-based methods employ deep generative models such as GANsGoodfellow et al. (2014) or VAEsBowman et al. (2015); Hu et al. (2017), trying to generate sentences from a continuous space with desired attributes of sentiment and tense. However, sentences generated in these methods are very hard to guarantee the quality both in label compatibility and sentence readability. In some specific areas Jia and Liang (2017); Xie et al. (2017); Ebrahimi et al. (2017). word replacement augmentation was applied. WangWang and Yang (2015) proposed the use of neighboring words in continuous representations to create new instances for every word in a tweet to augment the training dataset. ZhangZhang et al. (2015) extracted all replaceable words from the given text and randomly choose of them to be replaced, then substituted the replaceable words with synonyms from WordNetMiller (1995). KolomiyetsKolomiyets et al. (2011) replaced only the headwords under a task-specific assumption that temporal trigger words usually occur as headwords. KolomiyetsKolomiyets et al. (2011) selected substitute words with top- scores given by the Latent Words LMDeschacht and Moens (2009), which is a LM based on fixed length contexts. FadaeeFadaee et al. (2017) focused on the rare word problem in machine translation, replacing words in a source sentence with only rare words. A word in the translated sentence is also replaced using a word alignment method and a rightward LM. The work most similar to our research is KobayashiKobayashi (2018). Kobayashi used a fill-in-the-blank context for data augmentation by replacing every words in the sentence with language model. In order to prevent the generated words from reversing the information related to the labels of the sentences, KobayashiKobayashi (2018) introduced a conditional constraint to control the replacement of words. Unlike previous works, we adopt a deep bidirectional language model to apply replacement, and the attention mechanism within our model allows a more structured memory for handling long-term dependencies in text, which resulting in more general and robust improvement on various downstream tasks.

3 Conditional BERT Contextual Augmentation

3.1 Preliminary: Masked Language Model Task

3.1.1 Bidirectional Language Model

In general, the language model(LM) models the probability of generating natural language sentences or documents. Given a sequence

S of N tokens, , a forward language model allows us to predict the probability of the sequence as:


Similarly, a backward language model allows us to predict the probability of the sentence as:


Traditionally, a bidirectional language model a shallow concatenation of independently trained forward and backward LMs.

3.1.2 Masked Language Model Task

In order to train a deep bidirectional language model, BERT proposed Masked Language Model (MLM) task, which was also referred to Cloze TaskTaylor (1953). MLM task randomly masks some percentage of the input tokens, and then predicts only those masked tokens according to their context. Given a masked token , the context is the tokens surrounding token in the sequence S, i.e. cloze sentence

. The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary to produce words with a probability distribution

. MLM task only predicts the masked words rather than reconstructing the entire input, which suggests that more pre-training steps are required for the model to converge. Pre-trained BERT can augment sentences through MLM task, by predicting new words in masked positions according to their context.

3.2 Conditional BERT

Figure 1: Model architecture of conditional BERT. The label embeddings in conditional BERT corresponding to segmentation embeddings in BERT, but their functions are different.

As shown in Fig 1, our conditional BERT shares the same model architecture with the original BERT. The differences are the input representation and training procedure.

The input embeddings of BERT are the sum of the token embeddings, the segmentation embeddings and the position embeddings. For the segmentation embeddings in BERT, a learned sentence A embedding is added to every token of the first sentence, and if a second sentence exists, a sentence B embedding will be added to every token of the second sentence. However, the segmentation embeddings has no connection to the actual annotated labels of a sentence, like sense, sentiment or subjectivity, so predicted word is not always compatible with annotated labels. For example, given a positive movie remark “this actor is good”, we have the word “good” masked. Through the Masked Language Model task by BERT, the predicted word in the masked position has potential to be negative word likes ”bad” or ”boring”. Such new generated sentences by substituting masked words are implausible with respect to their original labels, which will be harmful if added to the corpus to apply augmentation. In order to address this issue, we propose a new task: “conditional masked language model”.

3.2.1 Conditional Masked Language Model

The conditional masked language model randomly masks some of the tokens from the labeled sentence, and the objective is to predict the original vocabulary index of the masked word based on both its context and its label. Given a masked token , the context and label are both considered, aiming to calculate , instead of calculating . Unlike MLM pre-training, the conditional MLM objective allows the representation to fuse the context information and the label information, which allows us to further train a label-conditional deep bidirectional representations.

To perform conditional MLM task, we fine-tune on pre-trained BERT. We alter the segmentation embeddings to label embeddings, which are learned corresponding to their annotated labels on labeled datasets. Note that the BERT are designed with segmentation embedding being embedding A or embedding B, so when a downstream task dataset with more than two labels, we have to adapt the size of embedding to label size compatible. We train conditional BERT using conditional MLM task on labeled dataset. After the model has converged, it is expected to be able to predict words in masked position both considering the context and the label.

3.3 Conditional BERT Contextual Augmentation

After the conditional BERT is well-trained, we utilize it to augment sentences. Given a labeled sentence from the corpus, we randomly mask a few words in the sentence. Through conditional BERT, various words compatibly with the label of the sentence are predicted by conditional BERT. After substituting the masked words with predicted words, a new sentences is generated, which shares similar context and same label with original sentence. Then new sentences are added to original corpus. We elaborate the entire process in algorithm 1.

1:  Alter the segmentation embeddings to label embeddings
2:  Fine-tune the pre-trained BERT using conditional MLM task on labeled dataset D until convergence
3:  for each iteration i=1,2,…,M do
4:     Sample a sentence from D
5:     Randomly mask words
6:     Using fine-tuned conditional BERT to predict label-compatible words on masked positions to generate a new sentence
7:  end for
8:  Add new sentences into dataset to get augmented dataset
9:  Perform downstream task on augmented dataset
Algorithm 1 Conditional BERT contextual augmentation algorithm. Fine-tuning on the pre-trained BERT , we retrofit BERT to conditional BERT using conditional MLM task on labeled dataset. After the model converged, we utilize it to augment sentences. New sentences are added into dataset to augment the dataset.

4 Experiment

In this section, we present conditional BERT parameter settings and, following KobayashiKobayashi (2018), we apply different augmentation methods on two types of neural models through six text classification tasks. The pre-trained BERT model we used in our experiment is BERT, with number of layers (i.e., Transformer blocks) , the hidden size , and the number of self-attention heads , total parameters . Detailed pre-train parameters setting can be found in original paperDevlin et al. (2018). For each task, we perform the following steps independently. First, we evaluate the augmentation ability of original BERT model pre-trained on MLM task. We use pre-trained BERT to augment dataset, by predicted masked words only condition on context for each sentence. Second, we fine-tune the original BERT model to a conditional BERT. Well-trained conditional BERT augments each sentence in dataset by predicted masked words condition on both context and label. Third, we compare the performance of the two methods with Kobayashi’sKobayashi (2018) contextual augmentation results. Note that the original BERT’s segmentation embeddings layer is compatible with two-label dataset. When the task-specific dataset is with more than two different labels, we should re-train a label size compatible label embeddings layer instead of directly fine-tuning the pre-trained one.

4.1 Datasets

Six benchmark classification datasets are listed in table 1. Following KimKim (2014), for a dataset without validation data, we use 10% of its training set for the validation set. Summary statistics of six classification datasets are shown in table 1.

SST5 5 18 11855 17836 2210
SST2 2 19 9613 16185 1821
Subj 2 23 10000 21323 CV
TREC 6 10 5952 9592 500
MPQA 2 3 10606 6246 CV
RT 2 21 10662 20287 CV
Table 1: Summary statistics for the datasets after tokenization. : Number of target classes. : Average sentence length. : Dataset size. : Vocabulary size. : Test set size (CV means there was no standard train/test split and thus 10-fold cross-validation was used).

SSTSocher et al. (2013) SST (Stanford Sentiment Treebank) is a dataset for sentiment classification on movie reviews, which are annotated with five labels (SST5: very positive, positive, neutral, negative, or very negative) or two labels (SST2: positive or negative).
SubjPang and Lee (2004) Subj (Subjectivity dataset) is annotated with whether a sentence is subjective or objective.
MPQAWiebe et al. (2005) MPQA Opinion Corpus is an opinion polarity detection dataset of short phrases rather than sentences, which contains news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
RTPang and Lee (2005) RT is another movie review sentiment dataset contains a collection of short review excerpts from Rotten Tomatoes collected by Bo Pang and Lillian Lee.
TRECLi and Roth (2002) TREC is a dataset for classification of the six question types (whether the question is about person, location, numeric information, etc.).

4.2 Text classification

Model SST5 SST2 Subj MPQA RT TREC Avg.
CNN* 41.3 79.5 92.4 86.1 75.9 90.0 77.53
w/synonym* 40.7 80.0 92.4 86.3 76.0 89.6 77.50
w/context* 41.9 80.9 92.7 86.7 75.9 90.0 78.02
w/context+label* 42.1 80.8 93.0 86.7 76.1 90.5 78.20
w/BERT 41.5 81.9 92.9 87.7 78.2 91.8 79.00
w/C-BERT 42.3 82.1 93.4 88.2 79.0 92.6 79.60
RNN* 40.2 80.3 92.4 86.0 76.7 89.0 77.43
w/synonym* 40.5 80.2 92.8 86.4 76.6 87.9 77.40
w/context* 40.9 79.3 92.8 86.4 77.0 89.3 77.62
w/context+label* 41.1 80.1 92.8 86.4 77.4 89.2 77.83
w/BERT 41.3 81.4 93.5 87.3 78.3 89.8 78.60
w/C-BERT 42.6 81.9 93.9 88.0 78.9 91.0 79.38
Table 2: Accuracies of different methods for various benchmarks on two classifier architectures. C-BERT, which represents conditional BERT, performs best on two classifier structures over six datasets. “w/” represents “with”, lines marked with “*” are experiments results from KobayashiKobayashi (2018).

4.2.1 Sentence Classifier Structure

We evaluate the performance improvement brought by conditional BERT contextual augmentation on sentence classification tasks, so we need to prepare two common sentence classifiers beforehand. For comparison, following KobayashiKobayashi (2018), we adopt two typical classifier architectures: CNN or LSTM-RNN. The CNN-based classifierKim (2014)

has convolutional filters of size 3, 4, 5 and word embeddings. All outputs of each filter are concatenated before applied with a max-pooling over time, then fed into a two-layer feed-forward network with ReLU, followed by the softmax function. An RNN-based classifier has a single layer LSTM and word embeddings, whose output is fed into an output affine layer with the softmax function. For both the architectures, dropout

Srivastava et al. (2014) and Adam optimizationKingma and Ba (2014)

are applied during training. The train process is finish by early stopping with validation at each epoch.

4.2.2 Hyper-parameters Setting

Sentence classifier hyper-parameters including learning rate, embedding dimension, unit or filter size, and dropout ratio, are selected using grid-search for each task-specific dataset. We refer to Kobayashi’s implementation in the released code222 For BERT, all hyper-parameters are kept the same as DevlinDevlin et al. (2018)

, codes in Tensorflow


and PyTorch

444 are all available on github and pre-trained BERT model can also be downloaded. The number of conditional BERT training epochs ranges in [1-50] and number of masked words ranges in [1-2].

4.2.3 Baselines

We compare the performance improvements obtained by our proposed method with the following baseline methods, “w/” means “with”:

  • w/synonym: Words are randomly replaced with synonyms from WordNetMiller (1995).

  • w/context: Proposed by KobayashiKobayashi (2018), which used a bidirectional language model to apply contextual augmentation, each word was replaced with a probability.

  • w/context+label: Kobayashi’s contextual augmentation methodKobayashi (2018) in a label-conditional LM architecture.

4.2.4 Experiment Results

Table 2 lists the accuracies of the all methods on two classifier architectures. The results show that, for various datasets on different classifier architectures, our conditional BERT contextual augmentation improves the model performances most. BERT can also augments sentences to some extent, but not as much as conditional BERT does. For we masked words randomly, the masked words may be label-sensitive or label-insensitive. If label-insensitive words are masked, words predicted through BERT may not be compatible with original labels. The improvement over all benchmark datasets also shows that conditional BERT is a general augmentation method for multi-labels sentence classification tasks.

4.2.5 Effect of Number of Fine-tuning Steps

We also explore the effect of number of training steps to the performance of conditional BERT data augmentation. The fine-tuning epoch setting ranges in [1-50], we list the fine-tuning epoch of conditional BERT to outperform BERT for various benchmarks in table 3. The results show that our conditional BERT contextual augmentation can achieve obvious performance improvement after only a few fine-tuning epochs, which is very convenient to apply to downstream tasks.

CNN 4 3 1 2 2 1
RNN 6 2 2 2 1 1
Table 3: Fine-tuning epochs of conditional BERT to outperform BERT for various benchmarks

5 Connection to Style Transfer

Original: there ’s no disguising this as one of the worst films of the summer .
Generated: there ’s no disguising this as one of the best films of the summer .
Original: it ’s probably not easy to make such a worthless film …
Generated: it ’s probably not easy to make such a stunning film …
Original: woody allen has really found his groove these days .
Generated: woody allen has really lost his groove these days .
Table 4: Examples generated by conditional BERT on the SST2 dataset. To perform style transfer, we reverse the original label of a sentence, and conditional BERT output a new label compatible sentence.

In this section, we further deep into the connection to style transfer and apply our well trained conditional BERT to style transfer task. Style transfer is defined as the task of rephrasing the text to contain specific stylistic properties without changing the intent or affect within the contextPrabhumoye et al. (2018). Our conditional MLM task changes words in the text condition on given label without changing the context. View from this point, the two tasks are very close. So in order to apply conditional BERT to style transfer task, given a specific stylistic sentence, we break it into two steps: first, we find the words relevant to the style; second, we mask the style-relevant words, then use conditional BERT to predict new substitutes with sentence context and target style property. In order to find style-relevant words in a sentence, we refer to XuXu et al. (2018), which proposed an attention-based method to extract the contribution of each word to the sentence sentimental label. For example, given a positive movie remark “This movie is funny and interesting”, we filter out the words contributes largely to the label and mask them. Then through our conditional BERT contextual augmentation method, we fill in the masked position by predicting words conditioning on opposite label and sentence context, resulting in “This movie is boring and dull”. The words “boring” and “dull” contribute to the new sentence being labeled as negative style. We sample some sentences from dataset SST2, transferring them to the opposite label, as listed in table 4.

6 Conclusions and Future Work

In this paper, we fine-tune BERT to conditional BERT by introducing a novel conditional MLM task. After being well trained, the conditional BERT can be applied to data augmentation for sentence classification tasks. Experiment results show that our model outperforms several baseline methods obviously. Furthermore, we demonstrate that our conditional BERT can also be applied to style transfer task. In the future, (1)We will explore how to perform text data augmentation on imbalanced datasets with pre-trained language model, (2) we believe the idea of conditional BERT contextual augmentation is universal and will be applied to paragraph or document level data augmentation.