Deep neural network-based models are easy to overfit and result in losing their generalization due to limited size of training data. In order to address the issue, data augmentation methods are often applied to generate more training samples. Recent years have witnessed great success in applying data augmentation in the field of speech areaJaitly and Hinton (2013); Ko et al. (2015)
and computer visionKrizhevsky et al. (2012); Simard et al. (1998); Szegedy et al. (2015). Data augmentation in these areas can be easily performed by transformations like resizing, mirroring, random cropping, and color shifting. However, applying these universal transformations to texts is largely randomized and uncontrollable, which makes it impossible to ensure the semantic invariance and label correctness. For example, given a movie review “The actors is good”, by mirroring we get “doog si srotca ehT”, or by random cropping we get “actors is”, both of which are meaningless.
Existing data augmentation methods for text are often loss of generality, which are developed with handcrafted rules or pipelines for specific domains. A general approach for text data augmentation is replacement-based method, which generates new sentences by replacing the words in the sentences with relevant words (e.g. synonyms). However, words with synonyms from a handcrafted lexical database likes WordNetMiller (1995) are very limited , and the replacement-based augmentation with synonyms can only produce limited diverse patterns from the original texts. To address the limitation of replacement-based methods, KobayashiKobayashi (2018) proposed contextual augmentation for labeled sentences by offering a wide range of substitute words, which are predicted by a label-conditional bidirectional language model according to the context. But contextual augmentation suffers from two shortages: the bidirectional language model is simply shallow concatenation of a forward and backward model, and the usage of LSTM models restricts their prediction ability to a short range.
BERT, which stands for Bidirectional Encoder Representations from Transformers, pre-trained deep bidirectional representations by jointly conditioning on both left and right context in all layers. BERT addressed the unidirectional constraint by proposing a “masked language model” (MLM) objective by masking some percentage of the input tokens at random, and predicting the masked words based on its context. This is very similar to how contextual augmentation predict the replacement words. But BERT was proposed to pre-train text representations, so MLM task is performed in an unsupervised way, taking no label variance into consideration.
This paper focuses on the replacement-based methods, by proposing a novel data augmentation method called conditional BERT contextual augmentation. The method applies contextual augmentation by conditional BERT, which is fine-tuned on BERT. We adopt BERT as our pre-trained language model with two reasons. First, BERT is based on Transformer. Transformer provides us with a more structured memory for handling long-term dependencies in text. Second, BERT, as a deep bidirectional model, is strictly more powerful than the shallow concatenation of a left-to-right and right-to left model. So we apply BERT to contextual augmentation for labeled sentences, by offering a wider range of substitute words predicted by the masked language model task. However, the masked language model predicts the masked word based only on its context, so the predicted word maybe incompatible with the annotated labels of the original sentences. In order to address this issue, we introduce a new fine-tuning objective: the ”conditional masked language model”(C-MLM). The conditional masked language model randomly masks some of the tokens from an input, and the objective is to predict a label-compatible word based on both its context and sentence label. Unlike Kobayashi’s work, the C-MLM objective allows a deep bidirectional representations by jointly conditioning on both left and right context in all layers. In order to evaluate how well our augmentation method improves performance of deep neural network models, following KobayashiKobayashi (2018), we experiment it on two most common neural network structures, LSTM-RNN and CNN, on text classification tasks. Through the experiments on six various different text classification tasks, we demonstrate that the proposed conditional BERT model augments sentence better than baselines, and conditional BERT contextual augmentation method can be easily applied to both convolutional or recurrent neural networks classifier. We further explore our conditional MLM task’s connection with style transfer task and demonstrate that our conditional BERT can also be applied to style transfer too.
Our contributions are concluded as follows:
We propose a conditional BERT contextual augmentation method. The method allows BERT to augment sentences without breaking the label-compatibility. Our conditional BERT can further be applied to style transfer task.
Experimental results show that our approach obviously outperforms existing text data augmentation approaches.
To our best knowledge, this is the first attempt to alter BERT to a conditional BERT or apply BERT on text generation tasks.
2 Related Work
2.1 Fine-tuning on Pre-trained Language Model
Language model pre-training has attracted wide attention and fine-tuning on pre-trained language model has shown to be effective for improving many downstream natural language processing tasks. DaiDai and Le (2015) pre-trained unlabeled data to improve Sequence Learning with recurrent networks. HowardHoward and Ruder (2018)
proposed a general transfer learning method, Universal Language Model Fine-tuning (ULMFiT), with the key techniques for fine-tuning a language model. RadfordRadford et al. (2018) proposed that by generative pre-training of a language model on a diverse corpus of unlabeled text, large gains on a diverse range of tasks could be realized. RadfordRadford et al. (2018) achieved large improvements on many sentence-level tasks from the GLUE benchmarkWang et al. (2018). BERTDevlin et al. (2018) obtained new state-of-the-art results on a broad range of diverse tasks. BERT pre-trained deep bidirectional representations which jointly conditioned on both left and right context in all layers, following by discriminative fine-tuning on each specific task. Unlike previous works fine-tuning pre-trained language model to perform discriminative tasks, we aim to apply pre-trained BERT on generative tasks by perform the masked language model(MLM) task. To generate sentences that are compatible with given labels, we retrofit BERT to conditional BERT, by introducing a conditional masked language model task and fine-tuning BERT on the task.
2.2 Text Data Augmentation
Text data augmentation has been extensively studied in natural language processing. Sample-based methods includes downsampling from the majority classes and oversampling from the minority class, both of which perform weakly in practice. Generation-based methods employ deep generative models such as GANsGoodfellow et al. (2014) or VAEsBowman et al. (2015); Hu et al. (2017), trying to generate sentences from a continuous space with desired attributes of sentiment and tense. However, sentences generated in these methods are very hard to guarantee the quality both in label compatibility and sentence readability. In some specific areas Jia and Liang (2017); Xie et al. (2017); Ebrahimi et al. (2017). word replacement augmentation was applied. WangWang and Yang (2015) proposed the use of neighboring words in continuous representations to create new instances for every word in a tweet to augment the training dataset. ZhangZhang et al. (2015) extracted all replaceable words from the given text and randomly choose of them to be replaced, then substituted the replaceable words with synonyms from WordNetMiller (1995). KolomiyetsKolomiyets et al. (2011) replaced only the headwords under a task-specific assumption that temporal trigger words usually occur as headwords. KolomiyetsKolomiyets et al. (2011) selected substitute words with top- scores given by the Latent Words LMDeschacht and Moens (2009), which is a LM based on fixed length contexts. FadaeeFadaee et al. (2017) focused on the rare word problem in machine translation, replacing words in a source sentence with only rare words. A word in the translated sentence is also replaced using a word alignment method and a rightward LM. The work most similar to our research is KobayashiKobayashi (2018). Kobayashi used a fill-in-the-blank context for data augmentation by replacing every words in the sentence with language model. In order to prevent the generated words from reversing the information related to the labels of the sentences, KobayashiKobayashi (2018) introduced a conditional constraint to control the replacement of words. Unlike previous works, we adopt a deep bidirectional language model to apply replacement, and the attention mechanism within our model allows a more structured memory for handling long-term dependencies in text, which resulting in more general and robust improvement on various downstream tasks.
3 Conditional BERT Contextual Augmentation
3.1 Preliminary: Masked Language Model Task
3.1.1 Bidirectional Language Model
In general, the language model(LM) models the probability of generating natural language sentences or documents. Given a sequenceS of N tokens, , a forward language model allows us to predict the probability of the sequence as:
Similarly, a backward language model allows us to predict the probability of the sentence as:
Traditionally, a bidirectional language model a shallow concatenation of independently trained forward and backward LMs.
3.1.2 Masked Language Model Task
In order to train a deep bidirectional language model, BERT proposed Masked Language Model (MLM) task, which was also referred to Cloze TaskTaylor (1953). MLM task randomly masks some percentage of the input tokens, and then predicts only those masked tokens according to their context. Given a masked token , the context is the tokens surrounding token in the sequence S, i.e. cloze sentence. MLM task only predicts the masked words rather than reconstructing the entire input, which suggests that more pre-training steps are required for the model to converge. Pre-trained BERT can augment sentences through MLM task, by predicting new words in masked positions according to their context.
3.2 Conditional BERT
As shown in Fig 1, our conditional BERT shares the same model architecture with the original BERT. The differences are the input representation and training procedure.
The input embeddings of BERT are the sum of the token embeddings, the segmentation embeddings and the position embeddings. For the segmentation embeddings in BERT, a learned sentence A embedding is added to every token of the first sentence, and if a second sentence exists, a sentence B embedding will be added to every token of the second sentence. However, the segmentation embeddings has no connection to the actual annotated labels of a sentence, like sense, sentiment or subjectivity, so predicted word is not always compatible with annotated labels. For example, given a positive movie remark “this actor is good”, we have the word “good” masked. Through the Masked Language Model task by BERT, the predicted word in the masked position has potential to be negative word likes ”bad” or ”boring”. Such new generated sentences by substituting masked words are implausible with respect to their original labels, which will be harmful if added to the corpus to apply augmentation. In order to address this issue, we propose a new task: “conditional masked language model”.
3.2.1 Conditional Masked Language Model
The conditional masked language model randomly masks some of the tokens from the labeled sentence, and the objective is to predict the original vocabulary index of the masked word based on both its context and its label. Given a masked token , the context and label are both considered, aiming to calculate , instead of calculating . Unlike MLM pre-training, the conditional MLM objective allows the representation to fuse the context information and the label information, which allows us to further train a label-conditional deep bidirectional representations.
To perform conditional MLM task, we fine-tune on pre-trained BERT. We alter the segmentation embeddings to label embeddings, which are learned corresponding to their annotated labels on labeled datasets. Note that the BERT are designed with segmentation embedding being embedding A or embedding B, so when a downstream task dataset with more than two labels, we have to adapt the size of embedding to label size compatible. We train conditional BERT using conditional MLM task on labeled dataset. After the model has converged, it is expected to be able to predict words in masked position both considering the context and the label.
3.3 Conditional BERT Contextual Augmentation
After the conditional BERT is well-trained, we utilize it to augment sentences. Given a labeled sentence from the corpus, we randomly mask a few words in the sentence. Through conditional BERT, various words compatibly with the label of the sentence are predicted by conditional BERT. After substituting the masked words with predicted words, a new sentences is generated, which shares similar context and same label with original sentence. Then new sentences are added to original corpus. We elaborate the entire process in algorithm 1.
In this section, we present conditional BERT parameter settings and, following KobayashiKobayashi (2018), we apply different augmentation methods on two types of neural models through six text classification tasks. The pre-trained BERT model we used in our experiment is BERT, with number of layers (i.e., Transformer blocks) , the hidden size , and the number of self-attention heads , total parameters . Detailed pre-train parameters setting can be found in original paperDevlin et al. (2018). For each task, we perform the following steps independently. First, we evaluate the augmentation ability of original BERT model pre-trained on MLM task. We use pre-trained BERT to augment dataset, by predicted masked words only condition on context for each sentence. Second, we fine-tune the original BERT model to a conditional BERT. Well-trained conditional BERT augments each sentence in dataset by predicted masked words condition on both context and label. Third, we compare the performance of the two methods with Kobayashi’sKobayashi (2018) contextual augmentation results. Note that the original BERT’s segmentation embeddings layer is compatible with two-label dataset. When the task-specific dataset is with more than two different labels, we should re-train a label size compatible label embeddings layer instead of directly fine-tuning the pre-trained one.
Six benchmark classification datasets are listed in table 1. Following KimKim (2014), for a dataset without validation data, we use 10% of its training set for the validation set. Summary statistics of six classification datasets are shown in table 1.
SSTSocher et al. (2013) SST (Stanford Sentiment Treebank) is a dataset for sentiment classification on movie reviews, which are annotated with five labels (SST5: very positive, positive, neutral, negative, or very negative) or two labels (SST2: positive or negative).
SubjPang and Lee (2004) Subj (Subjectivity dataset) is annotated with whether a sentence is subjective or objective.
MPQAWiebe et al. (2005) MPQA Opinion Corpus is an opinion polarity detection dataset of short phrases rather than sentences, which contains news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
RTPang and Lee (2005) RT is another movie review sentiment dataset contains a collection of short review excerpts from Rotten Tomatoes collected by Bo Pang and Lillian Lee.
TRECLi and Roth (2002) TREC is a dataset for classification of the six question types (whether the question is about person, location, numeric information, etc.).
4.2 Text classification
4.2.1 Sentence Classifier Structure
We evaluate the performance improvement brought by conditional BERT contextual augmentation on sentence classification tasks, so we need to prepare two common sentence classifiers beforehand. For comparison, following KobayashiKobayashi (2018), we adopt two typical classifier architectures: CNN or LSTM-RNN. The CNN-based classifierKim (2014)
has convolutional filters of size 3, 4, 5 and word embeddings. All outputs of each filter are concatenated before applied with a max-pooling over time, then fed into a two-layer feed-forward network with ReLU, followed by the softmax function. An RNN-based classifier has a single layer LSTM and word embeddings, whose output is fed into an output affine layer with the softmax function. For both the architectures, dropoutSrivastava et al. (2014) and Adam optimizationKingma and Ba (2014)
are applied during training. The train process is finish by early stopping with validation at each epoch.
4.2.2 Hyper-parameters Setting
Sentence classifier hyper-parameters including learning rate, embedding dimension, unit or filter size, and dropout ratio, are selected using grid-search for each task-specific dataset. We refer to Kobayashi’s implementation in the released code222https://github.com/pfnetresearch/contextual_augmentation. For BERT, all hyper-parameters are kept the same as DevlinDevlin et al. (2018)
, codes in Tensorflow333https://github.com/google-research/bert
and PyTorch444https://github.com/huggingface/pytorch-pretrained-BERT are all available on github and pre-trained BERT model can also be downloaded. The number of conditional BERT training epochs ranges in [1-50] and number of masked words ranges in [1-2].
We compare the performance improvements obtained by our proposed method with the following baseline methods, “w/” means “with”:
w/synonym: Words are randomly replaced with synonyms from WordNetMiller (1995).
w/context: Proposed by KobayashiKobayashi (2018), which used a bidirectional language model to apply contextual augmentation, each word was replaced with a probability.
w/context+label: Kobayashi’s contextual augmentation methodKobayashi (2018) in a label-conditional LM architecture.
4.2.4 Experiment Results
Table 2 lists the accuracies of the all methods on two classifier architectures. The results show that, for various datasets on different classifier architectures, our conditional BERT contextual augmentation improves the model performances most. BERT can also augments sentences to some extent, but not as much as conditional BERT does. For we masked words randomly, the masked words may be label-sensitive or label-insensitive. If label-insensitive words are masked, words predicted through BERT may not be compatible with original labels. The improvement over all benchmark datasets also shows that conditional BERT is a general augmentation method for multi-labels sentence classification tasks.
4.2.5 Effect of Number of Fine-tuning Steps
We also explore the effect of number of training steps to the performance of conditional BERT data augmentation. The fine-tuning epoch setting ranges in [1-50], we list the fine-tuning epoch of conditional BERT to outperform BERT for various benchmarks in table 3. The results show that our conditional BERT contextual augmentation can achieve obvious performance improvement after only a few fine-tuning epochs, which is very convenient to apply to downstream tasks.
5 Connection to Style Transfer
|Original:||there ’s no disguising this as one of the worst films of the summer .|
|Generated:||there ’s no disguising this as one of the best films of the summer .|
|Original:||it ’s probably not easy to make such a worthless film …|
|Generated:||it ’s probably not easy to make such a stunning film …|
|Original:||woody allen has really found his groove these days .|
|Generated:||woody allen has really lost his groove these days .|
In this section, we further deep into the connection to style transfer and apply our well trained conditional BERT to style transfer task. Style transfer is defined as the task of rephrasing the text to contain specific stylistic properties without changing the intent or affect within the contextPrabhumoye et al. (2018). Our conditional MLM task changes words in the text condition on given label without changing the context. View from this point, the two tasks are very close. So in order to apply conditional BERT to style transfer task, given a specific stylistic sentence, we break it into two steps: first, we find the words relevant to the style; second, we mask the style-relevant words, then use conditional BERT to predict new substitutes with sentence context and target style property. In order to find style-relevant words in a sentence, we refer to XuXu et al. (2018), which proposed an attention-based method to extract the contribution of each word to the sentence sentimental label. For example, given a positive movie remark “This movie is funny and interesting”, we filter out the words contributes largely to the label and mask them. Then through our conditional BERT contextual augmentation method, we fill in the masked position by predicting words conditioning on opposite label and sentence context, resulting in “This movie is boring and dull”. The words “boring” and “dull” contribute to the new sentence being labeled as negative style. We sample some sentences from dataset SST2, transferring them to the opposite label, as listed in table 4.
6 Conclusions and Future Work
In this paper, we fine-tune BERT to conditional BERT by introducing a novel conditional MLM task. After being well trained, the conditional BERT can be applied to data augmentation for sentence classification tasks. Experiment results show that our model outperforms several baseline methods obviously. Furthermore, we demonstrate that our conditional BERT can also be applied to style transfer task. In the future, (1)We will explore how to perform text data augmentation on imbalanced datasets with pre-trained language model, (2) we believe the idea of conditional BERT contextual augmentation is universal and will be applied to paragraph or document level data augmentation.
- Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. pages 3079–3087.
- Deschacht and Moens (2009) Koen Deschacht and Marie-Francine Moens. 2009. Semi-supervised semantic role labeling using the latent words language model. pages 21–29.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Ebrahimi et al. (2017) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. Hotflip: White-box adversarial examples for nlp. arXiv preprint arXiv:1712.06751.
- Fadaee et al. (2017) Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. pages 2672–2680.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. 1:328–339.
- Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. arXiv preprint arXiv:1703.00955.
- Jaitly and Hinton (2013) Navdeep Jaitly and Geoffrey E Hinton. 2013. Vocal tract length perturbation (vtlp) improves speech recognition. 117.
- Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Ko et al. (2015) Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. Audio augmentation for speech recognition.
- Kobayashi (2018) Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201.
- Kolomiyets et al. (2011) Oleksandr Kolomiyets, Steven Bethard, and Marie-Francine Moens. 2011. Model-portability experiments for textual temporal analysis. pages 271–276.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. pages 1097–1105.
- Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. pages 1–7.
- Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
Pang and Lee (2004)
Bo Pang and Lillian Lee. 2004.
A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts.page 271.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. pages 115–124.
- Prabhumoye et al. (2018) Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. Style transfer through back-translation. arXiv preprint arXiv:1804.09000.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper. pdf.
Simard et al. (1998)
Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. 1998.
Transformation invariance in pattern recognition—tangent distance and tangent propagation.pages 239–274.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. pages 1631–1642.
Srivastava et al. (2014)
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. pages 1–9.
- Taylor (1953) Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
- Wang et al. (2018) Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Wang and Yang (2015) William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. pages 2557–2563.
- Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165–210.
- Xie et al. (2017) Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.
- Xu et al. (2018) Jingjing Xu, Xu Sun, Qi Zeng, Xuancheng Ren, Xiaodong Zhang, Houfeng Wang, and Wenjie Li. 2018. Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach. arXiv preprint arXiv:1805.05181.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. pages 649–657.