AEDA: An Easier Data Augmentation Technique for Text Classification

08/30/2021 ∙ by Akbar Karimi, et al. ∙ 0

This paper proposes AEDA (An Easier Data Augmentation) technique to help improve the performance on text classification tasks. AEDA includes only random insertion of punctuation marks into the original text. This is an easier technique to implement for data augmentation than EDA method (Wei and Zou, 2019) with which we compare our results. In addition, it keeps the order of the words while changing their positions in the sentence leading to a better generalized performance. Furthermore, the deletion operation in EDA can cause loss of information which, in turn, misleads the network, whereas AEDA preserves all the input information. Following the baseline, we perform experiments on five different datasets for text classification. We show that using the AEDA-augmented data for training, the models show superior performance compared to using the EDA-augmented data in all five datasets. The source code is available for further study and reproduction of the results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text classification is a major area of study in natural language processing (NLP) with numerous applications such as sentiment analysis, toxicity detection, and question answering, to name but a few. In order to build text classifiers that perform well, the training data need to be large enough so that the model can generalize to the unseen data. However, for many machine learning (ML) applications and domains, there do not exist sufficient labeled data for training. In this situation, data augmentation (DA) can provide a solution and help improve the performance of ML systems

(Ragni et al., 2014; Fadaee et al., 2017; Ding et al., 2020). DA can be carried out in many different ways such as by modifying elements of the input sequence, namely word substitution, deletion, and insertion (Wei and Zou, 2019; Zhang et al., 2015), and back-translation (Sennrich et al., 2016). It can also be performed by noise injection in the input sequence (Xie et al., 2019) or in the embedding space utilizing a deep language model (Jiao et al., 2020; Karimi et al., 2020; Garg and Ramakrishnan, 2020).

Figure 1: Average performance of the generated data using our proposed augmentation method (AEDA) compared with that of the original and EDA-generated data on five text classification tasks. Using both EDA and AEDA, we added 9 augmented sentences to the original training set to train the models. For each task, we ran the models with 5 different seed numbers and took the average score.

Using a deep language model to do DA can be complicated, while word replacement techniques with the help of a word thesaurus, even though a simple method, risks information loss due to the operations such as deletion and substitution. These operations can even result in changing the label of the input sequence (Kumar et al., 2020), thus misleading the network.

To address these problems, we propose an extremely simple yet effective approach called AEDA (An Easier Data Augmentation) which includes only the insertion of various punctuation marks into the input sequence. AEDA preserves all the input information and does not mislead the network since it keeps the word order intact while changing their positions in that the words are shifted to the right. Our extensive experiments show that AEDA helps the models avoid overfitting (Figure 1).

2 Related Work

Although the textual content is always increasing, data augmentation is still a highly active area of research since for machine learning applications, especially the new ones, the initial annotated data are usually small. As a result, researchers are constantly coming up with innovative ideas to create new data from the available content.

Some have experimented at the input sequence level performing operations on words. For example, to improve machine translation quality, Fadaee et al. (2017) utilize substitution of common words with rare ones, thus providing more context for the rare words, while Sennrich et al. (2016)

use back-translation where automatically translated data along with the original human-translated data are employed to train a neural machine translation system.

Wang and Yang (2015) replaces words with their synonyms for classifying tweets. Similarly, Andreas (2020) replace sentence fragments from common categories with each other in order to produce new sentences.

Others have opted for using pre-trained language models such as BERT

(Devlin et al., 2019). Kobayashi (2018) utilizes contextual augmentation, replacing the words with the prediction of a bidirectional language model at a desired position in the sentence. Hu et al. (2019) and Liu et al. (2020)

utilize reinforcement learning with a conditional language model which is carried out by attaching the correct label to the input sequence when training

(Wu et al., 2019). Working with Transformer model (Vaswani et al., 2017), Sun et al. (2020)

propose Mix-Transformer where two input sentences and their corresponding labels are linearly interpolated to create new samples.

Xie et al. (2019) make use of data noising which can be considered similar to our work with the difference that they replace words choosing from the unigram frequency distribution or insert the underscore character as a placeholder, whereas we insert punctuation characters which usually occur in sentences. The related works mostly use some auxiliary data or a complicated language model to produce augmented data. Conversely, our method is extremely simple to implement and does not need any extra data. In addition, it shows superior performance to EDA in both simple models such as RNNs and CNNs and deep models such as BERT.

3 AEDA Augmentation

In order to insert the punctuation marks, we randomly choose a number between 1 and one-third of the length of the sequence which indicates how many insertions will be carried out. The reason is that we want to ensure there is at least on inserted mark and at the same time we do not want to insert too many punctuation marks as too much noise might have a negative effect on the model, although this effect can be investigated in future work. Then, positions in the sequence are also specified in random as many as the selected number in the previous step. In the end, for each chosen position, a punctuation mark is picked randomly from the six punctuation marks in {".", ";", "?", ":", "!", ","}. Table 3, in Supplementary Material, shows example augmentations by the AEDA technique.

4 Experimental Setup

Since we compare our proposed method with Wei and Zou (2019), we used the same codebase as theirs with no changes in the implementation of the models. We executed the code using a GeForce RTX 2070 GPU with 8 GB of memory.

4.1 Datasets

We experiment with the same five datasets as our baseline. They include SST-2 (Socher et al., 2013) Standford Sentiment Treebank, CR (Hu and Liu, 2004; Ding et al., 2008; Liu et al., 2015) Customer Reviews Dataset, SUBJ (Pang and Lee, 2004) Subjectivity/Objectivity Dataset, TREC (Li and Roth, 2002) Question Classification Dataset, and PC (Ganapathibhotla and Liu, 2008) Pros and Cons Dataset. Table 4, in Supplementary Material, shows the statistics of the utilized datasets.

The train and test sets utilized for the experiments for these datasets were not made available by the baseline. Therefore, after collecting them, we shuffled and divided them into train and test sets with almost the same size as the ones reported by the baseline. For the CR dataset, we combined all the reviews from the three cited sources. The annotations included multiple target sentiments for each sentence. Therefore, to convert them into binary classes, we considered a sentence positive if there was no negative sentiment and negative if there was no positive sentiment. We will make our datasets available along with the source code.

4.2 Models

To be consistent as well as for a fair comparison of the effects of EDA- and AEDA-augmented data, we used the same Recurrent Neural Network (RNN)

(Liu et al., 2016)

and Convolutional Neural Network (CNN)

(Kim, 2014) as implemented in the baseline.

5 Results

To evaluate the quality of augmented sentences, we performed experiments using the data augmented by both EDA and AEDA as well as the original data. For the results reported in Table 1, we added 16 augmentations and for the ones in Figure 2, 9 augmentations to be consistent with the baseline. All experiments were repeated with 5 different seed numbers and the average scores are reported.

5.1 AEDA Outperforms EDA

The results of the experiments with 500, 2000, 5000 and full dataset sizes for training are reported in Table 1. We can see that in some small datasets, EDA improves the results while for bigger ones it has a negative effect on the performance of the models. Conversely, AEDA gives a performance boost on all datasets, showing greater boosts for smaller ones. For instance, with 500 sentences, the average absolute improvement is 3.2% while for full dataset it is 0.5%. The reason why EDA does not perform well can be attributed to the operations such as deletion and substitution which insert more misleading information to the network as the number of augmentations grows. In contrast, AEDA keeps the original information in all augmentations.

5.2 Trend on Training Set Sizes

Figure 2 shows how both models perform on different fractions of the training set. These fractions include {1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100} percent. We can see that AEDA outperforms EDA in all tasks as well as showing improvements over the original data. One observation to point out is that also EDA works well on small datasets which can be because of lower number of augmentations compared to the ones reported in Table 1.

(a) SST-2
(b) CR
(c) SUBJ
(d) TREC
(e) PC
Figure 2: Performance of the RNN model trained on various proportions of the original, EDA-generated, and AEDA-generated training data for five text classification tasks. All the scores are the average of 5 runs.
Training set size
Model 500 2,000 5,000 full set
RNN 73.5 82.6 85.9 87.9
+EDA 76.1 81.3 85.2 86.5
+AEDA 77.8 83.9 87.2 88.6
CNN 76.5 83.8 87.0 87.9
+EDA 77.5 82.2 84.5 86.1
+AEDA 78.5 84.4 86.5 88.1
Average 75.0 83.2 86.5 87.9
+EDA 76.8 81.8 84.9 86.3
+AEDA 78.2 84.2 86.9 88.4
Table 1: Comparing average performance of EDA and AEDA across all datasets on different training set sizes. For each training sample, 16 augmented sentences were added. Scores are the average of 5 runs.

6 Ablation Study

In this section, we investigate how much gain there is for different number of augmentations, the effect of random initialization, and whether AEDA can improve deep models.

Figure 3: Impact of number of augmentations on the performance of the RNN model trained on various training sizes. Scores are the average of 5 runs over the five datasets. The y axis shows the percentage of improvement.

6.1 Number of Augmentations

Figure 3 presents the impact of adding various numbers of augmentations to the training set. We can see that only one augmentation can improve the performance by an absolute amount of 1.5% to 2.5% for all dataset sizes. However, as the augmentations increase, the smallest dataset greatly benefits from that by an improvement of almost 4% while the full dataset only gains 1%. The middle-sized ones have a gain in between (2% to 2.5%).

6.2 Effect of Random Initialization

When conducting the experiments, we noticed that different seed numbers produce different results. As a result, we ran the experiments for 5 times. However, in each run with the same seed number, the results can be slightly different due to the local and global generators in TensorFlow. Therefore, to ensure that 5 runs show the correct trend, we chose two of the datasets (CR and TREC) and ran the models for 21 different seeds (zero to 20). From Figure

4, we see that the trend is similar to Figure 2, which shows the average results of 5 seeds.

(a) CR
(b) TREC
Figure 4: Average performance of EDA and AEDA over 21 different seed numbers. The results are in line with the experiments run over 5 seeds.

6.3 Using AEDA with Deep Models

AEDA can also improve the performance of a deep model such as BERT. For instance, we trained the BERT model used in Kumar et al. (2020)

on SST2 and TREC for 3 epochs with its default settings and observed that adding one augmentation for each training sample increased the performance by 0.66% for SST2 and 0.2% for TREC (Table

2).

Model SST2 TREC
BERT 91.10 97.00
+EDA 90.99 96.00
+AEDA 91.76 97.20
Table 2: Comparing the impact of EDA and AEDA on the BERT model. The model was trained on the combination of the original data and one augmentation for each training sample.

7 Discussion

Comparing the results that we have gained in our experiments with the ones reported in Wei and Zou (2019), we can see some discrepancy, especially in the impact of EDA on improving the performance of the models. We speculate that the difference can be caused by the inconsistency in the training and test sets. Although we obtained the datasets from the same references they have specified, some of them are not divided into train and test datasets ready to be used. As mentioned in Section 4.1, we randomly divided them into train and test sets. In addition, some of them have different sizes which can produce different results.

With that said, to conduct a fair evaluation, we kept the same setting for all comparisons in terms of the utilized library and source code, train and test sets, number of augmentations, number of runs, batch size, and learning rate.

8 Conclusion and Future Work

We proposed an easy data augmentation technique for text classification tasks. Extensive experiments on five different datasets showed that this extremely simple method which uses punctuation marks outperforms the EDA technique which includes random deletion, insertion, and substitution of words, on all the utilized datasets. The future work will focus on exploiting the proposed method regarding which punctuation marks can have more impact, which ones to add or discard, and how many of them can be used to achieve a better performance. In addition, the question whether the punctuation marks should be inserted randomly or some positions are more effective will be investigated.

References

  • J. Andreas (2020) Good-enough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7556–7566. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2.
  • B. Ding, L. Liu, L. Bing, C. Kruengkrai, T. H. Nguyen, S. Joty, L. Si, and C. Miao (2020) DAGA: data augmentation with a generation approach for low-resource tagging tasks. arXiv preprint arXiv:2011.01549. Cited by: §1.
  • X. Ding, B. Liu, and P. S. Yu (2008)

    A holistic lexicon-based approach to opinion mining

    .
    In Proceedings of the 2008 international conference on web search and data mining, pp. 231–240. Cited by: §4.1.
  • M. Fadaee, A. Bisazza, and C. Monz (2017) Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 567–573. Cited by: §1, §2.
  • M. Ganapathibhotla and B. Liu (2008) Mining opinions in comparative sentences. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 241–248. Cited by: §4.1.
  • S. Garg and G. Ramakrishnan (2020) BAE: bert-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6174–6181. Cited by: §1.
  • M. Hu and B. Liu (2004) Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. Cited by: §4.1.
  • Z. Hu, B. Tan, R. Salakhutdinov, T. Mitchell, and E. P. Xing (2019) Learning data manipulation for augmentation and weighting. arXiv preprint arXiv:1910.12795. Cited by: §2.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2020) TinyBERT: distilling bert for natural language understanding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174. Cited by: §1.
  • A. Karimi, L. Rossi, A. Prati, and K. Full (2020) Adversarial training for aspect-based sentiment analysis with bert. arXiv preprint arXiv:2001.11316. Cited by: §1.
  • Y. Kim (2014)

    Convolutional neural networks for sentence classification

    .
    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §4.2.
  • S. Kobayashi (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452–457. Cited by: §2.
  • V. Kumar, A. Choudhary, and E. Cho (2020) Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pp. 18–26. Cited by: §1, §6.3.
  • X. Li and D. Roth (2002) Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §4.1.
  • P. Liu, X. Qiu, and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

    ,
    pp. 2873–2879. Cited by: §4.2.
  • Q. Liu, Z. Gao, B. Liu, and Y. Zhang (2015) Automated rule selection for aspect extraction in opinion mining. In Twenty-Fourth international joint conference on artificial intelligence, Cited by: §4.1.
  • R. Liu, G. Xu, C. Jia, W. Ma, L. Wang, and S. Vosoughi (2020) Data boost: text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9031–9041. Cited by: §2.
  • B. Pang and L. Lee (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 271–278. Cited by: §4.1.
  • A. Ragni, K. M. Knill, S. P. Rath, and M. J. Gales (2014) Data augmentation for low resource languages. In INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pp. 810–814. Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Cited by: §1, §2.
  • R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng (2013)

    Parsing with compositional vector grammars

    .
    In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 455–465. Cited by: §4.1.
  • L. Sun, C. Xia, W. Yin, T. Liang, S. Y. Philip, and L. He (2020) Mixup-transformer: dynamic data augmentation for nlp tasks. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3436–3440. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Cited by: §2.
  • W. Y. Wang and D. Yang (2015) That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2557–2563. Cited by: §2.
  • J. Wei and K. Zou (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6383–6389. Cited by: AEDA: An Easier Data Augmentation Technique for Text Classification, §1, §4, §7.
  • X. Wu, S. Lv, L. Zang, J. Han, and S. Hu (2019) Conditional bert contextual augmentation. In International Conference on Computational Science, pp. 84–95. Cited by: §2.
  • Z. Xie, S. I. Wang, J. Li, D. Lévy, A. Nie, D. Jurafsky, and A. Y. Ng (2019) Data noising as smoothing in neural network language models. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: §1, §2.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pp. 649–657. Cited by: §1.

9 Supplementary Material

9.1 Example Augmentations

Original a sad , superior human comedy played out on the back roads of life .
Aug 1 a sad , superior human comedy played out on the back roads ; of life ; .
Aug 2 a , sad . , superior human ; comedy . played . out on the back roads of life .
Aug 3 : a sad ; , superior ! human : comedy , played out ? on the back roads of life .
Table 3: Examples of augmented data using AEDA technique.

9.2 Benchmark Datasets

Dataset N L N N |V|
SST-2 2 19 7791 1821 15771
CR 2 19 4067 451 9048
SUBJ 2 25 9000 1000 22715
TREC 6 10 5452 500 9448
PC 2 7 40000 5806 26090
Table 4: Statistics of the utilized datasets. N: Number of classes, L: Sentence average length, N: Number of training samples, N: Number of test samples, |V|: Number of unique words.