Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

05/25/2022

∙

Manually annotating datasets requires domain experts to read through many documents and carefully label them, which is often expensive. Recently, pre-trained generative language models (GLMs) have demonstrated exceptional abilities in generating text which motivates to leverage them for generative data augmentation. We improve generative data augmentation by formulating the data generation as context generation task and use question answering (QA) datasets for intermediate training. Specifically, we view QA to be more as a format than of a task and train GLMs as context generators for a given question and its respective answer. Then, we cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which is further used as synthetic training data for their corresponding tasks. We perform extensive experiments, case studies, and ablation studies on multiple sentiment and topic classification datasets and demonstrate substantial improvements in performance in few-shot, zero-shot settings. Remarkably, on the SST-2 dataset, intermediate training on SocialIQA dataset achieves an improvement of 40 that QA datasets that requires high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.

READ FULL TEXT

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Sign in with Google

Consider DeepAI Pro