Reprint: a randomized extrapolation based on principal components for data augmentation

04/26/2022
by   Jiale Wei, et al.
30

Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared with different NLP data augmentation approaches under a range of data imbalanced scenarios on four text classification benchmark, REPRINT shows prominent improvements. Moreover, through comprehensive ablation studies, we show that label refinement is better than label-preserving for augmented examples, and that our method suggests stable and consistent improvements in terms of suitable choices of principal components. Moreover, REPRINT is appealing for its easy-to-use since it contains only one hyperparameter determining the dimension of subspace and requires low computational resource.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2021

Text Augmentation in a Multi-Task View

Traditional data augmentation aims to increase the coverage of the input...
research
09/12/2021

Good-Enough Example Extrapolation

This paper asks whether extrapolating the hidden space distribution of t...
research
10/14/2019

Rethinking Data Augmentation: Self-Supervision and Self-Distillation

Data augmentation techniques, e.g., flipping or cropping, which systemat...
research
09/12/2022

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

This paper proposes a simple yet effective interpolation-based data augm...
research
11/17/2021

Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification

Data augmentation techniques are widely used for enhancing the performan...
research
06/06/2023

Augmenting Reddit Posts to Determine Wellness Dimensions impacting Mental Health

Amid ongoing health crisis, there is a growing necessity to discern poss...
research
07/27/2023

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

Recent label mix-based augmentation methods have shown their effectivene...

Please sign up or login with your details

Forgot password? Click here to reset