GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation

11/18/2022
by   Biyang Guo, et al.
0

We introduce GENIUS: a conditional text generation model using sketches as input, which can fill in the missing contexts for a given sketch (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective using an extreme and selective masking strategy, enabling it to generate diverse and high-quality texts given sketches. Comparison with other competitive conditional language models (CLMs) reveals the superiority of GENIUS's text generation quality. We further show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks. Most existing textual data augmentation methods are either too conservative, by making small changes to the original text, or too aggressive, by creating entirely new samples. With GENIUS, we propose GeniusAug, which first extracts the target-aware sketches from the original training set and then generates new samples based on the sketches. Empirical experiments on 6 text classification datasets show that GeniusAug significantly improves the models' performance in both in-distribution (ID) and out-of-distribution (OOD) settings. We also demonstrate the effectiveness of GeniusAug on named entity recognition (NER) and machine reading comprehension (MRC) tasks. (Code and models are publicly available at https://github.com/microsoft/SCGLab and https://github.com/beyondguo/genius)

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2020

iNLTK: Natural Language Toolkit for Indic Languages

We present iNLTK, an open-source NLP library consisting of pre-trained l...
research
05/18/2023

BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER

Biomedical Named Entity Recognition (BioNER) is the fundamental task of ...
research
06/01/2023

ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER

Complex Named Entity Recognition (NER) is the task of detecting linguist...
research
04/24/2022

EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification

Recent works have empirically shown the effectiveness of data augmentati...
research
06/15/2021

SSMix: Saliency-Based Span Mixup for Text Classification

Data augmentation with mixup has shown to be effective on various comput...
research
02/17/2023

Entry Separation using a Mixed Visual and Textual Language Model: Application to 19th century French Trade Directories

When extracting structured data from repetitively organized documents, s...
research
03/08/2023

Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

Large Language Models (LLMs) like GPT-3 have sparked significant interes...

Please sign up or login with your details

Forgot password? Click here to reset