Probabilistic Linguistic Knowledge and Token-level Text Augmentation

06/29/2023
by   Zhengxiang Wang, et al.
0

This paper investigates the effectiveness of token-level text augmentation and the role of probabilistic linguistic knowledge within a linguistically-motivated evaluation context. Two text augmentation programs, REDA and REDA_NG, were developed, both implementing five token-level text editing operations: Synonym Replacement (SR), Random Swap (RS), Random Insertion (RI), Random Deletion (RD), and Random Mix (RM). REDA_NG leverages pretrained n-gram language models to select the most likely augmented texts from REDA's output. Comprehensive and fine-grained experiments were conducted on a binary question matching classification task in both Chinese and English. The results strongly refute the general effectiveness of the five token-level text augmentation techniques under investigation, whether applied together or separately, and irrespective of various common classification model types used, including transformers. Furthermore, the role of probabilistic linguistic knowledge is found to be minimal.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2021

Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching

Data augmentation (DA) is a common solution to data scarcity and imbalan...
research
09/02/2022

Random Text Perturbations Work, but not Always

We present three large-scale experiments on binary text matching classif...
research
07/29/2022

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

This paper investigates an open research problem of generating text-imag...
research
07/18/2023

Text vectorization via transformer-based language models and n-gram perplexities

As the probability (and thus perplexity) of a text is calculated based o...
research
10/05/2020

On the Effects of Knowledge-Augmented Data in Word Embeddings

This paper investigates techniques for knowledge injection into word emb...
research
04/04/2020

Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

This paper presents an empirical study of conversational question reform...
research
01/30/2023

Exploring the Constructicon: Linguistic Analysis of a Computational CxG

Recent work has formulated the task for computational construction gramm...

Please sign up or login with your details

Forgot password? Click here to reset