DeepAI AI Chat
Log In Sign Up

Text Data Augmentation: Towards better detection of spear-phishing emails

by   Mehdi Regina, et al.

Text data augmentation, i.e. the creation of synthetic textual data from an original text, is challenging as augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g. Machine Translation, Question Answering, Text Classification, etc.). Motivated by a business application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic text augmentation framework combining different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora (SST2 and TREC) as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.


page 1

page 2

page 3

page 4


Performance of Data Augmentation Methods for Brazilian Portuguese Text Classification

Improving machine learning performance while increasing model generaliza...

An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution

One critical issue of zero anaphora resolution (ZAR) is the scarcity of ...

I-WAS: a Data Augmentation Method with GPT-2 for Simile Detection

Simile detection is a valuable task for many natural language processing...

Differentiable Retrieval Augmentation via Generative Language Modeling for E-commerce Query Intent Classification

Retrieval augmentation, which enhances downstream models by a knowledge ...

PnPOOD : Out-Of-Distribution Detection for Text Classification via Plug andPlay Data Augmentation

While Out-of-distribution (OOD) detection has been well explored in comp...

Multilingual Augmenter: The Model Chooses

Natural Language Processing (NLP) relies heavily on training data. Trans...

Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Before entering the neural network, a token is generally converted to th...