Text Data Augmentation: Towards better detection of spear-phishing emails

07/04/2020
by   Mehdi Regina, et al.
0

Text data augmentation, i.e. the creation of synthetic textual data from an original text, is challenging as augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g. Machine Translation, Question Answering, Text Classification, etc.). Motivated by a business application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic text augmentation framework combining different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora (SST2 and TREC) as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2023

Performance of Data Augmentation Methods for Brazilian Portuguese Text Classification

Improving machine learning performance while increasing model generaliza...
research
11/02/2020

An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution

One critical issue of zero anaphora resolution (ZAR) is the scarcity of ...
research
08/08/2023

I-WAS: a Data Augmentation Method with GPT-2 for Simile Detection

Simile detection is a valuable task for many natural language processing...
research
08/18/2023

Differentiable Retrieval Augmentation via Generative Language Modeling for E-commerce Query Intent Classification

Retrieval augmentation, which enhances downstream models by a knowledge ...
research
10/31/2021

PnPOOD : Out-Of-Distribution Detection for Text Classification via Plug andPlay Data Augmentation

While Out-of-distribution (OOD) detection has been well explored in comp...
research
02/19/2021

Multilingual Augmenter: The Model Chooses

Natural Language Processing (NLP) relies heavily on training data. Trans...
research
02/28/2022

Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Before entering the neural network, a token is generally converted to th...

Please sign up or login with your details

Forgot password? Click here to reset