On Cross-Dataset Generalization in Automatic Detection of Online Abuse

10/14/2020
by   Isar Nejadgholi, et al.
0

NLP research has attained high performances in abusive language detection as a supervised classification task. While in research settings, training and test datasets are usually obtained from similar data samples, in practice systems are often applied on data that are different from the training set in topic and class distributions. Also, the ambiguity in class definitions inherited in this task aggravates the discrepancies between source and target datasets. We explore the topic bias and the task formulation bias in cross-dataset generalization. We show that the benign examples in the Wikipedia Detox dataset are biased towards platform-specific topics. We identify these examples using unsupervised topic modeling and manual inspection of topics' keywords. Removing these topics increases cross-dataset generalization, without reducing in-domain classification performance. For a robust dataset design, we suggest applying inexpensive unsupervised methods to inspect the collected data and downsize the non-generalizable content before manually annotating for class labels.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2021

Is Stance Detection Topic-Independent and Cross-topic Generalizable? – A Reproduction Study

Cross-topic stance detection is the task to automatically detect stances...
research
05/26/2020

Examining Racial Bias in an Online Abuse Corpus with Structural Topic Modeling

We use structural topic modeling to examine racial bias in data collecte...
research
12/19/2022

Human in the loop: How to effectively create coherent topics by manually labeling only a few documents per class

Few-shot methods for accurate modeling under sparse label-settings have ...
research
08/11/2016

Sex, drugs, and violence

Automatically detecting inappropriate content can be a difficult NLP tas...
research
06/01/2023

Estimating Semantic Similarity between In-Domain and Out-of-Domain Samples

Prior work typically describes out-of-domain (OOD) or out-of-distributio...
research
05/31/2020

Applying support vector data description for fraud detection

Fraud detection is an important topic that applies to various enterprise...
research
02/06/2021

Exclusive Topic Modeling

We propose an Exclusive Topic Modeling (ETM) for unsupervised text class...

Please sign up or login with your details

Forgot password? Click here to reset