HQP: A Human-Annotated Dataset for Detecting Online Propaganda

04/28/2023
by   Abdurahman Maarouf, et al.
0

Online propaganda poses a severe threat to the integrity of societies. However, existing datasets for detecting online propaganda have a key limitation: they were annotated using weak labels that can be noisy and even incorrect. To address this limitation, our work makes the following contributions: (1) We present HQP: a novel dataset (N=30,000) for detecting online propaganda with high-quality labels. To the best of our knowledge, HQP is the first dataset for detecting online propaganda that was created through human annotation. (2) We show empirically that state-of-the-art language models fail in detecting online propaganda when trained with weak labels (AUC: 64.03). In contrast, state-of-the-art language models can accurately detect online propaganda when trained with our high-quality labels (AUC: 92.25), which is an improvement of  44 few-shot learning. Specifically, we show that prompt-based learning using a small sample of high-quality labels can still achieve a reasonable performance (AUC: 80.27). Finally, we discuss implications for the NLP community to balance the cost and quality of labeling. Crucially, our work highlights the importance of high-quality labels for sensitive NLP tasks such as propaganda detection.

READ FULL TEXT
research
10/04/2021

An Empirical Investigation of Learning from Biased Toxicity Labels

Collecting annotations from human raters often results in a trade-off be...
research
09/14/2023

Detecting Misinformation with LLM-Predicted Credibility Signals and Weak Supervision

Credibility signals represent a wide range of heuristics that are typica...
research
10/28/2022

"It's Not Just Hate”: A Multi-Dimensional Perspective on Detecting Harmful Speech Online

Well-annotated data is a prerequisite for good Natural Language Processi...
research
11/11/2019

ContamiNet: Detecting Contamination in Municipal Solid Waste

Leveraging over 30,000 images each with up to 89 labels collected by Rec...
research
06/02/2023

LyricSIM: A novel Dataset and Benchmark for Similarity Detection in Spanish Song LyricS

In this paper, we present a new dataset and benchmark tailored to the ta...
research
08/07/2020

Few Shot Learning Framework to Reduce Inter-observer Variability in Medical Images

Most computer aided pathology detection systems rely on large volumes of...
research
05/16/2023

CPL-NoViD: Context-Aware Prompt-based Learning for Norm Violation Detection in Online Communities

Detecting norm violations in online communities is critical to maintaini...

Please sign up or login with your details

Forgot password? Click here to reset