Leveraging Large-scale Multimedia Datasets to Refine Content Moderation Models

12/01/2022
by   Ioannis Sarridis, et al.
0

The sheer volume of online user-generated content has rendered content moderation technologies essential in order to protect digital platform audiences from content that may cause anxiety, worry, or concern. Despite the efforts towards developing automated solutions to tackle this problem, creating accurate models remains challenging due to the lack of adequate task-specific training data. The fact that manually annotating such data is a highly demanding procedure that could severely affect the annotators' emotional well-being is directly related to the latter limitation. In this paper, we propose the CM-Refinery framework that leverages large-scale multimedia datasets to automatically extend initial training datasets with hard examples that can refine content moderation models, while significantly reducing the involvement of human annotators. We apply our method on two model adaptation strategies designed with respect to the different challenges observed while collecting data, i.e. lack of (i) task-specific negative data or (ii) both positive and negative data. Additionally, we introduce a diversity criterion applied to the data collection process that further enhances the generalization performance of the refined models. The proposed method is evaluated on the Not Safe for Work (NSFW) and disturbing content detection tasks on benchmark datasets achieving 1.32 of the art, respectively. Finally, it significantly reduces human involvement, as 92.54 while no human intervention is required for the NSFW task.

READ FULL TEXT

page 8

page 9

research
08/18/2023

LSCD: A Large-Scale Screen Content Dataset for Video Compression

Multimedia compression allows us to watch videos, see pictures and hear ...
research
10/03/2019

Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

The wide popularity of digital photography and social networks has gener...
research
03/24/2023

Paraphrase Detection: Human vs. Machine Content

The growing prominence of large language models, such as GPT-4 and ChatG...
research
11/18/2019

Task-Based Hybrid Shared Control for Training Through Forceful Interaction

Despite the fact that robotic platforms can provide both consistent prac...
research
09/18/2023

ProtoKD: Learning from Extremely Scarce Data for Parasite Ova Recognition

Developing reliable computational frameworks for early parasite detectio...
research
01/08/2022

Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence

Automated monitoring of dark web (DW) platforms on a large scale is the ...

Please sign up or login with your details

Forgot password? Click here to reset