BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service

08/21/2023
by   Inez Okulska, et al.
0

Advances in automated detection of offensive language online, including hate speech and cyberbullying, require improved access to publicly available datasets comprising social media content. In this paper, we introduce BAN-PL, the first open dataset in the Polish language that encompasses texts flagged as harmful and subsequently removed by professional moderators. The dataset encompasses a total of 691,662 pieces of content from a popular social networking service, Wykop, often referred to as the "Polish Reddit", including both posts and comments, and is evenly distributed into two distinct classes: "harmful" and "neutral". We provide a comprehensive description of the data collection and preprocessing procedures, as well as highlight the linguistic specificity of the data. The BAN-PL dataset, along with advanced preprocessing scripts for, i.a., unmasking profanities, will be publicly available.

READ FULL TEXT
research
02/25/2019

Predicting the Type and Target of Offensive Posts in Social Media

As offensive content has become pervasive in social media, there has bee...
research
06/01/2022

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

Social media platforms and online streaming services have spawned a new ...
research
04/03/2020

Directions in Abusive Language Training Data: Garbage In, Garbage Out

Data-driven analysis and detection of abusive online content covers many...
research
02/04/2021

Bangla Text Dataset and Exploratory Analysis for Online Harassment Detection

Being the seventh most spoken language in the world, the use of the Bang...
research
11/30/2018

Detecting Offensive Content in Open-domain Conversations using Two Stage Semi-supervision

As open-ended human-chatbot interaction becomes commonplace, sensitive c...
research
06/12/2020

A Face Preprocessing Approach for Improved DeepFake Detection

Recent advancements in content generation technologies (also widely know...
research
04/03/2018

Development of the Japanese Moral Foundations Dictionary: Procedures and Applications

The Moral Foundations Dictionary (MFD) is a useful tool for applying the...

Please sign up or login with your details

Forgot password? Click here to reset