KOLD: Korean Offensive Language Dataset

05/23/2022
by   Younghoon Jeong, et al.
3

Although large attention has been paid to the detection of hate speech, most work has been done in English, failing to make it applicable to other languages. To fill this gap, we present a Korean offensive language dataset (KOLD), 40k comments labeled with offensiveness, target, and targeted group information. We also collect two types of span, offensive and target span that justifies the decision of the categorization within the text. Comparing the distribution of targeted groups with the existing English dataset, we point out the necessity of a hate speech dataset fitted to the language that best reflects the culture. Trained with our dataset, we report the baseline performance of the models built on top of large pretrained language models. We also show that title information serves as context and is helpful to discern the target of hatred, especially when they are omitted in the comment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2021

WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Recently, large pretrained language models (LMs) have gained popularity....
research
08/27/2021

Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Social media has effectively become the prime hub of communication and d...
research
10/07/2021

mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer

The translation of natural language questions to SQL queries has attract...
research
05/22/2023

PrOnto: Language Model Evaluations for 859 Languages

Evaluation datasets are critical resources for measuring the quality of ...
research
05/24/2023

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Joint speech-language training is challenging due to the large demand fo...
research
03/02/2020

Identification of primary and collateral tracks in stuttered speech

Disfluent speech has been previously addressed from two main perspective...
research
05/22/2023

llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology

This study constructed a Japanese chat dataset for tuning large language...

Please sign up or login with your details

Forgot password? Click here to reset