A simple language-agnostic yet very strong baseline system for hate speech and offensive content identification

02/05/2022
by   Yves Bestgen, et al.
0

For automatically identifying hate speech and offensive content in tweets, a system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed by the SATLab team. After its optimization in terms of the feature weighting and the classifier parameters, it reached, in the multilingual HASOC 2021 challenge, a medium performance level in English, the language for which it is easy to develop deep learning approaches relying on many external linguistic resources, but a far better level for the two less resourced language, Hindi and Marathi. It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches. These performances suggest that it is an interesting reference level to evaluate the benefits of using more complex approaches such as deep learning or taking into account complementary resources.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2018

The JHU Speech LOREHLT 2017 System: Cross-Language Transfer for Situation-Frame Detection

We describe the system our team used during NIST's LoReHLT (Low Resource...
research
01/11/2022

A Feature Extraction based Model for Hate Speech Identification

The detection of hate speech online has become an important task, as off...
research
12/01/2020

Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

Existing multilingual speech NLP works focus on a relatively small subse...
research
01/31/2021

Multilingual Email Zoning

The segmentation of emails into functional zones (also dubbed email zoni...
research
05/31/2021

Singing Language Identification using a Deep Phonotactic Approach

Extensive works have tackled Language Identification (LID) in the speech...
research
06/12/2016

External Lexical Information for Multilingual Part-of-Speech Tagging

Morphosyntactic lexicons and word vector representations have both prove...
research
06/23/2023

Retrieval of Boost Invariant Symbolic Observables via Feature Importance

Deep learning approaches for jet tagging in high-energy physics are char...

Please sign up or login with your details

Forgot password? Click here to reset