The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification

07/30/2020
by   Mihaela Gaman, et al.
0

In this work, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and the Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, e.g. when we shorten the text samples to single sentences or when use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles versus tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the machine learning performance on the MRC shared task can be improved through an ensemble based on classifier stacking.

READ FULL TEXT

page 18

page 22

page 24

research
07/25/2022

UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu

This paper gives the overview of the first shared task at FIRE 2020 on f...
research
09/08/2020

LynyrdSkynyrd at WNUT-2020 Task 2: Semi-Supervised Learning for Identification of Informative COVID-19 English Tweets

We describe our system for WNUT-2020 shared task on the identification o...
research
02/19/2021

KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for Detection of Hate Speech and Offensive Code-Mixed Social Media text

This paper describes the system submitted by our team, KBCNMUJAL, for Ta...
research
07/21/2020

XD at SemEval-2020 Task 12: Ensemble Approach to Offensive Language Identification in Social Media Using Transformer Encoders

This paper presents six document classification models using the latest ...
research
12/04/2020

Automated Detection of Cyberbullying Against Women and Immigrants and Cross-domain Adaptability

Cyberbullying is a prevalent and growing social problem due to the surge...
research
09/22/2022

AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models

This paper presents our submission for the SMM4H 2022-Shared Task on the...
research
05/04/2021

Drifting Features: Detection and evaluation in the context of automatic RRLs identification in VVV

As most of the modern astronomical sky surveys produce data faster than ...

Please sign up or login with your details

Forgot password? Click here to reset