UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

05/13/2018
by   Andrei M. Butnaru, et al.
0

We present a machine learning approach that ranked on the first place in the Arabic Dialect Identification (ADI) Closed Shared Tasks of the 2018 VarDial Evaluation Campaign. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech or phonetic transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the organizers. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Preliminary experiments indicate that KRR provides better classification results. Our approach is shallow and simple, but the empirical results obtained in the 2018 ADI Closed Shared Task prove that it achieves the best performance. Furthermore, our top macro-F1 score (58.92 better than the second best score (57.59 according to the statistical significance test performed by the organizers. With a very similar approach (that did not include phonetic features), we also ranked first in the ADI Closed Shared Tasks of the 2017 VarDial Evaluation Campaign, surpassing the second best method by 4.62 that our multiple kernel learning method is the best approach for Arabic dialect identification.

READ FULL TEXT
research
07/26/2017

Can string kernels pass the test of time in Native Language Identification?

We describe a machine learning approach for the 2017 shared task on Nati...
research
02/19/2021

Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT

This paper presents our approach to address the EACL WANLP-2021 Shared T...
research
08/14/2018

Classifier Ensembles for Dialect and Language Variety Identification

In this paper we present ensemble-based systems for dialect and language...
research
09/21/2017

Speech Recognition Challenge in the Wild: Arabic MGB-3

This paper describes the Arabic MGB-3 Challenge - Arabic Speech Recognit...
research
09/28/2019

Overview for the Second Shared Task on Language Identification in Code-Switched Data

We present an overview of the second shared task on language identificat...
research
03/20/2018

UnibucKernel: A kernel-based learning method for complex word identification

In this paper, we present a kernel-based learning approach for the 2018 ...
research
08/25/2018

Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set

Recently, string kernels have obtained state-of-the-art results in vario...

Please sign up or login with your details

Forgot password? Click here to reset