Can string kernels pass the test of time in Native Language Identification?

07/26/2017
by   Radu Tudor Ionescu, et al.
0

We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the shared task organizers. For the learning stage, we choose Kernel Discriminant Analysis (KDA) over Kernel Ridge Regression (KRR), because the former classifier obtains better results than the latter one on the development set. In our previous work, we have used a similar machine learning approach to achieve state-of-the-art NLI results. The goal of this paper is to demonstrate that our shallow and simple approach based on string kernels (with minor improvements) can pass the test of time and reach state-of-the-art performance in the 2017 NLI shared task, despite the recent advances in natural language processing. We participated in all three tracks, in which the competitors were allowed to use only the essays (essay track), only the speech transcripts (speech track), or both (fusion track). Using only the data provided by the organizers for training our models, we have reached a macro F1 score of 86.95 of 87.55 closed fusion track. With these scores, our team (UnibucKernel) ranked in the first group of teams in all three tracks, while attaining the best scores in the speech and the fusion tracks.

READ FULL TEXT
research
05/13/2018

UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

We present a machine learning approach that ranked on the first place in...
research
07/22/2017

Native Language Identification on Text and Speech

This paper presents an ensemble system combining the output of multiple ...
research
08/25/2018

Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set

Recently, string kernels have obtained state-of-the-art results in vario...
research
10/29/2020

Unbabel's Participation in the WMT20 Metrics Shared Task

We present the contribution of the Unbabel team to the WMT 2020 Shared T...
research
04/21/2018

Automated essay scoring with string kernels and word embeddings

In this work, we present an approach based on combining string kernels a...
research
09/09/2018

SHOMA at Parseme Shared Task on Automatic Identification of VMWEs: Neural Multiword Expression Tagging with High Generalisation

This paper presents a language-independent deep learning architecture ad...
research
12/28/2016

A Deep Learning Approach To Multiple Kernel Fusion

Kernel fusion is a popular and effective approach for combining multiple...

Please sign up or login with your details

Forgot password? Click here to reset