Unravelling Interlanguage Facts via Explainable Machine Learning

08/02/2022
by   Barbara Berti, et al.
0

Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e., that of analysing the internals of an NLI classifier trained by an explainable machine learning algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena “give a speaker's native language away”. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e., guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1. We also present two case studies, one on Spanish and one on Italian learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s. Overall, our study shows that the use of explainable machine learning can be a valuable tool for th

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/24/2017

Predicting Native Language from Gaze

A fundamental question in language learning concerns the role of a speak...
research
09/13/2023

Native Language Identification with Big Bird Embeddings

Native Language Identification (NLI) intends to classify an author's nat...
research
04/30/2018

A Portuguese Native Language Identification Dataset

In this paper we present NLI-PT, the first Portuguese dataset compiled f...
research
01/16/2021

Tuiteamos o pongamos un tuit? Investigating the Social Constraints of Loanword Integration in Spanish Social Media

Speakers of non-English languages often adopt loanwords from English to ...
research
12/02/2021

Computing Class Hierarchies from Classifiers

A class or taxonomic hierarchy is often manually constructed, and part o...
research
04/12/2022

Idiomify – Building a Collocation-supplemented Reverse Dictionary of English Idioms with Word2Vec for non-native learners

The aim of idiomify is to build a collocation-supplemented reverse dicti...
research
01/31/2019

Rhythm Zone Theory: Speech Rhythms are Physical after all

Speech rhythms have been dealt with in three main ways: from the introsp...

Please sign up or login with your details

Forgot password? Click here to reset