The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus

by   Samia Touileb, et al.

Recent years have seen a rise in interest for cross-lingual transfer between languages with similar typology, and between languages of various scripts. However, the interplay between language similarity and difference in script on cross-lingual transfer is a less studied problem. We explore this interplay on cross-lingual transfer for two supervised tasks, namely part-of-speech tagging and sentiment analysis. We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts, as well as annotations for sentiment and topic categories. We perform baseline experiments by fine-tuning multi-lingual language models. We further explore the effect of script vs. language similarity in cross-lingual transfer by fine-tuning multi-lingual models on languages which are a) typologically distinct, but use the same script, b) typologically similar, but use a distinct script, or c) are typologically similar and use the same script. We find there is a delicate relationship between script and typology for part-of-speech, while sentiment analysis is less sensitive.



There are no comments yet.


page 12


Analyzing Zero-shot Cross-lingual Transfer in Supervised NLP Tasks

In zero-shot cross-lingual transfer, a supervised NLP task trained on a ...

Ranking Transfer Languages with Pragmatically-Motivated Features for Multilingual Sentiment Analysis

Cross-lingual transfer learning studies how datasets, annotations, and m...

On the Effect of Word Order on Cross-lingual Sentiment Analysis

Current state-of-the-art models for sentiment analysis make use of word ...

Consistency Regularization for Cross-Lingual Fine-Tuning

Fine-tuning pre-trained cross-lingual language models can transfer task-...

Fine-Grained Analysis of Cross-Linguistic Syntactic Divergences

The patterns in which the syntax of different languages converges and di...

Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource

Most work on part-of-speech (POS) tagging is focused on high resource la...

Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification

In recent years deep neural networks have achieved great success in sent...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.