Log In Sign Up

Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

by   Elena Alvarez-Mellado, et al.

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings – words from one language that are introduced into another without orthographic adaptation – and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.


page 1

page 2

page 3

page 4


Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF

The article introduces a new set of Polish word embeddings, built using ...

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

A prerequisite for the computational study of literature is the availabi...

Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

In this paper, we formulate keyphrase extraction from scholarly articles...

Domain Adaptive Pretraining for Multilingual Acronym Extraction

This paper presents our findings from participating in the multilingual ...

An Annotated Corpus of Emerging Anglicisms in Spanish Newspaper Headlines

The extraction of anglicisms (lexical borrowings from English) is releva...

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

Given a small corpus D_T pertaining to a limited set of focused topics,...

Weak Semi-Markov CRFs for NP Chunking in Informal Text

This paper introduces a new annotated corpus based on an existing inform...