Lexical borrowing is the process of importing words from one language into another [onysko2007anglicisms, poplack1988social], a phenomenon that occurs in all languages. The task of automatically extracting lexical borrowings from text has proven to be relevant in lexicographic work as well as for NLP downstream tasks, such as parsing [alex2008automatic], text-to-speech synthesis [leidig2014automatic] and machine translation [tsvetkov2016cross].
In recent decades, English in particular has produced numerous lexical borrowings (often called anglicisms) in many European languages [furiassi2012anglicization]
. Previous work estimated that a reader of French newspapers encounters a new lexical borrowing every 1,000 words[chesley_paula_predicting_2010], English borrowings outnumbering all other borrowings combined [chesley2010lexical]. In Chilean newspapers, lexical borrowings account for approximately 30% of neologisms, 80% of those corresponding to anglicisms [gerding2014anglicism]. In European Spanish, it was estimated that anglicisms could account for 2% of the vocabulary used in Spanish newspaper El País in 1991 [gorlach_felix], a number that is likely to be higher today. As a result, the usage of lexical borrowings in Spanish (and particularly anglicisms) has attracted lots of attention, both in linguistic studies and among the general public.
For ADoBo 2021, we proposed a shared task on automatically detecting lexical borrowings in Spanish newswire, with a special focus on unassimilated anglicisms. In this paper we describe the purpose and scope of the shared task, introduce the systems that participated in it, and share the results obtained during the competition.
2 Related work
Several projects have approached the task of extracting lexical borrowings in various European languages, such as German [alex2008automatic, alex-2008-comparing, garley-hockenmaier-2012-beefmoves, leidig2014automatic], Italian [furiassi2007retrieval], French [alex2008automatic, chesley2010lexical], Finnish [mansikkaniemi2012unsupervised], and Norwegian [andersen2012semi, losnegaard2012data], with a particular focus on anglicism extraction.
Despite the interest in modeling anglicism usage, the problem of automatically extracting lexical borrowings has been seldom explored in the NLP literature for Iberian languages in general and for Spanish in particular, with only a few recent exceptions [serigos2017applying, alvarez2020lazaro].
3 Lexical borrowing: scope of the phenomenon
The concept of linguistic borrowing covers a wide range of linguistic phenomena, but is generally understood as the process of introducing words, elements or patterns of one language (the donor language) into another language (the recipient language) [haugen1950analysis, weinreich1963languages]. In that sense, lexical borrowing is somewhat similar to linguistic code-switching (the process of using two languages interchangeably in the same discourse that is common among bilingual speakers), and in fact both phenomena have been sometimes described as a continuum with a fuzzy frontier between the two [clyne2003dynamics]. Consequently, disagreement on what a borrowing is (and is not) exists [gomez1997towards] and various classifications and typologies for characterizing borrowing usage have been proposed, both for borrowings in general [thomason1992language, matras2007grammatical, haspelmath2009loanwords] and for anglicism usage in Spanish in particular [pratt1980anglicismo, lorenzo1996anglicismos, gomez1997towards, gonzalez1999anglicisms, nogueroles2018comprehensive].
4 Task description
For the ADoBo shared task we have focused on unassimilated lexical borrowings, words from another language that are used in Spanish without orthographic modification and that have not (yet) been integrated into the recipient language—for example, running, smartwatch, influencer, holding, look, hype, prime time and lawfare.
4.1 Motivation for the task
The task of extracting unassimilated lexical borrowings is a more challenging undertaking than it might appear to be at first. To begin with, lexical borrowings can be either single or multitoken expressions (e.g., prime time, tie break or machine learning). Second, linguistic assimilation is a diachronic process and, as a result, what constitutes an unassimilated borrowing is not clear-cut. For example, words like bar or club were unassimilated lexical borrowings in Spanish at some point in the past, but have been around for so long in the Spanish language that the process of phonological and morphological adaptation is now complete and they cannot be considered unassimilated borrowings anymore. On the other hand, realia words, that is, culture-specific elements whose name entered via the language of origin decades ago (like jazz or whisky) cannot be considered unassimilated anymore, despite their orthography not having been adapted into Spanish conventions.
All these subtleties make the annotation of lexical borrowings non-trivial. Consequently, in prior work on anglicism extraction from Spanish text, plain dictionary lookup produced very limited results with F1 scores of 47 [serigos2017applying] and 26 [alvarez2020lazaro]. In fact, whether a given expression is a borrowing or not cannot always be determined by plain dictionary lookup; after all, an expression such as social media is an anglicism in Spanish, even when both social and media also happen to be Spanish words that are registered in regular dictionaries. This justifies the need for a more NLP-heavy approach to the task, which has already proven to be promising. Previous work on borrowing extraction using a CRF model with handcrafted features produced an F1 score of 86 on a corpus of Spanish headlines [alvarez2020lazaro].
Finally, although there are some already well-established shared tasks on mixed-language settings, they have focused exclusively on code-switched data [solorio-etal-2014-overview, molina-etal-2016-overview, aguilar-etal-2018-named], which is close to borrowing but different in scope and nature (see Section 3), and no specific venue exists on borrowing detection in NLP so far. To the best of our knowledge, ADoBo is the first shared task specifically devoted to linguistic borrowing.
A corpus of newspaper articles written in Spanish was distributed to the task participants. The corpus articles were sourced from various Spanish newspapers and online media based in Spain. The articles were annotated with unassimilated lexical borrowings.
Given that lexical borrowings can be multiword expressions (such as best seller, big data) and that those units should be treated as one borrowing and not as two independent borrowings, BIO encoding was used to denote the boundaries of each span.
Two classes were used for borrowings: ENG for English borrowings, and OTHER for lexical borrowings from other languages. Tokens that were not part of a borrowing were annotated with the “outside” tag (O). Only unassimilated lexical borrowings were considered borrowings. This means that borrowings that have already gone through orthographical adaption (such fútbol or hackear) were not considered borrowings and were therefore annotated as O. Annotation guidelines were also made available for participants.
The data was distributed in CoNLL format. An additional collection of documents that was not evaluated (the background set) was released as a part of the test set. This was done to encourage scalability to larger data collections and to ensure that participating teams were not be able to easily perform manual examination of the evaluated part of the test set.
The dataset contained a high number of unique borrowings and OOV words, and there was minimal overlap between splits. This enabled a more rigorous evaluation of system performance, as it helped us better assess the generalizing abilities of the participants’ models. Table 2 contains the number of tokens and borrowing spans per type in each split.
4.3 Evaluation metrics
The evaluation metrics used for the task was the standard precision, recall and F1 over spans:
Precision: The percentage of borrowings in the system’s output that are correctly recognized and classified.
Recall: The percentage of borrowings in the test set that were correctly recognized and classified.
F1-measure was used as the oﬀicial evaluation score for the final ranking of the participating teams. Evaluation was done exclusively at the span level. This means that only exact matches were considered, and no credit was given to partial matches. For example, given the multitoken borrowing late night, the entire phrase would have to be correctly labeled in order to count as a true positive. This makes the evaluation more rigorous, as it avoids the overly-generous scores that can sometimes result from token level evaluation. A model that can only detect English function words would detect on and the in on the rocks or by in stand by and still get a generous result on a token-level evaluation.
4.4 Resource limitation for model training
The following limitations were established for participants during training:
No additional human annotation was allowed for training. Given that the main purpose of the shared task was to evaluate how different models perform for the task of borrowing detection, using external data annotated with borrowings would prevent a fair evaluation of different model approaches.
Although the usage of regular lexicons and linguistic resources was accepted, no automatically-compiled lexicons of borrowings (such as those produced by already-existing models that perform borrowing extraction) were allowed. The reason for this limitation was that we were interested in evaluating how different approaches to borrowing detection performed when dealing with previously unseen borrowings, and models that piggyback on already-existing systems’s output would prevent that.
5 System descriptions
We received nine submissions from four different teams. However, only two teams submitted system descriptions. As a result, we have no description whatsoever for two of the participating systems, including the one that obtained the best results. We provide a brief summary of the two participating systems for which we received a submission, and refer the reader to their respective task description papers for further details.
5.1 BERT4EVER team: CRF model with data augmentation
The BERT4EVER team [jiang2021bert4ever] submitted a system to ADoBo based on combining several CRF models trained on different portions of the task’s training data. The models were used to label a freely-available open corpus in Spanish, and individual models were then re-trained on the output. Results suggest that this strategy improves two F1 points on the test set when compared to a trained-on-task-data-only baseline. The paper combines two well-known items in the ML toolbox, namely CRFs and data augmentation, and shows that bootstrapping an additional dataset is indeed useful.
5.2 Versae team: using STILTs
The Versae team [de2021futility] submitted a system that experimented with using STILTs—supplementary training on intermediate label-data tasks [phang2019sentence]
—for the ADoBo task. They experimented with training using part of speech, named entity recognition, code-switching, and language identification datasets, but found that models trained in this way consistently perform worse than fine-tuning multilingual language models. The Versae team also explored which multilingual language models perform best, evaluating multilingual BERT, RoBERTa, and models trained on small sets of languages.
Results of the task were computed using SeqScore111https://github.com/bltlab/seqscore [seqscore], a Python package for evaluating sequence labeling tasks, configured to emulate the conlleval evaluation script. Scores are summarized in Table 1. F1 ranged from 37.29 to 85.03, with the Marrouviere team scoring highest (F1=85.03, P=88.81 and R=81.56), close to the next-highest scores from the Versae team (F1=84.80, P=88.77 and R=81.17).
In order to get a better understanding of the systems that took part in the shared task, we performed some experiments on the output that was submitted by participants.
6.1 Combining outputs
In order to assess the complementarity of the submitted systems, an experiment was carried out combining their outputs. The combination consisted of the union of all detected terms. Since the number of systems is not very high, all combinations of systems were explored. In terms of F1 score, the best performing combination was (1), (2), and (4), with F1=87.83, P=87.83, and R=89.26, a result that outperforms the scores obtained separately by each individual system.
6.2 Removing ortho-typographic cues
Three variations of the test set were included in the background set (the additional collection of documents released along with the test set):
A lowercase version, where all uppercase letters in the original test set were transformed to lowercase.
A no-quotation-mark version, where all quotation marks in the original test set (“ ” ‘ ’ « ») were removed.
A lowercase no-quotation-mark version, where all uppercase letters where transformed to lowercase AND all quotation marks were removed.
None of these versions were used to rank the systems but to observe the systems difference in performance on different textual characteristics. The rationale for these experiments was to assess how well systems performed if certain orthotypographic cues that usually appear along with borrowings (such as quotation marks) were removed. After all, a borrowing is still a borrowing regardless of whether it is written with or without quotation marks and it would be of little use to have a model that systematically labeled anything between quotation marks as a borrowing, or that only detected borrowings if they are written between quotation marks.
Similarly, many of the foreign words that appear in newswire are usually proper names, where the uppercase can serve as cue to distinguish them from borrowings. Given that speakers are capable of distinguishing borrowings from proper names in oral settings—where no case distinction exists—and that these cues are not present in other textual genres (e.g. social media), we were interested in assessing how well the models performed when no case cue was available.
Results for these experiments are presented in Tables 3, 4 and 5. Focusing on the best two performing systems, we observe a drop of global F1 due to a consistent drop on precision not compensated with the a slightly increase of recall for the lowercased versions of the test set. In general, the drop in system (2) is more pronounced than in system (1), which causes its repositioning in the corresponding rankings. For the unquoted version of the test set, system (1) increases its F1 and system (2) decrements it slightly. Not having information on system (1), we can not attribute any of the differences to any characteristics of the systems.
In this paper we have presented the results of the ADoBo shared task on extracting unassimilated lexical borrowings from Spanish newswire. We have introduced the motivation for this topic, we have described the scope and nature of the proposed task, we have shared the obtained results and have summarized the main findings. Participants results ranged from F1 scores of 37 to 85. These scores show that this is not a trivial task and that lexical borrowing detection is an open problem that requires further research.
Our goal with this shared task was to raise awareness about a topic that, although highly relevant in the linguistics literature, has been mostly neglected within NLP. Although the participation for this first edition was modest (nine systems submitted from four different teams), the response was positive and it seems to indicate that there exists a moderate population within the community that is interested in borrowing as an NLP task. In fact, a post-task survey distributed among registered participants showed that 85% of respondents were interested in seeing future editions around this phenomenon, particularly on languages other than Spanish and including both semantic and diachronic borrowings.