Texts represent the main source of knowledge for our society. However, they can be written in various manners, thus creating a barrier between the readers and the ideas they intend to convey. Therefore, document comprehension is the main challenge users have to overcome, by understanding the meaning behind troublesome words and becoming familiar with them. Complex Word Identification (CWI) is a task that intends to identify hard-to-understand tokens, highlighting them for further clarification and assisting users to grasping the contents of the document.
Motivation. Each culture includes exclusive ideas, available only for the ones who can pass the obstacle of language . However, properly understanding language can prove to be a difficult task. By identifying complex words, users can make consistent steps towards adapting to the culture and accessing the knowledge it has to offer. As an example, entries like ”mayoritariamente” (eng. ”mostly”) or ”gobernatura” (eng. ”governance”) in the Spanish environment can create understanding problems for non-native Spanish speakers , thus requiring users to familiarize themselves with these particular terms.
Challenges. The identification task becomes increasingly more difficult, as proper complex word identification is not guaranteed. For example, if we use human identification techniques, language learners may consider a new word to be complex, while others might not share the same opinion by relying on their prior knowledge in that language. Therefore, universal annotation techniques are required, such that a ground truth can be established and the same set of words is considered complex in any context.
. We consider state-of-the-art solutions, namely multilingual Transformer-based approaches, to address the CWI challenge. First, we apply a zero-shot learning approach. This was performed by training Recurrent Neural Networks (RNNs) and Transformer-based  models on a source language corpus, followed by validating and testing on a corpus from a target language, different from the source language. A second experiment consists of a one-shot learning approach that considers training on each of the three languages (i.e., English, German, Spanish), but only keeping one entry from the target language, and validating and testing on English, German, Spanish, and French, respectively.
In addition, we performed few-shot learning experiments by validating and testing on a language, and training on the others, but with the addition of a small number of training entries from the target language. The model learns sample structures from the language and, in general, performs better when applied on multiple entries. Furthermore, this training process can help the model adapt to situations in which the number of training inputs is scarce. The dataset provided by the CWI Shared Task 2018  was used to perform all experiments.
This paper is structured as follows. The second section describes related work and its impact on the CWI task. The third section describes the corpus and outlines our method based on multilingual embeddings and Transformer-based models, together with the corresponding experimental setup. The fourth section details the results, alongside a discussion and an error analysis. The fifth section concludes the paper and outlines the main ideas, together with potential extensions.
Ii Related Work
Complex word identification was explored in various other studies and underlying approaches can be split into two main categories: monolingual and cross-lingual.
Monolingual CWI. The first category implies the usage of the same language for training, testing, and validation processes using a supervised approach. Sheang 
proposed a solution based on Convolutional Neural Networks trained on both word embeddings and handcrafted features. The author used pretrained GloVe word embeddings  for representing words from each of the three languages in the dataset. Furthermore, the author engineered a series of morphological features to obtain additional insights into the structure of the entries, features like the number of vowels, word length, and Tf-Idf. At the same time, the author considered a series of linguistic features, alongside morphological ones, by identifying syntactic dependencies between words. However, the presence of these features together with language-specific word embeddings implies a complex training and evaluation process, performed on each language separately and with different configuration setups.
Cross-lingual CWI. Cross-lingual transfer has been successfully used in various NLP tasks, for example: machine translation 20], verb sense disambiguation , dependency parsing , coreference resolution , event detection , sentence summarization , document retrieval , irony detection , dialogue systems , domain-specific tweet classification , as well abusive language identification .
In addition, cross-lingual approaches were employed in few works on the CWI task. For example, Finnimore et al.  extracted cross-lingual features for each considered language (i.e. English, German, Spanish, and French). They concluded that the best features for cross-lingual approaches are represented by the number of syllables, number of tokens, and number of punctuation marks. However, performing this process can prove to be costly, as it requires re-running the model for each additional language in which the user intends to perform complex word identification.
Bingel and Bjerva13]
proposed a sequence labeling approach for CWI. They used 300-dimensional word embeddings for encoding the input words, and fed this input to a Bidirectional Long Short-Term Memory (BiLSTM)
network that considered both word and character-level representations. The authors imposed a probability threshold of 0.5 for classifying a word as complex and applied the same rules for phrase-level classification. The authors used an English dataset based on news articles written with different levels of professionalism. Their approach underlines the effectiveness of sequence labeling models which considerably surpassed prior methods by a margin of up to 3.6% in terms of macro F1-score.
Zampieri et al. developed ensemble classifiers to identify complex words. They used two approaches for classification, namely Plurality Voting  and Oracle . Based on multiple subsystems, the authors concluded that the latter approach performed well when integrating the top three methods participating in the SemEval CWI 2016 competition .
A different approach to CWI was taken by Thomas et al. 
who considered simplifying the entire document lexicon, thus making the text more accessible for non-native speakers. The authors introduced different algorithms for reducing the lexicon size, by combining disambiguation and lexical reduction steps.
In contrast to the previous approaches, we developed a system based on state-of-the-art NLP solutions (i.e., Transformers), that can efficiently adapt to a large number of languages, without prior setup or feature engineering. The Transformer multi-lingual models are pretrained on a large number of languages, with various word representations already mapped into the same space. Unlike previous work, our models are universal, can be easily extended to other languages, and can be used for transfer learning.
We consider two main multi-lingual approaches for CWI: a) RNN-based solutions, alongside multilingual word embeddings, and b) multilingual Transformers specialized in token classification. Our aim is to infer cross-lingual features of complex words by training or fine-tuning on a labelled corpus containing different languages, followed by the identification of complex words on a newly encountered language. Preprocessing is minimal and considered only the removal of unknown characters, as well as extra spaces from the dataset.
Our analysis uses the dataset provided by the CWI Shared Task 2018 , which contains entries in four languages, namely: English, German, Spanish, and French. The English section of the dataset contains articles written at three proficiency levels: professional (news), non-professional (WikiNews), and Wikipedia articles. The German and the Spanish sections contain only one category of entries, taken from Wikipedia pages. Quantitatively, the English section contains 27,299 entries for training and 3,328 for validation. In contrast, the German section offers only 6,151 training elements and 795 for validation. At the same time, the Spanish section provides 13,750 training entries and 1,622 validation entries. We note that there are no training and validation entries for the French language.
As expected, the number of complex words is lower when compared to the number of non-complex words. Table I shows the distribution of complex words in the dataset. While the Spanish and English sections contain a relatively large amount of complex or non-complex words, the vocabulary corresponding to the German section is considerably smaller, with only 17,462 words. The small number of German entries is caused by the general focus on English and Spanish, languages with a greater number of speakers when compared to German111https://www.visualcapitalist.com/100-most-spoken-languages/. Additionally, the test dataset also contains French entries, with a total of 4,507 words.
|Language||Complex Words||Non-complex Words|
Iii-B Multilingual Word Embeddings
Our first experiment consists of using a common embedding for all four languages. We selected pretrained FastText  embedding for English, German, Spanish and French. However, these embedding spaces are not aligned one with another. Thus, we mapped them into a merged space by using Facebook MUSE 
, a tool that receives as inputs two embedding files and a target vector space, and maps them into the same space. The mapping process consists of learning a rotation matrixW, that intends to align the two distributions by using an adversarial learning technique. The matrix W is then refined by using Procrustes transformations because the initial alignment is rough. The transformation consists of setting frequent words aligned in the previous step to anchor points, followed by minimizing an energy function between the anchor points. Finally, an expansion is performed using the matrix W and a distance metric for the space containing a high density of words, such that the distance between unrelated words is increased.
The tool requires a parallel corpus between the languages. The corpus can be created by selecting the desired ground-truth bilingual dictionaries available on the Facebook MUSE repository222https://github.com/facebookresearch/MUSE. The mapping was performed in two steps, as follows. First, we mapped the English and German vectors by using an English-German parallel corpus. Second, we added the Spanish embeddings, by further using an English-Spanish parallel corpus. The obtained embeddings are then fed into a BiLSTM  network, alongside a TimeDistributed layer333https://keras.io/api/layers/recurrent_layers/time_distributed/. The experiments were performed in different scenarios: a) a zero-shot approach that required training on combinations of all the available languages, excepting the target language; b) a one-shot approach that introduces the target language (one entry) into the training corpus; and c) a few-shot approach, introducing 100 target language entries in the training dataset.
Iii-C Multilingual BERT
Multilingual BERT (mBERT)  is a pretrained Transformer architecture trained on over 100 languages, which we selected for multi-lingual token classification. The efficiency of representations generated by the model needs to be maximized because we performed our experiments in a multilingual environment. Fortunately, mBERT offers the possibility of splitting its representations into two categories, language-neutral components and language-specific components, thus sharing certain features between the languages of interest. mBERT was fine-tuned for the CWI task by using the previously mentioned zero-shot and one-shot learning approaches.
XLM-RoBERTa  is also a multilingual model built with the Masked Language Model objective, that should have an advantage over mBERT because it was pretrained on even more multilingual data (approximately 2.5 TB of raw text data). The model obtains state-of-the-art results for the GLUE benchmark tasks , while performing extremely well on Named Entity Recognition and Cross-lingual Natural Language Inference tasks .
Iii-E Other BERT-based Monolingual Models
Alongside mBERT, we decided to experiment with models extensively pretrained on each one of our target languages, alternatives that have shown better performance than the multi-lingual models in other NLP tasks. Thus, we used new models for the German, Spanish and French languages, namely: German BERT444https://deepset.ai/german-bert, Spanish BERT (BETO) , and French BERT (CamemBERT) . Our goal was to increase performance by specifically focusing on a certain language, instead of over 100 languages (as the case of mBERT).
Iii-F Implementation Details
Six experiments were conducted: a) embeddings aligned with MUSE fed to a BiLSTM network, b) mBERT token classification, c) XLM-RoBERTa token classification, d) German BERT token classification, e) BETO token classification, and f) CamemBERT token classification. Each experiment is also divided into sub-experiments that considered the usage of each language individually, as well as all possible combinations of languages in the training set. The four languages (i.e. English, German, Spanish, and French) were considered, by turn, for validation and testing. The BiLSTM-based solution was trained for 5 epochs, while the others (i.e, the Transformer-based solutions) were trained for 4 epochs. We concluded that this setup offers the best results considering that all our solutions start overfitting after 5 and 4 epochs, respectively. TableII
presents the hyperparameters used for training the models during the experiments.
|Hyperparameter||MUSE + BiLSTM||Transformer|
|Optimizer||RMSprop ||AdamW |
|MUSE + BiLSTM||✓||.606||.582||.577||.622||.609||.592||.587||.579||.625||.640||.524|
We considered: EN-W = English-Wikipedia; EN-WN = English-WikiNews; EN-N = English-News; DE = German; ES = Spanish; FR = French.
|Model||Train||Macro F1-score (one-shot)||Macro F1-score (few-shot)|
✓implies the usage of the entire dataset corresponding to that language. Additionally, we randomly selected 1 (for one-shot learning) or 100 (for few-shot learning) training entries from the language corresponding to the result for that line.
Table III contains the macro F1-scores obtained on the CWI validation and test datasets for each experiment and for each combination of training languages. Table III contains monolingual and zero-shot learning experiments. The best results for the zero-shot approach are marked in bold, while the best results for the monolingual approach are underlined.
|Our best solution, zero-shot learning||.774||.720||.731||.782||.734||.702|
|Our best solution, few-shot learning||.761||.733||.726||.766||.730||-|
|Our best monolingual solution||.808||.811||.808||.795||.756||-|
Iv-a Zero-Shot Transfer Evaluation
The best results on both validation and test datasets for the zero-shot learning strategy are obtained using the XLM-RoBERTa model, with a single exception represented by the validation dataset on German. With a considerable margin when compared to its counterparts, XLM-RoBERTa fine-tuned on English and Spanish manages to obtain a macro F1-score of 0.782 on the German test dataset, compared to 0.626 (MUSE+BiLSTM), 0.739 (mBERT), and 0.717 (German BERT). The results are similar for the Spanish and English test datasets (Wikipedia, WikiNews, News) having macro F1-values of 0.702 and 0.774, 0.720, and 0.731, respectively. The increased performance of XLM-RoBERTa can be attributed to the larger corpus it was pretrained on, a clear advantage over other BERT-based solutions. However, if we look at the other BERT-based monolingual models (i.e. German BERT, BETO, and CamemBERT), we can see that their performance is surpassed by both mBERT and XLM-RoBERTa. These models are pretrained on a main language, and fine-tuning them on different languages can lead to poorer results, as seen in Table III. For example, the difference in performance (macro F1) between XLM-RoBERTa and BETO is of 6.8% on the Spanish validation dataset, a significant discrepancy for a CWI task.
Iv-B One-Shot Transfer Evaluation
Furthermore, the best values for the one-shot learning approach are marked with bold in Table IV, where we considered only one training entry corresponding to the language of the result. We can observe that, again, the XLM-RoBERTa model offers the best performance. For example, XLM-RoBERTa obtains a macro F1-score of 0.731 on the WikiNews dataset, compared to 0.711 for mBERT. Moreover, the large difference is maintained for the German language as well, with a result of 0.783 versus 0.743. However, the scores for the Spanish language are closer, with a value around 0.730 for both models.
Iv-C Few-Shot Transfer Evaluation
Next, we included a small number of train entries (i.e., 100) from the same language as the test dataset because we intended to further improve the scores obtained by the Transformer-based solution using the zero-shot learning scenario. Using this approach, the model can infer characteristics of the target language and may perform better when identifying complex words on a wide range of different test entries.
Table IV contains the results obtained in the few-shot learning experiments. Unexpectedly, the models perform slightly worse. This phenomenon can be attributed to the models’ incapacity to grasp the main language characteristics, as well as the representations of a complex word, given a small number of training entries.
To conclude, our solution manages to outperform state-of-the-art alternatives on five out of six cross-lingual entries, the only exception being the French language (see Table V). Furthermore, our solution manages to surpass state-of-the-art results for German in the monolingual setup, even though it was created for cross-lingual experiments.
Iv-D Error Analysis
Most misclassifications occurred in the English News test dataset, where our models yielded a maximum F1-macro score of 0.733 by using a few-shot learning approach with XLM-RoBERTa. The high number of wrongly categorized tokens can be attributed to the complexity of the dataset, written in a more formal manner, adequate for news articles. This complexity implies the presence of more sophisticated words (e.g., ”underwriter”) that are not present in the training dataset, thus causing the model to wrongly classify them. In addition, the dataset contains news with series of location names (e.g. ”Londonderry”) or composed notions (e.g. ”better-optimized”, ”android-running”, ”java-related”) that, once again, are not included in the training set.
At the same time, another aspect that influences the classification performance is represented by the annotators’ subjectivity. In certain circumstances, words may not be considered complex (e.g. ”with”, ”connection”, ”been”) in the training set, while they are marked as complex in the test dataset. Similar situations also occur in the English Wikipedia, English WikiNews, German and Spanish datasets, with a series of tokens that either are not present in the training dataset, or have different labels between them.
V Conclusions and Future Work
Complex Word Indentification is a challenging task, even when using state-of-the-art Transformer-based solutions. In this work, we introduce an approach that improves the previous results on the cross-lingual and monolingual CWI shared task 2018 by using multilingual and language-specific Transformer models, multilingual word embeddings (non-Transformer), and different fine-tuning techniques. Fine-tuning a model on data from two different languages creates the opportunity of grasping features that empower it to better recognize complex words in certain contexts, even in a different language. In addition, zero-shot, one-shot, and few-shot learning strategies provide good results, surpassing strong baselines  and proposing an alternative to help non-native speakers to properly understand the difficult aspects of a certain language.
For future work, we intend to improve our results on the monolingual tasks by integrating additional models, such as XLNet  and techniques like adversarial training  and multi-task learning . Furthermore, we intend to experiment with other pretraining techniques specific to Transformer models, such that the results for French can benefit from cross-lingual transfer learning.
This work was supported by the Operational Programme Human Capital of the Ministry of European Funds through the Financial Agreement 51675/09.07.2019, SMIS code 125125.
-  (2019) On difficulties of cross-lingual transfer with order differences: a case study on dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2440–2452. Cited by: §II.
-  (2018-06) Cross-lingual complex word identification with multitask learning. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, Louisiana, pp. 166–174. External Links: Cited by: §II.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §III-B.
-  (2020) Spanish pre-trained bert model and evaluation data. In Practical ML for Developing Countries Workshop@ ICLR 2020, Cited by: §III-E.
-  (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §V.
-  (2020) Cross-lingual disaster-related multi-label tweet classification with manifold mixup. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 292–298. Cited by: §II.
-  (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §III-D.
-  (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §III-B.
-  (2019) Zero-shot cross-lingual abstractive sentence summarization through teaching generation and attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3162–3172. Cited by: §II.
-  (2019) Strong baselines for complex word identification across multiple languages. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 970–977. Cited by: §II, TABLE V, §V.
-  (2019) Cross-lingual visual verb sense disambiguation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1998–2004. Cited by: §II.
-  (2020) Irony detection in a multilingual context. In European Conference on Information Retrieval, pp. 141–149. Cited by: §II.
-  (2019) Complex word identification as a sequence labelling task. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1148–1153. Cited by: §II.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II, §III-B.
Effective cross-lingual transfer of neural machine translation models without shared vocabularies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1246–1257. Cited by: §II.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: TABLE II.
-  (2001) Decision templates for multiple classifier fusion: an experimental comparison. Pattern recognition 34 (2), pp. 299–314. Cited by: §II.
-  (2020) Extensively matching for few-shot learning event detection. arXiv preprint arXiv:2006.10093. Cited by: §II.
-  (2017) Perceptions, awareness and perceived effects of home culture on intercultural communication: perspectives of university students in china. System 67, pp. 25–37. Cited by: §I.
-  (2020) Coach: a coarse-to-fine approach for cross-domain slot filling. arXiv preprint arXiv:2004.11727. Cited by: §II.
-  (2019) Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894. Cited by: §III-E.
An introduction to convolutional neural networks. ArXiv e-prints, pp. . Cited by: §II.
-  (2016) Semeval 2016 task 11: complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 560–569. Cited by: §II.
Cross-domain and cross-lingual abusive language detection: a hybrid approach with deep learning and a multilingual lexicon. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 363–370. Cited by: §II.
-  (2014-10) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §II.
-  (2019) How multilingual is multilingual bert?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001. Cited by: §III-C.
-  (2006) Ensemble based systems in decision making. IEEE Circuits and systems magazine 6 (3), pp. 21–45. Cited by: §II.
-  (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: TABLE II.
-  (2019) Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3795–3805. Cited by: §II.
-  (2019) Multilingual complex word identification: convolutional neural networks with morphological and linguistic features. In Proceedings of the Student Research Workshop Associated with RANLP 2019, pp. 83–89. Cited by: §II, TABLE V.
-  (2020-03) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena 404, pp. 132306. External Links: Cited by: §I.
-  (2012) WordNet-based lexical simplification of a document.. In KONVENS, pp. 80–88. Cited by: §II.
-  (2019) Deep cross-lingual coreference resolution for less-resourced languages: the case of basque. In Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference, pp. 35–41. Cited by: §II.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §I.
-  (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Cited by: §III-D.
-  (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §V.
-  (2018) A report on the complex word identification shared task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 66–78. Cited by: §I, §I, §III-A.
-  (2017-09) Multilingual and cross-lingual complex word identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, pp. 813–822. External Links: Cited by: §II.
-  (2017) Complex word identification: challenges in data annotation and system performance. In Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pp. 59–63. Cited by: §II.
-  (2019) Improving low-resource cross-lingual document retrieval by reranking with deep bilingual representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3173–3179. Cited by: §II.
-  (2019) Freelb: enhanced adversarial training for natural language understanding. In International Conference on Learning Representations, Cited by: §V.