Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition

09/18/2019 ∙ by Genta Indra Winata, et al. ∙ The Hong Kong University of Science and Technology 0

In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings. However, in many cases, languages share common subwords, especially for closely related languages, but also for languages that are seemingly irrelevant. Therefore, we propose Hierarchical Meta-Embeddings (HME) that learn to combine multiple monolingual word-level and subword-level embeddings to create language-agnostic lexical representations. On the task of Named Entity Recognition for English-Spanish code-switching data, our model achieves the state-of-the-art performance in the multilingual settings. We also show that, in cross-lingual settings, our model not only leverages closely related languages, but also learns from languages with different roots. Finally, we show that combining different subunits are crucial for capturing code-switching entities.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code-switching is a phenomenon that often happens between multilingual speakers, in which they switch between their two languages in conversations, hence, it is practically useful to recognize this well in spoken language systems winata2018code. This occurs more often for entities such as organizations or products, which motivates us to focus on the specific problem of Named Entity Recognition (NER) in code-switching scenarios. We show one of the examples as the following:

  • walking dead le quita el apetito a cualquiera

  • (translation) walking dead (a movie title) takes away the appetite of anyone

For this task, previous works have mostly focused on applying pre-trained word embeddings from each language in order to represent noisy mixed-language texts, and combine them with character-level representations trivedi2018iit; wang2018code; winata2018bilingual. However, despite the effectiveness of such word-level approaches, they neglect the importance of subword-level characteristics shared across different languages. Such information is often hard to capture with word embeddings or randomly initialized character-level embeddings. Naturally, we can turn towards subword-level embeddings such as FastText grave2018learning to help this task, which will evidently allow us to leverage the morphological structure shared across different languages.

Figure 1: Hierarchical Meta-Embeddings (HME) architecture for Named Entity Recognition (NER) task. Left: Transformer-CRF architecture for Named Entity Recognition. Right: HME accept words, BPEs, and characters inputs.

Despite such expected usefulness, there has not been much attention focused around using subword-level features in this task. This is partly because of the non-trivial difficulty of combining different language embeddings in the subword space, which arises from the distinct segmentation into subwords for different languages. This leads us to explore the literature of Meta-Embeddings yin2016learning; muromagi2017linear; bollegala2018think; coates2018frustratingly; kiela2018dynamic; winata2019learning, which is a method to learn how to combine different embeddings.

In this paper, we propose Hierarchical Meta-Embeddings (HME) 111The source code is available at https://github.com/gentaiscool/meta-emb which learns how to combine different pre-trained monolingual embeddings in word, subword, and character-level into a single language-agnostic lexical representation without using specific language identifiers. To address the issue of different segmentations, we add a Transformer vaswani2017attention encoder which learns the important subwords in a given sentence. We evaluate our model on the task of Named Entity Recognition for English-Spanish code-switching data, and we use Transformer-CRF, a transformer-based encoder for sequence labeling based on the implementation of winata2019learning. Our experimental results confirm that HME significantly outperforms the state-of-the-art system in absolute F1 score. The analysis shows that in the task of English-Spanish mixed texts not only similar languages like Portuguese or Catalan help, but also seemingly distant languages from Celtic origin also significantly increase the performance.

2 Related Work


Previous works have extensively explored different representations such as word mikolov2013distributed; pennington2014glove; grave2018learning; xu2018emo2vec, subword sennrich2016neural; heinzerling2018bpemb, and character Santos2014LearningCR; wieting2016charagram. lample2016neural has successfully concatenated character and word embeddings to their model, showing the potential of combining multiple representations. liu-etal-2019-incorporating proposed to leverage word and subword embeddings into the application of unsupervised machine translation.


Recently, there are studies on combining multiple word embeddings in pre-processing steps yin2016learning; muromagi2017linear; bollegala2018think; coates2018frustratingly. Later, kiela2018dynamic introduced a method to dynamically learn word-level meta-embeddings, which can be effectively used in a supervised setting. winata2019learning proposed an idea to leverage multiple embeddings from different languages to generate language-agnostic meta-representations for mixed-language data.

3 Hierarchical Meta-Embeddings

We propose a method to combine word, subword, and character representations to create a mixture of embeddings. We generate a multilingual meta-embeddings of word and subword, and then, we concatenate them with character-level embeddings to generate final word representations, as shown in Figure 1. Let be a sequence of words with elements, where . Each word can be tokenized into a list of subwords and a list of characters . The list of subwords is generated using a function ; . Function maps a word into a sequence of subwords. Further, let , , and

be a set of word, subword, and character embedding lookup tables. Each set consists of different monolingual embeddings. Each element is transformed into a embedding vector in

. We denote subscripts as element and embedding language index, and superscripts as word, subword, and character.

3.1 Multilingual Meta-Embeddings (MME)

We generate a meta-representations by taking the vector representation from multiple monolingual pre-trained embeddings in different subunits such as word and subword. We apply a projection matrix to transform the dimensions from the original space to a new shared space . Then, we calculate attention weights with a non-linear scoring function (e.g., tanh) to take important information from each individual embedding . Then, MME is calculated by taking the weighted sum of the projected embeddings :

Model Multilingual embeddings Cross-lingual embeddings
+ closely-related
+ distant
en-es ca-pt ca-pt-de-fr-it br-cy-ga-gd-gv ca-pt ca-pt-de-fr-it br-cy-ga-gd-gv
Flat word-level embeddings
CONCAT 65.3 0.38 64.99 1.06 65.91 1.16 65.79 1.36 58.28 2.66 64.02 0.26 50.77 1.55
LINEAR 64.61 0.77 65.33 0.87 65.63 0.92 64.95 0.77 60.72 0.84 62.37 1.01 53.28 0.41
Multilingual Meta-Embeddings (MME) winata2019learning
Word 65.43 0.67 66.63 0.94 66.8 0.43 66.56 0.4 61.75 0.56 63.23 0.29 53.43 0.37
Hierarchical Meta-Embeddings (HME)
+ BPE 65.9 0.72 67.31 0.34 67.26 0.54 66.88 0.37 63.44 0.33 63.78 0.62 60.19 0.63
+ Char 65.88 1.02 67.38 0.84 65.94 0.47 66.1 0.33 61.97 0.6 63.06 0.69 57.5 0.56
+ BPE + Char 66.55 0.72 67.8 0.31 67.07 0.49 67.02 0.16 63.9 0.22 64.52 0.35 60.88 0.84
Table 1:

Results (percentage F1 mean and standard deviation from five experiments).

Multilingual: with main languages, Cross-lingual: without main languages.

3.2 Mapping Subwords and Characters to Word-Level Representations

We propose to map subword into word representations and choose byte-pair encodings (BPEs) sennrich2016neural since it has a compact vocabulary. First, we apply to segment words into sets of subwords, and then we extract the pre-trained subword embedding vectors for language .

Since, each language has a different , we replace the projection matrix with Transformer vaswani2017attention to learn and combine important subwords into a single vector representation. Then, we create which represents the subword-level MME by taking the weighted sum of .

Model F1
trivedi2018iit 61.89
wang2018code 62.39
wang2018code (Ensemble) 62.67
winata2018bilingual 62.76
trivedi2018iit (Ensemble) 63.76
winata2019learning MME 66.63 0.94
Random embeddings 46.68 0.79
Aligned embeddings
MUSE (es en) 60.89 0.37
MUSE (en es) 61.49 0.62
Multilingual embeddings
Our approach 67.8 0.31
Our approach (Ensemble) 69.17
Cross-lingual embeddings
Our approach 64.52 0.35
Our approach (Ensemble) 65.99
Table 2: Comparison to existing works. Ensemble: We run a majority voting scheme from five different models.

To combine character-level representations, we apply an encoder to each character.


We combine the word-level, subword-level, and character-level representations by concatenation , where and are word-level MME and BPE-level MME, and is a character embedding. We randomly initialize the character embedding and keep it trainable. We fix all subword and word pre-trained embeddings during the training.

3.3 Sequence Labeling

To predict the entities, we use Transformer-CRF, a transformer-based encoder followed by a Conditional Random Field (CRF) layer Lafferty2001ConditionalRF. The CRF layer is useful to constraint the dependencies between labels.


The best output sequence is selected by a forward propagation using the Viterbi algorithm.

4 Experiments

4.1 Experimental Setup

We train our model for solving Named Entity Recognition on English-Spanish code-switching tweets data from W18-3219. There are nine entity labels with IOB format. The training, development, and testing sets contain 50,757, 832, and 15,634 tweets, respectively.

We use FastText word embeddings trained from Common Crawl and Wikipedia grave2018learning for English (es), Spanish (es), including four Romance languages: Catalan (ca), Portuguese (pt), French (fr), Italian (it), and a Germanic language: German (de), and five Celtic languages as the distant language group: Breton (br), Welsh (cy), Irish (ga), Scottish Gaelic (gd), Manx (gv). We also add the English Twitter GloVe word embeddings pennington2014glove and BPE-based subword embeddings from heinzerling2018bpemb.

We train our model in two different settings: (1) multilingual setting, we combine main languages (en-es) with Romance languages and a Germanic language, and (2) cross-lingual setting, we use Romance and Germanic languages without main languages. Our model contains four layers of transformer encoders with a hidden size of 200, four heads, and a dropout of 0.1. We use Adam optimizer and start the training with a learning rate of 0.1 and an early stop of 15 iterations. We replace user hashtags and mentions with <USR>, emoji with <EMOJI>, and URL with <URL>. We evaluate our model using absolute F1 score metric.

4.2 Baselines


We concatenate word embeddings by merging the dimensions of word representations. This method combines embeddings into a high-dimensional input that may cause inefficient computation.



We sum all word embeddings into a single word vector with equal weight. This method combines embeddings without considering the importance of each of them.


Random Embeddings

We use randomly initialized word embeddings and keep it trainable to calculate the lower-bound performance.

Aligned Embeddings

We align English and Spanish FastText embeddings using CSLS with two scenarios. We set English (en) as the source language and Spanish (es) (en es) as the target language, and vice versa (es en). We run MUSE by using the code prepared by the authors of lample2018word222https://github.com/facebookresearch/MUSE

5 Results & Discussion

In general, from Table 1, we can see that word-level meta-embeddings even without subword or character-level information, consistently perform better than flat baselines (e.g., CONCAT and LINEAR) in all settings. This is mainly because of the attention layer which does not require additional parameters. Furthermore, comparing our approach to previous state-of-the-art models, we can clearly see that our proposed approaches all significantly outperform them.

From Table 1, in Multilingual setting, which trains with the main languages, it is evident that adding both closely-related and distant language embeddings improves the performance. This shows us that our model is able to leverage the lexical similarity between the languages. This is more distinctly shown in Cross-lingual setting as using distant languages significantly perform less than using closely-related ones (e.g., ca-pt). Interestingly, for distant languages, when adding subwords, we can still see a drastic performance increase. We hypothesize that even though the characters are mostly different, the lexical structure is similar to our main languages.

Figure 2: Heatmap of attention over languages from a validation sample. Left: word-level MME, Right: BPE-level MME. We extract the attention weights from a multilingual model (en-es-ca-pt-de-fr-it).
Figure 3: The average of attention weights for word embeddings versus NER tags from the validation set.

On the other hand, adding subword inputs to the model is consistently better than characters. This is due to the transfer of the information from the pre-trained subword embeddings. As shown in Table 1, subword embeddings is more effective for distant languages (Celtic languages) than closely-related languages such as Catalan or Portuguese.

Moreover, we visualize the attention weights of the model in word and subword-level to interpret the model dynamics. From the left image of Figure 2, in word-level, the model mostly chooses the correct language embedding for each word, but also combines with different languages. Without any language identifiers, it is impressive to see that our model learns to attend to the right languages. The right side of Figure 2, which shows attention weight distributions for subword-level, demonstrates interesting behaviors, in which for most English subwords, the model leverages ca, fr, and de embeddings. We hypothesize this is because the dataset is mainly constructed with Spanish words, which can also be verified from Figure 3

in which most NER tags are classified as


6 Conclusion

We propose Hierarchical Meta-Embeddings (HME) that learns how to combine multiple monolingual word-level and subword-level embeddings to create language-agnostic representations without specific language information. We achieve the state-of-the-art results on the task of Named Entity Recognition for English-Spanish code-switching data. We also show that our model can leverage subword information very effectively from languages from different roots to generate better word representations.


This work has been partially funded by ITF/319/16FP and MRP/055/18 of the Innovation Technology Commission, the Hong Kong SAR Government, and School of Engineering Ph.D. Fellowship Award, the Hong Kong University of Science and Technology, and RDC 1718050-0 of EMOS.AI. We sincerely thank the three anonymous reviewers for their insightful comments on our paper.