Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition

06/01/2021 ∙ by Shining Liang, et al. ∙ Microsoft Simon Fraser University Jilin University 0

Named entity recognition (NER) is a fundamental component in many applications, such as Web Search and Voice Assistants. Although deep neural networks greatly improve the performance of NER, due to the requirement of large amounts of training data, deep neural networks can hardly scale out to many languages in an industry setting. To tackle this challenge, cross-lingual NER transfers knowledge from a rich-resource language to languages with low resources through pre-trained multilingual language models. Instead of using training data in target languages, cross-lingual NER has to rely on only training data in source languages, and optionally adds the translated training data derived from source languages. However, the existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages, which is relatively easy to collect in industry applications. To address the opportunities and challenges, in this paper we describe our novel practice in Microsoft to leverage such large amounts of unlabeled data in target languages in real production settings. To effectively extract weak supervision signals from the unlabeled data, we develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning. The empirical study on three benchmark data sets verifies that our approach establishes the new state-of-the-art performance with clear edges. Now, the NER techniques reported in this paper are on their way to become a fundamental component for Web ranking, Entity Pane, Answers Triggering, and Question Answering in the Microsoft Bing search engine. Moreover, our techniques will also serve as part of the Spoken Language Understanding module for a commercial voice assistant. We plan to open source the code of the prototype framework after deployment.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Named entity recognition (NER) (Nadeau and Sekine, 2007) identifies text spans that belong to predefined entity categories, such as persons, locations, and organizations. For example, in the sentence “John Doe wrote to the association of happy entrepreneurs.”, NER may identify that the first two words, “John Doe”, refer to a person, and the last five words, “the association of happy entrepreneurs” refer to an organization. As a fundamental component in Natural Language Processing (NLP), NER has numerous applications in various industrial products. For example, in a commercial Web search engine, such as Microsoft Bing, NER is crucial for Query Understanding (et al., 2019b), Web Information Extraction (et al., 2020m), and Question Answering (et al., 2020c, g). For voice assistants such as Siri, Alexa, and Cortana, NER is a key building block for Spoken Language Understanding (SLU) (Tur and Mori, 2011). For global companies, such as Microsoft, cross-lingual NER is critical to deploy and maintain their products across hundreds of regions with a large number of languages (typically over one hundred).

Recently, deep neural networks achieve great performance in NER (et al., 2020b; Akbik, 2019). However, deep neural network models typically require large amounts of training data, which presents a huge challenge for global companies to deploy and maintain their products across different regions with many languages. Importantly, labeling training data is not a one-off effort, instead, maintaining high-quality NER models requires periodical training data refresh, e.g., tens of thousands of new annotated instances every few months per language. Moreover, with the evolving of products, there are often needs for schema update, e.g., adding more classes of named entities to be recognized, merging some existing classes, or retiring some classes. Such schema updates cause extra cost in adjusting or even relabeling training data to comply with new schema. Although the crowd-sourcing approach can substantially reduce the cost of data labeling, when data refreshes, schema updates, as well as a large number of languages are considered, it is still too expensive, if not unrealistic at all, to manually label training data at an industrial scale. In addition to financial constraints, hiring crowd-sourcing workers, building labeling guidelines and pipelines, and controlling labeling quality especially on low resource languages are also challenging and time-consuming. Therefore, scaling out NER to a large number of languages remains a grand challenge to the industry.

To reduce the cost of human labeling training data, cross-lingual NER tries to transfer knowledge from rich-resource (source) languages to low-resource (target) languages. This approach usually pre-trains a multilingual model to learn a unified representation of different languages (such as mBERT (et al., 2019d), Unicoder (et al., 2019c), and XLM-Roberta (et al., 2020a)). Then the pre-trained model is further fine-tuned using the training data in the source language, and is applied to other languages (et al., 2020i; Wu and Dredze, 2019). Although this approach has shown good results for classification tasks, the performance on sequence labeling tasks, such as NER and SLU, is still far from perfect (et al., 2020e; Wu and Dredze, 2019). Table 1 compares the NER performance in English versus that in some target languages. Following (Moon, 2019), we fine-tune mBERT with English data and directly test on the target languages. A dramatic drop in F1 score in every target language clearly indicates a big performance loss.

Languages English Spanish Dutch German
F1 Score 90.87 75.56 (-15.31) 78.86 (-12.01) 71.94 (-18.93)
Table 1. The performance comparison between NER performance in English and some target languages. Following (Moon, 2019), we fine-tune mBERT with English data and directly test on the target languages.

To enhance the transferability of cross-lingual models, several methods convert training examples in a source language into examples in a target language through machine translation (et al., 2017c, 2020e). The annotation of entities is derived through word or phrase alignments between source and target languages (et al., 2018a, 2017b). Despite the improved transferability across languages, this approach still suffers from several critical limitations. First, parallel data and machine translators may not be available for all target languages. Second, translated data may not be diverse enough compared to real target data, and there may exist some translation artifacts in the data distribution (et al., 2020h). Finally, there are both translation errors and alignment errors in translated data, which hurt the performance of models (et al., 2017b).

In this paper, we describe a different approach to cross-lingual NER practiced in the Microsoft product team. Our approach is based on the industry reality that in real product settings, it is often feasible to collect large amounts of unlabeled data in target languages. For example, in both Web search engines and voice assistants, there are huge amounts of user queries or utterances recorded in the search/product logs. Compared with the existing approaches, our method does not need parallel data or machine translators. Moreover, the real user input is much larger in size and much richer in the diversity of expressions. Leveraging such rich and diversified unlabeled data is far from straightforward. Although some recent effort (et al., 2020k) explore a semi-supervised knowledge distillation (et al., 2015) approach to allow a student model to learn the knowledge of NER from the teacher model through the distillation process, as shown in Table 1 as well as the previous works (Wu and Dredze, 2019; et al., 2020e), fine-tuning using English data alone often leads to inferior results for sequence labeling tasks.

In our approach, we adopt the knowledge distillation framework and use a weak model tuned from English data alone as a starting point. The novelty of our approach is that we develop a reinforcement learning (RL) based framework, which trains a policy network to predict the utility of an unlabeled example to improve the student model . Then, based on the predicted utility, the examples are selectively added to the knowledge distillation process. We observe that this screening process can effectively improve the performance of the derived student model. Moreover, we adopt a bootstrapping approach and extend the knowledge distillation step into an iterative process: the student model derived from the last round can take the role of teacher model in the next round. With the guidance of the policy network, the noise in supervision signals, that is, the prediction errors made by teacher models is reduced step by step. The model evolves towards better performance for NER in each round, which in turn generates stronger supervision signals for the next round.

We make the following contributions in this paper. First, we target an underlying component in many industrial applications and call out the unaddressed challenges for cross-lingual NER. After analyzing various existing approaches to this problem and considering the industry practice, we propose to leverage large amounts of unlabeled data, which can often be easily collected in real applications. Second, we present our findings that by smartly selecting the unlabeled data in an iterated reinforcement learning framework, the model performance can be improved substantially. We develop an industry solution that can be used in many products built on NER. Third, we conduct experiments on three widely used datasets and demonstrate the effectiveness of the proposed framework. We establish the new SOTA performance with a clear gain comparing to the existing strong baselines. Now, the NER techniques reported in this paper are on their way to become a fundamental component for Web ranking, Entity Pane, Answers Triggering, and Question Answering in the Microsoft Bing search engine. Moreover, our techniques will also serve as part of the Spoken Language Understanding module for a commercial voice assistant.

The rest of the paper is organized as follows. We review the related work in Section 2, and present our method in Section 3. We report an empirical study in Section 4, and conclude the paper in Section 5. Table 2 summarizes some frequently used symbols.

Symbol Description
probability distribution of teacher/student model
model for source language
model for target language
student model in the -th distillation iteration
policy network used to select instances

batch of state vectors

sampled action with policy network
, batches of instances and selected instances in target language
Table 2. Frequently used notations in the paper.

2. Related Work

Our approach is highly related to the existing work on cross-lingual NER, knowledge distillation, and reinforcement learning. In this section, we briefly review some most related studies, in addition to those discussed in Section 1.

Zero-shot cross-lingual NER seeks to extract entities in a target language but assumes only annotated data in a source language. Pseudo training data in a target language may be generated by leveraging parallel corpus and word alignment models (et al., 2017b) or by machine translation approaches (et al., 2017c, 2018a). In addition to training using synthetic data, some approaches directly transfer models in source languages to target languages using a shared vector space to represent different languages (et al., 2017a, e).

Recently, pre-trained multilingual language models are adopted to address the challenge of cross-lingual transfer using only the labeled training data in the source language and directly transferring to target languages (Wu and Dredze, 2019; et al., 2020f; Wu and Dredze, 2020). Taking advantage of large-scale unsupervised pre-training, these methods achieve prominent results in cross-lingual NER. However, the performance in target languages is still unsatisfactory due to the lack of corresponding knowledge about target languages. In this work, on top of those pre-trained multilingual models, we propose an iterative distillation framework under the guidance of reinforcement learning to enhance the cross-lingual transfer-ability using unlabeled data in target languages.

Knowledge distillation (KD) is effective in transferring knowledge from a complex teacher model to a simple student model (et al., 2015, 2020o). In a standard KD procedure, a teacher model is first obtained by training using golden standard labeled data. A student model is then optimized by learning from the ground-truth labels as well as mimicking the output distribution of the teacher model. KD has also been used for cross-lingual transferring. For example, Xu and Yang (Xu and Yang, 2017) leverage soft labels produced by a model in a rich-resource language to train a target language model on the parallel corpus. Wu et al. (et al., 2020k) train a teacher model based on a pre-trained multilingual language model and directly distill knowledge using unlabeled data in target languages. Nevertheless, these methods directly perform knowledge distillation with all instances and do not address the subtlety that some samples may have a negative impact due to teacher model prediction errors.

In this work, we establish a reinforcement learning based framework to select unlabeled instances for knowledge distillation in cross-lingual knowledge transfer by removing the errors in teacher model predictions in the target language. This framework can be applied to not only NER, but also more cross-lingual Web applications, such as relation extraction and question answering.

Reinforcement learning (RL) (Sutton and Barto, 2018) has been widely used in natural language processing, such as dialogue systems (González-Garduño, 2019) and machine translation (et al., 2018b). Those methods leverage semantic information as rewards to train generative models. Particularly, a series of studies use RL to select proper training instances. For example, Wang et al. (et al., 2019e)

leverage a selector to select source domain data closed to the target and accept the reward from the discriminator and the transfer learning module.

Motivated by the above studies, in this work, we leverage RL to smartly select unlabeled instances for knowledge distillation. To the best of our knowledge, our work is the first to apply reinforcement learning for cross-lingual transfer learning.

3. Methodology

In this section, we first define the problem and review the preliminaries. Then, we introduce our iterative knowledge distillation framework for cross-lingual NER. Last, we develop our reinforced selective knowledge distillation technique.

3.1. Problem Definition and Preliminaries

We model cross-lingual named entity recognition as a sequence labeling problem. Given a sentence with tokens, a NER model produces a sequence of labels , where indicates the category of the entity (or not an entity) of the corresponding token . Denote by the annotated data in the source language, where the superscript indicates that this is a data set in the source language. In the target language, annotated data is not available for training except for a test set , where the superscript indicates that those are data sets in the target language. We also assume unlabeled data in the target language, denoted by , which may be leveraged for knowledge distillation. Formally, zero-shot cross-lingual NER is to learn a model by leveraging both and to obtain good performance on .

An encoder is used to learn contextualized embedding and produce hidden states , that is, , where denotes the parameters of the encoder. Here we adopt two pre-trained multilingual language models, mBERT and XLM-Roberta (XLM-R) as the basic encoders separately, to verify the generalization of our method. In general, any encoding model that produces a hidden state for the corresponding input token

may be employed. For each token of the sequence, the probability of each category is learned by

, where and are the weight and the bias term.

In general, the Knowledge Distillation (KD) approach (et al., 2015)

uses the soft output (logits) of one large model or the ensemble of multiple large models as the knowledge and transfers the knowledge to a small/single student model. The distilled student model can achieve decent performance with high efficiency as well. Although KD was initially proposed for model compression, in this paper, we apply this approach to cross-lingual NER in order to transfer knowledge learned from the training data in the (rich-resource) source language to the (low-resource) target language.

Figure 1. The architecture of our proposed method Reinforced Iterative Knowledge Distillation for cross-lingual NER. (a) The iterative KD framework. (b) RL based selective KD. Please note that model is first obtained through fine-tuning the base model with the labeled data in the source language.

For a NER task, given an unlabeled sentence , the distillation loss is the mean squared error loss between the predicted probability distributions of entity labels by the student model and that of the teacher model. To be specific, the loss with regard to to train a student model is formalized as , where and are the parameters of the teacher model and the student model, respectively, and are the predicted label distributions of the teacher model and the student model, respectively, and represents the mean squared error. The parameters of the teacher model are fixed during the training. In our knowledge distillation framework, both the teacher model and the student model share the same architecture (multilingual models) but with different parameter weights.

3.2. Iterative KD for Cross-Lingual NER

One challenge in cross-lingual NER is that the teacher model is trained by the source language but applied to the target language. Due to the differences between languages, the knowledge transferred from teacher model to student model may contain much noise. To address this challenge, we propose a framework Reinforced Iterative Knowledge Distillation (or RIKD for short).

The overall architecture of our method is shown in Figure 1(a). A source multilingual model is first trained using the annotated data in the source language. The source multilingual model is leveraged as a teacher model to train a target model by transferring the shared knowledge from the source language to the target language. To reduce noise in knowledge transfer, we introduce a reinforced instance selector to select unlabeled data in the distillation step for better transfer learning. Through the smart selection of examples in knowledge distillation, the student model can be improved over the teacher model on the target language. Therefore, we further iterate this RL-based knowledge distillation step multiple rounds to drive the final target model, where the student model derived from the last round takes the role of teacher model in the next round.

In cross-lingual knowledge distillation for NER, although the source model is only trained using the labeled data in the source language, it is capable of inferring directly on the cases in the target language, since it is benefited from the language-independent common feature space of pre-trained multilingual encoder and entity-specific knowledge of the labeled data. The cross-lingual transfer step aims to transfer language-agnostic knowledge from the source model to the model in the target language by minimizing the distance between the prediction distribution of the source model and that of the target model.

0:   Iteration number ; Training steps number ; Pre-trained model in source language; Target language unlabeled data ; Base NER model initialized using the pre-trained weights of mBERT or XLM-R.
0:   Distilled student model for target language.
1:  for  to  do
2:     Initialize a new model with .
3:     for  to  do
4:         Sample a batch from then distill knowledge from to with by Algorithm 2.
5:     end for
6:  end for
Algorithm 1 : Reinforced Iterative Knowledge Distillation.

Specifically, given an instance in the target language, we minimize the MSE of the output probability distributions between the source model and the target model, which is given by , where and are the parameters of the source model and the target model, respectively, and and are the predicted label distributions of the source model and the target model, respectively. This cross-lingual distillation step enables the target model to leverage the unlabeled data in the target language by mimicking the soft labels predicted by the source model to transfer knowledge from the source language.

Inspired by the self-training paradigm (et al., 2020l, n), where a model itself is used to generate labels from unlabeled data and the model is retrained using the same structure based on the generated labels, we further leverage the target model to produce the probability distributions of the training instances on the unlabeled data in the target language and conduct another knowledge distillation step to derive a new model in the target language. The training objective of this distillation step is formulated as


where and are the parameters of the model from the previous iteration and the new model to be trained, respectively, and and are the predicted label distributions of these two models, respectively.

This iterative training step may be conducted multiple rounds by leveraging unlabeled data in the target language, which is relatively easy to obtain than labeled data. In our experiments on the benchmark datasets, we find that three rounds can achieve decent and stable results. Algorithm 1 shows the pseudo-code of our RIKD method.

3.3. Reinforced Selective Knowledge Distillation

Now let us explain the instance selector in our approach shown in Figure 1(b). While conventional knowledge distillation directly transfers knowledge from the source model to the target model, the discrepancy between the source language and the target language may induce noise in the soft labels of the source model. As shown in Table 1, the model in the source language has low performances on other languages, thus the supervision from the predictions of the model in the source language may be noisy.

To address this challenge, we use reinforcement learning to select the most informative training instances to strengthen transfer learning between the two languages (or two generations after the first round). We adopt this method in each round of RIKD. The major elements in our reinforcement learning procedure include states, actions, and awards.

3.3.1. State

We model the state of a given unlabeled instance in the target language by a continuous real-valued vector . Given a target language instance , we first obtain a pseudo labeled-sequence using the source model, and use the concatenation of a series of features to form the state vector.

The first two features are based on the prediction results from the source model only. The number of predicted entities indicates the informativeness of an instance by the source model, that is, , where denotes non-entities in BIO tagging schema following (Wu and Dredze, 2019). The inference loss of the source model indicates how confident the source model is about the prediction, .

The third feature is the MSE loss of the output probability distributions between the source model and the target model on the unlabeled instances, which is . It combines the predictions by the source model and the target model. This feature is based on the intuition that the agreement between the source model and the target model may indicate how well the target model imitates the source model on the current instance.

The fourth feature describes the internal representation as well as the output for the target model after seeing the example using the label-aware representation of the target model. We convert the predicted label into label embedding through a shared trainable embedding matrix

, which is trained during the optimization process of policy network. Then a linear transformation is used to build a label-aware vector for each token:

, where and are the weight matrix and the bias term, respectively, symbol denotes the concatenation operation and is the hidden state of token

at the last layer of the target model. A max-pooling operation over the length is used to generate the

semantic feature: .

Last, we use the length of the target unlabeled instance , which relies on only the instance itself and is used to balance the effects of the instance length and number of predicted entities. The parameters introduced to learn the state features are part of the policy network to be trained.

3.3.2. Action

We introduce a binary action space , which indicates whether to keep or drop the current instance from a batch of instances to optimize the target model. A policy function takes the state as input and outputs the probability distribution where the action is derived from. The policy network is implemented using a two-layer fully connected network computed as , where and , respectively, are the weight matrix and the bias term of the -th fully connected layer, and

is the ReLU activation function.

3.3.3. Reward

The selector takes as input a batch of states corresponding to a batch of unlabeled data , and samples actions to keep a subset of the batch. Since we sample a batch of actions in each updating step, the proposed method assigns one reward to a batch of actions.

We adopt delayed reward (et al., 2019e) to update the selector. Recall that we aim to select a subset of informative training data to improve the cross-lingual NER task. For the -th training step in iteration , we sample actions and use the selected sub-batch to optimize the new target model with parameters . The optimization objective is formulated as below according to Equation 1.


Since we target the zero-shot setting where no labeled data is available in the target language including development set, we use the training loss delta on to obtain the delayed reward motivated by Yuan et al. (et al., 2020d), that is,


where is cached beforehand, and initialized by the training loss of the last warm-up training step when the reinforced training starts.

3.3.4. Optimization

We use the policy-based RL method (et al., 1999) to train the selector. Algorithm 2 shows the pseudo-code. First, we pre-train the target model without instance selection for several warm-up steps. Second, for each batch , we sample actions based on the probabilities given by the policy network. Denote by the selected instances. Those selected target-language instances are then used as the inputs to perform knowledge distillation and update the parameters of the target model. The selector remains unchanged. Last, we calculate the delayed reward according to Equation 3 and optimize the selector using this reward with cached states and actions, that is,

0:   Policy network ; Target language unlabeled data ; Teacher model and student model initialized using mBERT or XLM-R; Warm-up training steps and reinforced training steps , .
0:   Distilled student model in iteration .
1:  for  do
2:     Sample a random batch from .
3:     Update the target model with according to Equation 1.
4:  end for
5:  for  do
6:     Sample a random batch from .
7:     Obtain states for instances in with and .
8:     Sample a batch of actions

based on the probabilities estimated by

9:     Obtain the selected training batch according to .
10:     Update the target model with according to Equation 2.
11:     Utilize the training loss of the current step and the previous step to obtain delayed reward according to Equation 3.
12:     Update policy model with , and according to Equation 4.
13:  end for
Algorithm 2 : Algorithm for Reinforced Instance Selection.

4. An Empirical Study

In this section, we report a systematic empirical study using three well-accepted benchmark data sets and compare our proposed methods with a series of state-of-the-art methods.

4.1. Datasets

We use the datasets from CoNLL 2002 and 2003 NER shared tasks (Sang, 2002; Sang and Meulder, 2003) with 4 distinct languages (Spanish, Dutch, English, and German). Moreover, to evaluate the generalization and scalability of our proposed framework, we select three non-western languages (Arabic, Hindi, and Chinese) from another multilingual NER dataset: WikiAnn (et al., 2017d), partitioned by Rahimi et al. (et al., 2019a). Each datasets is split into training, development, and test sets. The statistics of those datasets are shown in Table 3. All of those datasets are annotated with 4 types of entities, namely PER, LOC, ORG and MISC in BIO tagging schema following Wu et al. (et al., 2020k) and Wu and Dredze (Wu and Dredze, 2019). The words are tokenized using WordPiece (et al., 2016b) and, following Wu et al. (et al., 2020k) and Wu and Dredze (Wu and Dredze, 2019), we only tag the first sub-word if a word is split.

For both CoNLL and WikiAnn, we use English as the source language and the others as target languages. The pre-trained multilingual language models are fine-tuned using the annotated English training data. As for target languages, we remove the entity labels in the corresponding training data and adopt them as the unlabeled target-language instances. Note that to follow the zero-shot setting, we use the English development set to select the best checkpoints and evaluate them directly on the target language test sets.

(a) Statistics of CoNLL.

(b) Statistics of WikiAnn.

Table 3. Statistics of the datasets.

4.2. Implementation Details

We leverage the PyTorch version of cased multilingual

BERTbase and XLM-Rbase in HuggingFace’s Transformers*** as the basic encoders for all variants. Each of the two models has 12 Transformer layers, 12 self-attention heads and 768 hidden units (i.e. ). The hidden sizes of the policy network and the label embedding vector are set to and

, respectively. We set the batch size to 64 and train each model for 5 epochs with a linear scheduler. The parameters of embedding and the bottom three layers are fixed, following Wu and Dredze 

(Wu and Dredze, 2019). We use the AdamW optimizer (Loshchilov and Hutter, 2017)

for all source and target models with a weight decay rate selected from {5e-3, 7.5e-3} and a learning rate chosen from {3e-5, 5e-5, 7e-5}. The policy network is optimized using stochastic gradient descent and the learning rate is set to 0.01. The warm-up steps is selected from {250, 500}. For evaluation, we use entity-level F1 score as the metric. The target models are evaluated on the development set every 100 steps and the checkpoints are saved based on the evaluation results. Training of each round using 8 Tesla V100 GPUs takes 5 - 40 minutes depending on the encoder and target language.

4.3. Baseline Models

We compare our method with the following baselines. (Täckström, 2012) is trained using cross-lingual word cluster features. (et al., 2016a, 2017b) leverage extra knowledge base or word alignment tools to annotate training data. (et al., 2017c, 2018a) generate target-language training data with machine translation. (Wu and Dredze, 2019; Moon, 2019) are trained using monolingual data and directly inference in the target language. (et al., 2020j) leverages a meta-learning based method that benefits from similar instances. (et al., 2020k) explores teacher-student paradigm to directly distill knowledge from a source language to a target language using the unlabeled data in the target language. (Wu and Dredze, 2020) proposes a contrastive alignment objective for multilingual encoders and outperforms previous word-level alignments. (et al., 2020f)

introduces language, task, and invertible adapters to enable pre-trained multilingual models with high portability and efficient transfer capability. We test statistical significance using t-test with p-value threshold


4.4. Major Results

Table 4 shows the results on cross-lingual NER of our method and the baseline methods. The first block compares our model (on top of mBERT) with the SOTA mBERT-based and the non-pretrained model based approaches. The second block compares our method (on top of XLM-R) with the SOTA XLM-R based models. The third block of Table 4 (a) denote works utilizing training data from multiple source languages. We can see that our proposed method significantly outperforms the baseline methods and achieves the new state-of-the-art performance. The results clearly manifest the effectiveness of the proposed cross-lingual NER framework.

(a) Results on CoNLL.

(b) Results on WikiAnn.

Table 4. The F1 scores of our method and the baseline models. Notes: the reported results w.r.t. with the bottom three layers of language model fixed. statistically significant improvements over Wu et al.  (et al., 2020k). the approaches utilizing training data from multiple source languages.

The pre-trained contextualized embedding based models (Moon, 2019; et al., 2020j, k) outperform by a large margin those models learned from scratch. Our mBERT and XLM-R based methods further achieve average gains of 1.57 and 1.61 percentage points over the strongest baseline (i.e., (et al., 2020k)) in F1 score on CoNLL, respectively. As for results of non-western languages on WikiAnn, RIKD shows consistent improvements over (et al., 2020k) and non-distillation methods using additional layers (et al., 2020f) and external resource (Wu and Dredze, 2020). (Note that Wu and Dredze (Wu and Dredze, 2020) re-tokenize Chinese dataset and obtain relatively high results.) While the pre-trained methods directly adopt models trained on the source language, our proposed framework enables the model to take advantage of learning target language information from the unlabeled text and thus achieves even better performance.

In particular, (et al., 2020k) represents the previous state-of-the-art performance. It also employs a teacher-student framework to distill knowledge from a teacher model in the source language to the target model. However, it directly transfers knowledge from a source model but neglects the noise in it. Our proposed reinforced framework selectively transfers knowledge to reduce noise from the source language and fits the target languages better.

We further compare our method with the state-of-the-art multi-source cross-lingual NER approaches (Täckström, 2012; Moon, 2019; et al., 2020k), where human-annotated data on multiple source languages are assumed available. Although using only labeled data in English and unlabeled data in the target languages, our proposed method achieves the best average performance compared to the SOTA multi-source methods in our experiments. This further verifies that iterative knowledge distillation with reinforced instance selector is effective for the zero-shot cross-lingual NER task.

From the application point of view, our method is more convenient and less data-consuming, especially for scenarios where multilingual training data is not available. Our proposed method also has great potentials in multi-source settings and other cross-lingual tasks, which are left for future work.

4.5. Further Analysis

4.5.1. Ablation Study

We conduct experiments on different variants of the proposed framework to investigate the contributions of different components. Table 5 presents the results of removing one component at a time.

es nl de Average
RIKDFull 79.46 81.40 78.40 79.75
RL 79.21 81.21 76.65 79.02 (0.73 )
IKD 78.90 81.02 74.86 78.26 (1.49 )
RL&IKD 78.77 80.99 74.67 78.14 (1.60 )
KD 76.79 79.88 70.64 75.77 (3.97 )
Table 5. F1 scores of ablation of reinforced selector and iterative knowledge distillation. RIKDFull is the proposed framework with iterative knowledge distillation and reinforced selectively transfer. RL removes Reinforced Knowledge Distillation from the full method. IKD removes Iterative Knowledge Distillation from the full method. KD removes knowledge distillation from the full method and directly conducts inference using the source model .

In general, all of the proposed techniques contribute to the cross-lingual setting. The full model consistently achieves the best performance on all languages experimented. “KD” is trained using the annotated data in the source language and directly infers in the target languages. It suffers from a decrease of 3.97 points in F1 on average. Both the single step and our iterative knowledge distillation settings outperform the direct model transfer approach, indicating that knowledge distillation is an efficient way to transfer knowledge across languages. “IKD” leads to a gain of 0.11 points in F1 on average compared to “RL&IKD”, which directly performs cross-lingual knowledge distillation. Since the IKD step does not introduce extra supervision signals, we believe that the gain may come from that our iterative distillation framework benefits from better teacher models compared with the single round method.

The RL-based instance selection module contributes to both a single step and the iterative knowledge distillation framework. Directly forcing the student model in the target language to imitate all behaviors of the teacher model may introduce source language specific bias (Liu, 2020) as well as the teacher model inference errors. The intuition behind our instance selector is that selective distillation from the most informative cases can lead to better knowledge transfer performance on the target languages. The experimental results verify that the RL-based selector is capable of enhancing knowledge distillation by removing the noises in the teacher model prediction.

4.5.2. Effect of Multiple Iterations in RIKD

We further conduct experiments to study the effect of iterations in the RIKD training framework. Five rounds of iterations are conducted. For each iteration, the RL-based instance selector is leveraged to select instances for KD. The results are shown in Figure 2. Taking German as an example, RIKD gets gains over the last round by , , in the 1st, 2nd, and 3rd rounds, respectively. This further demonstrates the effectiveness of our proposed iterative training approach. It reduces noises in supervision signals step by step under the guidance of reinforcement learning. The figure also shows that the model performance stabilizes after three rounds.

Figure 2. Effect of different iterations in RIKD (XLM-R) on CoNLL.

4.5.3. Data Selection Strategy

We study the effect of reinforced training instance selection by comparing with two other selection strategies including Confidence Selection and Agreement Selection.

A straightforward method for training data selection is to keep those cases that the teacher model can predict with high confidence. Specifically, in an instance of tokens, for each token, we use the entity label of the highest probability predicted by the teacher model as the entity label. The confidence of is the average token entity label probability. That is,


For each instance with over a predefined threshold , is selected; otherwise, it is discarded.

Another straightforward way for data selection is based on the agreement between the teacher model prediction and that of the student model. Specifically, the agreement score for each instance is defined as the minus of mean squared loss of the output probability distribution between the source and target model, that is,


For each instance , if passes a predefined threshold , is selected otherwise discarded.

For fair comparisons, we set and properly to select a comparable amount of training instances in each batch as our RIKD method. As shown in Table 6, we record the number of discarded instances under our reinforced learning strategy and obtain the average discard ratio for each target language in iteration 1. Besides, we only adopt the performance of RIKD in iteration 1.

The results in Table 7 show that the reinforced selection method achieves the best results, which verify that RIKD can select more informative cases for better knowledge transfer.

es nl de ar hi zh
Discard ratio (%) 38.8 56.7 52.4 60.8 23.5 57.2
Table 6. Average discard ratio of our approach. We first calculate the percentage of discarded cases in each batch and then average over all training batches in one iteration.
es nl de ar hi zh Average
RIKD 78.90 81.02 74.86 52.38 72.88 32.14 65.36
Agreement Selection 78.47 77.64 72.18 51.71 72.02 32.86 64.15 (1.21 )
Confidence Selection 78.61 78.54 74.45 52.37 71.99 32.15 64.69 (0.67 )
Table 7. Analysis of different selection strategies.

4.5.4. State Vector Study

This section investigates the effect of different features of the proposed state vector. Table 8 shows the results when we remove each feature vector from the state vector introduced in Section 3.3. The results show that ablation of features causes performance to degrade to different extents. Among them , and show the biggest contributions. One explanation is that these features could represent both the semantic meaning of the input training case but also how noisy the input training instance (in the target language) is. To be specific, encodes both the internal representation and the predicted labels of the training instance. denotes the prediction confidence of the source model. denotes the agreement degree (through MSE metric) between the predictions of source and target models. If the agreement is low, the training instance may have noise.

ar hi zh Average
RIKD 52.38 72.88 32.14 52.47
- 49.98 71.95 30.92 50.95 (2.52 )
- 50.08 71.78 30.31 50.72 (1.75 )
- 51.24 71.77 30.49 51.17 (1.30 )
- 49.20 71.68 28.05 49.64 (2.83 )
- 48.14 72.37 29.77 50.09 (2.38 )
Table 8. Analysis of different features in the state vector from Section 3.3.

4.5.5. Application to Industry Scenarios

We further apply RIKD to one production scenario from the Microsoft Bing search engine to illustrate its practical effectiveness. The dataset consists of a human-annotated 50k en training set and multilingual test sets. Large scale unlabeled sentences are collected from web documents for distillation which includes 70.9k, 30.1k, 55.3k for es, hi, and zh respectively and we use the corresponding 5.6k, 3.2k, and 11k test data from the dataset for evaluation. As shown in Table 9, RIKD outperforms Wu et al.  (et al., 2020k) by 3.07 average F1 score. This manifests the effectiveness and generalization of our proposed method.

es hi zh Average
Wu et al.  (et al., 2020k) 77.94 65.61 31.17 58.24
RIKD 79.59 68.29 36.06 61.31 (3.07 )
Table 9. Results on industry scenario.

5. Conclusion and Industry Impact

In this paper, we propose a reinforced knowledge distillation framework for cross-lingual named entity recognition (NER). The proposed method iteratively distills transferable knowledge from the model in the source language and performs language adaption using only unlabeled data in the target language. We further introduce a reinforced instance selector that helps to selectively transfer useful knowledge from a teacher model to a student model. We report a series of experiments on several widely used benchmark datasets. The results verify that the proposed framework outperforms the existing methods and establishes the new state-of-the-art performance on the cross-lingual NER task.

Moreover, RIKD is on the way to be deployed as an underlying technique in the Microsoft Bing search engine to serve many core modules such as Web ranking, Entity Pane, Answers Triggering, etc. And our framework will be adopted to improve User Intent Recognition and Slot Filling in a commercial voice assistant.

Shining Liang’s research is supported by the National Natural Science Foundation of China (61976103, 61872161), the Scientific and Technological Development Program of Jilin Province (20190302029GX, 20180101330JC, 20180101328JC), and the Development and Reform Commission Program of Jilin Province (2019C053-8). Jian Pei’s research is supported in part by the NSERC Discovery Grant program. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.


  • A. Akbik (2019) Pooled contextualized embeddings for named entity recognition. In NAACL-HLT, pp. 724–728. Cited by: §1.
  • A. C. et al. (2020a) Unsupervised cross-lingual representation learning at scale. In ACL, pp. 8440–8451. Cited by: §1.
  • A. R. et al. (2019a) Massively multilingual transfer for ner. In ACL, pp. 151–164. Cited by: §4.1.
  • B. L. et al. (2019b) A user-centered concept mining system for query and document understanding at tencent. In KDD, pp. 1831–1841. Cited by: §1.
  • C. L. et al. (2020b) Bond: bert-assisted open-domain named entity recognition with distant supervision. In KDD, pp. 1054–1064. Cited by: §1.
  • C. T. et al. (2016a) Cross-lingual named entity recognition via wikification. In CoNLL, pp. 219–228. Cited by: §4.3, Table 4.
  • D. W. et al. (2017a) A multi-task learning approach to adapting bilingual word embeddings for cross-lingual named entity recognition. In IJCNLP, pp. 383–388. Cited by: §2.
  • F. Y. et al. (2020c) Enhancing answer boundary detection for multilingual machine reading comprehension. In ACL, Cited by: §1.
  • F. Y. et al. (2020d) Reinforced multi-teacher selection for knowledge distillation. arXiv preprint arXiv:2012.06048. Cited by: §3.3.3.
  • G. H. et al. (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2, §3.1.
  • H. H. et al. (2019c) Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. In EMNLP-IJCNLP, pp. 2485–2494. Cited by: §1.
  • H. L. et al. (2020e) MTOP: a comprehensive multilingual task-oriented semantic parsing benchmark. arXiv preprint arXiv:2008.09335. Cited by: §1, §1, §1.
  • J. D. et al. (2019d) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §1.
  • J. N. et al. (2017b) Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In ACL, pp. 1470–1480. Cited by: §1, §2, §4.3, Table 4.
  • J. P. et al. (2020f) MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In EMNLP, pp. 7654–7673. Cited by: §2, §4.3, §4.4, Table 4.
  • J. X. et al. (2018a) Neural cross-lingual named entity recognition with minimal resources. In EMNLP, pp. 369–379. Cited by: §1, §2, §4.3, Table 4.
  • L. S. et al. (2020g) Mining implicit relevance feedback from user behavior for web question answering. Cited by: §1.
  • L. W. et al. (2018b)

    A study of reinforcement learning for neural machine translation

    In EMNLP, pp. 3612–3621. Cited by: §2.
  • M. A. et al. (2020h) Translation artifacts in cross-lingual transfer learning. In EMNLP, pp. 7674–7684. Cited by: §1.
  • M. B. et al. (2020i) Zero-resource cross-lingual named entity recognition. In AAAI, pp. 7415–7423. Cited by: §1.
  • Q. W. et al. (2020j) Enhanced meta-learning for cross-lingual named entity recognition with minimal resources. In AAAI, pp. 9274–9281. Cited by: §4.3, §4.4, Table 4.
  • Q. W. et al. (2020k) Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In ACL, pp. 6505–6514. Cited by: §1, §2, §4.1, §4.3, §4.4, §4.4, §4.4, §4.5.5, Table 4, Table 9.
  • Q. X. et al. (2020l)

    Self-training with noisy student improves imagenet classification

    In CVPR, pp. 10687–10698. Cited by: §3.2.
  • R. S. et al. (1999) Policy gradient methods for reinforcement learning with function approximation. In NIPS, pp. 1057–1063. Cited by: §3.3.4.
  • S. M. et al. (2017c) Cheap translation for cross-lingual named entity recognition. In EMNLP, pp. 2536–2545. Cited by: §1, §2, §4.3, Table 4.
  • W. B. et al. (2019e) A minimax game for instance based selective transfer learning. In KDD, pp. 34–43. Cited by: §2, §3.3.3.
  • X. D. et al. (2020m) Multi-modal information extraction from text, semi-structured, and tabular data on the web. In KDD, pp. 3543–3544. Cited by: §1.
  • X. L. et al. (2020n) Self-supervised learning: generative or contrastive. arXiv preprint arXiv:2006.08218. Cited by: §3.2.
  • X. P. et al. (2017d) Cross-lingual name tagging and linking for 282 languages. In ACL, pp. 1946–1958. Cited by: §4.1.
  • Y. W. et al. (2016b) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §4.1.
  • Z. Y. et al. (2017e) Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR, Cited by: §2.
  • Z. Y. et al. (2020o) Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM, pp. 690–698. Cited by: §2.
  • A. V. González-Garduño (2019) Reinforcement learning for improved low resource dialogue generation. In AAAI, pp. 9884–9885. Cited by: §2.
  • Z. Liu (2020) Do we need word order information for cross-lingual sequence labeling. arXiv preprint arXiv:2001.11164. Cited by: §4.5.1.
  • I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. Cited by: §4.2.
  • T. Moon (2019) Towards lingua franca named entity recognition with bert. arXiv preprint arXiv:1912.01389. Cited by: Table 1, §1, §4.3, §4.4, §4.4, Table 4.
  • D. Nadeau and S. Sekine (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (1), pp. 3–26. Cited by: §1.
  • E. Sang and F. D. Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In CoNLL, pp. 142–147. Cited by: §4.1.
  • E. Sang (2002) Introduction to the conll-2002 shared task: language-independent named entity recognition. In CoNLL, Cited by: §4.1.
  • R.S. Sutton and A.G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
  • O. Täckström (2012) Nudging the envelope of direct transfer methods for multilingual named entity recognition. In NAACL-HLT Workshop on the Induction of Linguistic Structure, pp. 55–63. Cited by: §4.3, §4.4, Table 4.
  • G. Tur and R. D. Mori (Eds.) (2011) Spoken language understanding: systems for extracting semantic information from speech. Cited by: §1.
  • S. Wu and M. Dredze (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In EMNLP-IJCNLP, pp. 833–844. Cited by: §1, §1, §2, §3.3.1, §4.1, §4.2, §4.3, Table 4.
  • S. Wu and M. Dredze (2020) Do explicit alignments robustly improve multilingual encoders?. In EMNLP, pp. 4471–4482. Cited by: §2, §4.3, §4.4, Table 4.
  • R. Xu and Y. Yang (2017) Cross-lingual distillation for text classification. In ACL, pp. 1415–1425. Cited by: §2.