Log In Sign Up

TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to use them in practice. Some recent and concurrent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multi-lingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations that is agnostic of teacher architecture and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of MBERT-like teacher models by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95


page 1

page 2

page 3

page 4


XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

While deep and large pre-trained models are the state-of-the-art for var...

Tree-structured Auxiliary Online Knowledge Distillation

Traditional knowledge distillation adopts a two-stage training process i...

Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data

Recent advances in pre-training huge models on large amounts of text thr...

Improving NER's Performance with Massive financial corpus

Training large deep neural networks needs massive high quality annotatio...

Knowledge Distillation of Transformer-based Language Models Revisited

In the past few years, transformer-based pre-trained language models hav...

QUILL: Query Intent with Large Language Models using Retrieval Augmentation and Multi-stage Distillation

Large Language Models (LLMs) have shown impressive results on a variety ...

1 Introduction

Motivation: Pre-trained deep language models have shown state-of-the-art performance for various natural language processing applications like text classification, named entity recognition, question-answering, etc. A significant challenge facing many practitioners is how to deploy these huge models in practice. For instance, BERT Large and GPT 2 contain and model parameters respectively. Although these models are trained offline, during prediction we still need to traverse the deep neural network architecture stack involving a large number of parameters. This significantly increases latency and memory requirements.

Knowledge distillation DBLP:journals/corr/HintonVD15; DBLP:conf/nips/BaC14

, originally developed for computer vision applications, provides one of the techniques to compress huge neural networks into smaller ones. In this, shallow models (called students) are trained to mimic the output of huge models (called teachers) based on a transfer set. Similar approaches have been recently adopted for language model distillation.

Limitations of existing work: Recent works  (DBLP:journals/corr/abs-1904-09482; zhu-etal-2019-panlp; DBLP:journals/corr/abs-1903-12136; turc2019wellread)

leverage only the soft output (logits) from the teacher as optimization targets for distilling student models, with some notable exceptions from concurrent work.

sun2019patient; sanh2019; aguilar2019knowledge111 Currently under review at ICLR or alternate.; zhao2019extreme1 additionally use internal representations from the teacher to provide useful hints for distilling better students. However, these methods are constrained by the teacher architecture like embedding dimension in BERT and transformer architectures. This makes it difficult to massively compress these models (without being able to reduce the network width) or adopt alternate architectures. For instance, we observe BiLSTMS as students to be more accurate than Transformers for low latency configurations. Some of the concurrent works (turc2019wellread)1; (zhao2019extreme)1 adopt pre-training or dual training to distil student models of arbitrary architecture. However, pre-training is expensive both in terms of time and computational resources.

Additionally, most of the above works are geared for distilling language models for GLUE tasks. There has been very limited exploration of such techniques for NER (izsak2019training; Shi_2019) or multi-lingual tasks (Tsai_2019). Moreover, these works also suffer from the same drawbacks as mentioned before.

Overview of our method: In this work, we compare distillation strategies used in all the above works and propose a new scheme outperforming prior ones. In this, we leverage teacher internal representations to transfer knowledge to the student. However, in contrast to prior work, we are not restricted by the choice of student architecture. This allows representation transfer from Transformer-based teacher model to BiLSTM-based student model with different embedding dimensions and disparate output spaces. We also propose a stage-wise optimization scheme to sequentially transfer most general to task-specific information from teacher to student for better distillation.

Overview of our task: Unlike prior works mostly focusing on GLUE tasks in a single language, we employ our techniques to study distillation for massive multi-lingual Named Entity Recognition (NER) over 41 languages. Prior work on multi-lingual transfer on the same rahimi-etal-2019-massively (MMNER) requires knowledge of source and target language whereby they judiciously select pairs for effective transfer resulting in a customized model for each language. In our work, instead, we adopt Multi-lingual Bidirectional Encoder Representations from Transformer (MBERT) as our teacher and show that it is possible to perform language-agnostic joint NER for all languages with a single model that has a similar performance but massively compressed in contrast to MBERT and MMNER.

Perhaps, the closest work to this work is that of  (Tsai_2019) where MBERT is leveraged for multi-lingual NER. We discuss this in details and use their strategy as one of our baselines. We show that our distillation strategy is better leading to a much higher compression and faster inference. We also investigate several unexplored dimensions of distillation like the impact of unlabeled transfer data and annotation resources, choice of multi-lingual word embeddings, architectural variations and inference latency to name a few.

Our techniques obtain massive compression of MBERT-like teacher models by upto in terms of parameters and in terms of latency for batch inference while retaining of its performance for massive multi-lingual NER, and matching or outperforming it for classification tasks. Overall, our work makes the following contributions:

Method: We propose a distillation method leveraging internal representations and parameter projection that is agnostic of teacher architecture.

Inference: To learn model parameters, we propose stage wise optimization schedule with gradual unfreezing outperforming prior schemes.

Experiments: We perform distillation for multi-lingual NER on 41 languages with massive compression and comparable performance to huge models222We will release code and distilled model checkpoints.. We also perform classification experiments on four datasets where our compressed models perform at par with huge teachers.

Study: We study the influence of several factors on distillation like the availability of annotation resources for different languages, model architecture, quality of multi-lingual word embeddings, memory footprint and inference latency.

Problem Statement: Consider a sequence with tokens and as the corresponding labels. Consider to be a set of labeled instances with denoting the instances and the corresponding labels. Consider to be a transfer set of unlabeled instances from the same domain where . Given a teacher , we want to train a student with being trainable parameters such that

and the student is comparable in performance to the teacher based on some evaluation metric. In the following section, the superscript ‘t’ always represents the teacher and ‘s’ denotes the student.

2 Models

The Student: The input to the model are

-dimensional word embeddings for each token. In order to capture sequential information in the tokens, we use a single layer Bidirectional Long Short Term Memory Network (BiLSTM). Given a sequence of

tokens, a BiLSTM computes a set of vectors as the concatenation of the states generated by a forward and backward LSTM . Assuming the number of hidden units in the LSTM to be , each hidden state is of dimension

. Probability of the label at timestep

is given by:


where and is number of labels.

We train the student network end-to-end minimizing the cross-entropy loss over labeled data:


The Teacher: Pre-trained language models like ELMO (DBLP:conf/naacl/PetersNIGCLZ18), BERT (DBLP:conf/naacl/DevlinCLT19) and GPT (radford2018improving; radford2019) have shown state-of-the-art performance for several tasks. We adopt BERT as the teacher – specifically, the multi-lingual version of BERT (MBERT) with parameters trained on top of 104 languages with the largest Wikipedias. MBERT does not use any markers to distinguish languages during pre-training and learns a single language-agnostic model trained via masked language modeling over Wikipedia articles from all languages.

Tokenization: Similar to MBERT, we use WordPiece tokenization with shared WordPiece vocabulary. We preserve casing, remove accents, split on punctuations and whitespace.

Fine-tuning the Teacher: The pre-trained language models are trained for general language model objectives. In order to adapt them for the given task, the teacher is fine-tuned end-to-end with task-specific labeled data to learn parameters using cross-entropy loss as in Equation 2.

3 Distillation Features

Fine-tuning the teacher gives us access to its task-specific representations for distilling the student model. To this end, we use different kinds of information from the teacher.

Teacher Logits: Logits as logarithms of predicted probabilities provide a better view of the teacher by emphasizing on the different relationships learned by it across different instances. Consider to be the classification probability of token as generated by the fine-tuned teacher with representing the corresponding logits. Our objective is to train a student model with these logits as targets. Given the hidden state representation for token , we can obtain the corresponding classification score (since targets are logits) as:


where and are trainable parameters and is the number of classes. We want to train the student neural network end-to-end by minimizing the element-wise mean-squared error between the classification scores given by the student and the target logits from the teacher as:


3.1 Internal Teacher Representations

Hidden representations: Recent works (sun2019patient; DBLP:journals/corr/RomeroBKCGB14)

have shown the hidden state information from the teacher to be helpful as a hint-based guidance for the student. Given a large collection of task-specific unlabeled data, we can transfer the teacher’s knowledge to the student via its hidden representations. However, this poses a challenge in our setting as the teacher and student models have different architectures with disparate output spaces.

Consider and to be the representations generated by the student and the deep layer of the fine-tuned teacher respectively for a token . Consider to be the set of unlabeled instances. We will later discuss the choice of the teacher layer and its impact on distillation.

Projection: To make all output spaces compatible, we perform a non-linear projection of the parameters in student representation to have same shape as teacher representation for each token :


where is the projection matrix, is the bias, and Gelu (Gaussian Error Linear Unit) (DBLP:journals/corr/HendrycksG16) is the non-linear projection function. represents the embedding dimension of the teacher. This transformation aligns the output spaces of the student and teacher and allows us to accommodate arbitrary student architecture. Also note that the projections (and therefore the parameters) are shared across tokens at different timepoints.

The projection parameters are learned by minimizing the -divergence (KLD) between the student and the layer teacher representations:


Multi-lingual word embeddings: A large number of parameters reside in the word embeddings. For MBERT a shared multi-lingual WordPiece vocabulary of tokens and embedding dimension of leads to parameters. To have massive compression, we cannot directly incorporate MBERT embeddings in our model. Since we use the same WordPiece vocabulary, we are likely to benefit more from these embeddings than from Glove (DBLP:conf/emnlp/PenningtonSM14) or FastText (bojanowski2016enriching).

We use a dimensionality reduction algorithm like Singular Value Decomposition (SVD) to project the MBERT word embeddings to a lower dimensional space. Given MBERT word embedding matrix of dimension

, SVD finds the best -dimensional representation that minimizes sum of squares of the projections (of rows) to the subspace.

4 Training

We want to optimize the loss functions for

representation , logits and cross-entropy . These optimizations can be scheduled differently to obtain different training regimens as follows.

4.1 Joint Optimization

In this, we optimize the following losses jointly:


where and weigh the contribution of different losses. A high value of makes the student focus more on easy targets; whereas a high value of leads focus to the difficult ones. The above loss is computed over two different task-specific data segments. The first part involves cross-entropy loss over labeled data, whereas the second part involves representation and logit loss over unlabeled data.

[t] Fine-tune teacher on and update   stage in {1,2,3} Freeze all layers   stage=1 = = teacher representations on from the layer as =   stage=2 = = teacher logits on as =   stage=3 = = =   layer Unfreeze   Update parameters by minimizing the optimization between student and teacher Multi-stage distillation.

4.2 Stage-wise Training

Instead of optimizing all loss functions jointly, we propose a stage-wise scheme to gradually transfer most general to task-specific representations from teacher to student. In this, we first train the student to mimic teacher representations from its layer by optimizing on unlabeled data. The student learns the parameters for word embeddings (), BiLSTM () and projections .

In the second stage, we optimize for the cross-entropy and logit loss jointly on both labeled and unlabeled data respectively to learn the corresponding parameters and .

The above can be further broken down in two stages, where we sequentially optimize logit loss on unlabeled data and then optimize cross-entropy loss on labeled data. Every stage learns parameters conditioned on those learned in previous stage followed by end-to-end fine-tuning.

4.3 Gradual Unfreezing

One potential drawback of end-to-end fine-tuning for stage-wise optimization is ‘catastrophic forgetting’ (DBLP:conf/acl/RuderH18) where the model forgets information learned in earlier stages. To address this, we adopt gradual unfreezing – where we tune the model one layer at a time starting from the configuration at the end of previous stage.

We start from the top layer that contains the most task-specific information and allow the model to configure the task-specific layer first while others remain frozen. The latter layers are gradually unfrozen one by one and the model trained till convergence. Once a layer is unfrozen, it maintains the state. When the last layer (word embeddings) is unfrozen, the entire network is trained end-to-end. The order of this unfreezing scheme (top-to-bottom) is reverse of that in (DBLP:conf/acl/RuderH18) and we find this to work better in our setting with the following intuition. At the end of the first stage on optimizing , the student learns to generate representations similar to that of the layer of the teacher. Now, we need to add only a few task-specific parameters () to optimize for logit loss with all others frozen. Next, we gradually give the student more flexibility to optimize for task-specific loss by tuning the layers below where the number of parameters increases with depth ().

We tune each layer for epochs and restore model to the best configuration based on validation loss on a held-out set. Therefore, the model retains best possible performance from any iteration. Algorithm 4.1 shows overall processing scheme.

Dataset Labels Train Test Unlabeled
Wikiann-41 11 705K 329K 7.2MM
IMDB 2 25K 25K 50K
DBPedia 14 560K 70K -
AG News 4 120K 7.6K -
Elec 2 25K 25K 200K
Table 1: Full dataset summary.

5 Experiments

Work PT TA Distil.
sanh2019 Y Y D1
turc2019wellread Y N D1
DBLP:journals/corr/abs-1904-09482; zhu-etal-2019-panlp; Shi_2019; Tsai_2019; DBLP:journals/corr/abs-1903-12136; izsak2019training; Clark-2019 N N D1
sun2019patient N Y D2
jiao2019tinybert N N D2
zhao2019extreme Y N D2
TinyMBERT (ours) N N D4
Table 2: Different distillation strategies. D1 leverages soft logits with hard labels. D2 uses representation loss. PT denotes pre-training with language modeling. TA depicts students constrained by teacher architecture.
Strategy Features Transfer = 0.7MM Transfer = 1.4MM Transfer = 7.2MM
D0 Labels per lang. 71.26 (6.2) - -
D0-S Labels across all lang. 81.44 (5.3) - -
D1 Labels and Logits 82.74 (5.1) 84.52 (4.8) 85.94 (4.8)
D2 Labels, Logits and Repr. 82.38 (5.2) 83.78 (4.9) 85.87 (4.9)
D3.1 (S1) Repr. (S2) Labels and Logits 83.10 (5.0) 84.38 (5.1) 86.35 (4.9)
D3.2 + Gradual unfreezing 86.77 (4.3) 87.79 (4.0) 88.26 (4.3)
D4.1 (S1) Repr. (S2) Logits (S3) Labels 84.82 (4.7) 87.07 (4.2) 87.87 (4.1)
D4.2 + Gradual unfreezing 87.10 (4.2) 88.64 (3.8) 88.52 (4.1)
Table 3: Comparison of several strategies with average

-score (and standard deviation) across 41 languages over different transfer data size.

depicts separate stages and corresponding optimized loss functions.

Dataset Description: We evaluate our model TinyMBERT for multi-lingual NER on 41 languages and the same setting as in (rahimi-etal-2019-massively). This data has been derived from the WikiAnn NER corpus (pan-etal-2017-cross) and partitioned into training, development and test sets. All the NER results are reported in this test set for a fair comparison between existing works. We report both the average -score () and standard deviation between scores across 41 languages for phrase-level evaluation. Refer to Figure 2 for languages codes and distribution of training labels across languages.

We also perform experiments with data from four other domains (refer to Table 1): IMDB (DBLP:conf/acl/MaasDPHNP11), SST-2 socher-etal-2013-parsing and Elec (DBLP:conf/recsys/McAuleyL13)

for sentiment analysis for movie and electronics product reviews, DbPedia 

(DBLP:conf/nips/ZhangZL15) and Ag News (DBLP:conf/nips/ZhangZL15) for topic classification of Wikipedia and news articles.

NER Tags:

The NER corpus uses IOB2 tagging strategy with entities like LOC, ORG and PER. Following MBERT, we do not use language markers and share these tags across all languages. We use additional syntactic markers like {CLS, SEP, PAD} and ‘X’ for marking segmented wordpieces contributing a total of 11 tags (with shared ‘O’).

5.1 Evaluating Distillation Strategies

Baselines: A trivial baseline (D0) is to learn models one per language using only corresponding labels for learning. This can be improved by merging all instances and sharing information across all languages (D0-S). Most of the concurrent and recent works (refer to Table 2 for an overview) leverage logits as optimization targets for distillation (D1). A few exceptions also use teacher internal representations along with soft logits (D2). For our model we consider multi-stage distillation, where we first optimize representation loss followed by jointly optimizing logit and cross-entropy loss (D3.1) and further improving it by gradual unfreezing of neural network layers (D3.2). Finally, we optimize the loss functions sequentially in three stages (D4.1) and improve it further by unfreezing mechanism (D4.2). We further compare all strategies while varying the amount of unlabeled transfer data for distillation (hyper-parameter settings in Appendix).

Results: From Table 3, we observe all strategies that share information across languages to work better (D0-S vs. D0) with the soft logits adding more value than hard targets (D1 vs. D0-S). Interestingly, we observe simply combining representation loss with logits (D3.1 vs. D2) hurts the model. We observe this strategy to be vulnerable to the hyper-parameters ( in Eqn. 7) used to combine multiple loss functions. We vary hyper-parameters in multiples of 10 and report best numbers.

Stage-wise optimizations remove these hyper-parameters and improve performance. We also observe the gradual unfreezing scheme to improve both stage-wise distillation strategies significantly.

Focusing on the data dimension, we observe all models to improve as more and more unlabeled data is used for transferring teacher knowledge to student. However, we also observe the improvement to slow down after a point where additional unlabeled data does not yield significant benefits. Table 4 shows the gradual performance improvement in TinyMBERT after every stage and unfreezing various neural network layers.

5.2 Performance, Compression and Speedup

Stage Unfreezing Layer Std. Dev.
2 Linear () 0 0
2 Projection () 2.85 3.9
2 BiLSTM () 81.64 5.2
2 Word Emb () 85.99 4.4
3 Softmax () 86.38 4.2
3 Projection () 87.65 3.9
3 BiLSTM () 88.08 3.9
3 Word Emb () 88.64 3.8
Table 4: Gradual -score improvement over multiple distillation stages in TinyMBERT.
(a) Parameter compression vs. -score.
(b) Inference speedup vs. -score.
Figure 1: Variation in TinyMBERT -score with parameter and latency compression against MBERT. Each point in the linked scatter plots represents a configuration with corresponding embedding dimension and BiLSTM hidden states as ().
Model Avg. Std. Dev
MBERT-single (DBLP:conf/naacl/DevlinCLT19) 90.76 3.1
MBERT DBLP:conf/naacl/DevlinCLT19 91.86 2.7
MMNER (rahimi-etal-2019-massively) 89.20 2.8
TinyMBERT (ours) 88.64 3.8
Table 5: -score comparison of different models with standard deviation across 41 languages.
Figure 2: -score comparison for different models across 41 languages. The y-axis on the left shows the scores, whereas the axis on the right (plotted against blue dots) shows the number of training labels (in thousands).

Performance: We observe TinyMBERT in Table 5 to perform competitively with other models. MBERT-single models are fine-tuned per language with corresponding labels, whereas MBERT is fine-tuned with data across all languages. MMNER results are reported from  rahimi-etal-2019-massively.

Figure 2 shows the variation in -score across different languages with variable amount of training data for different models. We observe all the models to follow the general trend with some aberrations for languages with less training labels.

Parameter compression: TinyMBERT performs at par with MMNER obtaining atleast compression by learning a single model across all languages as opposed to learning language-specific models.

Figure 0(a) shows the variation in -scores of TinyMBERT and compression against MBERT with different configurations corresponding to the embedding dimension () and number of BiLSTM hidden states (). We observe that reducing the embedding dimension leads to great compression with minimal performance loss. Whereas, reducing the BiLSTM hidden states impacts the performance more and contributes less to the compression.

Inference speedup: We compare the runtime inference efficiency of MBERT and our model in a single P100 GPU for batch inference (batch size = 32) on queries of sequence length . We average the time taken for predicting labels for all the queries for each model aggregated over runs. Compared to batch inference, the speedups are less for online inference (batch size = 1) at on Intel(R) Xeon(R) CPU (E5-2690 v4 @2.60GHz) (refer to Appendix for details).

Figure 0(b) shows the variation in -scores of TinyMBERT and inference speedup against MBERT with different (linked) parameter configurations as before. As expected, the performance degrades with gradual speedup. We observe that parameter compression does not necessarily lead to an inference speedup. Reduction in the word embedding dimension leads to massive model compression, however, it does not have a similar effect on the latency. The BiLSTM hidden states, on the other hand, constitute the real latency bottleneck. One of the best configurations leads to compression, speedup over MBERT retaining nearly of its performance.

Model #Transfer Samples
MMNER - 62.1
MBERT - 79.54
TinyMBERT 4.1K 19.12
705K 76.97
1.3MM 77.17
7.2MM 77.26
Table 6: -score comparison for low-resource setting with labeled samples per language and transfer set of different sizes for TinyMBERT.

5.3 Low-resource NER and Distillation

Models in all prior experiments are trained on labeled instances across all languages. In this setting, we consider only labeled samples for each language with a total of instances. From Table 6, we observe MBERT to outperform MMNER by more than percentage points with TinyMBERT closely following suit.

Furthermore, we observe our model’s performance to improve with the transfer set size depicting the importance of unlabeled transfer data for knowledge distillation. As before, a lot of additional data has marginal contribution.

5.4 Word Embeddings

Random initialization of word embeddings works well. Multi-lingual FastText embeddings bojanowski2016enriching led to minor improvement due to overlap between FastText tokens and MBERT wordpieces. English Glove does much better. We experiment with recent dimensionality reduction techniques and find SVD to work better. Surprisingly, it leads to marginal improvement over MBERT embeddings before reduction. As expected, MBERT embeddings after fine-tuning perform better than that from pre-trained checkpoints (refer to Appendix for -measures).

5.5 Architectural Considerations

Which teacher layer to distil from? The topmost teacher layer captures more task-specific knowledge. However, it may be difficult for a shallow student to capture this knowledge given its limited capacity. On the other hand, the less-deep representations at the middle of teacher model are easier to mimic by shallow student. We observe the student to benefit most from distilling the or layer of the teacher (results in Appendix).

Figure 3: BiLSTM and Transformer -score (left y-axis) vs. inference latency (right y-axis) in 13 different settings with corresponding embedding dimension and width / depth of the student as .

Which student architecture to use for distillation? Recent works in distillation leverage both BiLSTM and Transformer as students. In this experiment, we vary the embedding dimension and hidden states for BiLSTM-, and embedding dimension and depth for Transformer-based students to obtain configurations with similar inference latency. Each of configurations in Figure 3 depict -scores obtained by students of different architecture but similar latency – for strategy D0-S in Table 3. We observe that for low-latency configurations BiLSTMs with hidden states work better than -layer Transformers. Whereas, the latter starts performing better with more than -layers although with a higher latency.

5.6 Distillation for Text Classification

Model Transfer Set Acc.
BERT Large Teacher - 94.95
TinyBERT SST+Imdb 93.35
BERT Base Teacher - 92.78
TinyBERT SST+Imdb 92.89
sun2019patient SST 92.70
turc2019wellread SST+IMDB 91.10
Table 7: Model accuracy on of SST-2 (dev. set).

We switch gear and focus on classification tasks. In contrast to sequence tagging, we use the last hidden state of the BiLSTM as the final sentence representation for projection, regression and softmax.

Comparison with baselines: Since we focus only on single instance classification in this work, SST-2 socher-etal-2013-parsing is the only GLUE benchmark to compare against other distillation techniques. Table 7 shows the accuracy comparison with such methods reported in SST-2 development set.

We extract sentences from all IMDB movie reviews in Table 1 to form the unlabeled transfer set for distillation. We obtain the best performance on distilling with BERT Large (uncased, whole word masking model) than BERT Base – demonstrating a better student performance with a better teacher and outperforming other methods.

Other classification tasks: Table 8 shows the distillation performance of TinyBERT with different teachers. We observe the student to almost match the teacher performance. The performance also improves with a better teacher, although the improvement is marginal as the student model saturates.

Dataset Student Distil Distil BERT BERT
no distil. (Base) (Large) Base Large
Ag News 89.71 92.33 94.33 92.12 94.63
IMDB 89.37 91.22 91.70 91.70 93.22
Elec 90.62 93.55 93.56 93.46 94.27
DbPedia 98.64 99.10 99.06 99.26 99.20
Table 8: Distillation performance with BERT.

Table 9 shows the distillation performance with only labeled samples per class. The distilled student improves over the non-distilled version by percent and matches the teacher performance for all of the tasks demonstrating the impact of distillation for low-resource settings.

Dataset Student Student BERT
no distil. with distil. Large
AG News 85.85 90.45 90.36
IMDB 61.53 89.08 89.11
Elec 65.68 91.00 90.41
DBpedia 96.30 98.94 98.94
Table 9: Distillation with BERT Large on labeled samples per class.

6 Related Work

Model compression and knowledge distillation: Prior works in the vision community dealing with huge architectures like AlexNet and ResNet have addressed this challenge in two ways. Works in model compression use quantization (DBLP:journals/corr/GongLYB14), low-precision training and pruning the network, as well as their combination (HanMao16) to reduce the memory footprint. On the other hand, works in knowledge distillation leverage student teacher models. These approaches include using soft logits as targets (DBLP:conf/nips/BaC14), increasing the temperature of the softmax to match that of the teacher (DBLP:journals/corr/HintonVD15) as well as using teacher representations (DBLP:journals/corr/RomeroBKCGB14) (refer to (DBLP:journals/corr/abs-1710-09282) for a survey).

Recent and concurrent Works: DBLP:journals/corr/abs-1904-09482; zhu-etal-2019-panlp; Clark-2019 leverage ensembling to distil knowledge from several multi-task deep neural networks into a single model. sun2019patient; sanh2019;aguilar2019knowledge1 train student models leveraging architectural knowledge of the teacher models which adds architectural constraints (e.g., embedding dimension) on the student. In order to address this shortcoming, more recent works combine task-specific distillation with pre-training the student model with arbitrary embedding dimension but still relying on transformer architectures (turc2019wellread)1;(jiao2019tinybert)1;(zhao2019extreme)1.

izsak2019training; Shi_2019 extend these for sequence tagging for Part-of-Speech (POS) tagging and Named Entity Recognition (NER) in English. The one closest to our work Tsai_2019 extends the above for multi-lingual NER.

Most of these works rely on general corpora for pre-training and task-specific labeled data for distillation. To harness additional knowledge, (turc2019wellread) leverage task-specific unlabeled data. (DBLP:journals/corr/abs-1903-12136; jiao2019tinybert) use rule-and embedding-based data augmentation in absence of such unlabeled data.

7 Conclusions

We develop a multi-stage distillation framework for massive multi-lingual NER and classification that performs close to huge pre-trained models with a massive compression and inference speedup. Our distillation strategy leveraging teacher representations agnostic of its architecture and stage-wise optimization schedule outperforms existing ones. We perform extensive study of several hitherto less explored distillation dimensions like the impact of unlabeled transfer set, embeddings and student architectures, and make interesting observations.