TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

04/12/2020
by   Subhabrata Mukherjee, et al.
0

Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to use them in practice. Some recent and concurrent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multi-lingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations that is agnostic of teacher architecture and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of MBERT-like teacher models by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

While deep and large pre-trained models are the state-of-the-art for var...
research
12/11/2020

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and hug...
research
10/04/2019

Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data

Recent advances in pre-training huge models on large amounts of text thr...
research
01/21/2023

ProKD: An Unsupervised Prototypical Knowledge Distillation Network for Zero-Resource Cross-Lingual Named Entity Recognition

For named entity recognition (NER) in zero-resource languages, utilizing...
research
06/15/2022

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

We present results from a large-scale experiment on pretraining encoders...
research
07/31/2020

Improving NER's Performance with Massive financial corpus

Training large deep neural networks needs massive high quality annotatio...
research
06/29/2022

Knowledge Distillation of Transformer-based Language Models Revisited

In the past few years, transformer-based pre-trained language models hav...

Please sign up or login with your details

Forgot password? Click here to reset