Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

06/15/2022
by   Jack FitzGerald, et al.
10

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70 models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86 7.01 distilled from our Stage 2 teacher model has 2.88 and 7.69 teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23 we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74 on an automatic measurement of full-system user dissatisfaction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/19/2022

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

We investigate what kind of structural knowledge learned in neural netwo...
research
03/21/2023

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Most humans use visual imagination to understand and reason about langua...
research
04/27/2019

Several Experiments on Investigating Pretraining and Knowledge-Enhanced Models for Natural Language Inference

Natural language inference (NLI) is among the most challenging tasks in ...
research
04/12/2020

TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

Deep and large pre-trained language models are the state-of-the-art for ...
research
07/25/2023

XDLM: Cross-lingual Diffusion Language Model for Machine Translation

Recently, diffusion models have excelled in image generation tasks and h...
research
05/27/2020

Syntactic Structure Distillation Pretraining For Bidirectional Encoders

Textual representation learners trained on large amounts of data have ac...
research
01/21/2021

Distilling Large Language Models into Tiny and Effective Students using pQRNN

Large pre-trained multilingual models like mBERT, XLM-R achieve state of...

Please sign up or login with your details

Forgot password? Click here to reset