Industry Scale Semi-Supervised Learning for Natural Language Understanding

03/29/2021
by   Luoxin Chen, et al.
0

This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context: 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how do the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-Label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.

READ FULL TEXT
research
09/14/2021

Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding

Task-adaptive pre-training (TAPT) and Self-training (ST) have emerged as...
research
06/01/2021

Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition

Named entity recognition (NER) is a fundamental component in many applic...
research
10/05/2020

Self-training Improves Pre-training for Natural Language Understanding

Unsupervised pre-training has led to much recent progress in natural lan...
research
08/10/2020

Knowledge Distillation and Data Selection for Semi-Supervised Learning in CTC Acoustic Models

Semi-supervised learning (SSL) is an active area of research which aims ...
research
10/09/2019

Efficient Semi-Supervised Learning for Natural Language Understanding by Optimizing Diversity

Expanding new functionalities efficiently is an ongoing challenge for si...
research
12/16/2016

Machine Reading with Background Knowledge

Intelligent systems capable of automatically understanding natural langu...
research
06/11/2021

Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation

Semi-Supervised Learning (SSL) has seen success in many application doma...

Please sign up or login with your details

Forgot password? Click here to reset