ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

01/29/2022
by   Minjia Zhang, et al.
0

In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks. To train these models with increasing sizes, many neural network practitioners attempt to increase the batch sizes in order to leverage multiple GPUs to improve training speed. However, increasing the batch size often makes the optimization more difficult, leading to slow convergence or poor generalization that can require orders of magnitude more training time to achieve the same model quality. In this paper, we explore the steepness of the loss landscape of large-batch optimization for adapting pre-trained Transformer-based language models to domain-specific tasks and find that it tends to be highly complex and irregular, posing challenges to generalization on downstream tasks. To tackle this challenge, we propose ScaLA, a novel and efficient method to accelerate the adaptation speed of pre-trained transformer networks. Different from prior methods, we take a sequential game-theoretic approach by adding lightweight adversarial noise into large-batch optimization, which significantly improves adaptation speed while preserving model generalization. Experiment results show that ScaLA attains 2.7–9.8× adaptation speedups over the baseline for GLUE on BERT-base and RoBERTa-large, while achieving comparable and sometimes higher accuracy than the state-of-the-art large-batch optimization methods. Finally, we also address the theoretical aspect of large-batch optimization with adversarial noise and provide a theoretical convergence rate analysis for ScaLA using techniques for analyzing non-convex saddle-point problems.

READ FULL TEXT
research
05/11/2021

Benchmarking down-scaled (not so large) pre-trained language models

Large Transformer-based language models are pre-trained on corpora of va...
research
10/26/2020

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Recently, Transformer-based language models have demonstrated remarkable...
research
11/05/2022

Privacy-Preserving Models for Legal Natural Language Processing

Pre-training large transformer models with in-domain data improves domai...
research
04/19/2022

ALBETO and DistilBETO: Lightweight Spanish Language Models

In recent years there have been considerable advances in pre-trained lan...
research
04/01/2019

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

Large-batch training is key to speeding up deep neural network training ...
research
05/23/2022

Sample Efficient Approaches for Idiomaticity Detection

Deep neural models, in particular Transformer-based pre-trained language...
research
11/16/2022

Fast and Accurate FSA System Using ELBERT: An Efficient and Lightweight BERT

As an application of Natural Language Processing (NLP) techniques, finan...

Please sign up or login with your details

Forgot password? Click here to reset