Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

09/10/2022
by   Jared Lichtarge, et al.
7

Recent trends towards training ever-larger language models have substantially improved machine learning performance across linguistic tasks. However, the huge cost of training larger models can make tuning them prohibitively expensive, motivating the study of more efficient methods. Gradient-based hyper-parameter optimization offers the capacity to tune hyper-parameters during training, yet has not previously been studied in a sequence-to-sequence setting. We apply a simple and general gradient-based hyperparameter optimization method to sequence-to-sequence tasks for the first time, demonstrating both efficiency and performance gains over strong baselines for both Neural Machine Translation and Natural Language Understanding (NLU) tasks (via T5 pretraining). For translation, we show the method generalizes across language pairs, is more efficient than Bayesian hyper-parameter optimization, and that learned schedules for some hyper-parameters can out-perform even optimal constant-valued tuning. For T5, we show that learning hyper-parameters during pretraining can improve performance across downstream NLU tasks. When learning multiple hyper-parameters concurrently, we show that the global learning rate can follow a schedule over training that improves performance and is not explainable by the `short-horizon bias' of greedy methods <cit.>. We release the code used to facilitate further research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2022

Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation

In this paper, we present a substantial step in better understanding the...
research
03/15/2022

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Deep neural networks have seen great success in recent years; however, t...
research
11/08/2016

Unsupervised Pretraining for Sequence to Sequence Learning

Sequence to sequence models are successful tools for supervised sequence...
research
02/06/2022

No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

Recent research has shown the existence of significant redundancy in lar...
research
02/13/2023

Gradient-Based Automated Iterative Recovery for Parameter-Efficient Tuning

Pretrained large language models (LLMs) are able to solve a wide variety...
research
04/16/2021

Editing Factual Knowledge in Language Models

The factual knowledge acquired during pretraining and stored in the para...
research
08/11/2023

Optimizing transformer-based machine translation model for single GPU training: a hyperparameter ablation study

In machine translation tasks, the relationship between model complexity ...

Please sign up or login with your details

Forgot password? Click here to reset