DeepAI
Log In Sign Up

Multi-armed bandits for online optimization of language model pre-training: the use case of dynamic masking

03/24/2022
by   Iñigo Urteaga, et al.
0

Transformer-based language models (TLMs) provide state-of-the-art performance in many modern natural language processing applications. TLM training is conducted in two phases. First, the model is pre-trained over large volumes of text to minimize a generic objective function, such as the Masked Language Model (MLM). Second, the model is fine-tuned in specific downstream tasks. Pre-training requires large volumes of data and high computational resources, while introducing many still unresolved design choices. For instance, selecting hyperparameters for language model pre-training is often carried out based on heuristics or grid-based searches. In this work, we propose a multi-armed bandit-based online optimization framework for the sequential selection of pre-training hyperparameters to optimize language model performance. We pose the pre-training procedure as a sequential decision-making task, where at each pre-training step, an agent must determine what hyperparameters to use towards optimizing the pre-training objective. We propose a Thompson sampling bandit algorithm, based on a surrogate Gaussian process reward model of the MLM pre-training objective, for its sequential minimization. We empirically show how the proposed Gaussian process based Thompson sampling pre-trains robust and well-performing language models. Namely, by sequentially selecting masking hyperparameters of the TLM, we achieve satisfactory performance in less epochs, not only in terms of the pre-training MLM objective, but in diverse downstream fine-tuning tasks. The proposed bandit-based technique provides an automated hyperparameter selection method for pre-training TLMs of interest to practitioners. In addition, our results indicate that, instead of MLM pre-training with fixed masking probabilities, sequentially adapting the masking hyperparameters improves both pre-training loss and downstream task metrics.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/22/2020

Pre-Training a Language Model Without Human Language

In this paper, we study how the intrinsic nature of pre-training data co...
11/18/2020

Predictions For Pre-training Language Models

Language model pre-training has proven to be useful in many language und...
02/24/2021

Generalized and Transferable Patient Language Representation for Phenotyping with Limited Data

The paradigm of representation learning through transfer learning has th...
10/01/2020

An Empirical Investigation Towards Efficient Multi-Domain Language Model Pre-training

Pre-training large language models has become a standard in the natural ...
02/03/2022

Pre-Trained Language Models for Interactive Decision-Making

Language model (LM) pre-training has proven useful for a wide variety of...
04/28/2022

Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Randomly masking and predicting word tokens has been a successful approa...
11/02/2021

Meta-Learning to Improve Pre-Training

Pre-training (PT) followed by fine-tuning (FT) is an effective method fo...

1 Introduction

In the field of Natural Language Processing (NLP), models for learning unsupervised representations from unlabeled text based on Transformer architectures (58) have attained state-of-the-art results on diverse tasks; e.g., question answering and language inference (23). Transformer-based language models (TLMs), such as BERT (10) and RoBERTa (34)

, rely on the combination of unsupervised pre-training of the model, and a subsequent task-specific fine-tuning procedure, via additional neural network layers targeted to the task of interest. TLMs are pre-trained over large unlabeled text data using self-supervision, i.e., by learning the relationships between different sections, sentences or words of the input data. Once the TLM is pre-trained over large volumes of text, it can be used in various downstream tasks after fine-tuning task-specific layers.

The key insight from pre-trained TLMs is that they learn language representations or embeddings that are useful across downstream tasks, minimizing the need to retrain the entire model from scratch. The advantage is that extensive pre-training of TLMs can lead to significant downstream performance improvements, i.e., it is worth learning complex TLMs in huge natural language corpora before fine-tuning them for particular tasks.

Following the success of TLMs in general NLP tasks, many have replicated the pre-train, then fine-tune framework in different specific domains, ranging from language models pre-trained with scientific documents in SciBERT (5) and biomedical corpora in BioBERT (32), ClinicalBERT (3), and BlueBERT (17); to in-house, industry-specific implementations and pre-training of TLMs (23). In addition, the importance of further pre-training with in-domain corpora, a procedure known as continual training (23), has also been documented to yield downstream performance gains (18).

However, even if conceptually simple and empirically powerful, pre-training is challenging and expensive: the relationship between the Transformer architecture, the training corpus, the training hyperparameters, and the evaluation metrics is multi-modal and complex. Furthermore, many have highlighted the importance of previously overlooked design choices in pre-training (such as deciding the pre-training metric and optimizing hyperparameters) that result in significant performance differences.

In this work, our goal is to improve the pre-training procedure of TLMs, by selecting pre-training hyperparameters that result in optimized performance. We argue that an optimized selection of pre-training hyperparameters will accelerate pre-training (i.e., achieve a satisfactory evaluation metric value in fewer epochs) and allow for a better pre-training procedure (i.e., achieve a superior metric value). Increased efficiency in TLM training is all the more important amidst rising concerns pertaining to the carbon footprint of large language models (41); and more specifically, the significant impact hyperparameter choice has on power consumption (43).

Our TLM pre-training use-case is random dynamic masking hyperparameter optimization, contrary to alternative (rule or task-based) MLM dynamic masking approaches, such as SpanBERT (22) and ERNIE (52). Even though (34) showed the efficiency and benefits of random dynamic masking, the selection of the masking probability hyperparameters is often carried out based on heuristics or grid-based search approaches. On the contrary, we investigate automating TLM pre-training hyperparameter selection via multi-armed bandit (MAB) optimization.

We cast the TLM pre-training hyperparameter selection procedure as a sequential decision process, in which at each interaction, an agent selects an action (e.g., pre-training hyperparameters) to maximize cumulative rewards (e.g., pre-training metric). In the dynamic masking use case, the MAB actions (i.e., arms) are the dynamic masking choices, and the masked-language model performance, the unknown function the bandit algorithm is trying to maximize.

Hyperparameter search in machine learning is often addressed as a black-box optimization problem, where the aim is to optimize a computationally expensive function with no additional information known about it. These black-box optimization problems are often solved using evolutionary algorithms 

(62), entropy search based methods (19, 20), or Bayesian optimization (BO) (12).

BO can tackle the problem of simultaneously optimizing a black-box function with possibly noisy evaluations (50), and of speeding up the allocation of resources to promising hyperparameter configurations, as in (33). We here focus on the former task, and aligned with the successes recently reported by Turner et al. (56) that BO is successful for hyperparameter tuning, propose a BO approach for a sequential tuning of the dynamic masking procedure in MLMs. To that end, we probabilistically model a surrogate for the pre-training objective function and propose a bandit-based technique for its sequential optimization.

Contrary to novel work that aims at deciding which subsets of tokens to mask via combinatorial optimization and dynamic programming 

(59)

, we target online learning of appropriate dynamic masking hyperparameters via reinforcement learning (i.e., multi-armed bandit). In addition, and in contrast to approaches that adapt the language model’s masking policy to a particular task of interest 

(24), we aim to find the sequential set of MLM pre-training choices that result not only on performant pre-trained TLMs, but also best-performing across diverse fine-tuning tasks.

Contributions.

The specific contributions of this work are:

  • To present a bandit-based generic framework for online, black-box optimization of TLM pre-training.

  • To formulate a Gaussian Process based Thompson sampling algorithm for online MLM-loss minimization of TLMs. The novelty in the presented framework is on fitting the estimated pre-training validation losses with a Gaussian process reward model for the formulation of a Thompson sampling bandit policy, which results in an equivalence between bandit cumulative reward maximization and pre-train loss minimization.

  • To showcase empirically that the proposed algorithm efficiently pre-trains TLMs with robust and satisfactory performance, both in pre-training and across diverse downstream fine-tuned tasks.

  • To show that sequentially deciding, based on the proposed bandit-based algorithm, how many tokens of the input to mask —and how to mask them— results in improved dynamic masking-based MLM pre-training.

The rest of the manuscript is organized as follows: Section 2 provides a succinct background on Bayesian optimization, the multi-armed bandit and the TLM pre-training procedure; Section 3 describes the proposed method for bandit-based TLM pre-training optimization; with results on its empirical performance evaluated in Section 4, and concluding remarks provided in Section 5.

2 Background

2.1 Bayesian optimization and multi-armed bandits

Bayesian optimization

(BO) is a widely used technique to address the problem of hyperparameter optimization in machine learning (50, 26, 56) and many closely related applications in engineering, control systems, materials, and drug discovery (37, 8, 13, 21, 9). BO relies on a probabilistic (providing a measure of uncertainty) surrogate model for the objective function (47, 12) to tackle the fundamentally challenging problem of simultaneously fitting and optimizing a high-dimensional, non-convex function with unknown smoothness, and possibly noisy evaluations. Given the black-box optimization nature of BO, it is of paramount importance that the surrogate model provides a measure of uncertainty, for which generative models, Bayesian neural networks and Gaussian processes are often used (36). Using this surrogate model, an acquisition function determines the most promising point to evaluate next. The multi-armed bandit is a useful framework for addressing this challenge of learning about the environment (i.e., exploration) while simultaneously maximizing the outcomes observed (exploitation).

The multi-armed bandit

(MAB) is a well-studied abstraction for problems that require learning while simultaneously maximizing the rewards obtained (30), i.e., balancing the exploration-exploitation tradeoff (31). A MAB is a sequential decision process between an agent and an unknown environment that requires decision-making under uncertainty (48). Mathematically, at each interaction , a bandit agent needs to choose an action from a (not necessarily finite) set of actions . It then observes a stochastic reward drawn from an unknown, stochastic distribution of the selected arm . The reward function is in general unknown, dependent on properties often characterized parametrically, i.e., . The goal of a MAB agent is to maximize expected (cumulative) rewards, , where we denote each arm’s expected reward as . The challenge in MAB reward maximization is the lack of knowledge about the reward generating model (e.g., its parameters), which demands learning the properties of the reward distribution, as it interacts with the environment.

Bandit algorithms.

Since the introduction of the MAB problem by Thompson (53), diverse algorithms have been proposed and analyzed to solve it, from computing optimal strategies for certain types of bandits (15) and probabilistically greedy approaches (4)

, to upper confidence interval (UCB)  

(29, 25) and Thompson sampling (54, 46) algorithms. The latter bandit strategies rely on a stochastic model-based view of the MAB, where a reward model is specified with unknown, to be learned parameters. For models in the exponential family, these algorithms have been empirically and theoretically proven to perform competitively  (29, 25, 1, 2, 27). Extensions to accommodate reward functions not in the exponential family have also been proposed, by modeling observed rewards via ensembles of plausible models (35)

, using Gaussian mixture models 

(57) and Gaussian processes (51, 16, 28), as well as with neural networks (6, 39).

In the context of BO in general, and MABs in particular, reward uncertainty quantification is critical. On the one hand,  Riquelme et al. (45) emphasized the need for investigating how to sidestep the slow convergence of the uncertainty estimates in neural network based bandit algorithms. On the other, Gaussian processes (44) have been shown to provide not only adequate Bayesian uncertainty estimates, but a successful approach to specifying surrogate models that encode smoothness assumptions of the payoff function in different bandit tasks (28, 7, 38).

As detailed in Section 3, we resort to a Gaussian process surrogate reward model in the proposed bandit-based optimization of TLM pre-training.

2.2 Language model pre-training and the Masked Language Model

Language model pre-training aims at learning language representations that are useful across tasks, i.e., pre-training allows for a model to be better initialized for quick fine-tuning (while avoiding overfitting) to specific downstream tasks. With pre-training, TLMs learn language representations based on the supervision provided by one or more pre-training tasks. A pre-training task is a self-supervised task whose labels are generated automatically. Two popular objectives for TLM pre-training are the Masked Language Model (MLM) and Next Sentence Prediction (NSP).

We focus on MLM pre-training, as initially proposed by Devlin et al. (10), and implemented by many others (34, 23). MLMs learn by taking an input sequence of words, where a random sample of the tokens is replaced with the special token , and learning to predict them. I.e., for a given input sequence (with special tokens delimiting them)

(1)

MLMs select a random sample of the tokens, replace them with the mask, and learn to predict these masked tokens, utilizing both left and right contexts when using TLMs.

Dynamic masking.

In the original BERT model pre-training (10), a random but static subset of the input sequence tokens is replaced with the mask token. On the contrary,  Liu et al. (34) proposed a dynamic masking procedure, which generates a new masking pattern (given a fixed probability of masking) for every sequence fed to the model. Liu et al. (34) demonstrate that this dynamic approach becomes crucial when pre-training for more steps or with larger datasets, attaining better pre-trained and fine-tuned performance. Dynamic masking relies on several hyperparameters, specifically: () the probability of replacing an input token with the mask, () the probability that a masked token is left unmasked, and () the probability of replacing a token with a random token (instead of with the mask). The online optimization of these dynamic masking hyperparameters is the use-case for our experiments in Section 4.

MLM pre-training.

In pre-training, one aims at minimizing the MLM loss, which is a function of the original () and masked () datasets, the TLM architecture parameters , and the hyperparameters of the pre-training procedure. The MLM objective is the cross-entropy loss for predicting the masked tokens in the masked sequence , where we denote with the masked tokens in for tokens , in the original input sequence . Mathematically,

(2)
(3)

where we explicitly indicate with the dependency with respect to all hyperparameters relevant in pre-training and optimization procedures.

The analytical form of the MLM loss function, which is a function of the hyperparameters

used and the data where it is evaluated, is in general complex and unknown. However, estimates of the MLM loss are available at every epoch of pre-training, i.e., an empirical estimate of the MLM loss can be computed. For the sake of fair comparisons under different training setups (e.g., mini-batch sizes or other hyperparameters), per-epoch averaged empirical MLM losses are computed in the validation dataset ,

(4)

The pre-training objective is to find the TLM architecture that minimizes the MLM loss for the whole dataset and its masked version

. In practice, this minimization is commonly executed via stochastic gradient-descent methods, run for

epochs with randomly drawn mini-batches ,

(5)

3 Proposed method

We hereby propose to optimize the TLM pre-training procedure by casting it as a sequential decision process, where we tackle the problem of sequentially fitting and optimizing a pre-training black-box loss function with noisy evaluations. We pose the task of TLM pre-training with noisy MLM loss observations as a multi-armed bandit problem. We first determine the appropriate action space, and then formulate a proper surrogate reward function (leveraging the observed empirical MLM validation losses) that the bandit maximizes for its sequential selection of arms.

We define pre-training steps (i.e., a fixed number of stochastic gradient updates 111 Note that stochastic gradient updates might or might not correspond to a full pre-training epoch . ) as bandit interactions , towards minimizing a TLM pre-training objective given tunable hyperparameters —with (stochastic) objective evaluations estimated in the validation set. In the use-case of MLM pre-training with dynamic masking, in each bandit interaction, one selects the hyperparameters (e.g., the number of tokens to mask and associated random masking probabilities), pre-trains the TLM for certain stochastic updates that minimize the MLM loss as in Equation (5), and evaluates the pre-trained model’s MLM performance in the validation subset as per Equation (4). To that end, we identify the pre-training hyperparameters at interaction , , as the bandit’s arms, i.e., .

Due to the black-box nature of the pre-training objective (with only stochastic evaluations available), we formulate below the stochastic reward function surrogate needed to formalize the MAB approach to online optimization of TLM pre-training.

3.1 From MLM pre-training to Gaussian process based regret minimization

We hereby devise a bandit reward function that results in the sequential optimization of the MLM pre-training objective. To that end, we transform the empirical pre-training validation loss for each pre-training interaction into a reward quantity that allows for it’s cumulative optimization. To accommodate the empirical, stochastic loss estimates collected from the unknown analytical form of the loss function, we resort to Gaussian process modeling.

Bandit rewards as empirical MLM loss differences.

To guarantee that the cumulative rewards a bandit agent maximizes result in minimization of the pre-training objective, we compute the observed empirical rewards as the difference in averaged MLM losses between bandit interactions, i.e.,

(6)

where we have dropped the dependency of the MLM loss with respect to the TLM parameters for ease of exposition.

We now show that maximizing the cumulative rewards as defined above is equivalent to minimizing the training loss at at interaction . First, we compute the cumulative rewards,

(7)
(8)

where is a constant, initial loss of a randomly initialized model. We then conclude that maximizing cumulative rewards

(9)
(10)

is equivalent to minimizing validation MLM loss. Therefore, a bandit agent that aims at maximizing the cumulative rewards as in Equation (7) minimizes the MLM pre-training objective of Equation (4).

Bandit reward functions as Gaussian process models.

In practice, TLM pre-training is carried out based on empirical risk minimization: i.e., only empirical estimates of the true MLM objective are available. Namely, rewards as defined in Equation (6) are stochastic draws from an analytically unknown objective function, . To accommodate these stochastic observations of the unknown loss function —that we aim at optimizing with respect to its hyperparameters — we model the bandit reward function via a Gaussian process , with the observed (stochastic) rewards independent and identically (i.i.d.) distributed as

(11)

where is a Gaussian process (GP) surrogate model of the pre-training objective, and denotes the stochastic nature of each of the observed rewards —as empirical estimates computed in Equation (6). In summary, we overcome the black-box nature of the pre-training objective function (e.g., the MLM loss) by modeling the observed rewards as realizations of a noisy surrogate GP model.

Gaussian process.

A collection of random variables

is said to be drawn from a GP with mean function and covariance function , if for any finite set of elements , the associated finite set of random variables , follows

(12)

In particular, a GP is a stochastic process such that any finite collection of random variables has a multivariate Gaussian distribution 

(44)

. A GP can be seen as a probability distribution over arbitrary functions, with

its mean function, and the covariance kernel.

GP model fitting.

The mean and kernel functions determine the GP function class: i.e., the regularity/smoothness assumptions of the modeled data. These are parameterized prior-functions and with , which can be fitted to the observed data at inputs . For instance, via Type-II maximum likelihood estimation (MLE) of the GP model ’s hyperparameters ,

(13)

where the data likelihood is a function of the observation noise’s probability distribution. Bayesian approaches to hyperparameter selection for GP model training can also be implemented (44).

Gaussian process posteriors.

Given a fitted GP, posterior inference —computing the predictive distribution of a new datapoint after observing — can be performed in closed form for the Gaussian observation noise case: i.e., when the noise in Equation (11) is i.i.d. drawn . Formally, for a given set of observations at inputs , the posterior distribution over is a GP with the following mean and covariance functions:

(14)
(15)
(16)

These closed-form posterior inference expressions can be efficiently computed, both in exact and approximate ways (44, 42). Posterior inference with observation noise beyond the Gaussian assumption is an active research area, with many approximate techniques available for practitioners (49, 55, 61, 11).

3.2 GP-Thompson sampling for TLM pre-training.

Leveraging the GP-based reward model in Equation (11) and defining pre-training hyperparameters as the bandit arms, , we propose a bandit-based method for online pre-training loss minimization. Namely, we propose a Thompson sampling (TS) bandit algorithm that sequentially decides what arm to play at each sequential interaction , by drawing from the GP posterior updated with all available data up to interaction , to maximize its cumulative rewards as defined in Equation (7).

The proposed GP-TS bandit algorithm —with pseudo-code provided in Algorithm 1— views the TLM pre-training procedure as an unknown black-box function with inputs and outputs as in Equation (11) —for which cumulative rewards need to be maximized.

We note that any TLM can be used within our proposed framework, as long as the pre-training hyperparameter space is identified, and rewards as in Equation (6) can be computed based on a given pre-training objective. The GP reward model in Equation (11) accommodates continuous arms , with dimensionality determined by the TLM pre-training hyperparameter space .

1:  Input: TLM and training corpus
2:  Input: Pre-training hyperparameter space
3:  Input: Number of bandit pre-training interactions , number of updates per-interaction
4:  Input: GP prior functions and , initial hyperparameters
5:  Initialize: , ,
6:  for  do
7:     Draw posterior sample from the posterior GP, i.e.,
8:     Select arm based on drawn posterior sample, i.e.,
9:     Run TLM pre-training for steps, with hyperparameters
10:     Compute validation loss of pre-trained TLM, i.e., as in Equation (4).
11:     Observe bandit reward, i.e., as in Equation (6).
12:     Update bandit history
13:     Fit GP model with , i.e.,
14:  end for
Algorithm 1 GP-TS for online optimization of TLM pre-training

GP-TS policy.

To execute the proposed GP-Thompson sampling policy (54, 46), we compute the GP reward model posterior, with sufficient statistics as in Equation (16) —which due to its Gaussian nature permits drawing of predictive samples from it, as in Step 6 of Algorithm 1. These samples are used in the proposed GP-TS to determine (in Step 7 of Algorithm 1) the sequential arms (hyperparameters ) for the next interaction of the pre-training procedure.

After pre-training steps of the TLM, we collect the pre-trained model’s MLM validation loss to compute the observed bandit rewards as in Equation (6). After every bandit interaction , new evidence is collected that allows for updating (i.e., re-fitting) the GP model to the observed input (action)-output (rewards) history . For instance, via Type-II MLE —although we acknowledge that other hyperparameter selection procedures might be used— as in Step 12 of Algorithm 1.

4 Experiments

4.1 Evaluation set-up

Implementation.

We implement Algorithm 1 in Python222 Our implementation of the proposed bandit-based framework will be made publicly available in a public Github repository upon acceptance. , with Gaussian process modules based on GPyTorch (14). Our use-case for evaluating the proposed method is the online optimization of the dynamic masking procedure of TLM pre-training, as argued for by Liu et al. (34) in their RoBERTa model. We implement the RoBERTa model as provided by Fairseq (40) and incorporate it as a module in our proposed framework.

RoBERTa’s dynamic masking procedure relies on several hyperparameters , specifically: , the probability of replacing an input token with the mask; the probability that a masked token is left unmasked; and , the probability of replacing a token with a random token (instead of with the mask).

Experiment set-up.

Our goal is to probe the ability of the proposed GP-TS method to —given a dataset, a TLM architecture, and a computational budget— efficiently pre-train the best-performing language model, with satisfactory performance in pre-training and downstream tasks. We replicate the pre-training described in (34) and compare the performance of different RoBERTa models.

We download, pre-process and encode the Wikitext 103 dataset for pre-training, from scratch, each of the candidate TLMs. To quantify the downstream capabilities of the pre-trained models, we evaluate their performance in the General Language Understanding Evaluation (GLUE) benchmark (60).

We run our experiments with RoBERTa models using the BERT-base architecture (125M parameters) in a server with 8 Tesla V100-SXM2-32GB GPUs —implementation and configuration details are provided in Appendix A. The aim is not to compare these models with large-scale TLMs trained in huge datasets, but to scrutinize the quality of pre-trained RoBERTa models under equal experimental conditions. To that end, we fix the seed for the execution of RoBERTa pre-training and fine-tuning, but run five different realizations of the proposed GP-TS: i.e., we quantify the performance variability induced by the stochastic bandit decisions.

In Section 4.2 below, we evaluate the difference between RoBERTa models pre-trained based on a grid-search over masking probabilities —as originally executed by Liu et al. (34)— to our proposed bandit-based online optimization procedure —the proposed GP-TS in Algorithm 1. After showing the benefits of GP-TS based sequential selection of the masking probability , we investigate in Section 4.3 how successful the bandit-based optimization is when selecting all dynamic masking hyperparameters, .

4.2 GP-TS for online optimization of the masking probability

We first focus on the online optimization of the masking probability , with fixed, default and values as per guidelines in (34). The masking probability search space is , with a regular interval of for the proposed GP-TS method.

Pre-training performance.

Results in Figure 1, where we compare the MLM loss computed in the validation set over each of the pre-training epochs (a bandit interaction equals a single epoch), demonstrate that the proposed GP-TS pre-trains the best performing RoBERTa model.

The benefits provided are not only on the attained lower MLM metric value, but on a faster pre-training procedure: a better model than grid-search based alternatives is found in less than epochs. The RoBERTa model with the best MLM loss is pre-trained by GP-TS in about epochs, with no significant performance improvements past pre-training epochs. Namely, the selected RoBERTa architecture, when pre-trained in a given dataset, achieves the best MLM loss in fewer epochs when pre-trained with the proposed GP-TS.

Figure 1:

MLM validation loss performance comparison (lower is better) of grid-search based and the GP-TS based pre-trained RoBERTa models. GP-TS results are averaged across five realizations, with standard deviation shown in the shaded area.

In addition, we observe that GP-TS avoids model overfitting: contrary to RoBERTa models pre-trained with fixed that result in V-shaped validation curves (MLM training loss values are provided in Appendix B.1), the GP-TS pre-trained RoBERTa model showcases minimal degradation over pre-training epochs.

All in all, Figure 1 exhibits how the proposed GP-TS is able to sequentially select dynamic masking probabilities that result in fast, robust, and accurate pre-trained models.

Fine-tuning performance.

We showcase the downstream benefits of pre-trained TLMs by evaluating their accuracy in GLUE tasks, after fine-tuning each of the pre-trained language models for just two epochs. To elucidate the pre-trained language models’ quality over pre-training steps, we fine-tune and evaluate each RoBERTa model after every pre-training epoch: i.e., the x-axis in Figure 2 is identical to the x-axis in Figure 1.

Figure 2: GLUE QQP task accuracy comparison (higher is better) after two fine-tuning epochs of grid-search based and the GP-TS based pre-trained RoBERTa models.

Figure 2 showcases that the GP-TS pre-trained model, after only two fine-tuning epochs, provides the best GLUE-QQP task performance —results for all GLUE tasks are provided in Appendix B.2. We note that the downstream performance benefit is most evident after epochs of pre-training, i.e., as soon as best MLM-based pre-trained models have been learned by GP-TS, as shown in Figure 1.

Model CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B
0.689 0.613 0.706 0.66 0.763 0.473 0.795 0.143
0.691 0.642 0.694 0.687 0.773 0.48 0.79 0.247
0.687 0.657 0.703 0.79 0.781 0.477 0.807 0.225
0.685 0.667 0.699 0.787 0.788 0.48 0.808 0.314
0.675 0.661 0.691 0.787 0.788 0.502 0.819 0.248
GP-TS 0.69 0.669 0.708 0.791 0.801 0.491 0.812 0.353
Table 1: GLUE task accuracy (higher is better) for models at pre-training epoch 75, subsequently fine-tuned for two epochs. STS-B is evaluated via the Pearson correlation coefficient.

The accuracy of all pre-trained RoBERTa models in all GLUE tasks, after pre-training for 75 epochs and fine-tuning for only two epochs per-task, is shown in Table 1. Results in Table 1 exhibit how the proposed GP-TS pre-trains language models that can then be quickly fine-tuned to downstream tasks with top accuracy.

Overall, the presented empirical evidence illustrates that the proposed GP-TS enables fast and superior pre-training of language models: not only is GP-TS able to improve the MLM pre-training objective, but it results in robust TLMs that can be quickly fine-tuned to downstream tasks with excellent accuracy performance. These results show that instead of pre-training with fixed masking probabilities, sequentially deciding how many input tokens to mask —as per the GP-TS that minimizes MLM loss— is beneficial.

4.3 GP-TS for online optimization of all dynamic masking hyperparameters

We now interrogate the capability of the proposed GP-TS in optimizing TLMs with respect to all hyperparameters of the dynamic masking procedure. For these experiments, we focus on the hypercube space , and compare its performance to the GP-TS method that searches only along , with default and values. The goal is to inspect whether the proposed method is able to autonomously pre-train a RoBERTa model when no knowledge about what dynamic masking hyperparameters to use is available.

Figure 3: MLM validation loss performance comparison (lower is better) of GP-TS based pre-trained models, when and . GP-TS results are averaged across five realizations, with standard deviation shown in the shaded area.

As shown in Figure 3, the proposed GP-TS is successful, even when operating over a 3-dimensional search space, of finding a sequence of hyperparameters that result in a best-performing RoBERTa model.

We also observe that successful pre-training is achieved faster: i.e., a lower MLM loss is attained when pre-training with GP-TS than with the GP-TS (in as few as epochs, with no significant performance improvement beyond epochs). The GP-TS pre-trained model is again easily fine-tuned to provide satisfactory downstream performance across all GLUE tasks, as reported in Table 2.

Model CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B
GP-TS 0.69 0.669 0.708 0.791 0.801 0.491 0.812 0.353
GP-TS 0.69 0.665 0.699 0.778 0.796 0.477 0.836 0.325
Table 2: GLUE task accuracy (higher is better) for GP-TS pre-trained models at epoch 70, fine-tuned for a single epoch. STS-B is evaluated via the Pearson correlation coefficient.

Based on the presented results, we conclude that the proposed GP-TS is able to successfully find sequences of dynamic masking hyperparameters —even when no good guesses for them are available— that minimize MLM pre-training loss. Instead of pre-training with fixed dynamic masking hyperparameters, our results indicate that the proposed GP-TS algorithm sequentially selects hyperparameters that result in robust and well-performing models.

5 Conclusion

We have presented a multi-armed bandit-based online optimization framework for the sequential selection of pre-training hyperparameters towards optimized Transformer-based language model performance.

We model noisy evaluations of the pre-training objective function (e.g., the MLM loss) as drawn from a surrogate Gaussian process, and propose a Gaussian process based Thompson sampling (GP-TS) for online MLM-loss minimization. We prove the equivalence between the proposed bandit reward function’s cumulative maximization and pre-training loss minimization.

We provide empirical evidence of how the proposed GP-TS, when applied to MLM dynamic masking optimization, results in robust and accurate language models. Notably, while (34) randomly select which input tokens to mask with fixed probability, we show that sequentially adapting the masking hyperparameters as determined by GP-TS results in superior performance.

Our experiments demonstrate not only the practical significance of the proposed method in terms of efficiency (i.e., successful pre-training in less epochs), but that GP-TS based models achieve superior performance in pre-training (i.e., reduced MLM loss) and across diverse downstream tasks.

Building upon our formulation and the provided evidence, we envision interesting follow-up work on showcasing the proposed method’s ability to successfully pre-train large-scale models in general purpose corpora, as well as for domain-specific models and tasks.

References

  • Agrawal and Goyal (2012) S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
  • Agrawal and Goyal (2013) S. Agrawal and N. Goyal. Further Optimal Regret Bounds for Thompson Sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.
  • Alsentzer et al. (2019) E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323, 2019.
  • Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2-3):235–256, May 2002. ISSN 0885-6125. doi: 10.1023/A:1013689704352.
  • Beltagy et al. (2019) I. Beltagy, K. Lo, and A. Cohan. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
  • Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1613–1622, Lille, France, 2015. JMLR.org.
  • Bogunovic et al. (2016) I. Bogunovic, J. Scarlett, and V. Cevher. Time-varying gaussian process bandit optimization. In Artificial Intelligence and Statistics, pages 314–323. PMLR, 2016.
  • Calandra et al. (2016) R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artificial Intelligence, 76(1):5–23, 2016.
  • Candelieri et al. (2018) A. Candelieri, R. Perego, and F. Archetti. Bayesian optimization of pump operations in water distribution systems. Journal of Global Optimization, 71(1):213–235, 2018.
  • Devlin et al. (2018) J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. URL https://arxiv.org/abs/1810.04805.
  • Flaxman et al. (2015) S. Flaxman, A. Wilson, D. Neill, H. Nickisch, and A. Smola. Fast Kronecker Inference in Gaussian Processes with non-Gaussian Likelihoods. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 607–616, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/flaxman15.html.
  • Frazier (2018) P. I. Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  • Frazier and Wang (2016) P. I. Frazier and J. Wang. Bayesian optimization for materials design. In Information science for materials discovery and design, pages 45–75. Springer, 2016.
  • Gardner et al. (2018) J. R. Gardner, G. Pleiss, D. Bindel, K. Q. Weinberger, and A. G. Wilson. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. In Advances in Neural Information Processing Systems, 2018.
  • Gittins (1979) J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2):148–177, 1979. ISSN 00359246.
  • Grünewälder et al. (2010) S. Grünewälder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret Bounds for Gaussian Process Bandit Problems. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 273–280, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/grunewalder10a.html.
  • Gu et al. (2021) Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  • Gururangan et al. (2020) S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv preprint, Apr. 2020.
  • Hennig and Schuler (2012) P. Hennig and C. J. Schuler. Entropy Search for Information-Efficient Global Optimization. Journal of Machine Learning Research, 13(57):1809–1837, 2012. URL http://jmlr.org/papers/v13/hennig12a.html.
  • Hernández-Lobato et al. (2014) J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/069d3bb002acd8d7dd095917f9efe4cb-Paper.pdf.
  • Hernández-Lobato et al. (2017) J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik. Parallel and distributed Thompson sampling for large-scale accelerated exploration of chemical space. In International conference on machine learning, pages 1470–1479. PMLR, 2017.
  • Joshi et al. (2020) M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020.
  • Kalyan et al. (2021) K. S. Kalyan, A. Rajasekharan, and S. Sangeetha. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542, 2021.
  • Kang et al. (2020) M. Kang, M. Han, and S. J. Hwang. Neural mask generator: Learning to generate adaptive word maskings for language model adaptation. arXiv preprint arXiv:2010.02705, 2020. URL https://arxiv.org/abs/2010.02705.
  • Kaufmann et al. (2012) E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian Upper Confidence Bounds for Bandit Problems. In N. D. Lawrence and M. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 592–600, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
  • Klein et al. (2017) A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 528–536, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/klein17a.html.
  • Korda et al. (2013) N. Korda, E. Kaufmann, and R. Munos. Thompson Sampling for 1-Dimensional Exponential Family Bandits. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1448–1456. Curran Associates, Inc., 2013.
  • Krause and Ong (2011) A. Krause and C. Ong. Contextual Gaussian Process Bandit Optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper/2011/file/f3f1b7fc5a8779a9e618e1f23a7b7860-Paper.pdf.
  • Lai (1987) T. L. Lai. Adaptive Treatment Allocation and the Multi-Armed Bandit Problem. The Annals of Statistics, 15(3):1091–1114, 1987. ISSN 00905364.
  • Lai and Robbins (1985) T. L. Lai and H. Robbins. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6(1):4–22, mar 1985. ISSN 0196-8858. doi: 10.1016/0196-8858(85)90002-8.
  • Lattimore and Szepesvári (2019) T. Lattimore and C. Szepesvári. Bandit algorithms. Preprint, 2019.
  • Lee et al. (2020) J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  • Li et al. (2017) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
  • Liu et al. (2019) Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. URL https://arxiv.org/abs/1907.11692.
  • Lu and Roy (2017) X. Lu and B. V. Roy. Ensemble sampling. In Advances in Neural Information Processing Systems, pages 3258–3266, 2017.
  • Maddox et al. (2021) W. J. Maddox, M. Balandat, A. G. Wilson, and E. Bakshy. Bayesian Optimization with High-Dimensional Outputs. arXiv preprint arXiv:2106.12997, 2021.
  • Negoescu et al. (2011) D. M. Negoescu, P. I. Frazier, and W. B. Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
  • Nguyen et al. (2020) V. Nguyen, V. Masrani, R. Brekelmans, M. Osborne, and F. Wood. Gaussian Process Bandit Optimization of the Thermodynamic Variational Objective. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5764–5775. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/3f2dff7862a70f97a59a1fa02c3ec110-Paper.pdf.
  • Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep Exploration via Bootstrapped DQN. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4026–4034. Curran Associates, Inc., 2016.
  • Ott et al. (2019) M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  • Patterson et al. (2021) D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
  • Pleiss et al. (2018) G. Pleiss, J. Gardner, K. Weinberger, and A. G. Wilson. Constant-Time Predictive Distributions for Gaussian Processes. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4114–4123. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/pleiss18a.html.
  • Puvis de Chavannes et al. (2021) L. H. Puvis de Chavannes, M. G. K. Kongsbak, T. Rantzau, and L. Derczynski. Hyperparameter power impact in transformer language model training. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 96–118, Virtual, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.sustainlp-1.12. URL https://aclanthology.org/2021.sustainlp-1.12.
  • Rasmussen and Williams (2005) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
  • Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In International Conference on Learning Representations, 2018.
  • Russo et al. (2018) D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen. A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018. ISSN 1935-8237. doi: 10.1561/2200000070. URL http://dx.doi.org/10.1561/2200000070.
  • Shahriari et al. (2015) B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
  • Slivkins (2019) A. Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learning, 12(1-2):1–286, 2019. ISSN 1935-8237. doi: 10.1561/2200000068. URL http://dx.doi.org/10.1561/2200000068.
  • Snelson and Ghahramani (2006) E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006.
  • Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf.
  • Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1015–1022, USA, 2010. Omnipress. ISBN 978-1-60558-907-7.
  • Sun et al. (2020) Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8968–8975, 2020.
  • Thompson (1933) W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444.
  • Thompson (1935) W. R. Thompson. On the Theory of Apportionment. American Journal of Mathematics, 57(2):450–456, 1935. ISSN 00029327, 10806377.
  • Titsias (2009) M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
  • Turner et al. (2021) R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. arXiv preprint arXiv:2104.10201, 2021.
  • Urteaga and Wiggins (2018) I. Urteaga and C. Wiggins. Variational inference for the multi-armed contextual bandit. In A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 698–706, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Vu et al. (2020) T.-T. Vu, D. Phung, and G. Haffari. Effective unsupervised domain adaptation with adversarially trained language models. arXiv preprint arXiv:2010.01739, 2020. URL https://arxiv.org/abs/2010.01739.
  • Wang et al. (2018) A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  • Wilson and Nickisch (2015) A. Wilson and H. Nickisch.

    Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP).

    In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1775–1784, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/wilson15.html.
  • Yu and Gen (2010) X. Yu and M. Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010.

Appendix A Implementation details

a.1 Gaussian process

We implement Gaussian process modules based on GPyTorch (14), and execute all experiments with a GP process prior and GP fitting details as described in Table 3.

Hyperparameter Initial Value
GP Model
Mean Function Constant
Prior constant 0
Kernel Function Scaled RBF Kernel
Prior output-scale 1
Prior length-scale 0.25
Observation Model
Likelihood function Gaussian

Noise variance

1
Training details
Loss function ExactMarginalLogLikelihood
train max iters 100
loss epsilon 0.01
Optimizer
optimizer adam
lr 0.1
Table 3: Gaussian Process prior and hyperparameters.

a.2 RoBERTa pre-training

We execute the RoBERTa pre-training procedure as described in Fairseq’s RoBERTa pre-training tutorial, with specific hyperparameters as described in Table 4.

Hyperparameter Value
Architecture RoBERTa base
Task masked lm
Criterion masked lm
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.01
Training details
batch-size 16
update-freq 16
sample-break-mode complete
tokens-per-sample 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 0.001
lr-scheduler polynomial decay
linear-warmup-updates 1000
Dynamic masking
mask-prob
leave-unmasked-prob 0.1
random-token-prob 0.1
Table 4: RoBERTa pre-training hyperparameters.

a.3 RoBERTa fine-tuning

We execute the RoBERTa fine-tuning procedure for GLUE tasks as described in Fairseq’s RoBERTa GLUE tutorial, with specific hyperparameters as described in Tables 5-12.

Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 2
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 16
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 534
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 1e-5
lr-scheduler polynomial decay
linear-warmup-updates 32
total-num-update 534
Other
init-token 0
separator-token 2
Table 5: RoBERTa fine-tuning hyperparameters for CoLA.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 3
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 32
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 12387
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 1e-5
lr-scheduler polynomial decay
linear-warmup-updates 743
total-num-update 12387
Other
init-token 0
separator-token 2
Table 6: RoBERTa fine-tuning hyperparameters for MNLI.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 2
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 16
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 230
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 1e-5
lr-scheduler polynomial decay
linear-warmup-updates 13
total-num-update 230
Other
init-token 0
separator-token 2
Table 7: RoBERTa fine-tuning hyperparameters for MRPC.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 2
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 32
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 3311
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 1e-5
lr-scheduler polynomial decay
linear-warmup-updates 199
total-num-update 3311
Other
init-token 0
separator-token 2
Table 8: RoBERTa fine-tuning hyperparameters for QNLI.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 2
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 32
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 11327
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 1e-5
lr-scheduler polynomial decay
linear-warmup-updates 2832
total-num-update 11327
Other
init-token 0
separator-token 2
Table 9: RoBERTa fine-tuning hyperparameters for QQP.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 2
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 16
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 204
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 2e-5
lr-scheduler polynomial decay
linear-warmup-updates 12
total-num-update 204
Other
init-token 0
separator-token 2
Table 10: RoBERTa fine-tuning hyperparameters for RTE.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 2
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 32
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 2093
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 1e-5
lr-scheduler polynomial decay
linear-warmup-updates 125
total-num-update 2093
Other
init-token 0
separator-token 2
Table 11: RoBERTa fine-tuning hyperparameters for SST.
Hyperparameter Value
Architecture RoBERTa base
Task sentence prediction
Criterion sentence prediction
num-classes 1 (regression-target)
max-epoch 2
Model details
dropout 0.1
attention-dropout 0.1
weight-decay 0.1
Training details
batch-size 16
update-freq 1
required-batch-size-multiple 1
sample-break-mode complete
tokens-per-sample 512
max-update 360
max-tokens 4400
max-positions 512
Optimizer
optimizer adam
adam-betas (0.9,0.98)
adam-eps 1e-6
clip-norm 1.0
Learning rate
lr 2e-5
lr-scheduler polynomial decay
linear-warmup-updates 21
total-num-update 360
Other
init-token 0
separator-token 2
Table 12: RoBERTa fine-tuning hyperparameters for STS-B.

Appendix B Additional results

b.1 Pre-training losses

We showcase in Figures 4 and 5 the pre-training MLM losses over epochs computed both in the training and validation datasets, where we observe overfitting for RoBERTa models with fixed hyperparameters, yet robust learning for the proposed GP-TS technique.

(a) MLM training loss.
(b) MLM validation loss.
Figure 4: MLM loss performance comparison (lower is better) of grid-search based and the GP-TS based pre-training with respect to . GP-TS results are averaged across 5 realizations, with standard deviation shown in the shaded area.
(a) MLM training loss.
(b) MLM validation loss.
Figure 5: MLM loss performance comparison (lower is better) of the GP-TS based pre-trained models, when and . GP-TS results are averaged across 5 realizations, with standard deviation shown in the shaded area.

b.2 Fine-tuning losses

We showcase in Figure 6 the accuracy in all GLUE task dev sets, after fine-tuning each of the pre-trained language models for only two fine-tuning epochs. We note that the downstream performance in GLUE-tasks with small datasets (i.e., CoLA, MRPC, RTE) is unsatisfactory (for both fixed and GP-TS pre-trained models) when run with the hyperparameters as in Appendix A.3. Although further experimentation is required to improve downstream performance in these GLUE tasks, our claim that GP-TS provides pre-trained models easily fine-tunable to a variety of tasks still holds.

(a) GLUE-CoLA task accuracy.
(b) GLUE-MNLI task accuracy.
(c) GLUE-MRPC task accuracy.
(d) GLUE-QNLI task accuracy.
(e) GLUE-QQP task accuracy.
(f) GLUE-RTE task accuracy.
(g) GLUE-SST-2 task accuracy.
(h) GLUE-STS-B task Pearson correlation.
Figure 6: GLUE task accuracy comparison (higher is better) after two fine-tuning epochs of grid-search based and the proposed GP-TS based pre-trained models.