1 Introduction
In the field of Natural Language Processing (NLP), models for learning unsupervised representations from unlabeled text based on Transformer architectures (58) have attained stateoftheart results on diverse tasks; e.g., question answering and language inference (23). Transformerbased language models (TLMs), such as BERT (10) and RoBERTa (34)
, rely on the combination of unsupervised pretraining of the model, and a subsequent taskspecific finetuning procedure, via additional neural network layers targeted to the task of interest. TLMs are pretrained over large unlabeled text data using selfsupervision, i.e., by learning the relationships between different sections, sentences or words of the input data. Once the TLM is pretrained over large volumes of text, it can be used in various downstream tasks after finetuning taskspecific layers.
The key insight from pretrained TLMs is that they learn language representations or embeddings that are useful across downstream tasks, minimizing the need to retrain the entire model from scratch. The advantage is that extensive pretraining of TLMs can lead to significant downstream performance improvements, i.e., it is worth learning complex TLMs in huge natural language corpora before finetuning them for particular tasks.
Following the success of TLMs in general NLP tasks, many have replicated the pretrain, then finetune framework in different specific domains, ranging from language models pretrained with scientific documents in SciBERT (5) and biomedical corpora in BioBERT (32), ClinicalBERT (3), and BlueBERT (17); to inhouse, industryspecific implementations and pretraining of TLMs (23). In addition, the importance of further pretraining with indomain corpora, a procedure known as continual training (23), has also been documented to yield downstream performance gains (18).
However, even if conceptually simple and empirically powerful, pretraining is challenging and expensive: the relationship between the Transformer architecture, the training corpus, the training hyperparameters, and the evaluation metrics is multimodal and complex. Furthermore, many have highlighted the importance of previously overlooked design choices in pretraining (such as deciding the pretraining metric and optimizing hyperparameters) that result in significant performance differences.
In this work, our goal is to improve the pretraining procedure of TLMs, by selecting pretraining hyperparameters that result in optimized performance. We argue that an optimized selection of pretraining hyperparameters will accelerate pretraining (i.e., achieve a satisfactory evaluation metric value in fewer epochs) and allow for a better pretraining procedure (i.e., achieve a superior metric value). Increased efficiency in TLM training is all the more important amidst rising concerns pertaining to the carbon footprint of large language models (41); and more specifically, the significant impact hyperparameter choice has on power consumption (43).
Our TLM pretraining usecase is random dynamic masking hyperparameter optimization, contrary to alternative (rule or taskbased) MLM dynamic masking approaches, such as SpanBERT (22) and ERNIE (52). Even though (34) showed the efficiency and benefits of random dynamic masking, the selection of the masking probability hyperparameters is often carried out based on heuristics or gridbased search approaches. On the contrary, we investigate automating TLM pretraining hyperparameter selection via multiarmed bandit (MAB) optimization.
We cast the TLM pretraining hyperparameter selection procedure as a sequential decision process, in which at each interaction, an agent selects an action (e.g., pretraining hyperparameters) to maximize cumulative rewards (e.g., pretraining metric). In the dynamic masking use case, the MAB actions (i.e., arms) are the dynamic masking choices, and the maskedlanguage model performance, the unknown function the bandit algorithm is trying to maximize.
Hyperparameter search in machine learning is often addressed as a blackbox optimization problem, where the aim is to optimize a computationally expensive function with no additional information known about it. These blackbox optimization problems are often solved using evolutionary algorithms
(62), entropy search based methods (19, 20), or Bayesian optimization (BO) (12).BO can tackle the problem of simultaneously optimizing a blackbox function with possibly noisy evaluations (50), and of speeding up the allocation of resources to promising hyperparameter configurations, as in (33). We here focus on the former task, and aligned with the successes recently reported by Turner et al. (56) that BO is successful for hyperparameter tuning, propose a BO approach for a sequential tuning of the dynamic masking procedure in MLMs. To that end, we probabilistically model a surrogate for the pretraining objective function and propose a banditbased technique for its sequential optimization.
Contrary to novel work that aims at deciding which subsets of tokens to mask via combinatorial optimization and dynamic programming
(59), we target online learning of appropriate dynamic masking hyperparameters via reinforcement learning (i.e., multiarmed bandit). In addition, and in contrast to approaches that adapt the language model’s masking policy to a particular task of interest
(24), we aim to find the sequential set of MLM pretraining choices that result not only on performant pretrained TLMs, but also bestperforming across diverse finetuning tasks.Contributions.
The specific contributions of this work are:

To present a banditbased generic framework for online, blackbox optimization of TLM pretraining.

To formulate a Gaussian Process based Thompson sampling algorithm for online MLMloss minimization of TLMs. The novelty in the presented framework is on fitting the estimated pretraining validation losses with a Gaussian process reward model for the formulation of a Thompson sampling bandit policy, which results in an equivalence between bandit cumulative reward maximization and pretrain loss minimization.

To showcase empirically that the proposed algorithm efficiently pretrains TLMs with robust and satisfactory performance, both in pretraining and across diverse downstream finetuned tasks.

To show that sequentially deciding, based on the proposed banditbased algorithm, how many tokens of the input to mask —and how to mask them— results in improved dynamic maskingbased MLM pretraining.
The rest of the manuscript is organized as follows: Section 2 provides a succinct background on Bayesian optimization, the multiarmed bandit and the TLM pretraining procedure; Section 3 describes the proposed method for banditbased TLM pretraining optimization; with results on its empirical performance evaluated in Section 4, and concluding remarks provided in Section 5.
2 Background
2.1 Bayesian optimization and multiarmed bandits
Bayesian optimization
(BO) is a widely used technique to address the problem of hyperparameter optimization in machine learning (50, 26, 56) and many closely related applications in engineering, control systems, materials, and drug discovery (37, 8, 13, 21, 9). BO relies on a probabilistic (providing a measure of uncertainty) surrogate model for the objective function (47, 12) to tackle the fundamentally challenging problem of simultaneously fitting and optimizing a highdimensional, nonconvex function with unknown smoothness, and possibly noisy evaluations. Given the blackbox optimization nature of BO, it is of paramount importance that the surrogate model provides a measure of uncertainty, for which generative models, Bayesian neural networks and Gaussian processes are often used (36). Using this surrogate model, an acquisition function determines the most promising point to evaluate next. The multiarmed bandit is a useful framework for addressing this challenge of learning about the environment (i.e., exploration) while simultaneously maximizing the outcomes observed (exploitation).
The multiarmed bandit
(MAB) is a wellstudied abstraction for problems that require learning while simultaneously maximizing the rewards obtained (30), i.e., balancing the explorationexploitation tradeoff (31). A MAB is a sequential decision process between an agent and an unknown environment that requires decisionmaking under uncertainty (48). Mathematically, at each interaction , a bandit agent needs to choose an action from a (not necessarily finite) set of actions . It then observes a stochastic reward drawn from an unknown, stochastic distribution of the selected arm . The reward function is in general unknown, dependent on properties often characterized parametrically, i.e., . The goal of a MAB agent is to maximize expected (cumulative) rewards, , where we denote each arm’s expected reward as . The challenge in MAB reward maximization is the lack of knowledge about the reward generating model (e.g., its parameters), which demands learning the properties of the reward distribution, as it interacts with the environment.
Bandit algorithms.
Since the introduction of the MAB problem by Thompson (53), diverse algorithms have been proposed and analyzed to solve it, from computing optimal strategies for certain types of bandits (15) and probabilistically greedy approaches (4)
, to upper confidence interval (UCB)
(29, 25) and Thompson sampling (54, 46) algorithms. The latter bandit strategies rely on a stochastic modelbased view of the MAB, where a reward model is specified with unknown, to be learned parameters. For models in the exponential family, these algorithms have been empirically and theoretically proven to perform competitively (29, 25, 1, 2, 27). Extensions to accommodate reward functions not in the exponential family have also been proposed, by modeling observed rewards via ensembles of plausible models (35), using Gaussian mixture models
(57) and Gaussian processes (51, 16, 28), as well as with neural networks (6, 39).In the context of BO in general, and MABs in particular, reward uncertainty quantification is critical. On the one hand, Riquelme et al. (45) emphasized the need for investigating how to sidestep the slow convergence of the uncertainty estimates in neural network based bandit algorithms. On the other, Gaussian processes (44) have been shown to provide not only adequate Bayesian uncertainty estimates, but a successful approach to specifying surrogate models that encode smoothness assumptions of the payoff function in different bandit tasks (28, 7, 38).
As detailed in Section 3, we resort to a Gaussian process surrogate reward model in the proposed banditbased optimization of TLM pretraining.
2.2 Language model pretraining and the Masked Language Model
Language model pretraining aims at learning language representations that are useful across tasks, i.e., pretraining allows for a model to be better initialized for quick finetuning (while avoiding overfitting) to specific downstream tasks. With pretraining, TLMs learn language representations based on the supervision provided by one or more pretraining tasks. A pretraining task is a selfsupervised task whose labels are generated automatically. Two popular objectives for TLM pretraining are the Masked Language Model (MLM) and Next Sentence Prediction (NSP).
We focus on MLM pretraining, as initially proposed by Devlin et al. (10), and implemented by many others (34, 23). MLMs learn by taking an input sequence of words, where a random sample of the tokens is replaced with the special token , and learning to predict them. I.e., for a given input sequence (with special tokens delimiting them)
(1) 
MLMs select a random sample of the tokens, replace them with the mask, and learn to predict these masked tokens, utilizing both left and right contexts when using TLMs.
Dynamic masking.
In the original BERT model pretraining (10), a random but static subset of the input sequence tokens is replaced with the mask token. On the contrary, Liu et al. (34) proposed a dynamic masking procedure, which generates a new masking pattern (given a fixed probability of masking) for every sequence fed to the model. Liu et al. (34) demonstrate that this dynamic approach becomes crucial when pretraining for more steps or with larger datasets, attaining better pretrained and finetuned performance. Dynamic masking relies on several hyperparameters, specifically: () the probability of replacing an input token with the mask, () the probability that a masked token is left unmasked, and () the probability of replacing a token with a random token (instead of with the mask). The online optimization of these dynamic masking hyperparameters is the usecase for our experiments in Section 4.
MLM pretraining.
In pretraining, one aims at minimizing the MLM loss, which is a function of the original () and masked () datasets, the TLM architecture parameters , and the hyperparameters of the pretraining procedure. The MLM objective is the crossentropy loss for predicting the masked tokens in the masked sequence , where we denote with the masked tokens in for tokens , in the original input sequence . Mathematically,
(2)  
(3) 
where we explicitly indicate with the dependency with respect to all hyperparameters relevant in pretraining and optimization procedures.
The analytical form of the MLM loss function, which is a function of the hyperparameters
used and the data where it is evaluated, is in general complex and unknown. However, estimates of the MLM loss are available at every epoch of pretraining, i.e., an empirical estimate of the MLM loss can be computed. For the sake of fair comparisons under different training setups (e.g., minibatch sizes or other hyperparameters), perepoch averaged empirical MLM losses are computed in the validation dataset ,(4) 
The pretraining objective is to find the TLM architecture that minimizes the MLM loss for the whole dataset and its masked version
. In practice, this minimization is commonly executed via stochastic gradientdescent methods, run for
epochs with randomly drawn minibatches ,(5) 
3 Proposed method
We hereby propose to optimize the TLM pretraining procedure by casting it as a sequential decision process, where we tackle the problem of sequentially fitting and optimizing a pretraining blackbox loss function with noisy evaluations. We pose the task of TLM pretraining with noisy MLM loss observations as a multiarmed bandit problem. We first determine the appropriate action space, and then formulate a proper surrogate reward function (leveraging the observed empirical MLM validation losses) that the bandit maximizes for its sequential selection of arms.
We define pretraining steps (i.e., a fixed number of stochastic gradient updates ^{1}^{1}1 Note that stochastic gradient updates might or might not correspond to a full pretraining epoch . ) as bandit interactions , towards minimizing a TLM pretraining objective given tunable hyperparameters —with (stochastic) objective evaluations estimated in the validation set. In the usecase of MLM pretraining with dynamic masking, in each bandit interaction, one selects the hyperparameters (e.g., the number of tokens to mask and associated random masking probabilities), pretrains the TLM for certain stochastic updates that minimize the MLM loss as in Equation (5), and evaluates the pretrained model’s MLM performance in the validation subset as per Equation (4). To that end, we identify the pretraining hyperparameters at interaction , , as the bandit’s arms, i.e., .
Due to the blackbox nature of the pretraining objective (with only stochastic evaluations available), we formulate below the stochastic reward function surrogate needed to formalize the MAB approach to online optimization of TLM pretraining.
3.1 From MLM pretraining to Gaussian process based regret minimization
We hereby devise a bandit reward function that results in the sequential optimization of the MLM pretraining objective. To that end, we transform the empirical pretraining validation loss for each pretraining interaction into a reward quantity that allows for it’s cumulative optimization. To accommodate the empirical, stochastic loss estimates collected from the unknown analytical form of the loss function, we resort to Gaussian process modeling.
Bandit rewards as empirical MLM loss differences.
To guarantee that the cumulative rewards a bandit agent maximizes result in minimization of the pretraining objective, we compute the observed empirical rewards as the difference in averaged MLM losses between bandit interactions, i.e.,
(6) 
where we have dropped the dependency of the MLM loss with respect to the TLM parameters for ease of exposition.
We now show that maximizing the cumulative rewards as defined above is equivalent to minimizing the training loss at at interaction . First, we compute the cumulative rewards,
(7)  
(8) 
where is a constant, initial loss of a randomly initialized model. We then conclude that maximizing cumulative rewards
(9)  
(10) 
is equivalent to minimizing validation MLM loss. Therefore, a bandit agent that aims at maximizing the cumulative rewards as in Equation (7) minimizes the MLM pretraining objective of Equation (4).
Bandit reward functions as Gaussian process models.
In practice, TLM pretraining is carried out based on empirical risk minimization: i.e., only empirical estimates of the true MLM objective are available. Namely, rewards as defined in Equation (6) are stochastic draws from an analytically unknown objective function, . To accommodate these stochastic observations of the unknown loss function —that we aim at optimizing with respect to its hyperparameters — we model the bandit reward function via a Gaussian process , with the observed (stochastic) rewards independent and identically (i.i.d.) distributed as
(11) 
where is a Gaussian process (GP) surrogate model of the pretraining objective, and denotes the stochastic nature of each of the observed rewards —as empirical estimates computed in Equation (6). In summary, we overcome the blackbox nature of the pretraining objective function (e.g., the MLM loss) by modeling the observed rewards as realizations of a noisy surrogate GP model.
Gaussian process.
A collection of random variables
is said to be drawn from a GP with mean function and covariance function , if for any finite set of elements , the associated finite set of random variables , follows(12) 
In particular, a GP is a stochastic process such that any finite collection of random variables has a multivariate Gaussian distribution
(44). A GP can be seen as a probability distribution over arbitrary functions, with
its mean function, and the covariance kernel.GP model fitting.
The mean and kernel functions determine the GP function class: i.e., the regularity/smoothness assumptions of the modeled data. These are parameterized priorfunctions and with , which can be fitted to the observed data at inputs . For instance, via TypeII maximum likelihood estimation (MLE) of the GP model ’s hyperparameters ,
(13) 
where the data likelihood is a function of the observation noise’s probability distribution. Bayesian approaches to hyperparameter selection for GP model training can also be implemented (44).
Gaussian process posteriors.
Given a fitted GP, posterior inference —computing the predictive distribution of a new datapoint after observing — can be performed in closed form for the Gaussian observation noise case: i.e., when the noise in Equation (11) is i.i.d. drawn . Formally, for a given set of observations at inputs , the posterior distribution over is a GP with the following mean and covariance functions:
(14)  
(15)  
(16) 
These closedform posterior inference expressions can be efficiently computed, both in exact and approximate ways (44, 42). Posterior inference with observation noise beyond the Gaussian assumption is an active research area, with many approximate techniques available for practitioners (49, 55, 61, 11).
3.2 GPThompson sampling for TLM pretraining.
Leveraging the GPbased reward model in Equation (11) and defining pretraining hyperparameters as the bandit arms, , we propose a banditbased method for online pretraining loss minimization. Namely, we propose a Thompson sampling (TS) bandit algorithm that sequentially decides what arm to play at each sequential interaction , by drawing from the GP posterior updated with all available data up to interaction , to maximize its cumulative rewards as defined in Equation (7).
The proposed GPTS bandit algorithm —with pseudocode provided in Algorithm 1— views the TLM pretraining procedure as an unknown blackbox function with inputs and outputs as in Equation (11) —for which cumulative rewards need to be maximized.
We note that any TLM can be used within our proposed framework, as long as the pretraining hyperparameter space is identified, and rewards as in Equation (6) can be computed based on a given pretraining objective. The GP reward model in Equation (11) accommodates continuous arms , with dimensionality determined by the TLM pretraining hyperparameter space .
GPTS policy.
To execute the proposed GPThompson sampling policy (54, 46), we compute the GP reward model posterior, with sufficient statistics as in Equation (16) —which due to its Gaussian nature permits drawing of predictive samples from it, as in Step 6 of Algorithm 1. These samples are used in the proposed GPTS to determine (in Step 7 of Algorithm 1) the sequential arms (hyperparameters ) for the next interaction of the pretraining procedure.
After pretraining steps of the TLM, we collect the pretrained model’s MLM validation loss to compute the observed bandit rewards as in Equation (6). After every bandit interaction , new evidence is collected that allows for updating (i.e., refitting) the GP model to the observed input (action)output (rewards) history . For instance, via TypeII MLE —although we acknowledge that other hyperparameter selection procedures might be used— as in Step 12 of Algorithm 1.
4 Experiments
4.1 Evaluation setup
Implementation.
We implement Algorithm 1 in Python^{2}^{2}2 Our implementation of the proposed banditbased framework will be made publicly available in a public Github repository upon acceptance. , with Gaussian process modules based on GPyTorch (14). Our usecase for evaluating the proposed method is the online optimization of the dynamic masking procedure of TLM pretraining, as argued for by Liu et al. (34) in their RoBERTa model. We implement the RoBERTa model as provided by Fairseq (40) and incorporate it as a module in our proposed framework.
RoBERTa’s dynamic masking procedure relies on several hyperparameters , specifically: , the probability of replacing an input token with the mask; the probability that a masked token is left unmasked; and , the probability of replacing a token with a random token (instead of with the mask).
Experiment setup.
Our goal is to probe the ability of the proposed GPTS method to —given a dataset, a TLM architecture, and a computational budget— efficiently pretrain the bestperforming language model, with satisfactory performance in pretraining and downstream tasks. We replicate the pretraining described in (34) and compare the performance of different RoBERTa models.
We download, preprocess and encode the Wikitext 103 dataset for pretraining, from scratch, each of the candidate TLMs. To quantify the downstream capabilities of the pretrained models, we evaluate their performance in the General Language Understanding Evaluation (GLUE) benchmark (60).
We run our experiments with RoBERTa models using the BERTbase architecture (125M parameters) in a server with 8 Tesla V100SXM232GB GPUs —implementation and configuration details are provided in Appendix A. The aim is not to compare these models with largescale TLMs trained in huge datasets, but to scrutinize the quality of pretrained RoBERTa models under equal experimental conditions. To that end, we fix the seed for the execution of RoBERTa pretraining and finetuning, but run five different realizations of the proposed GPTS: i.e., we quantify the performance variability induced by the stochastic bandit decisions.
In Section 4.2 below, we evaluate the difference between RoBERTa models pretrained based on a gridsearch over masking probabilities —as originally executed by Liu et al. (34)— to our proposed banditbased online optimization procedure —the proposed GPTS in Algorithm 1. After showing the benefits of GPTS based sequential selection of the masking probability , we investigate in Section 4.3 how successful the banditbased optimization is when selecting all dynamic masking hyperparameters, .
4.2 GPTS for online optimization of the masking probability
We first focus on the online optimization of the masking probability , with fixed, default and values as per guidelines in (34). The masking probability search space is , with a regular interval of for the proposed GPTS method.
Pretraining performance.
Results in Figure 1, where we compare the MLM loss computed in the validation set over each of the pretraining epochs (a bandit interaction equals a single epoch), demonstrate that the proposed GPTS pretrains the best performing RoBERTa model.
The benefits provided are not only on the attained lower MLM metric value, but on a faster pretraining procedure: a better model than gridsearch based alternatives is found in less than epochs. The RoBERTa model with the best MLM loss is pretrained by GPTS in about epochs, with no significant performance improvements past pretraining epochs. Namely, the selected RoBERTa architecture, when pretrained in a given dataset, achieves the best MLM loss in fewer epochs when pretrained with the proposed GPTS.
In addition, we observe that GPTS avoids model overfitting: contrary to RoBERTa models pretrained with fixed that result in Vshaped validation curves (MLM training loss values are provided in Appendix B.1), the GPTS pretrained RoBERTa model showcases minimal degradation over pretraining epochs.
All in all, Figure 1 exhibits how the proposed GPTS is able to sequentially select dynamic masking probabilities that result in fast, robust, and accurate pretrained models.
Finetuning performance.
We showcase the downstream benefits of pretrained TLMs by evaluating their accuracy in GLUE tasks, after finetuning each of the pretrained language models for just two epochs. To elucidate the pretrained language models’ quality over pretraining steps, we finetune and evaluate each RoBERTa model after every pretraining epoch: i.e., the xaxis in Figure 2 is identical to the xaxis in Figure 1.
Figure 2 showcases that the GPTS pretrained model, after only two finetuning epochs, provides the best GLUEQQP task performance —results for all GLUE tasks are provided in Appendix B.2. We note that the downstream performance benefit is most evident after epochs of pretraining, i.e., as soon as best MLMbased pretrained models have been learned by GPTS, as shown in Figure 1.
Model  CoLA  MNLI  MRPC  QNLI  QQP  RTE  SST2  STSB 

0.689  0.613  0.706  0.66  0.763  0.473  0.795  0.143  
0.691  0.642  0.694  0.687  0.773  0.48  0.79  0.247  
0.687  0.657  0.703  0.79  0.781  0.477  0.807  0.225  
0.685  0.667  0.699  0.787  0.788  0.48  0.808  0.314  
0.675  0.661  0.691  0.787  0.788  0.502  0.819  0.248  
GPTS  0.69  0.669  0.708  0.791  0.801  0.491  0.812  0.353 
The accuracy of all pretrained RoBERTa models in all GLUE tasks, after pretraining for 75 epochs and finetuning for only two epochs pertask, is shown in Table 1. Results in Table 1 exhibit how the proposed GPTS pretrains language models that can then be quickly finetuned to downstream tasks with top accuracy.
Overall, the presented empirical evidence illustrates that the proposed GPTS enables fast and superior pretraining of language models: not only is GPTS able to improve the MLM pretraining objective, but it results in robust TLMs that can be quickly finetuned to downstream tasks with excellent accuracy performance. These results show that instead of pretraining with fixed masking probabilities, sequentially deciding how many input tokens to mask —as per the GPTS that minimizes MLM loss— is beneficial.
4.3 GPTS for online optimization of all dynamic masking hyperparameters
We now interrogate the capability of the proposed GPTS in optimizing TLMs with respect to all hyperparameters of the dynamic masking procedure. For these experiments, we focus on the hypercube space , and compare its performance to the GPTS method that searches only along , with default and values. The goal is to inspect whether the proposed method is able to autonomously pretrain a RoBERTa model when no knowledge about what dynamic masking hyperparameters to use is available.
As shown in Figure 3, the proposed GPTS is successful, even when operating over a 3dimensional search space, of finding a sequence of hyperparameters that result in a bestperforming RoBERTa model.
We also observe that successful pretraining is achieved faster: i.e., a lower MLM loss is attained when pretraining with GPTS than with the GPTS (in as few as epochs, with no significant performance improvement beyond epochs). The GPTS pretrained model is again easily finetuned to provide satisfactory downstream performance across all GLUE tasks, as reported in Table 2.
Model  CoLA  MNLI  MRPC  QNLI  QQP  RTE  SST2  STSB 

GPTS  0.69  0.669  0.708  0.791  0.801  0.491  0.812  0.353 
GPTS  0.69  0.665  0.699  0.778  0.796  0.477  0.836  0.325 
Based on the presented results, we conclude that the proposed GPTS is able to successfully find sequences of dynamic masking hyperparameters —even when no good guesses for them are available— that minimize MLM pretraining loss. Instead of pretraining with fixed dynamic masking hyperparameters, our results indicate that the proposed GPTS algorithm sequentially selects hyperparameters that result in robust and wellperforming models.
5 Conclusion
We have presented a multiarmed banditbased online optimization framework for the sequential selection of pretraining hyperparameters towards optimized Transformerbased language model performance.
We model noisy evaluations of the pretraining objective function (e.g., the MLM loss) as drawn from a surrogate Gaussian process, and propose a Gaussian process based Thompson sampling (GPTS) for online MLMloss minimization. We prove the equivalence between the proposed bandit reward function’s cumulative maximization and pretraining loss minimization.
We provide empirical evidence of how the proposed GPTS, when applied to MLM dynamic masking optimization, results in robust and accurate language models. Notably, while (34) randomly select which input tokens to mask with fixed probability, we show that sequentially adapting the masking hyperparameters as determined by GPTS results in superior performance.
Our experiments demonstrate not only the practical significance of the proposed method in terms of efficiency (i.e., successful pretraining in less epochs), but that GPTS based models achieve superior performance in pretraining (i.e., reduced MLM loss) and across diverse downstream tasks.
Building upon our formulation and the provided evidence, we envision interesting followup work on showcasing the proposed method’s ability to successfully pretrain largescale models in general purpose corpora, as well as for domainspecific models and tasks.
References
 Agrawal and Goyal (2012) S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the multiarmed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
 Agrawal and Goyal (2013) S. Agrawal and N. Goyal. Further Optimal Regret Bounds for Thompson Sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.
 Alsentzer et al. (2019) E. Alsentzer, J. R. Murphy, W. Boag, W.H. Weng, D. Jin, T. Naumann, and M. McDermott. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323, 2019.
 Auer et al. (2002) P. Auer, N. CesaBianchi, and P. Fischer. Finitetime Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(23):235–256, May 2002. ISSN 08856125. doi: 10.1023/A:1013689704352.
 Beltagy et al. (2019) I. Beltagy, K. Lo, and A. Cohan. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
 Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 1613–1622, Lille, France, 2015. JMLR.org.
 Bogunovic et al. (2016) I. Bogunovic, J. Scarlett, and V. Cevher. Timevarying gaussian process bandit optimization. In Artificial Intelligence and Statistics, pages 314–323. PMLR, 2016.
 Calandra et al. (2016) R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artificial Intelligence, 76(1):5–23, 2016.
 Candelieri et al. (2018) A. Candelieri, R. Perego, and F. Archetti. Bayesian optimization of pump operations in water distribution systems. Journal of Global Optimization, 71(1):213–235, 2018.
 Devlin et al. (2018) J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. URL https://arxiv.org/abs/1810.04805.
 Flaxman et al. (2015) S. Flaxman, A. Wilson, D. Neill, H. Nickisch, and A. Smola. Fast Kronecker Inference in Gaussian Processes with nonGaussian Likelihoods. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 607–616, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/flaxman15.html.
 Frazier (2018) P. I. Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
 Frazier and Wang (2016) P. I. Frazier and J. Wang. Bayesian optimization for materials design. In Information science for materials discovery and design, pages 45–75. Springer, 2016.
 Gardner et al. (2018) J. R. Gardner, G. Pleiss, D. Bindel, K. Q. Weinberger, and A. G. Wilson. GPyTorch: Blackbox MatrixMatrix Gaussian Process Inference with GPU Acceleration. In Advances in Neural Information Processing Systems, 2018.
 Gittins (1979) J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2):148–177, 1979. ISSN 00359246.
 Grünewälder et al. (2010) S. Grünewälder, J.Y. Audibert, M. Opper, and J. ShaweTaylor. Regret Bounds for Gaussian Process Bandit Problems. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 273–280, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/grunewalder10a.html.
 Gu et al. (2021) Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. Domainspecific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
 Gururangan et al. (2020) S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv preprint, Apr. 2020.
 Hennig and Schuler (2012) P. Hennig and C. J. Schuler. Entropy Search for InformationEfficient Global Optimization. Journal of Machine Learning Research, 13(57):1809–1837, 2012. URL http://jmlr.org/papers/v13/hennig12a.html.
 HernándezLobato et al. (2014) J. M. HernándezLobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Blackbox Functions. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/069d3bb002acd8d7dd095917f9efe4cbPaper.pdf.
 HernándezLobato et al. (2017) J. M. HernándezLobato, J. Requeima, E. O. PyzerKnapp, and A. AspuruGuzik. Parallel and distributed Thompson sampling for largescale accelerated exploration of chemical space. In International conference on machine learning, pages 1470–1479. PMLR, 2017.
 Joshi et al. (2020) M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. Spanbert: Improving pretraining by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020.
 Kalyan et al. (2021) K. S. Kalyan, A. Rajasekharan, and S. Sangeetha. Ammus: A survey of transformerbased pretrained models in natural language processing. arXiv preprint arXiv:2108.05542, 2021.
 Kang et al. (2020) M. Kang, M. Han, and S. J. Hwang. Neural mask generator: Learning to generate adaptive word maskings for language model adaptation. arXiv preprint arXiv:2010.02705, 2020. URL https://arxiv.org/abs/2010.02705.
 Kaufmann et al. (2012) E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian Upper Confidence Bounds for Bandit Problems. In N. D. Lawrence and M. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 592–600, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
 Klein et al. (2017) A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 528–536, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/klein17a.html.
 Korda et al. (2013) N. Korda, E. Kaufmann, and R. Munos. Thompson Sampling for 1Dimensional Exponential Family Bandits. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1448–1456. Curran Associates, Inc., 2013.
 Krause and Ong (2011) A. Krause and C. Ong. Contextual Gaussian Process Bandit Optimization. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper/2011/file/f3f1b7fc5a8779a9e618e1f23a7b7860Paper.pdf.
 Lai (1987) T. L. Lai. Adaptive Treatment Allocation and the MultiArmed Bandit Problem. The Annals of Statistics, 15(3):1091–1114, 1987. ISSN 00905364.
 Lai and Robbins (1985) T. L. Lai and H. Robbins. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6(1):4–22, mar 1985. ISSN 01968858. doi: 10.1016/01968858(85)900028.
 Lattimore and Szepesvári (2019) T. Lattimore and C. Szepesvári. Bandit algorithms. Preprint, 2019.
 Lee et al. (2020) J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang. BioBERT: a pretrained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
 Li et al. (2017) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel banditbased approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
 Liu et al. (2019) Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. URL https://arxiv.org/abs/1907.11692.
 Lu and Roy (2017) X. Lu and B. V. Roy. Ensemble sampling. In Advances in Neural Information Processing Systems, pages 3258–3266, 2017.
 Maddox et al. (2021) W. J. Maddox, M. Balandat, A. G. Wilson, and E. Bakshy. Bayesian Optimization with HighDimensional Outputs. arXiv preprint arXiv:2106.12997, 2021.
 Negoescu et al. (2011) D. M. Negoescu, P. I. Frazier, and W. B. Powell. The knowledgegradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
 Nguyen et al. (2020) V. Nguyen, V. Masrani, R. Brekelmans, M. Osborne, and F. Wood. Gaussian Process Bandit Optimization of the Thermodynamic Variational Objective. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5764–5775. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/3f2dff7862a70f97a59a1fa02c3ec110Paper.pdf.
 Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep Exploration via Bootstrapped DQN. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4026–4034. Curran Associates, Inc., 2016.
 Ott et al. (2019) M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACLHLT 2019: Demonstrations, 2019.
 Patterson et al. (2021) D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
 Pleiss et al. (2018) G. Pleiss, J. Gardner, K. Weinberger, and A. G. Wilson. ConstantTime Predictive Distributions for Gaussian Processes. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4114–4123. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/pleiss18a.html.
 Puvis de Chavannes et al. (2021) L. H. Puvis de Chavannes, M. G. K. Kongsbak, T. Rantzau, and L. Derczynski. Hyperparameter power impact in transformer language model training. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 96–118, Virtual, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.sustainlp1.12. URL https://aclanthology.org/2021.sustainlp1.12.
 Rasmussen and Williams (2005) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
 Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In International Conference on Learning Representations, 2018.
 Russo et al. (2018) D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen. A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018. ISSN 19358237. doi: 10.1561/2200000070. URL http://dx.doi.org/10.1561/2200000070.
 Shahriari et al. (2015) B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
 Slivkins (2019) A. Slivkins. Introduction to MultiArmed Bandits. Foundations and Trends in Machine Learning, 12(12):1–286, 2019. ISSN 19358237. doi: 10.1561/2200000068. URL http://dx.doi.org/10.1561/2200000068.
 Snelson and Ghahramani (2006) E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudoinputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006.
 Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cdPaper.pdf.
 Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1015–1022, USA, 2010. Omnipress. ISBN 9781605589077.
 Sun et al. (2020) Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang. Ernie 2.0: A continual pretraining framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8968–8975, 2020.
 Thompson (1933) W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444.
 Thompson (1935) W. R. Thompson. On the Theory of Apportionment. American Journal of Mathematics, 57(2):450–456, 1935. ISSN 00029327, 10806377.
 Titsias (2009) M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
 Turner et al. (2021) R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the blackbox optimization challenge 2020. arXiv preprint arXiv:2104.10201, 2021.
 Urteaga and Wiggins (2018) I. Urteaga and C. Wiggins. Variational inference for the multiarmed contextual bandit. In A. Storkey and F. PerezCruz, editors, Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 698–706, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
 Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 Vu et al. (2020) T.T. Vu, D. Phung, and G. Haffari. Effective unsupervised domain adaptation with adversarially trained language models. arXiv preprint arXiv:2010.01739, 2020. URL https://arxiv.org/abs/2010.01739.
 Wang et al. (2018) A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.

Wilson and Nickisch (2015)
A. Wilson and H. Nickisch.
Kernel Interpolation for Scalable Structured Gaussian Processes (KISSGP).
In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1775–1784, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/wilson15.html.  Yu and Gen (2010) X. Yu and M. Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010.
Appendix A Implementation details
a.1 Gaussian process
We implement Gaussian process modules based on GPyTorch (14), and execute all experiments with a GP process prior and GP fitting details as described in Table 3.
Hyperparameter  Initial Value 

GP Model  
Mean Function  Constant 
Prior constant  0 
Kernel Function  Scaled RBF Kernel 
Prior outputscale  1 
Prior lengthscale  0.25 
Observation Model  
Likelihood function  Gaussian 
Noise variance 
1 
Training details  
Loss function  ExactMarginalLogLikelihood 
train max iters  100 
loss epsilon  0.01 
Optimizer  
optimizer  adam 
lr  0.1 
a.2 RoBERTa pretraining
We execute the RoBERTa pretraining procedure as described in Fairseq’s RoBERTa pretraining tutorial, with specific hyperparameters as described in Table 4.
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  masked lm 
Criterion  masked lm 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.01 
Training details  
batchsize  16 
updatefreq  16 
samplebreakmode  complete 
tokenspersample  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  0.001 
lrscheduler  polynomial decay 
linearwarmupupdates  1000 
Dynamic masking  
maskprob  
leaveunmaskedprob  0.1 
randomtokenprob  0.1 
a.3 RoBERTa finetuning
We execute the RoBERTa finetuning procedure for GLUE tasks as described in Fairseq’s RoBERTa GLUE tutorial, with specific hyperparameters as described in Tables 512.
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  2 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  16 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  534 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  1e5 
lrscheduler  polynomial decay 
linearwarmupupdates  32 
totalnumupdate  534 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  3 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  32 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  12387 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  1e5 
lrscheduler  polynomial decay 
linearwarmupupdates  743 
totalnumupdate  12387 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  2 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  16 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  230 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  1e5 
lrscheduler  polynomial decay 
linearwarmupupdates  13 
totalnumupdate  230 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  2 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  32 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  3311 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  1e5 
lrscheduler  polynomial decay 
linearwarmupupdates  199 
totalnumupdate  3311 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  2 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  32 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  11327 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  1e5 
lrscheduler  polynomial decay 
linearwarmupupdates  2832 
totalnumupdate  11327 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  2 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  16 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  204 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  2e5 
lrscheduler  polynomial decay 
linearwarmupupdates  12 
totalnumupdate  204 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  2 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  32 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  2093 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  1e5 
lrscheduler  polynomial decay 
linearwarmupupdates  125 
totalnumupdate  2093 
Other  
inittoken  0 
separatortoken  2 
Hyperparameter  Value 

Architecture  RoBERTa base 
Task  sentence prediction 
Criterion  sentence prediction 
numclasses  1 (regressiontarget) 
maxepoch  2 
Model details  
dropout  0.1 
attentiondropout  0.1 
weightdecay  0.1 
Training details  
batchsize  16 
updatefreq  1 
requiredbatchsizemultiple  1 
samplebreakmode  complete 
tokenspersample  512 
maxupdate  360 
maxtokens  4400 
maxpositions  512 
Optimizer  
optimizer  adam 
adambetas  (0.9,0.98) 
adameps  1e6 
clipnorm  1.0 
Learning rate  
lr  2e5 
lrscheduler  polynomial decay 
linearwarmupupdates  21 
totalnumupdate  360 
Other  
inittoken  0 
separatortoken  2 
Appendix B Additional results
b.1 Pretraining losses
We showcase in Figures 4 and 5 the pretraining MLM losses over epochs computed both in the training and validation datasets, where we observe overfitting for RoBERTa models with fixed hyperparameters, yet robust learning for the proposed GPTS technique.
b.2 Finetuning losses
We showcase in Figure 6 the accuracy in all GLUE task dev sets, after finetuning each of the pretrained language models for only two finetuning epochs. We note that the downstream performance in GLUEtasks with small datasets (i.e., CoLA, MRPC, RTE) is unsatisfactory (for both fixed and GPTS pretrained models) when run with the hyperparameters as in Appendix A.3. Although further experimentation is required to improve downstream performance in these GLUE tasks, our claim that GPTS provides pretrained models easily finetunable to a variety of tasks still holds.