Transfer learning has been widely used for the tasks in natural language processing (NLP) (Collobert et al., 2011; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019; Phang et al., 2018). In particular, Devlin et al. (2018) recently demonstrated the effectiveness of finetuning a large-scale language model pretrained on a large, unannotated corpus on a wide range of NLP tasks including question answering and language inference. They have designed two variants of models, (340M parameters) and (110M parameters). Although outperforms generally, it was observed that finetuning sometimes fails when a target dataset has fewer than 10,000 training instances (Devlin et al., 2018; Phang et al., 2018).
When finetuning a big, pretrained language model, dropout (Srivastava et al., 2014)
has been used as a regularization technique to prevent co-adaptation of neurons(Vaswani et al., 2017; Devlin et al., 2018; Yang et al., 2019). We provide a theoretical understanding of dropout and its variants, such as Gaussian dropout (Wang & Manning, 2013), variational dropout (Kingma et al., 2015), and dropconnect (Wan et al., 2013), as an adaptive -penalty toward the origin (all zero parameters ) and generalize dropout by considering a target model parameter (instead of the origin), to which we refer as . We illustrate in Figure 1. To be specific, replaces all outgoing parameters from a randomly selected neuron to the corresponding parameters of . avoids optimization from diverging away from through an adaptive -penalty toward . Unlike , dropout encourages a move toward the origin which deviates away from since dropout is equivalent to .
. (b): In the dropout network, we randomly choose an input neuron to be dropped (a dotted neuron) with a probability of. That is, all outgoing parameters from the dropped neuron are eliminated (dotted connections). (c): In the network, the eliminated parameters in (b) are replaced by the corresponding parameters in (a). In other words, the network at is the mixture of the vanilla network at and the dropout network at with a probability of .
We conduct experiments empirically validating the effectiveness of the proposed where denotes a pretrained model parameter. To validate our theoretical findings, we train a fully connected network on EMNIST Digits (Cohen et al., 2017) and finetune it on MNIST. We observe that a finetuning solution of deviates less from in the -sense than that of dropout. In the main experiment, we finetune with on small training sets of GLUE (Wang et al., 2018). We observe that reduces the number of unusable models that fail with the chance-level accuracy and increases the average development (dev) scores for all tasks. In the ablation studies, we perform the following three experiments for finetuning with : (i) the effect of on a sufficient number of training examples, (ii) the effect of a regularization technique for an additional output layer which is not pretrained, and (iii) the effect of probability of compared to dropout. From these ablation studies, we observe that three characteristics of : (i) finetuning with does not harm model performance even with a sufficient number of training examples; (ii) It is beneficial to use a variant of mixout as a regularization technique for the additional output layer; (iii) The proposed
is helpful to the average dev score and to the finetuning stability in a wider range of its hyperparameterthan dropout.
1.1 Related Work
For large-scale pretrained language models (Vaswani et al., 2017; Devlin et al., 2018; Yang et al., 2019), dropout has been used as one of several regularization techniques. The theoretical analysis for dropout as an -regularizer toward was explored by Wan et al. (2013) where is the origin. They provided a sharp characterization of dropout for a simplified setting (generalized linear model). Mianjy & Arora (2019) gave a formal and complete characterization of dropout in deep linear networks with squared loss as a nuclear norm regularization toward . However, both Wan et al. (2013) and Mianjy & Arora (2019) do not give theoretical analysis for the extension of dropout which uses a point other than .
Wiese et al. (2017), Kirkpatrick et al. (2017), and Schwarz et al. (2018) use -penalty toward a pretrained model parameter to improve model performance. They focus on preventing catastrophic forgetting to enable their models to learn multiple tasks sequentially. They however do not discuss nor demonstrate the effect of -penalty toward the pretrained model parameter on the stability of finetuning. Barone et al. (2017) introduced tuneout, which is a special case of mixout. They applied various regularization techniques including dropout, tuneout, and
-penalty toward a pretrained model parameter to finetune neural machine translation. They however do not demonstrate empirical significance of tuneout compared to other regularization techniques nor its theoretical justification.
2 Preliminaries and Notations
Norms and Loss Functions
A differentiable function is strongly convex if there exists such that
for all and .
We refer as “” to minimizing
instead of the original loss function . is a regularization coefficient. Usual weight decay of is equivalent to .
Probability for Dropout and Dropconnect
Dropout (Srivastava et al., 2014) is a regularization technique selecting a neuron to drop with a probability of . Dropconnect (Wan et al., 2013) chooses a parameter to drop with a probability of . To emphasize their hyperparameter , we write dropout and dropconnect with a drop probability of as “” and “”, respectively. is a special case of if we simultaneously drop the parameters outgoing from each dropped neuron.
Inverted Dropout and Dropconnect
In the case of , a neuron is retained with a probability of during training. If we denote the weight parameter of that neuron as during training, then we use for that weight parameter at test time (Srivastava et al., 2014). This ensures that the expected output of a neuron is the same as the actual output at test time. In this paper, refers to inverted which uses instead of during training. By doing so, we do not need to compute the output separately at test time. Similarly, refers to inverted .
3 Analysis of Dropout and Its Generalization
We start our theoretical analysis by investigating dropconnect which is a general form of dropout and then apply the result derived from dropconnect to dropout. The iterative SGD equation with a learning rate of for is
where and ’s are mutually independent random variables with a drop probability of for all .
Gaussian dropout (Wang & Manning, 2013) and variational dropout (Kingma et al., 2015) use other random masks to improve dropout rather than Bernoulli random masks. To explain these variants of dropout as well, we set a random mask matrix to satisfy and for all . Now we define a random mixture function with respect to from and as
and a minimization problem with “” as
If we assume the strong convexity of the loss function , we can derive a lower bound for as in Theorem 1:
Theorem 1 shows that minimizing the l.h.s. of equation 7 minimizes the r.h.s. of equation 7 when the r.h.s. is a sharp lower limit of the l.h.s. The strongly convexity of means that is bounded from below by a quadratic function, and the inequality of equation 7 comes from the strong convexity. Hence, the equality holds if is quadratic, and is an -regularizer with a regularization coefficient of .
3.1 Mixconnect to Mixout
We propose mixout as a special case of mixconnect, which is motivated by the relationship between dropout and dropconnect. We assume that
where is the th parameter outgoing from the neuron . We set the corresponding to
where and for all . In this paper, we set to for all and hereafter refers to this correlated version of mixconnect with Bernoulli random masks. We write it as “” when we emphasize the mix probability .
Assume that the loss function is strongly convex. We denote the random mixture function of , which is equivalent to that of , as where is defined in equation 8. Then, there exists such that
for all .
Corollary 1.1 is a straightforward result from Theorem 1. As the mix probability in equation 9 increases to 1, the -regularization coefficient of increases to infinity. It means that of can adjust the strength of -penalty toward in optimization. differs from since the regularization coefficient of depends on determined by the current model parameter .
We often apply dropout to specific layers. For instance, Simonyan & Zisserman (2014) applied dropout to fully connected layers only. We generalize Theorem 1 to the case in which mixout is only applied to specific layers, and it can be done by constructing in a particular way. We demonstrate this approach in Supplemental B and show that mixout for specific layers adaptively -penalizes their parameters.
3.2 Mixout for Pretrained Models
Hoffer et al. (2017) have empirically shown that
where is a model parameter after the -th SGD step. When training from scratch, we usually sample an initial model parameteris close to the origin, is away from the origin only with a large by equation 10. When finetuning, we initialize our model parameter from a pretrained model parameter . Since we usually obtain by training from scratch on a large pretraining dataset, is often far away from the origin. By Corollary 1.1, dropout -penalizes the model parameter for deviating away from the origin rather than . To explicitly prevent deviation from , we instead propose to use .
4 Verification of Theoretical Results for Mixout on MNIST
Wiese et al. (2017) have highlighted that is an effective regularization technique to avoid catastrophic forgetting during finetuning. Because keeps the finetuned model to stay in the vicinity of the pretrained model similarly to , we suspect the proposed to have a similar effect of alleviating the issue of catastrophic forgetting. To empirically verify this claim, we pretrain a 784-300-100-10 fully-connected network on EMNIST Digits (Cohen et al., 2017), and finetune it on MNIST. For more detailed description of the model architecture and datasets, see Supplemental C.1.
In the pretraining stage, we run five random experiments with a batch size of 32 for
training epochs. We use Adam(Kingma & Ba, 2014) with a learning rate of , , , , learning rate warm-up over the first 10% steps of the total steps, and linear decay of the learning rate after the warm-up. We use for all layers except the input and output layers. We select whose validation accuracy on EMNIST Digits is best (0.992) in all experiments.
For finetuning, most of the model hyperparameters are kept same as in pretraining, with the exception of the learning rate, number of training epochs, and regularization techniques. We train with a learning rate of for 5 training epochs. We replace with . We do not use any other regularization technique such as and . We monitor ,111 is a model parameter after finetuning. validation accuracy on MNIST, and validation accuracy on EMNIST Digits to compare to across 10 random restarts.222Using the same pretrained model parameter but perform different finetuning data shuffling.
As shown in Figure 2 (a), after finetuning with , the deviation from is minimized in the -sense. This result verifies Corollary 1.1. We demonstrate that the validation accuracy of has greater robustness to the choice of than that of . In Figure 2 (b), both and result in high validation accuracy on the target task (MNIST) for , although is much more robust with respect to the choice of the mix probability . In Figure 2 (c), the validation accuracy of on the source task (EMNIST Digits) drops from the validation accuracy of the model at (0.992) to approximately 0.723 regardless of . On the other hand, the validation accuracy of on the source task respectively drops by 0.041, 0.074 and 0.105 which are more than those of for .
5 Finetuning a Pretrained Language Model with Mixout
In order to experimentally validate the effectiveness of mixout, we finetune on a subset of GLUE (Wang et al., 2018) tasks (RTE, MRPC, CoLA, and STS-B) with . We choose them because Phang et al. (2018) have observed that it was unstable to finetune on these four tasks. We use the publicly available pretrained model released by Devlin et al. (2018)
, ported into PyTorch by HuggingFace.333 https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin We use the learning setup and hyperparameters recommended by Devlin et al. (2018). We use Adam with a learning rate of , , , learning rate warmup over the first 10% steps of the total steps, and linear decay of the learning rate after the warmup finishes. We train with a batch size of 32 for 3 training epochs. Since the pretrained is the sentence encoder, we have to create an additional output layer, which is not pretrained. We initialize each parameter of it by . We describe our experimental setup further in Supplemental C.2.
The original regularization strategy used in Devlin et al. (2018) for finetuning is using both and for all layers except layer normalization and intermediate layers activated by GELU (Hendrycks & Gimpel, 2016). We however cannot use nor for the additional output layer which was not pretrained and therefore does not have . We do not use any regularization for the additional output layer when finetuning with and . For the other layers, we replace and with and , respectively.
Phang et al. (2018) have reported that large pretrained models (e.g., ) are prone to degenerate performance when finetuned on a task with a small number of training examples, and that multiple random restarts444 Using the same pretrained model parameter but each random restart differs from the others by shuffling target data and initializing the additional output layer differently. are required to obtain a usable model better than random prediction. To compare finetuning stability of the regularization techniques, we need to demonstrate the distribution of model performance. We therefore train with each regularization strategy on each task with 20 random restarts. We validate each random restart on the dev set to observe the behaviour of the proposed mixout and finally evaluate it on the test set for generalization. We present the test score of our proposed regularization strategy on each task in Supplemental C.3.
We fine-tune with for all tasks. For the baselines, we finetune with both and as well as with . These choices are made based on the experiments on RTE investigating the effect of the mix probability in Section 6.3. We observe that finetuning with is significantly more stable with while finetuning with becomes unstable as increases.
In Figure 3, we plot the distributions of the dev scores from 20 random restarts when finetuning with various regularization strategies on each task. For conciseness, we only show four regularization strategies; Devlin et al. (2018)’s: both and , Wiese et al. (2017)’s: , ours: , and ours+Wiese et al. (2017)’s: both and . As shown in Figure 3 (a–c), we observe many finetuning runs that fail with the chance-level accuracy when we finetune with both and on RTE, MRPC, and CoLA. We also have a bunch of degenerate model configurations when we use without .
Unlike existing regularization strategies, when we use as a regularization technique with or without for finetuning , the number of degenerate model configurations that fail with a chance-level accuracy significantly decreases. For example, in Figure 3 (c), we have only one degenerate model configuration when finetuning with on CoLA while we observe respectively seven and six degenerate models with Devlin et al. (2018)’s and Wiese et al. (2017)’s regularization strategies.
In Figure 3 (a), we further improve the stability of finetuning by using both and . Figure 3 (d) shows respectively two and one degenerate model configurations with Devlin et al. (2018)’s and Wiese et al. (2017)’s, but we do not have any degenerate resulting model with ours and ours+Wiese et al. (2017)’s. In Figure 3 (b, c), we observe that the number of degenerate model configurations increases when we use additionally to . In short, applying our proposed mixout significantly stabilizes the finetuning results of on small training sets regardless of whether we use .
In Table 1, we report the average and the best dev scores across 20 random restarts for each task with various regularization strategies. The average dev scores with increase for all tasks. For instance, the mean dev score of finetuning with on CoLA is 57.9 which is 49.2% increase over 38.8 obtained by finetuning with both and . We observe that using also improves the average dev scores for most tasks compared to using both and . We however observe that finetuning with outperforms that with on average. This confirms that has a different effect for finetuning compared to since is an adaptive -regularizer along the optimization trajectory.
Since finetuning a large pretrained language model such as on a small training set frequently fails, the final model performance has often been reported as the maximum dev score (Devlin et al., 2018; Phang et al., 2018) among a few random restarts. We thus report the best dev score for each setting in Table 1. According to the best dev scores as well, improves model performance for all tasks compared to using both and . For instance, using improves the maximum dev score by 0.9 compared to using both and on MRPC. Unlike the average dev scores, the best dev scores achieved by using are better than those achieved by using except RTE on which it was better to use than .
|TECHNIQUE 1||TECHNIQUE 2||RTE||MRPC||CoLA||STS-B|
|56.5 (73.6)||83.4 (90.4)||38.8 (63.3)||82.4 (90.3)|
|-||56.3 (71.5)||86.2 (91.6)||41.9 (65.6)||85.4 (90.5)|
|-||51.5 (70.8)||85.8 (91.5)||35.4 (64.7)||80.7 (90.6)|
|-||57.0 (70.4)||85.8 (91.0)||48.1 (63.9)||89.6 (90.3)|
|-||54.6 (71.1)||84.2 (91.8)||45.6 (63.8)||84.3 (90.1)|
|-||61.6 (74.0)||87.1 (91.1)||57.4 (62.1)||89.6 (90.3)|
|-||64.0 (74.0)||89.0 (90.7)||57.9 (63.8)||89.4 (90.3)|
|-||64.3 (73.3)||88.2 (91.4)||55.2 (63.4)||89.4 (90.0)|
|65.3 (74.4)||87.8 (91.8)||51.9 (64.0)||89.6 (90.6)|
|62.8 (74.0)||86.3 (90.9)||58.3 (65.1)||89.7 (90.3)|
|65.0 (75.5)||88.6 (91.3)||58.1 (65.1)||89.5 (90.0)|
We investigate the effect of combining both and to see whether they are complementary. We finetune with both and . This leads not only to the improvement in the average dev scores but also in the best dev scores compared to using and using both and . The experiments in this section confirm that using as one of several regularization techniques prevents finetuning instability and yields gains in dev scores.
6 Ablation Study
In this section, we perform ablation experiments to better understand . Unless explicitly stated, all experimental setups are the same as in Section 5.
6.1 Mixout with a Sufficient Number of Training Examples
We showed the effectiveness of the proposed mixout finetuning with only a few training examples in Section 5. In this section, we investigate the effectiveness of the proposed mixout in the case of a larger finetuning set. Since it has been stable to finetune on a sufficient number of training examples (Devlin et al., 2018; Phang et al., 2018), we expect to see the change in the behaviour of when we use it to finetune on a larger training set.
We train by using both and with 20 random restarts on SST-2.555 For the description of SST-2 dataset, see Supplemental C.2. We also train by using both and with 20 random restarts on SST-2 as the baseline. In Table 2, we report the mean and maximum of SST-2 dev scores across 20 random restarts with each regularization strategy. We observe that there is little difference between their mean and maximum dev scores on a larger training set, although using both and outperformed using both and on small training sets in Section 5.
|TECHNIQUE 1||TECHNIQUE 2||SST-2|
6.2 Effect of a Regularization Technique for an Additional Output Layer
In this section, we explore the effect of a regularization technique for an additional output layer. There are two regularization techniques available for the additional output layer: and where is a randomly initialized parameter of it. Either of these strategies differs from the earlier experiments in Section 5 where we did not put any regularization for the additional output layer.
|-||61.6 (74.0)||87.1 (91.1)||57.4 (62.1)||89.6 (90.3)|
|66.5 (75.5)||88.1 (92.4)||58.7 (65.6)||89.7 (90.6)|
|57.2 (70.8)||85.9 (92.5)||48.9 (64.3)||89.2 (89.8)|
|The best of each result from Table 1||65.3 (75.5)||89.0 (91.8)||58.3 (65.6)||89.7 (90.6)|
We report the average and best dev scores across 20 random restarts when finetuning with while varying the regularization technique for the additional output layer in Table 3.666 In this experiment, we use neither nor . We observe that using for the additional output layer improves both the average and best dev score on RTE, CoLA, and STS-B. In the case of MRPC, we have the highest best-dev score by using for the additional output layer while the highest mean dev score is obtained by using for it. In Section 3.2, we discussed how does not differ from dropout when the layer is randomly initialized, since we sample from whose mean and variance are and small, respectively. Although the additional output layer is randomly initialized, we observe the significant difference between dropout and in this layer. We conjecture that is not sufficiently small because is proportional to the dimensionality of the layer (2,024 for this experiment). We therefore expect to behave differently from dropout even for the case of training from scratch.
In the last row of Table 3, we present the best of the corresponding result from Table 1. We have the highest mean and best dev scores when we respectively use and for the pretrained layers and the additional output layer on RTE, CoLA, and STS-B. The highest mean dev score on MRPC is obtained by using for the pretrained layers which is one of the results in Table 1. We have the highest best dev score on MRPC when we use and for the pretrained layers and the additional output layer, respectively. The experiments in this section reveal that using for a randomly initialized layer of a pretrained model is one of regularization schemes to improve the average dev score and the best dev score.
6.3 Effect of Mix Probability for Mixout and Dropout
We explore the effect of the hyperparameter when finetuning with and . We train with on RTE with 20 random restarts. We also train after replacing by with 20 random restarts. We do not use any regularization technique for the additional output layer. Because we use neither nor in this section, and are equivalent to finetuning without regularization.
It is not helpful to vary for while helps significantly in a wide range of . Figure 4 shows distributions of RTE dev scores across 20 random restarts when finetuning with and for . The mean dev score of finetuning