1 Introduction
The regularisation technique known as dropout underpins numerous stateoftheart results in deep learning
(Hinton et al. 2012; Srivastava et al. 2014), and its application has received much attention in the form of optimisation (Wang & Manning 2013) and attempts at explaining or improving its approximation properties (Baldi & Sadowski 2013; Zolna et al. 2017; Ma et al. 2016). The dominant perspective today views dropout as either an implicit ensemble method (WardeFarley et al. 2013) or averaging over an approximate Bayesian posterior (Gal & Ghahramani 2016a). Regardless of which view we take, dropout training is carried out the same way, by minimising the expectation of the loss over randomly sampled dropout masks. However, at test time these views naturally lead to different algorithms: the Bayesian approach computes an arithmetic average as it marginalises out the weight uncertainty, while the ensemble approach typically uses the geometric average due to its close relationship to the loss. Collectively they are called MC dropout and neither is clearly better than the other (WardeFarley et al. 2013). A third way to make predictions is to “turn dropout off”, that is, propagate expected values through the network in a single, deterministic pass. This deterministic (also known as standard) dropout in considered to be an excellent approximation to MC dropout.This situation is unsatisfactory as it does not provide theoretical grounding for dropout, without which the choice of dropout variant remains arbitrary. In this paper, we provide such theoretical foundations. First, we prove the dropout objective to be a common lower bound on the objectives of a family of infinitely many models. This family includes models corresponding to the three aforementioned methods of evaluation: the arithmetic averaging, the geometric averaging, and the deterministic. Thus by maximising the dropout objective we get a single set of parameters and many models that all have the same parameters but differ in how they make predictions. This allows us to train once and perform model selection at validation time by evaluating the different methods of making predictions corresponding to individual models in the family. Second, we turn the conventional perspective on its head by showing that while dropout training performs stochastic regularisation, the trained model is best viewed as deterministic, not as a stochastic model with a deterministic approximation.
This paper is structured as follows. In §2, we revisit variational dropout (Gal & Ghahramani 2016a) and demonstrate that, despite common perception, sharing of masks is not necessary, neither in theory nor in practice. Then, by recasting dropout in a simple conditional form, we highlight the counterintuitive role played by the variational posterior. §3 contains our main contributions. Here we construct a family of conditional models whose MAP objectives are all lower bounded by the usual dropout objective, and identify a member of this family as best in terms of model fit. In §4, we select the best of this family in terms of generalisation to improve language modelling. Finally, creating a cheap approximation to the bias of this model allows us to get better results from model tuning.
2 Variational dropout
Since its original publication (Hinton et al. 2012)
, dropout had been considered a stochastic regularisation method, implemented as a tweak to the loss function. That was until
Gal & Ghahramani (2016a) grounded dropout in muchneeded theory. Their subsequent work (Gal & Ghahramani 2016b) focused on RNNs, showing that if dropout masks are shared between time steps, the objective for their proposed variational model is the same as the commonly used dropout objective with an penalty. Their method became known as variational dropout, not to be confused with Kingma et al. (2015), and is used in stateoftheart sequential models (Merity et al. 2017; Melis et al. 2017). Before we move on to a more general formulation we revisit it to better understand its critical features.First, we recall the derivation of variational dropout. Consider an RNN that takes input and maps it to output and is trained on a set of data points in paired sets , . A variational lower bound on the log likelihood is obtained as follows:
(1) 
where is defined by the RNN with weights . Variational Bayesian methods then maximise this lower bound with respect to the variational distribution . For variational dropout,
takes the form of a mixture of two gaussians with small variances: one with zero mean that represents the dropped out rows of weights, and another with mean
:In the above, is the index of a row of a weight matrix. Dropping whole rows of weights is equivalent to the more familiar view of dropout over units. The prior over the weights is a zero mean gaussian:
The loss is defined based on Eq. 1. The integrals are approximated using a single sample , and the KL term is approximated with weight decay on :
(2) 
The same dropout mask (and consequently the same ) is employed at every time step. This sharing of masks is considered the defining characteristic of variational dropout, but we note in passing that the theory for the nonshared masks case is very similar and there is little between them in practice with LSTMs (see Appendix A). With this we conclude the recap of variational dropout, and describe our contributions in the rest of the paper.
2.1 Dropout as a conditional model
In variational inference the idea is to approximate the intractable and complicated posterior with a simple, parameterised distribution . Crucially, this approximation affects our inferences and predictions. If we are serious about it being an approximation to the posterior and want to reduce its distortion of the model , then can be made more flexible. But making more flexible in variational dropout can potentially ruin the regularisation effect. So the particular choice of plays an important, active role: it effectively performs posterior regularisation and acts as an integral part of the model.
Coming from another angle, Osband (2016) makes the point that in variational dropout the posterior over weights does not concentrate with more data, unlike for example in Graves (2011), which is unexpected behaviour from a Bayesian model. This conundrum is caused by encoding dropout with a fixed rate mixture of fixed variance components in , which also necessitates expensive tuning of the dropout rate. Gal et al. (2017) proposes a way to address these shortcomings.
To avoid getting bogged down in the issues surrounding the suitability of variational inference and ease interpretation, we construct a straightforward conditional model and lower bound its MAP objective in the same form as the variational objective. Suppose we want to do MAP estimation for the model parameters (the means of the distribution of weights, ): . Consider a conditional model as a crippled generative model with constant, and independent. Place a normal prior on the means and otherwise make the weights conditional on the same way as they were in the variational posterior :
(3) 
The log posterior of this model has a similar lower bound to the variational objective (Eq. 1):
(4) 
See Appendix C for detailed derivation. Dropping the normalisation constant that doesn’t depend on , and approximating the above integrals with a single sample, the loss corresponding to the MAP objective becomes:
(5) 
The first term of this loss is identical to that of the loss for variational dropout (Eq 2). If the prior on is a zero mean gaussian, then the second term is equivalent to a weight decay penalty just like the KL term in the variational setup. With the two losses being effectively the same, in the following we focus on MAP estimation for the conditional model to sidestep any questions about whether variational inference makes sense in this case.
3 The dropout family of models
Having developed a conditional model for dropout that leads to the same objective as variational dropout, we now derive a family of models whose objectives are all lower bounded by the usual dropout objective. We draw inspiration from the different evaluation methods employed for dropout:

Deterministic dropout propagates the expectation of each unit through the network in single pass. This is very efficient and is viewed as a good approximation to the next option.

MC dropout
mimicks the training procedure, and averages the predicted probabilities over randomly sampled dropout masks. With one forward pass per sample, this can be rather expensive. There is some ambiguity as to what kind of averaging shall be applied: oftentimes the
geometric average (GMC) is used, because of its close relationship to the loss, but the arithmetic average (AMC) is also widespread.
Our goal in this section is to demonstrate the consequences of optimising a lower bound instead of the true objective. While it is easy to argue in general that objectives of more than one model may share any given lower bound, for dropout a particularly simple explicit construction of such a family of models is possible. As we will see, this allows for posttraining model selection based on validation results given a trained set of parameters. In the absence of validation results to guide model selection, inspection of the tightness of the lower bound indicates the deterministic model as the most reasonable choice from the family.
3.1 Geometric model
First, we investigate whether the geometric or the arithmetic mean is the correct choice for making predictions in the context of classification. Recall the predictive term of the MAP loss in Eq. 5:
. Notice how with SGD and multiple epochs, for each data point several dropout masks are encountered, and the approximating quantity becomes the geometric mean of the predicted probabilities
over the masked weights. For this reason, the posterior predictive distribution
is often computed as the renormalised geometric mean. This is in apparent conflict with the conditional model that prescribes the arithmetic mean (integrating out of Eq. 3). However, we can define another model where the conditional distribution is directly defined to be the renormalised geometric mean(6) 
with a slight abuse of notation, due to using the symbol in although . It can be shown that the arithmetic model’s (Eq. 3) lower bound (Eq. 4) is a lower bound for this renormalised geometric model (Eq. 6), as well. See Appendix D for the derivation. The answer to the question whether we should use GMC or AMC is that it depends: they correspond to different models, but the dropout objective is a lower bound on the objectives of both models. So one can freely choose between GMC and AMC at evaluation time, doing model selection retrospectively after training.
3.2 The power mean model family
Having two models to choose from, it is natural to ask whether these are just instantiations of a larger class of models. We propose the power mean family of models to extend the set of models to a continuum between the geometric and arithmetic models described in §3.1 and §2.1, respectively, and show that they have the same lower bound. The power mean is defined as:
For we arrive at the arithmetic mean while the natural extension to is the geometric mean as it is the limit of at , which can be proven with L’Hôpital’s rule. Similarly to the construction of the geometric model, we define the power mean model by directly conditioning on :
(7) 
where is at most if because is monotonically increasing in and is for . Here we provide a concise derivation of a lower bound on the log posterior (the full derivation can be found in Appendix E):
(8)  
(9)  
The first inequality above follows from for all , , while the second is an application of Jensen’s rule assuming . We arrived at the same lower bound on the objective as we had for the geometric (Eq. 6) and arithmetic models (Eq. 3), thus defining the power mean family with parameter of models from which we can choose at evaluation time. For , the normalising constant would be greater than , and this would not be a lower bound in general.
3.3 Tightness of the lower bound
To better understand the quality of fit for models in the power mean family we examine the tightness of their lower bounds. There are two steps involving inequalities in the derivation of the bound: one where the normalisation constant is dropped (Eq. 8) and another where the logarithm is moved inside the expectation (Eq. 9). We show that the gaps introduced by these steps can be made arbitrarily small by reducing the variance of with respect to .
Notice that the Jensen gap with the logarithm function is scale invariant:
Intuitively, this suggests that is closely related to the size of the gap. Indeed, Maddison et al. (2017)
show that if the first inverse moment of
is finite, then(10) 
Here we go a bit further and show that if there is a positive lower and upper bound on , then there are nontrivial lower and upper bounds on its Jensen gap and these are bounds are multiplicative in . Let
be a random variable such that
where . Furthermore, let be a convex function. Jensen’s inequality states that . Liao & Berg (2017) show that the Jensen gap can be bounded from below and above:where does not depend on the distribution of , only on its expected value and on the function . Substituting (a random variable on due to the randomness of the dropout masks) and , we can see that the gap introduced by Eq. 9 can be made smaller by decreasing the variance of the predictions while maintaining the expected value of (i.e. the expected probability), assuming that there is a positive lower and upper bound on them (so that the supremum is finite and the infimum is positive, respectively). A similar argument based on shows that approximately monotonically approaches as the variance decreases, so the gap of Eq. 8 can also be reduced.
Suppose we pick a base model from the power mean family and have a continuum of subvariants with gradually reduced variance in their predictions but the same expectation. Clearly, for each of them we can derive a lower bound the same way as we did for the power mean family. And as we showed above, the lower bounds will tend to increase as the variance of the predictions decreases (see Fig. 0(a)). They do not strictly increase, only tend to, due to how the Jensen gap is bounded from above and below and also due to the term of Eq. 10. Nonetheless, as we approach determinism the lower bound is forced into increasingly tighter ranges with strictly monotonically increasing bounds around it, thus we can always reduce the variance such that there is no overlap between the ranges and we get a guaranteed improvement on the lower bound. This effect reaches its apex at the deterministic model whose lower bound is both exact and higher than any other model’s. Fig. 0(b) illustrates that regardless of the choice of base model, reducing the prediction variance will eventually transform it into the same deterministic model.
3.4 The extended power mean family: controlling the tightness of the bound
Intuitively, in the absence of other sources of stochasticity the dropout rate controls the variance of the predictions and if it is low, the lower bound can be pretty snug. However, there are two problems.
First, decreasing the dropout rate does not necessarily keep the expectation of the predictions the same. We offer no solution to this bias issue, but refer the reader to previous studies of dropout’s approximation properties such as (Baldi & Sadowski 2013) and our subsequent empirical results.
Second, reducing the dropout rate would trade off generalisation for tighter bounds. But doing so only at evaluation time leaves the training time regularisation effect intact, and can be seen as picking another model whose lower bound tends to be higher than that of the base model. Having thus extended the dropout family further, we can now tweak both and dropout rates at evaluation time.
Depending on the severity of the introduced bias compared to the benefits of having a tighter lower bound, the optimal variance may lie anywhere between the deterministic and the base model. We show experimentally that across a number of datasets the benefits of tighter bounds matter more, and observe monotonic improvement in model fit as evaluation time dropout rates are decreased all the way to full determinism. The experiment was conducted as follows. On an already trained model, the dropout rate was multiplied by .
0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0 

2.731  2.738  2.746  2.755  2.766  2.777  2.791  2.807  2.826  2.849  2.878 
As Table 1 shows, the model fit as measured by cross entropy (XE) on the training set improves monotonically when reducing . Results on other datasets and with other power mean models are very similar. We call the union of the reduced dropout rate subvariants of all power mean family models the extended dropout family paramaterised by .
Therefore, we can say that dropout training optimises a deterministic model subject to regularisation constraints, and deterministic evaluation, widely believed to approximate MC evaluation, is the closest match to the true objective at our disposal. It is not that dropout evaluation has a deterministic approximation: dropout trains a deterministic model first and foremost and a continuum of stochastic ones to various extents.
In summary, we described dropout training as optimising a common lower bound for a family of models. Since this lower bound is the same for all models in the family, we can nominate any of them at evaluation time. However, the tightness of the bound varies, which affects model fit. Having trained a model with dropout, the best fit is achieved by the deterministic model with no dropout. This result isolates the regularisation effects from the biases of the lower bound and the dropout family.
4 Applying dropout
We investigate how members of the extended dropout model family perform in terms of generalisation. We follow the experimental setup of Melis et al. (2017) and base our work on their best performing model variant for each dataset. Unless explicitly stated, no retraining was performed and their model weights reused. In the experiments with the tuning objective, we follow their experimental setup, using Google Vizier (Golovin et al. 2017)
, a blackbox hyperparameter tuner based on batched Gaussian Process Bandits.
Geometric ()  Power  Arithmetic ()  

Dataset  DET  0.8  0.9  1.0  0.8  0.9  1.0  0.8  0.9  1.0 
MNIST  0.070  0.087  0.087  0.088  0.92  0.93  0.93  0.100  0.100  0.100 
Enwik8  0.886  0.879  0.878  0.881  0.877  0.877  0.877  0.875  0.875  0.875 
PTB  4.110  4.090  4.090  4.093  4.072  4.070  4.073  4.061  4.064  4.080 
Wikitext2  4.236  4.229  4.231  4.235  4.025  4.026  4.208  4.203  4.212  4.228 
See Table 2 for results of image classification on MNIST, character based language modelling on Enwik8, word based language modelling on PTB and Wikitext2. On MNIST, deterministic dropout is the best in terms of cross entropy, which matches our theoretical predictions. In contrast, on language modelling arithmetic averaging produces the best results, which necessitates further analysis.
num  training  validation  

frequency  targets  DET  0.8  AMC  DET  0.8  AMC 
25000  13580  1.40  1.50  1.56  1.58  1.64  1.68 
5000  26658  1.65  1.75  1.81  1.93  1.98  2.02 
500  44702  2.19  2.30  2.36  2.58  2.63  2.66 
500  29058  4.07  4.19  4.29  6.49  6.39  6.39 
100  14222  4.24  4.38  4.49  7.81  7.64  7.61 
20  5008  4.00  4.19  4.33  9.20  9.01  8.97 
We suspected that the particularly severe form of class imbalance exhibited by the powerlaw word distribution (Zipf 1935) might play a role. To verify this, we contrasted training and validation XEs on PTB for words grouped by frequency (see Table 3). On the training set, the gap between deterministic dropout and AMC is wider for low frequency words. On the validation set, AMC is worse for frequent words but better for rare words. The 0.8 dropout multiplier just finds a reasonable compromise.
4.1 Softmax temperature
The observed effect is consistent with smoothing, thus we posit that the reason MNIST results are worse with AMC is that the marginal distributions of labels in the training and test set are identical by construction and further smoothing is unnecessary. On the other hand, PTB and Wikitext2 benefit from AMC’s smoothing because the penalty for underestimating low probabilities is harsh, hence the large improvement on rare words. The character based Enwik8 dataset lies somewhere in between: the training and test distributions are better matched and there are no very low probability characters.
Geometric ()  Power  Arithmetic ()  

Dataset  Temp  DET  0.8  0.9  1.0  0.8  0.9  1.0  0.8  0.9  1.0  
Validation 
WT2  1  69.1  68.6  68.8  69.1  67.0  67.2  67.2  66.9  67.5  68.6 
opt  67.4  67.5  67.7  68.0  67.0  67.1  67.2  66.9  67.4  68.1  
PTB  1  60.9  59.6  59.7  59.7  58.1  57.9  58.0  57.3  57.5  58.5  
opt  57.5  57.5  57.9  58.3  57.1  57.3  57.8  57.1  57.5  58.4  
Test 
WT2  1  65.9  65.3  65.4  65.6  63.8  63.9  64.2  63.7  64.5  65.5 
opt  64.5  64.7  64.8  64.9  63.8  63.8  64.2  63.7  64.2  64.9  
PTB  1  58.6  57.3  57.4  57.4  56.0  55.8  55.9  55.3  55.5  56.5  
opt  56.0  56.0  56.1  56.5  55.7  55.7  56.0  55.3  55.5  56.3 
To test the hypothesis that AMC’s advantage lies in smoothing, we tested how performing smoothing by other means affects the results. In this experiment, on a trained model the temperature of the final softmax was optimised on the validation set and the model was applied with the optimal temperature to the validation and test sets. Our experimental results in Table 4 support the hypotheses that AMC smooths the predicted distribution as increasing the temperature improves DET and GMC considerably but not AMC. In fact, the optimal temperature for AMC with was slightly lower than 1, which corresponds to sharpening, not smoothing.
Tuning the evaluation time softmax temperature is similar to label smoothing (Pereyra et al. 2017), the main difference being that our method does not affect training. While this is convenient, for tuning model hyperparameters, ideally we would determine the optimal evaluation parameters , and the temperature for the calculation of the validation score for each set of hyperparameters tried, but this would be prohibitively expensive. Since deterministic evaluation coupled with the optimal temperature is very close to the best performing AMC model, it serves as a good proxy for the ideal tuning objective. The optimal temperature can be approximately determinined using a linear search on a subset of the validation data which is orders of magnitude faster than MC dropout. In our experiments, hyperparameter tuning with validation scores computed at the optimal softmax temperature did improve results, albeit very slightly (about half a perplexity point). Thus we can conclude that deterministic dropout is already a reasonable proxy for which to optimise.
4.2 Results
We have improved the best test result of Melis et al. (2017) from 58.3 to 55.7 on PTB, and from 65.9 to 63.7 on Wikitext2 using their model weights, only tuning the evaluation parameters , and the softmax temperature on the validation set. By retuning the hyperparameters of the PTB model with optimal temperature deterministic evaluation, we improved to 55.3 on PTB. For lack of resources, we did not retune for Wikitext2. For comparison, the state of the art in language modelling without resorting to dynamic evaluation or a continuous cache pointer is Mixture of Softmaxes (Yang et al. 2017) with 54.44 and 61.45 on PTB and Wikitext2, respectively. At present, it is unclear whether the benefits of their approach and ours combine.
In summary, we looked at how different models and evaluation methods rank in terms of generalisation. Across a number of tasks and datasets the ranking differed from what was observed on the training set. We found that AMC smooths the distribution of the prediction probabilities and we achieved a similar effect without resorting to expensive sampling simply by adjusting the temperature of the final softmax. Finally, we brought the tuning objective more in line with the improved evaluation by automatically determining the optimal softmax temperature when evaluating on the validation set which further improved results.
5 Implications
The construction of a conditional model family with a common lower bound on their objectives is applicable to other latent variable models with similar structure and inference method. This lower bound admits ambiguity as to what model is being fit to the data, which in turn allows for picking any such model at evaluation time. However, the tightness of the bound and the quality of the fit varies. For dropout, the deterministic model has the best fit even though the training objective is highly stochastic, but this result hinges on the approximation properties of deterministic dropout and will not carry over to other probabilistic models in general. In particular, standard VAEs (Kingma & Welling 2013) with their lower bound being very similar in construction to Eq. 1 cannot quite collapse to a deterministic model else they suffer an infinite KL penalty. Still, the lower bound being looser on the tails of is related to problem of underestimating posterior uncertainty (Turner & Sahani 2011).
In related works, expectationlinear dropout (Ma et al. 2016) and fraternal dropout (Zolna et al. 2017) both try to reduce the “inference gap”: the mismatch between the training objective and deterministic evaluation. The gains reported in those works might be explained by reducing the bias of deterministic evaluation and also by encouraging small variance in the predictions and thus getting tighter bounds. Another recent work, activation regularisation (Merity et al. 2017), could be thought of as a mechanism to reduce the variance of predictions to a similar effect. In the context of language modelling, the connection between noise and smoothing was established by Xie et al. (2017). Our improved understanding further emphasises that connection, and at the same time challenges the way we think about dropout.
Acknowledgments
We would like to thank Laura Rimell, Aida Nematzadeh, and Andriy Mnih for their valuable feedback.
References
 Baldi & Sadowski (2013) Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in neural information processing systems, pp. 2814–2822, 2013.
 Bayer et al. (2013) Justin Bayer, Christian Osendorfer, Daniela Korhammer, Nutan Chen, Sebastian Urban, and Patrick van der Smagt. On fast dropout and its applicability to recurrent networks. arXiv preprint arXiv:1311.0701, 2013.

Gal & Ghahramani (2016a)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
international conference on machine learning
, pp. 1050–1059, 2016a. 
Gal & Ghahramani (2016b)
Yarin Gal and Zoubin Ghahramani.
A theoretically grounded application of dropout in recurrent neural networks.
In Advances in Neural Information Processing Systems, pp. 1019–1027, 2016b.  Gal et al. (2017) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pp. 3584–3593, 2017.
 Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM, 2017.
 Graves (2011) Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pp. 2348–2356, 2011.
 Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
 Liao & Berg (2017) JG Liao and Arthur Berg. Sharpening jensen’s inequality. The American Statistician, 2017.
 Ma et al. (2016) Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard H. Hovy. Dropout with expectationlinear regularization. CoRR, abs/1609.08017, 2016. URL http://arxiv.org/abs/1609.08017.
 Maddison et al. (2017) Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pp. 6573–6583, 2017.
 Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 Osband (2016) Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In NIPS Bayesian Deep Learning Workshop, 2016.
 Pachitariu & Sahani (2013) Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650, 2013.
 Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
 Semeniuta et al. (2016) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. arXiv preprint arXiv:1603.05118, 2016.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Turner & Sahani (2011) Richard E Turner and Maneesh Sahani. Two problems with variational expectation maximisation for timeseries models. Bayesian Time series models, 1(3.1):3–1, 2011.
 Wang & Manning (2013) Sida Wang and Christopher Manning. Fast dropout training. In international conference on machine learning, pp. 118–126, 2013.
 WardeFarley et al. (2013) David WardeFarley, Ian J Goodfellow, Aaron Courville, and Yoshua Bengio. An empirical analysis of dropout in piecewise linear networks. arXiv preprint arXiv:1312.6197, 2013.
 Xie et al. (2017) Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573, 2017.
 Yang et al. (2017) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: a highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zipf (1935) George Kingsley Zipf. The psychobiology of language. Houghton, Mifflin, 1935.
 Zolna et al. (2017) Konrad Zolna, Devansh Arpit, Dendi Suhubdy, and Yoshua Bengio. Fraternal dropout. arXiv preprint arXiv:1711.00066, 2017.
Appendix A Variational dropout with nonshared masks
If and are redefined for the nonshared setting to be products of identical and independent, per time step factors, neither term of the variational objective requires rethinking: the MC approximation still works since is easy to sample from, while the KL term becomes a sum of componentwise KL divergences and can still be implemented as weight decay. Consequently, both shared and nonshared masks fit into the variational framework. For a detailed derivation see Appendix B.
In related works, Pachitariu & Sahani (2013) in their investigation of regularisation of standard RNN based language models dismiss applying dropout to recurrent connections “to avoid introducing instabilities into the recurrent part of the LMs”. Bayer et al. (2013) echo this claim about RNNs, which is then cited by Zaremba et al. (2014), but their work is based on LSTMs not standard RNNs. Finally, Gal & Ghahramani (2016b) cite all of the above but also work with LSTMs. Their results indicate a large, about 15 perplexity point advantage to shared mask dropout for language modelling on the Penn Treebank (PTB) corpus (see Fig. 2 in their paper).
Our experimental results obtained with careful and extensive hyperparameter tuning, listed in Table 5, indicate only a small difference between the two which is in agreement with the empirical study of Semeniuta et al. (2016).
dataset  10M  24M  

S  NS  S  NS  
validation  59.4  60.2  57.5  58.3 
test  57.5  58.6  56.0  56.9 
In any case, nonshared masks, in addition to being variational, are also surprisingly competitive with shared masks for LSTMs (we make no claims about standard RNNs). We also tested whether embedding dropout (in which dropout is applied to entire vectors in the input embedding lookup table) proposed by
Gal & Ghahramani (2016b) improves results, and find that embedding dropout does not offer any improvement on top of input dropout.Appendix B Derivation of variational dropout with nonshared masks
In this section, we formulate naive (i.e. nonshared mask) dropout in the variational setting. In contrast to the shared mask case, where was a single set of weights, here (or , for short) has a set of weights for each time step that differ in their dropout masks. The variational posterior and the prior are both products of identical distributions over time:
An unbiased approximation to the integrals in Eq. 1 is based on a single, easy to obtain sample :
Showing that the KL term can still be approximated with weight decay with nonshared masks is not much more involved. Both distributions are products of densities over independent random variables, so the componentwise KL divergencies sum. In particular:
We partitioned the variables into two mutually exclusive sets and its complement , and split the multiple integral using Fubini’s theorem (or, equivalently, using the expectation of independent random variables rule). After the split, the first integral is trivially and the second has no dependence on .
What we end up with is a sum of identical KL terms of the same distributions as in the shared mask case, so the full KL can be approximated with weight decay.
Appendix C Derivation of the MAP lower bound for the arithmetic model
We can rewrite the posterior as:
Moving to the log domain and using Jensen’s inequality allows us to construct a lower bound that is a sum of per data point terms (i.e. something that can be conveniently optimised):
Appendix D Derivation of the MAP lower bound for the geometric model
From Eq. 6 recall that:
The normalisation constant is at most , due to the geometric mean being bounded from above by the arithmetic mean on a per class basis:
Since this a conditional model, we can rewrite the posterior as:
is dropped in the last step as it is constant. Moving to the log domain once again:
where the lower bound arises due to .
Appendix E Derivation of the MAP lower bound for the power mean family
In §3.2 we proved that . Starting from just like in the geometric case, we derive a lower bound in the log domain:
Comments
There are no comments yet.