Dropout: Explicit Forms and Capacity Control

03/06/2020 ∙ by Raman Arora, et al. ∙ Johns Hopkins University berkeley college Toyota Technological Institute at Chicago 0

We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a data-dependent regularizer that, in expectation, equals the weighted trace-norm of the product of the factors. In deep learning, we show that the data-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks. We evaluate our theoretical findings on real-world datasets, including MovieLens, MNIST, and Fashion-MNIST.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dropout is a popular algorithmic regularization technique for training deep neural networks that aims at “breaking co-adaptation” among neurons by randomly dropping them at training time 

(hinton2012improving). Dropout has been shown effective across a wide range of machine learning tasks, from classification (srivastava2014dropout; szegedy2015going) to regression (toshev2014human). Notably, dropout is considered an essential component in the design of AlexNet (krizhevsky2012imagenet)

, which won the prominent ImageNet challenge in 2012 with a significant margin and helped transform the field of computer vision.

Dropout regularizes the empirical risk by randomly perturbing the model parameters during training. A natural first step toward understanding generalization due to dropout, therefore, is to instantiate the explicit form of the regularizer due to dropout. In linear regression, with dropout applied to the input layer (i.e., on the input features), the explicit regularizer was shown to be akin to a data-dependent ridge penalty 

srivastava2014dropout; wager2013dropout; baldi2013understanding; wang2013fast. In factored models dropout yields more exotic forms of regularization. For instance, dropout induces regularizer that behaves similar to nuclear norm regularization in matrix factorization cavazza2018dropout, in single hidden-layer linear networks mianjy2018implicit, and in deep linear networks mianjy2019dropout. However, none of the works above discuss how the induced regularizer provides capacity control, or equivalently, help us establish generalization bounds for dropout.

In this paper, we provide an answer to this question. We give explicit forms

of the regularizers induced by dropout for the matrix sensing problem and two-layer neural networks with ReLU activations. Further, we establish

capacity control due to dropout and give precise generalization bounds. Our key contributions are as follows.

  1. In Section 2, we study dropout for matrix completion, wherein, the matrix factors are dropped randomly during training. We show that this algorithmic procedure induces a data-dependent regularizer that behaves similar to the weighted trace-norm which has been shown to yield strong generalization guarantees for matrix completion (foygel2011learning).

  2. In Section 3, we study dropout in two-layer ReLU networks. We show that the regularizer induced by dropout is a data-dependent measure that behaves as -path norm neyshabur2015path, and establish data-dependent generalization bounds.

  3. In Section 5, we present empirical evaluations that confirm our theoretical findings for matrix completion and deep regression on real world datasets including the MovieLens data, as well as the MNIST and Fashion MNIST datasets.

1.1 Related Work

Dropout was first introduced by hinton2012improving

as an effective heuristic for algorithmic regularization, yielding lower test errors on the MNIST and TIMIT datasets. In a subsequent work, 

srivastava2014dropout reported similar improvements over several tasks in computer vision (on CIFAR-10/100 and ImageNet datasets), speech recognition, text classification and genetics.

Thenceforth, dropout has been widely used in training state-of-the-art systems for several tasks including large-scale visual recognition szegedy2015going, large vocabulary continuous speech recognition dahl2013improving, image question answering yang2016stacked, handwriting recognition pham2014dropout, sentiment prediction and question classification kalchbrenner2014convolutional, dependency parsing chen2014fast, and brain tumor segmentation havaei2017brain.

Following the empirical success of dropout, there have been several studies in recent years aimed at establishing theoretical underpinnings of why and how dropout helps with generalization. Early work of baldi2013understanding showed that for a single linear unit (and a single sigmoid unit, approximately), dropout amounts to weight decay regularization on the weights. A similar result was shown by mcallester2013pac in a PAC-Bayes setting. For generalized linear models, wager2013dropout established that dropout performs an adaptive regularization which is equivalent to a data-dependent scaling of the weight decay penalty. In their follow-up work, wager2014altitude show that for linear classification, under a generative assumption on the data, dropout improves the convergence rate of the generalization error. In this paper, we focus on predictors represented in a factored form and give generalization bounds for matrix learning problems and single hidden layer ReLU networks.

In a related line of work, helmbold2015inductive study the structural properties of the dropout regularizer in the context of linear classification. They characterize the landscape of the dropout criterion in terms of unique minimizers and establish non-monotonic and non-convex nature of the regularizer. In a follow up work, helmbold2017surprising extend their analysis to dropout in deep ReLU networks and surprisingly find that the nature of regularizer is different from that in linear classification. In particular, they show that unlike weight decay, dropout regularizer in deep networks can grow exponentially with depth and remains invariant to rescaling of inputs, outputs, and network weights. We confirm some of these findings in our theoretical analysis. However, counter to the claims of helmbold2017surprising, we argue that dropout does indeed prevent co-adaptation.

In a closely related approach as ours, the works of zhai2018adaptive, gao2016dropout, and wan2013regularization bound the Rademacher complexity of deep neural networks trained using dropout. In particular, gao2016dropout

show that the Rademacher complexity of the target class decreases polynomially or exponentially, for shallow and deep networks, respectively, albeit they assume additional norm bounds on the weight vectors. Similarly, the works of

wan2013regularization and zhai2018adaptive assume that certain norms of the weights are bounded, and show that the Rademacher complexity of the target class decreases with dropout rates. We argue in this paper that dropout alone does not directly control the norms of the weight vectors; therefore, each of the works above fail to capture the practice. We emphasize that none of the previous works provide a generalization guarantee, i.e., a bound on the gap between the population risk and the empirical risk, merely in terms of the value of the explicit regularizer due to dropout. We give a first such result for dropout in the context of matrix completion and for a single hidden layer ReLU network.

There are a bunch of other works that do not fall into any of the categories above, and, in fact, are somewhat unrelated to the focus in this paper. Nonetheless, we discuss them here for completeness. For instance, gal2016dropout study dropout as Bayesian approximation , bank2018relationship draw insights from frame theory to connect the notion of equiangular tight frames with dropout training in auto-encoders. Also, some recent works have considered variants of dropout. For instance, mou2018dropout consider a variant of dropout, which they call “truthful” dropout, that ensures that the output of the randomly perturbed network is unbiased. However, rather than bound generalization error, mou2018dropout bound the gap between the population risk and the dropout objective, i.e., the empirical risk plus the explicit regularizer. li2016improved

study a yet another variant based on multinomoal sampling (different nodes are dropped with different rates), and establish sub-optimality bounds for stochastic optimization of linear models (for convex Lipschitz loss functions).

Matrix Factorization with Dropout.

Our study of dropout is motivated in part by recent works of cavazza2018dropout, mianjy2018implicit, and mianjy2019dropout. This line of work was initiated by cavazza2018dropout, who studied dropout for low-rank matrix factorization without constraining the rank of the factors or adding an explicit regularizer to the objective. They show that dropout in the context of matrix factorization yields an explicit regularizer whose convex envelope is given by nuclear norm. This result is further strengthened by mianjy2018implicit who show that induced regularizer is indeed nuclear norm.

While matrix factorization is not a learning problem per se (for instance, what is training versus test data), in follow-up works by mianjy2018implicit and mianjy2019dropout, the authors show that training deep linear networks with -loss using dropout reduces to the matrix factorization problem if the marginal distribution of the input feature vectors is assumed to be isotropic, i.e., . We note that this is a strong assumption. If we do not assume isotropy, we show that dropout induces a data-dependent regularizer which amounts to a simple scaling of the parameters and, therefore, does not control capacity in any meaningful way. We revisit this discussion in Section 4.

To summarize, while we are motivated by cavazza2018dropout, the problem setup, the nature of statements in this paper, and the tools we use are different from that in cavazza2018dropout. Our proofs are simple and quickly verified. We do build closely on the prior work of mianjy2018implicit.

However, different from mianjy2018implicit, we rigorously argue for dropout in matrix completion by 1) showing that the induced regularizer is equal to weighted trace-norm, which as far as we know, is a novel result, 2) giving strong generalization bounds, and 3) providing extensive experimental evidence that dropout provides state of the art performance on one of the largest datasets in recommendation systems research. Beyond that we rigorously extend our results to two layer ReLU networks, describe the explicit regularizer, bound the Rademacher complexity of the hypothesis class controlled by dropout, show precise generalization bounds, and support them with empirical results.

1.2 Notation and Preliminaries

We denote matrices, vectors, scalar variables and sets by Roman capital letters, Roman small letters, small letters, and script letters, respectively (e.g. X, x, , and ). For any integer , we represent the set by . For any vector , represents the diagonal matrix with the diagonal entry equal to , and is the elementwise squared root of x. Let represent the -norm of vector x, and , , and represent the spectral norm, the Frobenius norm, and the nuclear norm of matrix X, respectively. Let denote the Moore-Penrose pseudo-inverse of X. Given a positive definite matrix C, we denote the Mahalonobis norm as

. For a random variable

x that takes values in , given i.i.d. samples , the empirical average of a function is denoted by

. Furthermore, we denote the second moment of

x as . The standard inner product is represented by , for vectors or matrices, where .

We are primarily interested in understanding how dropout controls the capacity of the hypothesis class when using dropout for training. To that end, we consider Rademacher complexity, a sample dependent measure of complexity of a hypothesis class that can directly bound the generalization gap (bartlett2002rademacher). Formally, let be a sample of size . Then, the empirical Rademacher complexity of a function class with respect to , and the expected Rademacher complexity are defined, respectively, as

where are i.i.d. Rademacher random variables.

2 Matrix Sensing

We begin with understanding dropout for matrix sensing, a problem which arguably is an important instance of a matrix learning problem with lots of applications, and is well understood from a theoretical perspective. The problem setup is the following. Let be a matrix with rank . Let be a set of measurement matrices of the same size as . The goal of matrix sensing is to recover the matrix from observations of the form such that .

A natural approach is to represent the matrix in terms of factors and solve the following empirical risk minimization problem:

(1)

where . When the number of factors is unconstrained, i.e., when , there exist many “bad” empirical minimizers, i.e., those with a large true risk . Interestingly, li2018algorithmic showed recently that under a restricted isometry property (RIP), despite the existence of such poor ERM solutions, gradient descent with proper initialization is implicitly biased towards finding solutions with minimum nuclear norm – this is an important result which was first conjectured and empirically verified by gunasekar2017implicit. We do not make an RIP assumption here. Further, we argue that for the most part, modern machine learning systems employ explicit regularization techniques. In fact, as we show in the experimental section, the implicit

bias due to (stochastic) gradient descent does not prevent it from blatant overfitting in the matrix completion problem.

We propose solving the ERM problem (1) using dropout, where at training time, corresponding columns of U and V are dropped uniformly at random. As opposed to an implicit effect of gradient descent, dropout explicitly regularizes the empirical objective. It is then natural to ask, in the case of matrix sensing, if dropout also biases the ERM towards certain low norm solutions. To answer this question, we begin with the observation that dropout can be viewed as an instance of SGD on the following objective cavazza2018dropout; mianjy2018implicit:

(2)

where is a diagonal matrix whose diagonal elements are Bernoulli random variables distributed as . It is easy to show that for :

(3)

where is a data-dependent term that captures the explicit regularizer due to dropout. A similar result was shown by cavazza2018dropout and mianjy2018implicit, but we provide a proof for completeness (see Proposition 2 in the Appendix).

We show that the explicit regularizer concentrates around its expected value w.r.t. the data distribution (see Lemma 2 in the Appendix). Furthermore, given that we seek a minimum of , it suffices to consider the factors with the minimal value of the regularizer among all that yield the same empirical loss. This motivates studying the the following distribution-dependent induced regularizer:

For a wide range of random measurements, turns out to be a “suitable” regularizer. Here, we instantiate two important examples (see Proposition 3 in the Appendix).

Gaussian Measurements.

For all , let be standard Gaussian matrices. In this case, it is easy to see that and we recover the matrix factorization problem. Furthermore, we know from cavazza2018dropout; mianjy2019dropout that dropout regularizer acts as trace-norm regularization, i.e., .

Matrix Completion.

For all , let be an indicator matrix whose

-th element is selected randomly with probability

, where and denote the probability of choosing the -th row and the -th column, respectively. Then

is the weighted trace-norm studied by srebro2010collaborative and foygel2011learning.

These observations are specifically important because they connect dropout, an algorithmic heuristic in deep learning, to strong complexity measures that are empirically effective as well as theoretically well understood. To illustrate, here we give a generalization bound for matrix completion using dropout in terms of the value of the explicit regularizer at the minimizer.

Theorem 1.

Assume that and . Furthermore, assume that . Let be a minimizer of the dropout ERM objective in equation (3). Let be such that . Then, for any , the following generalization bounds holds with probability at least over a sample of size :

where thresholds M at , i.e. and is the true risk of .

The proof of Theorem 1 follows from standard generalization bounds for loss (mohri2018foundations) based on the Rademacher complexity (bartlett2002rademacher) of the class of functions with weighted trace-norm bounded by , i.e. . The non-degeneracy condition is required to obtain a bound on the Rademacher complexity of , as established by foygel2011learning.

We note that for large enough sample size, , where the second approximation is due the fact that the pair is a minimizer. That is, compared to the weighted trace-norm, the value of the explicit regularizer at the minimizer roughly scales as . Hence the assumption in the statement of the corollary.

In practice, for models that are trained with dropout, the training error is negligible (see Figure 1 for experiments on the MovieLens dataset). Moreover, given that the sample size is large enough, the third term can be made arbitrarily small. Having said that, the second term, which is , dominates the right hand side of generalization error bound in Theorem 9. In Appendix, we also give optimistic generalization bounds that decay as .

Finally, the required sample size heavily depends on the value of the explicit regularizer (i.e., ) at a minimizer, and hence, on the dropout rate . In particular, increasing the dropout rate increases the regularization parameter , thereby intensifying the penalty due to the explicit regularizer. Intuitively, a larger dropout rate results in a smaller , thereby a tighter generalization gap can be guaranteed. We show through experiments that that is indeed the case in practice.

3 Non-linear Networks

Next, we focus on neural networks with a single hidden layer. Let and denote the input and output spaces, respectively. Let

denote the joint probability distribution on

. Given examples

drawn i.i.d. from the joint distribution and a loss function

, the goal of learning is to find a hypothesis , parameterized by w, that has a small population risk .

We focus on the squared loss, i.e., , and study the generalization properties of the dropout algorithm for minimizing the empirical risk

. We consider the hypothesis class associated with feed-forward neural networks with

layers, i.e., functions of the form , where are the weight matrices. The parameter w is the collection of weight matrices and

is the ReLU activation function applied entrywise to an input vector.

As in Section 2, we view dropout as an instance of stochastic gradient descent on the following dropout objective:

(4)

where B

is a diagonal random matrix with diagonal elements distributed identically and independently as

, for some dropout rate . We seek to understand the explicit regularizer due to dropout:

(5)

We denote the output of the -th hidden node on an input vector x by ; for example, . Similarly, the vector denotes the activation of the hidden layer on input x. Using this notation, we can rewrite the objective in (4) as . It is then easy to show that the regularizer due to dropout in (5) is given as (see Proposition 4 in Appendix):

The explicit regularizer is a summation over hidden nodes, of the product of the squared norm of the outgoing weights with the empirical second moment of the output of the corresponding neuron. We should view it as a data-dependent variant of the path-norm of the network, studied recently by neyshabur2015norm and shown to yield capacity control in deep learning. Indeed, if we consider ReLU activations and input distributions that are symmetric and isotropic mianjy2018implicit, the expected regularizer is equal to the sum over all paths from input to output of the product of the squares of weights along the paths, i.e.,

which is precisely the squared path-norm of the network. We refer the reader to Proposition 5 in the Appendix for a formal statement and proof.

3.1 Generalization Bounds

To understand the generalization properties of dropout, we focus on the following distribution-dependent class

where is the top layer weight vector, denotes the -th entry of u, and is the expected squared neural activation for the -th hidden node. For simplicity, we focus on networks with one output neuron ; extension to multiple output neurons is rather straightforward.

We argue that networks that are trained with dropout belong to the class , for a small value of . In particular, by Cauchy-Schwartz inequality, it is easy to to see that . Thus, for a fixed width, dropout implicitly controls the function class . More importantly, this inequality is loose if a small subset of hidden nodes “co-adapt” in a way that for all , the other hidden nodes are almost inactive, i.e. . In other words, by minimizing the expected regularizer, dropout is biased towards networks where the gap between and is small, which in turn happens if . In this sense, dropout breaks “co-adaptation” between neurons by promoting solutions with nearly equal contribution from hidden neurons.

As we mentioned in the introduction, a bound on the dropout regularizer is not sufficient to guarantee a bound on a norm-based complexity measures that are common in the deep learning literature (see, e.g. golowich2018size and the references therein), whereas a norm bound on the weight vector would imply a bound on the explicit regularizer due to dropout. Formally, we show the following.

Proposition 1.

For any , there exists a distribution on the unit Euclidean sphere, and a network , such that , while .

In other words, even though we connect the dropout regularizer to path-norm, the data-dependent nature of the regularizer prevents us from leveraging that connection in data-independent manner (i.e., for all distributions). At the same time, making strong distributional assumptions (as in Proposition 5) would be impractical. Instead, we argue for the following milder condition on the input distribution which we show as sufficient to ensure generalization.

Assumption 1 (-retentive).

The marginal input distribution is -retentive for some , if for any non-zero vector , it holds that .

Intuitively, what the assumption implies is that the variance (aka, the information or signal in the data) in the pre-activation at any node in the network is not quashed considerably due to the non-linearity. In fact, no reasonable training algorithm should learn weights where

is small. However, we steer clear from algorithmic aspects of dropout training, and make the assumption above for every weight vector as we will need it when carrying out a union bound.

We now present the first main result of this section, which bounds the Rademacher complexity of in terms of , the retentiveness coefficient , and the Mahalanobis norm of the data with respect to the pseudo-inverse of the second moment, i.e. .

Theorem 2.

For any sample of size ,

Furthermore, it holds for the expected Rademacher complexity that

First, note that the bound depends on the quantity which can be in the same order as with both scaling as ; the latter is more common in the literature neyshabur2018towards; bartlett2017spectrally; neyshabur2017pac; golowich2018size; neyshabur2015norm. This is unfortunately unavoidable, unless one makes stronger distributional assumptions.

Second, as we discussed earlier, the dropout regularizer directly controls the value of , thereby controlling the Rademacher complexity in Theorem 2. This bound also gives us a bound on the Rademacher complexity of the networks trained using dropout. To see that, consider the following class of networks with bounded explicit regularizer, i.e., . Then, Theorem 2 yields In fact, we can show that this bound is tight up to by a reduction to the linear case. Formally, we show the following.

Theorem 3 (Lowerbound on ).

There is a constant such that for any scalar ,

Moreover, it is easy to give a generalization bound based on Theorem 2 that depends only on the distribution dependent quantities and . Let project the network output onto the range . We have the following generalization gurantees for .

Corollary 1.

For any , for any , the following generalization bound holds with probability at least over a sample of size

-independent Bounds.

Geometrically,

-retentiveness requires that for any hyperplane passing through the origin, both halfspaces contribute significantly to the second moment of the data in the direction of the normal vector. It is not clear, however, if

can be estimated efficiently, given a dataset. Nonetheless, when

, which is the case for image datasets, a simple symmetrization technique, described below, allows us to give bounds that are -independent; note that the bound still depend on the sample as . Here is the randomized symmetrization we propose. Given a training sample , consider the following randomized perturbation, , where ’s are i.i.d. Rademacher random variables. We give a generalization bound (w.r.t. the original data distribution) for the hypothesis class with bounded regularizer w.r.t. the perturbed data distribution.

Corollary 2.

Given an i.i.d. sample , let

where . For any , for any , the following generalization bound holds with probability at least over a sample of size and the randomization in symmetrization

Note that the population risk of the clipped predictor is bounded in terms of the empirical risk on . Finally, we verify in Section 5 that symmetrization of the training set, on MNIST and FashionMNIST datasets, does not have an effect on the performance of the trained models.

plain SGD dropout
width last iterate best iterate
0.8041 0.7938 0.7805 0.785 0.7991 0.8186
0.8315 0.7897 0.7899 0.7771 0.7763 0.7833
0.8431 0.7873 0.7988 0.7813 0.7742 0.7743
0.8472 0.7858 0.8042 0.7852 0.7756 0.7722
0.8473 0.7844 0.8069 0.7879 0.7772 0.772
Table 1: MovieLens dataset: Test RMSE of plain SGD as well as the dropout algorithm with various dropout rates for various factorization sizes. The grey cells shows the best performance(s) in each row.

4 Role of Parametrization

In this section, we argue that parametrization plays an important role in determining the nature of the inductive bias.

We begin by considering matrix sensing in non-factorized form, which entails minimizing , where denotes the column vectorization of M. Then, the expected explicit regularizer due to dropout equals , where

is the second moment of the measurement matrices. For instance, with Gaussian measurements, the second moment equals the identity matrix, in which case, the regularizer reduces to the Frobenius norm of the parameters

. While such a ridge penalty yields a useful inductive bias in linear regression, it is not “rich” enough to capture the kind of inductive bias that provides rank control in matrix sensing.

However, simply representing the hypotheses in a factored form alone is not sufficient in terms of imparting a rich inductive bias to the learning problem. Recall that in linear regression, dropout, when applied on the input features, yields ridge regularization. However, if we were to represent the linear predictor in terms of a deep linear network, then we argue that the effect of dropout is markedly different. Consider a deep linear network, with a single output neuron. In this case, mianjy2019dropout show that where is a regularization parameter independent of the parameters w. Consequently, in deep linear networks with a single output neuron, dropout reduces to solving

All the minimizers of the above problem are solutions to the system of linear equations , where are the design matrix and the response vector, respectively. In other words, the dropout regularizer manifests itself merely as a scaling of the parameters which does not seem to be useful in any meaningful way.

What we argue above may at first seem to contradict the results of Section 2 on matrix sensing, which is arguably an instance of regression with a two-layer linear network. Note though that casting matrix sensing in a factored form as a linear regression problem requires us to use a convolutional structure. This is easy to check since

where is the Kronecker product, and we used the fact that for any pair of matrices . The expression represents a fully connected convolutional layer with filters specified by columns of V. The convolutional structure in addition to dropout is what imparts the problem of matrix sensing the nuclear norm regularization. For nonlinear networks, however, a simple feed-forward structure suffices as we saw in Section 3.

5 Experimental Results

In this section, we report our empirical findings on real world datasets. All results are averaged over 50 independent runs with random initialization.

Figure 1: MovieLens dataset: the training error (left), the test error (middle), and the generalization gap (right) for plain SGD and dropout with as a function of the number of iterations. The factorization size is .

5.1 Matrix Completion

We evaluate dropout on the MovieLens dataset harper2016movielens, a publicly available collaborative filtering dataset that contains 10M ratings for 11K movies by 72K users of the online movie recommender service MovieLens.

We initialize the factors using the standard He initialization scheme. We train the model for 100 epochs over the training data, where we use a fixed learning rate of

, and a batch size of . We report the results for plain SGD () as well as the dropout algorithm with .

Figure 1 shows the progress in terms of the training and test error as well as the gap between them as a function of the number of iterations for . It can be seen that plain SGD is the fastest in minimizing the empirical risk. The dropout rate clearly determines the trade-off between the approximation error and the estimation error: as the dropout rate increases, the algorithm favors less complex solutions that suffer larger empirical error (left figure) but enjoy smaller generalization gap (right figure). The best trade-off here seems to be achieved by a moderate dropout rate of . We observe similar behaviour for different factorization sizes; please see the Appendix for additional plots with factorization sizes .

It is remarkable, how even in the “simple” problem of matrix completion, plain SGD lacks a proper inductive bias. As seen in the middle plot, without explicit regularization, in particular, without early stopping or dropout, SGD starts overfitting. We further illustrate this in Table 1, where we compare the test root-mean-squared-error (RMSE) of plain SGD with the dropout algorithm, for various factorization sizes. To show the superiority of dropout over SGD with early stopping, we give SGD the advantage of having access to the test set (and not a separate validation set), and report the best iterate in the third column. Even with this impractical privilege, dropout performs better ( difference in test RMSE).

5.2 Neural Networks

Figure 2: (left) “co-adaptation”; (middle) generalization gap; and (right) as a function of the width of networks trained with dropout on MNIST. In left figure, the dashed brown and dotted purple lines represent minimal and maximal co-adaptations, respectively.

We train 2-layer neural networks with and without dropout, on MNIST dataset of handwritten digits and Fashion MNIST dataset of Zalando’s article images, each of which contains 60K training examples and 10K test examples, where each example is a grayscale image associated with a label from classes. We extract two classes and label them as  111We observe similar results across other choices of target classes.. The learning rate in all experiments is set to . We train the models for 30 epochs over the training set. We run the experiments both with and without symmetrization. Here we only report the results with symmetrization, and on the MNIST dataset. For experiments without symmetrization, and experiments on FashionMNIST, please see the Appendix. We remark that under the above experimental setting, the trained networks achieve training accuracy.

For any node , we define its flow as (respectively for symmetrized data), which measures the overall contribution of a node to the output of the network. Co-adaptation occurs when a small subset of nodes dominate the overall function of the network. We argue that is a suitable measure of co-adaptation (or lack thereof) in a network parameterized by w. In case of high co-adaptation, only a few nodes have a high flow, which implies . At the other end of the spectrum, all nodes are equally active, in which case . Figure 2 (left) illustrates this measure as a function of the network width for several dropout rates . In particular, we observe that a higher dropout rate corresponds to less co-adapation. More interestingly, even plain SGD is implicitly biased towards networks with less co-adapation. Moreover, for a fixed dropout rate, the regularization effect due to dropout decreases as we increase the width. Thus, it is natural to expect more co-adaptation as the network becomes wider, which is what we observe in the plots.

The generalization gap is plotted in Figure 2 (middle). As expected, increasing dropout rate decreases the generalization gap, uniformly for all widths. In our experiments, the generalization gap increases with the width of the network. The figure on the right shows the quantity that shows up in the Rademacher complexity bounds in Section 3. We note that, the bound on the Rademacher complexity is predictive of the generalization gap, in the sense that a smaller bound corresponds to a curve with smaller generalization gap.

6 Conclusion

Motivated by the success of dropout in deep learning, we study a dropout algorithm for matrix sensing and show that it enjoys strong generalization guarantees as well as competitive test performance on the MovieLens dataset. We then focus on deep regression under the squared loss and show that the regularizer due to dropout serves as a strong complexity measure for the underlying class of neural networks, using which we give a generalization error bound in terms of the value of the regularizer.

Acknowledgements

This research was supported, in part, by NSF BIGDATA award IIS-1546482 and NSF CAREER award IIS-1943251. The seeds of this work were sown during the summer 2019 workshop on the Foundations of Deep Learning at the Simons Institute for the Theory of Computing. Raman Arora acknowledges the support provided by the Institute for Advanced Study, Princeton, New Jersey as part of the special year on Optimization, Statistics, and Theoretical Machine Learning.

References

Appendix A Auxiliary Results

Lemma 1 (Khintchine-Kahane inequality).

Let be i.i.d. Rademacher random variables, and . Then there exist a universal constants such that

Theorem 4 (Hoeffding’s inequality: Theorem 2.6.2 vershynin2018high).

Let be independent, mean zero, sub-Gaussian random variables. Then, for every , we have

Theorem 5 (Theorem 3.1 of mohri2018foundations).

Let be a family of functions mapping from to . Then, for any , with probability at least over a sample , the following holds for all

Theorem 6 (Theorem 10.3 of mohri2018foundations).

Assume that for all . Then, for any , with probability at least over a sample of size , the following inequalities holds uniformly for all .

Theorem 7 (Based on Theorem 1 in srebro2010optimistic).

Let and denote the input space and the label space, respectively. Let be the target function class. For any , and any , let be the squared loss. Let be the population risk with respect to the joint distribution on . For any , with probability at least over a sample of size , we have for any :

where , and is a numeric constant derived from srebro2010optimistic.

Theorem 8 (Theorem 3.3 in mianjy2018implicit).

For any pair of matrices , there exist a rotation matrix such that rotated matrices satisfy , for all .

Theorem 9 (Theorem 1 in foygel2011learning).

Assume that for all . For any , let

be the class of linear transformations with weighted trace-norm bounded with

. Then the expected Rademacher complexity of is bounded as follows:

Appendix B Matrix Sensing

Proposition 2 (Dropout regularizer in matrix sensing).

The following holds for any :

(6)

where and is the regularization parameter.

Proof of Proposition 2.

Similar statements and proofs can be found in several previous works srivastava2014dropout; wang2013fast; cavazza2018dropout; mianjy2018implicit. For completeness, we include a proof here. The following equality follows from the definition of variance:

(7)

Recall that for a Bernoulli random variable , we have and . Thus, the first term on right hand side is equal to . For the second term we have

Plugging the above into Equation (7) and averaging over samples we get

which completes the proof. ∎

Lemma 2 (Concentration in matrix completion).

For , let be an indicator matrix whose -th element is selected according to some distribution. Assume