Recently, deep belief networks (DBNs) have been used to design state-of-the-art systems in several important learning applications. An important reason for the success of DBNs is that they can model complex prediction functions using a large number of parameters linked through non-linear gating functions. However, this also makes training such models an extremely challenging task. Since there are potentially a large number of local minimas in the space of parameters, any standard gradient descent style method is prone to getting stuck in a local minimum which might be arbitrarily far from the global optimum.
A popular heuristic to avoid such local minima is dropout which perturbs the objective function randomly by dropping out several nodes of the DBN. Recently, there has been some work to understand this heuristic in certain limited convex settings Baldi and Sadowski (2014); Wager et al. (2013). However, in general the heuristic is not well understood, especially in the context of DBNs.
In this work, we first seek to understand why and under what conditions dropout helps in training DBNs. To this end, we show that for fairly general one-hidden layer neural networks, dropout indeed helps avoid local minima/stationary points. We prove that the following holds with at least a constant probability: dropout decreases the objective function value by a multiplicative factor, as long as the objective function value is not close to the optimal value (see Theorem1). To the best of our knowledge, ours is the first such result that explains performance of dropout for training neural networks.
Recently in a seminal work, Andoni et al. (2014) showed rigorously that a gradient descent based method for neural networks can be used to learn low-degree polynomials. However, their method analyzes a complex perturbation to gradient descent and does not apply to dropout. Moreover, our results apply to a significantly more general problem setting than the ones considered in Andoni et al. (2014); see Section 2.1 for more details.
Excess Risk bounds for Dropout. Additionally, we also study the dropout heuristic in a relatively easier setting of convex empirical risk minimization (ERM), where gradient descent methods are known to converge to the global optimum. In contrast to the above mentioned “instability” result, for convex ERM setting, our result indicates that the dropout heuristic leads to “stability” of the optimum. This hints at a dichotomy that dropout makes the global optimum stable while de-stabilizing local optima.
In particular, we study the excess error incurred by the dropout method when applied to the convex ERM problem. We show that, in expectation, dropout solves a problem similar to weighted -regularized ERM and exhibits fast excess risk rates (see Theorem 3). In comparison to recent works that analyze dropout for ERM style problems Baldi and Sadowski (2014); Wager et al. (2014), we study the general problem of convex ERM in generalized linear model (GLM) and provide precise generalized error bounds for the same. See Section 3.1 for more details.
Private learning using dropout.
Privacy is a looming concern for several large scale machine learning applications that have access to potentially sensitive data (e.g., medical health records)Dwork (2006). Differential privacy Dwork et al. (2006b) is a cryptographically strong notion of statistical data privacy. It has been extremely effective in protecting privacy in learning applications Chaudhuri et al. (2011); Duchi et al. (2013); Song et al. (2013); Jain and Thakurta (2014).
As mentioned above, for convex ERMs, dropout can be shown to be “stable” w.r.t. changing one or few entries in the training data. Using this insight, we design a dropout based differential private algorithm for convex ERMs (in GLM). Our algorithm requires that, in expectation over the randomness of dropout, the minimum eigenvalue of the Hessian of the given convex function should be lower bounded. This is in stark contrast to the existing differentially private learning algorithms. Most of these methods either need a strongly convex regularization or assume that the given ERM itself is strongly convex.
Experimental evaluation of dropout. Finally, we empirically validate our stability and “regularization” assertion for dropout in the convex ERM setting. In particular, we focus on the stability of dropout w.r.t. removal of training data, i.e., LOO stability. We study the random and adversarial removal of data samples. Interestingly, a recent works by Szegedy et al. (2013) and Maaten et al. (2013) provide a complementary set of experiments: while we study dropout for adversarial removal of the training data, Szegedy et al. (2013) studies adversarial perturbation of test inputs and Maaten et al. (2013) considers corrupted features. Our experiments indicate that dropout engenders more stability in accuracy than
regularization(with appropriate cross-validation to tune the regularization parameter). Moreover, perhaps surprisingly, dropout yields a more accurate classifier than the popular
regularization for several datasets. For example, for the Atheist dataset from UCI repository, dropout based logistic regression is almost 3% more accurate than theregularized logistic regression.
Paper Organization: We present our analysis of dropout for training neural networks in Section 2. Then, Section 3 presents excess risk bounds for dropout when applied to the convex ERM problem. In Section 4, we show that dropout applied to convex ERMs leads to stable solutions that can be used to guarantee differential privacy for the algorithm. Finally, we present our empirical results in Section 5.
2 Dropout algorithm for neural networks
In this section, we provide rigorous guarantees for training a certain class of neural networks (which are in particular non-convex) using the dropout heuristic. In particular, we show that dropout ensures with a constant probability that gradient descent does not get stuck in a “local optimum”. In fact under certain assumptions (stated in Theorem 1
), one can show that the function estimation error actually reduces by amultiplicative factor due to dropout. Andoni et al. (2014) also study the robustness properties of the local optima encountered by the gradient descent procedure while training neural networks. However, their proof applies only for complex perturbation of gradient descent and only for approximating low-degree polynomials.
We first describe the exact problem setting that we study. Let the space of input feature vectors be. Let be a fixed distribution defined on . For a fixed function , the goal is to approximate the function with a neural network (which we will define shortly). For a given estimated function , the error is measured by . We also define inner-product w.r.t. the distribution as . We now define the architecture of the neural network that is used to approximate .
Neural network architecture. We consider a one-hidden layer neural network architecture with nodes in the hidden layer. Let the underlying function be given by: , where is the link function for each hidden layer node . For simplicity, we assume that the coefficients are fixed . The goal is to learn parameters for each node. Also, let . The training data given for the learning task is . Note that Andoni et al. (2014) also studies the same architecture but their link functions are assumed to be low-degree polynomials.
Dropout heuristic. We now describe the dropout algorithm for this problem. At the -th step (for any ), sample a data point and perform gradient descent with learning rate : , where is the gradient of the error in approximation of . That is,
where is the -th step approximation to .
Now, if the procedure is stuck in a local minimum, then we use dropout perturbation to push it out of the local minima. That is, select a vector , where each . Now, for the current estimation (at time step ) we obtain a new polynomial as:
where . We now perform the gradient descent procedure using this perturbed instead of the true -th step iterate .
We now analyze the effectiveness of the above dropout heuristic for function approximation. We would like to stress that the objective in this section is to demonstrate instability of local-minima in function approximation w.r.t. dropout perturbation. This entails that if the gradient descent procedure is stuck in a local minima/stationary point, then dropout heuristic helps get out of local minima and in fact reduces the estimation error significantly. However, the exposition here does not guarantee that the dropout algorithm reaches the global minimum. It just ensures that using dropout one can get out of the local minimum.
Let be the true polynomial for , where represents the -th node’s link function in the neural network. Let represent the weights on the output layer of the neural network, with . Let be the current estimate of .
Let be a fixed distribution on from which the training examples are drawn. If and , then with probability at least over the dropout, the dropped out neural network satisfies the following.
Also, for all .
At a high level, the above theorem shows that if the estimation error is large enough, then the following holds with at least a constant probability: , which is a dropout based perturbation of , has significantly lesser estimation error than itself. Next in Section 2.1 we apply our results to the problem of learning low-degree polynomials and compare our guarantees with those of Andoni et al. (2014).
Proof of Theorem 1.
Let be the error polynomial for the approximation and let be the error polynomial for the approximation . We have the following identity. Here , and the last step in (1) is for notational purposes.
We first analyze the term . Notice that one can equivalently write the polynomial as , where . We have.
) we now need to lower bound the variance of
, i.e., lower bound the random variable. By the randomness of ’s we have . Also by (3), we have . Using standard Payley-Zigmund anti-concentration inequality, we have . Plugging in the bounds on from above we have .
Now we focus on the term in (1) and provide an upper bound. We have . Using the bounds on the terms and one can conclude that if , then with probability at least , . This completes the proof. ∎
2.1 Application: Learning polynomials with neural networks
The work of Andoni et al. (2014) studied the problem of learning degree- polynomials (with real or complex coefficients) using polynomial neural networks described above. In this section we provide a comparative analysis of Andoni et al. (2014, Theorem 5.1) with Theorem 1 above. The approach of Andoni et al. (2014) is different from our approach in two ways: i) For the analysis of Andoni et al. (2014) to go through, the perturbation has to be complex, and ii) They consider additive perturbation to the weights as opposed to the multiplicative perturbation to the nodes exhibited by dropout.
In order to make the results comparable, we will assume that for each of the node , , and . (Since Andoni et al. (2014) deal with complex numbers, these bounds above are on the modulus.) Under this assumption, Theorem 1 suggests that the error can be brought down to where as Andoni et al. (2014, Theorem 5.1) show that the error can be brought down to . Notice that our bound is independent of the dimensionality () and the degree of the polynomial (). In terms of the rate of convergence, Andoni et al. (2014, Theorem 5.1) ensures that the error reduces by factor, while in our case it is . Another advantage of Theorem 1 is that it is oblivious to the data distribution , as opposed to the results of Andoni et al. (2014) which explicitly require to be either uniform or Gaussian.
3 Fast rates of convergence for dropout
In the previous section we saw how dropout helps one come out of local minimum encountered during gradient descent. In this section, we show that for generalized linear models (GLMs) (a class of one layer convex neural networks), dropout gradient descent provides an excess risk bound of , where is the number of training data samples.
Problem Setting. We first describe the exact problem setting that we study. Let be a fixed but unknown distribution over the data domain , where is the input feature domain and
is the target output domain. Let theloss be a real-valued convex function (in the first parameter) defined over all and all . The population and excess risk of a model are defined as:
where is a fixed convex set. A learning algorithm typically has access to only a set of samples , drawn i.i.d. from . The goal of the algorithm is to find with small excess risk.
Dropout Heuristic. We now describe the dropout based algorithm used to minimize the Excess Risk (see (3
)). At a high-level, we just use the standard stochastic gradient descent algorithm. However, at each step,a random-fraction of the coordinates of the parameter vector are updated. That is, the data point generated by the stochastic gradient descent is perturbed to obtain where the -th coordinate of is given by . Now, the perturbed is used to update the parameter vector . In this section, we assume that the sampling probablity . See Algorithm 1 for the exact dropout algorithm that we analyze.
We also analyze a stylized variant of dropout that can be effectively captured by a standard regularized empirical risk minimization setup. (See Appendix A.1
.) Both of these analyses hinge on the observation that even though the loss functions are not strongly convex in general, the dropout variants of these loss functions arestrongly convex in expectation and enable us to derive an excess risk of in both cases. Recall for non-strongly convex loss functions in general, the lower bound on excess risk is Shalev-Shwartz et al. (2009).
Assumption 2 (Data normalization).
i) For any , , and ii) The loss function is -strongly convex in (i.e., ) and -Lipschitz (i.e, ).
In Theorem 3 we provide the excess risk guarantee for the dropout heuristic.
Theorem 3 (Dropout generalization bound).
Let be a fixed convex set and let Assumption 2 be true for the data domain and the loss . Let be i.i.d. samples drawn from . Let be defined as in (13). Let the learning rate . Then over the randomness of the SGD algorithm and the distribution , we have excess risk
Here , , and and are defined in Assumption 2. The outer expectation is over the randomness of the algorithm.
The proof of this theorem is provided in Section A.2. Observe that if , and are assumed to be constants, then the excess risk bound of Theorem 3 is . Second note that the bound is for the dropout risk defined in (13
). For the special case of linear regression (see Lemma4), dropout-based risk is the true risk (3) plus regularization. Hence, in this case, using standard arguments Shalev-Shwartz et al. (2009) we get the excess risk rate for population risk defined in (3). However, for other loss functions, it is not clear how close the dropout based risk is to the population risk.
Let be drawn uniformly from and let be the least squares loss function, i.e., . Then,
Note. Notice that even when is not full rank (e.g., all the are scaled versions of the -dimensional vector ), we can still obtain an excess risk of for the dropout loss. Recall that in general for non-strongly convex loss functions, the best excess risk one can hope for is Shalev-Shwartz et al. (2009).
3.1 Comparison to related work
After the seminal paper of Hinton et al. (2012), demonstrating strong experimental advantage of “dropping off” of nodes in the training of deep neural networks, there have been a series of works providing strong theoretical understanding of the dropout heuristic Baldi and Sadowski (2014); Wager et al. (2013); Wang et al. (2013); van Erven et al. (2014); Helmbold and Long (2014); Wager et al. (2014); McAllester (2013); Maaten et al. (2013). A high-level conclusion from all these works has been that dropout behaves as a regularizer, and in particular as an regularizer when the underlying optimization problem is convex. In terms of rates of convergence, the work of Wager et al. (2013) provide asymptotic consistency for the dropout heuristic w.r.t. convex models. They show (using second order Taylor approximation) that asymptotically dropout behaves as an adaptive -regularizer. The work of Wager et al. (2014) provide the precise rate of convergence of the excess risk when the data is assumed to be coming from a Possion generaive model, and the underlying optimization task is topic modeling. For the classic problem of linear optimization over a polytope, dropout recovers essentially the same bound as follow the perturbed leader Kalai and Vempala (2005) while bypassing the issue of tuning the regularization parameter.
In this work we extend this line of work further by providing the precise (non-asymptotic) rate of convergence of the dropout heuristic for arbitrary generalized linear models (GLMs). In essence, by providing this analysis, we close the fourth open problem raised in the work of van Erven et al. (2014) which posed the problem of determining the generalization error bound for GLMs. One surprising aspect of our result is that the rate of convergence is (as opposed to ), even when the underlying data covariance matrix is not full-rank.
4 Private convex optimization using dropout
In this section we show that dropout can be used to design differentially private convex optimization algorithms. In the last few years, design of differentially private optimization (learning) algorithms have received significant attention Chaudhuri and Monteleoni (2008); Dwork and Lei (2009); Chaudhuri et al. (2011); Jain et al. (2012); Kifer et al. (2012); Duchi et al. (2013); Song et al. (2013); Jain and Thakurta (2014); Bassily et al. (2014). We further extend this line of research to show that dropout allows one to exploit properties of the data (e.g., minimum entry in the diagonal of the Hessian) to ensure robustness, and hence differential privacy. Differential privacy is a cryptographically strong notion which by now is the de-facto standard for statistical data privacy. It ensures the privacy of individual entries in the data set even in the presence of arbitrary auxiliary information Dwork (2006, 2008).
For any pairs of neighboring data sets differing in exactly one entry, an algorithm is -differentially private if for all measurable sets in the range space of the following holds:
Here, think of and to be a small constant.
Background. At an intuitive level differential privacy ensures that the measure induced on the space of possible outputs by a randomized algorithm does not depend “too much” on the presence or absence of one data entry. This intuition has two immediate consequences: i) If the underlying training data contains potentially sensitive information (e.g., medical records), then it ensures that an adversary learns almost the same information about an individual independent of his/her presence or absence in the data set, and hence protecting his/her privacy, ii) Since the output does not depend “too much” on any one data entry, the Algorithm cannot over-fit and hence will provably have good generalization error. Formalizations of both these implications can be found in Dwork (2006) and Bassily et al. (2014, Appendix F). This property of a learning algorithm to not over-fit the training data (also known as stability) is known to be both necessary and sufficient for a learning algorithm to generalize Shalev-Shwartz et al. (2009, 2010); Poggio et al. (2011); Bousquet and Elisseeff (2002).
In the following we provide a stylized example where dropout ensures differential privacy. In Appendix B, we provide a detailed approach of extending this example to arbitrary generalized linear models (GLMs). (See Section 3 for a refresher on GLMs.)
4.1 Private dropout learning over the simplex
In this section, we analyze a stylized example: Linear loss functions over the simplex. The idea is to first show that for a given data set (with a set fixed set of properties which we will describe in Theorem 6), the dropout algorithm satisfies the differential privacy condition in (8) for any data set differing in one entry from . (For the purposes of brevity, we will refer to local differential privacy at .) Later we will use a standard technique called propose-test-release (PTR) framework Dwork and Lei (2009) to convert the above into a differential privacy guarantee. (The details are given in Algorithm 2).
Let the data domain be and let be i.i.d. samples from a distribution over . Let the loss function be and let the constraint set be the -dimensional simplex. Let be i.i.d. uniform samples from . The dropout optimization problem for the linear case can be defined as below. Here the -th coordinate of is given by .
At a high level Theorem 6 states that changing any one data entry in the training data set , changes the induced probability measure on the set of possible outputs by a factor of , and with a additive slack that is exponentially small in the number of data samples ().
Theorem 6 (Local differential privacy).
Let and . For the given data set if and , then the solution of dropout ERM (9) is -local differentially private at , where .
First notice that since we are optimizing a linear function over the simplex, the minimizer is essentially one of the coordinates in . Therefore one can equivalently write the optimization problem in (9) as follows. Here refers to the -th coordinate for the vector .
W.l.o.g. we assume that the neighboring data set differs from in . Also, let . Clearly for any , . In the following we show that the measures induced on the random variable by (10) for data sets and
have multiplicative closeness. The analysis of this part closely relates to the differential privacy guarantee under binomial distribution fromDwork et al. (2006a).
For a given , let be the number of non-zeroes in the -th coordinate of all the ’s, excluding . Therefore, we have the following for any .
So, as long as , the ratio in (11) is upper bounded by for . By Chernoff bound, such an event happens with probability at least . For the lower tail of the binomial distribution, an analogous argument provides such a bound.
One can use the same argument for other coordinates too. Notice . By union bound, for any given , the ratio of the measures induced by and on any which has probability measure at least , is in .
In the following we notice that not only individually each of the coordinates satisfy the multiplicative closeness in measure, in fact and satisfy analogous closeness in measure, where the closeness is within . This property follows by using Bhaskar et al. (2010, Theorem 5). This concludes the proof. ∎
4.1.1 From local differential privacy to differential privacy
Notice that Theorem 6 (ensuring local differential privacy) is independent of the data distribution . This has direct implications for differential privacy. We show that using the propose-test-release (PTR) framework Dwork and Lei (2009); Smith and Thakurta (2013), the dropout heuristic provides a differentially private algorithm.
Propose-test-release framework. Notice that for any pair of data sets and differing in one entry, and in Theorem 6 differs by at most . So using the standard Laplace mechanism from differential privacy Dwork et al. (2006b), one can show that satisfies -differential privacy, where is random variable sampled from the Laplace distribution with the scaling parameter of . With in hand we check if . For the condition being true, we output from (9) and output a otherwise. Theorem 7 ensures that the above PTR framework is -differentially private.
Propose-test-release framework along with dropout (i.e., Algorithm 2) is -differentially private for optimizing linear functions over the simplex, where and .
This theorem is a direct consequence of Theorem 6 and Thakurta (2015). Using the tail property of Laplace distribution, one can show that as long as in Theorem 6 is at least , w.p. at least the above PTR framework outputs from (9) exactly. While the current exposition of the PTR framework is tuned to the problem of optimizing linear functions over the simplex, a much more general treatment is provided in Appendix B.
In this section, we provide experimental evidence to support the stability guarantees we provided for dropout in Section 4 (for more extensive results, refer Appendix D). We empirically measure stability by observing the effect on the performance of the learning algorithm, as a function of the fraction of training examples removed. This measure captures how dependent an algorithm is on a particular subset of the training data. We show results for GLMs as well as for deep belief networks (DBN’s). We compare against the following two baseline methods (wherever applicable): a) unregularized models and b) -regularized GLM’s. We describe our experimental setup and results for each of these model classes below.
Stability of dropout for logistic regression We introduce perturbations of two forms: a) random removal of training examples and b) adversarially remove training examples.
Random removal of training examples: For a given , we train a model on a randomly selected -fraction of the training data. We report the test error and the difference in mean test error which is the absolute difference between the test error and the baseline error (the test error obtained by using the complete training dataset). We refer to this difference as the marginal error.
We present results on the benchmark Atheist dataset from the 20 newsgroup corpus; total number of examples = 1427, dimensionality: 22178.We use of the data for training and the remaining for testing and use a dropout rate of . We measure the error in terms of fraction of misclassified examples. Figure 1(a,b) shows the results for different values of when training a logistic regression model with no regularization, regularization, and two variants of dropout: “standard” dropout Hinton et al. (2012), and deterministic dropout outlined in Wang and Manning (2013).
We observe that the dropout variants exhibit more stability than the unregularized or the regularized versions. Moreover, deterministic dropout is more stable than standard dropout. Notice that, even though dropout consistently has a lower test error than other methods, its effectiveness diminishes with increasing . We hypothesize that with decreasing amount of training data, the regularization provided by dropout also decreases (see Section 3 and Appendix A.1).
Adversarial removal of training examples: Let be a given training set. Let be a model learned on the complete set . For a given value of , we remove the samples which have minimum . The rest of the experiment remains the same as in the random removal setting. Figure 1(c,d) shows the test error and the marginal error for different regularization methods w.r.t. in this adversarial setting.
As with random removal, dropout continues to be at least as good as the other regularization methods studied. However, when observe that dropout’s advantage decreases very rapidly, and all the methods tend to perform similarly.
Stability of linear regression Next, we apply our methods to linear regression using the Boston housing dataset Bache and Lichman (2013) (with 506 samples and 14 features) for our experiments. We use 300 examples for training and the rest for testing. Figure 2 (a), (b) shows that the marginal error of dropout is less than that of the other methods for all values of . Interestingly, for small values of , dropout performs worse than regularization, although it performs better at higher values. Here we use a dropout rate of , and we measure the mean squared error.
Stability of deep belief networks: While our theoretical stability guarantees hold only for generalized linear models, our experiments indicate that they extend to deep belief networks (DBN) too. We posit that the dropout algorithm on DBN’s (after pre-training) operates in a locally convex region, where the stability properties should hold.
We use the MNIST data set for our DBN experiments. Experiments with other data sets are in Appendix D.2. MNIST dataset contains 60000 examples for training and 10000 for testing. For training a DBN on this data set, we use a network with four layers111We use the gdbn and nolearn python toolkits for training a DBN.. We use 784, 800, 800, and 10 units in each layer respectively. Our error measure is the # of misclassifications.
As in the previous experiments, we measure stability by randomly removing training examples. See Figure 2 (c), (d) for test error and marginal error of dropout as well as the standard SGD algorithm applied to DBNs. Similar to the GLM setting, we observe that dropout exhibits more stability and accuracy than the unregularized SGD procedure. In fact for the case of 50% training data, dropout is more accurate than SGD.
- Andoni et al.  Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In ICML, 2014.
- Bache and Lichman  K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
- Baldi and Sadowski  Pierre Baldi and Peter Sadowski. The dropout learning algorithm. Artificial intelligence, 2014.
- Bassily et al.  Raef E Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization, revisited. Personal communication, 2014.
- Bhaskar et al.  Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. Discovering frequent patterns in sensitive data. In KDD, 2010.
- Bousquet and Elisseeff  Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499 – 526, 2002.
Bousquet et al. 
Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi.
Introduction to statistical learning theory.In Advanced Lectures on Machine Learning. 2004.
- Chaudhuri and Monteleoni  Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, NIPS. MIT Press, 2008.
- Chaudhuri et al.  Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 12:1069–1109, 2011.
- Duchi et al.  John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical minimax rates. In FOCS, 2013.
- Dwork  Cynthia Dwork. Differential privacy. In ICALP, LNCS, pages 1–12, 2006.
- Dwork  Cynthia Dwork. Differential privacy: A survey of results. In TAMC, pages 1–19. Springer, 2008.
- Dwork and Lei  Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In STOC, 2009.
- Dwork et al. [2006a] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mcsherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In In EUROCRYPT, pages 486–503. Springer, 2006a.
- Dwork et al. [2006b] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006b.
- Helmbold and Long  David P. Helmbold and Philip M. Long. On the inductive bias of dropout. CoRR, abs/1412.4736, 2014. URL http://arxiv.org/abs/1412.4736.
- Hinton et al.  Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Jain and Thakurta  Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In ICML, 2014.
- Jain et al.  Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In COLT, 2012.
- Kalai and Vempala  Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 2005.
- Kifer et al.  Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In COLT, 2012.
- Maaten et al.  Laurens Maaten, Minmin Chen, Stephen Tyree, and Kilian Q Weinberger. Learning with marginalized corrupted features. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 410–418, 2013.
- McAllester  David McAllester. A pac-bayesian tutorial with a dropout bound. arXiv preprint arXiv:1307.2118, 2013.
- Nikolov et al.  Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In STOC, 2013.
- Poggio et al.  Tomaso Poggio, Stephen Voinea, and Lorenzo Rosasco. Online learning, stability, and stochastic gradient descent. CoRR, abs/1105.4701, 2011.
- Shalev-Shwartz et al.  Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic Convex Optimization. In Proceedings of the Conference on Learning Theory (COLT), 2009.
- Shalev-Shwartz et al.  Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. JMLR, 2010.
- Shamir and Zhang  Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In ICML, 2013.
- Smith and Thakurta  Adam D. Smith and Abhradeep Thakurta. Differentially private model selection via stability arguments and the robustness of the lasso. In COLT, 2013.
- Song et al.  Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing, 2013.
- Sridharan et al.  Karthik Sridharan, Shai Shalev-shwartz, and Nathan Srebro. Fast rates for regularized objectives. In NIPS, 2008.
- Szegedy et al.  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
- Thakurta  Abhradeep Thakurta. Beyond worst case sensitivity in private data analysis. In Encyclopedia of Algorithms. 2015.
- van Erven et al.  Tim van Erven, Wojciech Kotłowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In COLT, 2014.
- Wager et al.  Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. In NIPS, 2013.
- Wager et al.  Stefan Wager, William Fithian, Sida Wang, and Percy S Liang. Altitude training: Strong bounds for single-layer dropout. In NIPS, 2014.
- Wang and Manning  Sida Wang and Christopher D. Manning. Fast dropout training. In ICML (2), 2013.
- Wang et al.  Sida Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. Feature noising for log-linear structured prediction. In EMNLP, pages 1170–1179, 2013.
Appendix A Fast rates of convergence for dropout optimization
a.1 Empirical risk minimization (ERM) formulation of dropout
For simplicity of exposition, we modify the ERM formulation to incorporate dropout perturbation as a part of the optimization problem itself. We stress that the ERM formulation is for intuition only. In Section 3, we analyze the stochastic gradient descent (SGD) variant of the dropout heuristic and show that the excess risk bound for the SGD variant is similar to that of the ERM variant.
Given a loss function , convex set , and data set which consists of i.i.d. samples drawn from , fitting a model with dropout corresponds to the following optimization:
where each is an i.i.d. sample drawn uniformly from , and the operator refers to the Hadamard product. We assume that the loss function is strongly convex in . For example, in the case of least-squares linear regression the loss function is .
Let be drawn uniformly from and let the expected population risk be given by
Let be a -strongly convex function w.r.t. . Then, the expected population risk (13) is strongly convex w.r.t. , where and is the -th coordinate of .
where second to last inequality follows by strong convexity of and from the fact that is sampled uniformly from . ∎
An immediate corollary to the above lemma is that for normalized features, i.e., , the dropout risk function (13) is the same as that for regularized least squares (in expectation).
Let be drawn uniformly from and let be the least squares loss function, i.e., . Then,
Next, we provide an excess risk bound for , the optimal solution to the dropout-based ERM (12). Our proof technique closely follows that of Sridharan et al.  and crucially uses the fact that Sridharan et al.  only requires strong convexity of the expected loss function. Below we provide the risk bound.
Theorem 10 (Dropout generalization bound).
Let be a fixed convex set and let Assumption 2 be true for the data domain and the loss . Let be i.i.d. samples drawn from . Let be i.i.d. vectors drawn uniformly from . Let and let be defined as in (13). Then, w.p. (over the randomness of both and ), we have the following:
Here , and the parameters and are defined in Assumption 2.
Define as: , where . Also, let . Following the technique of Sridharan et al. , we will now scale each of the ’s such that the ones which have higher expected value over have exponentially smaller weight. This helps us obtain a more fine-grained bound on the Rademacher complexity, which will be apparent below.
Let . Using standard Rademacher complexity bounds Bousquet et al. [2004, Theorem 5], for any , the following holds (w.p. over the randomness in selection of dataset ):
Here refers to the Rademacher complexity of the hypothesis class. In the following we will bound each of the term in the right hand side of (15).
Let , where refers to the -th coordinate of . We claim that
By the definition of the bound on the domain of and assumption on , we have . Therefore , we have the following.
In the following we now bound . Using Lemma 8, is strongly convex and hence using optimality of , we have:
where the last equations follows using definition of . ∎