Unsupervised neural networks assume unlabeled data to be generated from a neural network structure, and have been applied extensively to pattern analysis and recognition. The most basic one is the restricted Boltzmann machine (RBM) [Salakhutdinov, Mnih, and Hinton2007]
, an energy-based model with a layer of hidden nodes and a layer of visible nodes. With such a basic structure, we can stack multiple layers of RBMs to create an unsupervised deep neural network structure, such as the deep belief network (DBN) and the deep Boltzmann machine (DBM)[Hinton, Osindero, and Teh2006, Salakhutdinov and Hinton2009a]Salakhutdinov, Mnih, and Hinton2007, Tieleman2008]
. Once we learn the parameters of a model, we can retrieve the values of the hidden nodes from the visible nodes, thus applying unsupervised neural networks for feature selection. Alternatively, we may consider applying the parameters obtained from an unsupervised deep neural network to initialize a deep feedforward neural network (FFNN), thus improving supervised learning.
One essential question for such models is to adjust for the high-dimensionality of their parameters and avoid overfitting. In FFNNs, the simplest regularization is arguably the early stopping method, which stops the gradient descent algorithm before the validation error rate goes up. The weight decay method, or regularization, is also commonly used [Witten, Frank, and Hall2011]. Recently Dropout is proposed, which optimizes the parameters over an average of exponentially many models with a subset of all nodes [Srivastava et al.2014]. It has been shown to outperform weight decay regularization in many situations.
For regularizing unsupervised neural networks, sparse-RBM-type models encourage a smaller proportion of -valued hidden nodes [Cho, Ilin, and Taiko2012, Lee, Ekanadham, and Ng2007]. DBNs are regularized in GT13 (2013) with outcome labels. While these works tend to be goal-specific, we consider regularization for unsupervised neural networks in a more general setting. Our work and contributions are as follows: (1) we extend common regularization methods to unsupervised deep neural networks, and explain their underlying mechanisms; (2) we propose partial Dropout/DropConnect which can improve the performance of Dropout/DropConnect; (3) we compare the performance of different regularization methods on real data sets, thus providing suggestions on regularizing unsupervised neural networks. We note that this is the very first study illustrating the mechanisms of various regularization methods for unsupervised neural nets with model convergence and likelihood bounds, including the effective newly proposed partial Dropout/DropConnect.
Section 2 reviews recent works for regularizing neural networks, and Section 3 exhibits RBM regularization as a basis for regularizing deeper networks. Section 4 discusses the model convergence of each regularization method. Section 5 extends regularization to unsupervised deep neural nets. Section 6 presents a numerical comparison of different regularization methods on RBM, DBN, DBM, RSM [Salakhutdinov and Hinton2009b] and Gaussian RBM [Salakhutdinov, Mnih, and Hinton2007]. Section 7 discusses potential future research and concludes the paper.
2 Related Works
To begin with, we consider a simple FFNN with a single layer of input and a single layer of output . The weight matrix is of size . We assume the relation
whereapplied element-wise. Equation (1) has the modified form in SH14 (2014),
where denotes element-wise multiplication, and
denotes the Bernoulli distribution, thereby achieving the Dropout (DO) regularization for neural networks. In Dropout, we minimize the objective function
which can be achieved by a stochastic gradient descent algorithm, sampling a different mask per data example and per iteration. We observe that this can be readily extended to deep FFNNs. Dropout regularizes neural networks because it incorporates prediction based on any subset of all the nodes, therefore penalizing the likelihood. A theoretical explanation is provided in WW13 (2013) for Dropout, noting that it can be viewed as feature noising for GLMs, and we have the relation
Here for simplicity, and , where is the log-partition function of a GLM. Therefore, Dropout can be viewed approximately as the adaptive regularization [Baldi and Sadowski2013, Wager, Wang, and Liang2013]
. A recursive approximation of Dropout is provided in BS13 (2013) using normalized weighted geometric means to study its averaging properties.
An intuitive extension of Dropout is DropConnect (DC) [Wan et al.2013], which has the form below
and thus masks the weights rather than the nodes. The objective
has the same form as in (3). There are a number of related model averaging regularization methods, each of which averages over subsets of the original model. For instance, Standout varies Dropout probabilities for different nodes which constitute a binary belief network[Ba and Frey2013]. Shakeout adds additional noise to Dropout so that it approximates elastic-net regularization [Kang, Li, and Tao2016]. Fast Dropout accelerates Dropout with Gaussian approximation [Wang and Manning2013]. Variational Dropout applies variational Bayes to infer the Dropout function [Kingma, Salimans, and Welling2015].
We note that while Dropout has been discussed for RBMs [Srivastava et al.2014], to the best of our knowledge, there is no literature extending common regularization methods to RBMs and unsupervised deep neural networks; for instance, adaptive regularization and DropConnect as mentioned. Therefore, below we discuss their implementations and examine their empirical performance. In addition to studying model convergence and likelihood bounds, we propose partial Dropout/DropConnect which iteratively drops a subset of nodes or edges based on a given calibrated model, therefore improving robustness in many situations.
3 RBM Regularization
For a Restricted Boltzmann machine, we assume that
denotes the visible vector, anddenotes the hidden vector. Each , is a visible node and each , is a hidden node. The joint probability is
We let the parameters , which is a vector containing all components of , , and . To calibrate the model is to find .
An RBM is a neural network because we have the following conditional probabilities
where and represent, respectively, the -th row and -th column of . The gradient descent algorithm is applied to calibration. The gradient of the log-likelihood can be expressed in the following form
where is the free energy. The right-hand side of (8) is approximated by contrastive divergence with steps of Gibbs sampling (CD-) [Salakhutdinov, Mnih, and Hinton2007].
3.1 Weight Decay Regularization
Weight decay, or regularization, adds the term to the negative log-likelihood of an RBM. The most commonly used is
(ridge regression), or(LASSO). In all situations, we do not regularize biases for simplicity.
Here we consider a more general form. Suppose we have a trained set of weights from CD with no regularization. Instead of adding the term , we add the term to the negative log-likelihood. Apparently this adjusts for the different scales of the components of . We refer to this approach as adaptive . We note that adaptive is the adaptive LASSO [Zou2006], and adaptive plus is the elastic-net [Zou and Hastie2005]. We consider the performance of regularization plus adaptive regularization () below.
3.2 Model Averaging Regularization
As discussed in SH14 (2014), to characterize a Dropout (DO) RBM, we simply need to apply the following conditional distributions
Therefore, given a fixed mask , we actually obtain an RBM with all visible nodes and hidden nodes . Hidden nodes are fixed to zero so they have no influence on the conditional RBM. Apart from replacing (7) with (9), the only other change needed is to replace with . In terms of training, we suggest sampling a different mask per data example and per iteration as in SH14 (2014).
A DropConnect (DC) RBM is closely related; given a mask on weights , in a plain RBM is replaced by everywhere. We suggest sampling a different mask per mini-batch since it is usually much larger than a mask in a Dropout RBM.
3.3 Network Pruning Regularization
There are typically many nodes or weights which are of little importance in a neural network. In network pruning, such unimportant nodes or weights are discarded, and the neural network is retrained. This process can be conducted iteratively [Reed1993]. Now we consider two variants of network pruning for RBMs. For an trained set of weights with no regularization, we consider implementing a fixed mask where
i.e. is the -th left percentile of all , and is some fixed proportion of retained weights. We then recalibrate the weights and biases fixing mask , leading to a simple network pruning (SNP) procedure which deletes of all weights. We may also consider deleting of all weights at a time, and conduct the above process times, leading to an iterative network pruning (INP) procedure.
3.4 Hybrid Regularization
We may consider combining some of the above approaches. For instance, SH14 (2014) considered a combination of and Dropout. We introduce two new hybrid approaches, namely partial DropConnect (PDC) presented in Algorithm 1 and partial Dropout (PDO), which generalizes DropConnect and Dropout, and borrows from network pruning. The rationale comes from some of the model convergence results exhibited later.
As before, suppose we have a trained set of weights with no regularization. Instead of implementing a fixed mask , we perform DropConnect regularization with different retaining probabilities for each weight
. We let the quantile, and
Therefore, we sample a different per mini-batch, which means that we always keep of all the weights, and randomly drop the remaining weights with probability . The mask can be resampled iteratively. Intuitively, we are trying to maximize the following
such that , and .
Algorithm 1. (Partial DropConnect)
Initialize , the unregularized trained parameters for an RBM.
Find retaining rates from (11).
Retrain weights with DropConnect for a given number of iterations, and then update .
If maximum number of iterations reached, stop and obtain ; otherwise, go back to Step 2.
This technique is proposed because we hypothesize that some weights could be more important than others a posteriori, so dropping them could cause much variation among the models being averaged. From (11), in partial Dropout, we tend to drop weights which have smaller magnitude, since setting larger weights to zero may substantially alter the structure of a neural network. Experiments on real data show that this technique can effectively improve the performance of plain DropConnect.
We denote , and . From first-order Taylor’s expansion,
Here lies between and from Taylor’s expansion, and is a Lipschitz constant.
Note that given and , Step 2 in Algorithm 1 lowers the term by assigning to weights of smaller magnitude, reducing an upper bound of . Step 3 further increases and reduces the gap . Therefore, each iteration of Algorithm 1 tends to increase , and hence Algorithm 1 provides an intuitive solution to problem (12).
We also consider a partial Dropout approach which is analogous to partial DropConnect and keeps some important nodes rather than weights. We set a mask for nodes , , where
This algorithm protects more important hidden nodes from being dropped in order to reduce variation. We also evaluate its empirical performance later.
4 More Theoretical Considerations
Here we discuss the model convergence properties of different regularization methods when the number of data examples
. We mark all regularization coefficients and parameter estimates withwhen there are data examples. We assume , which is compact, , is unique for each , and are i.i.d. generated from an RBM with a “true” set of parameters . We denote each regularized calibrated set of parameters as .
Let and . [Zou2006] showed that guarantees asymptotic normality and identification of set
for linear regression. We demonstrate that similar results hold forfor RBMs. We let and be the and regularization coefficients for each component. The proofs of all propositions and corollaries below are in the supplementary material [Wang and Klabjan2016].
Proposition 1. (a) If , as , then the estimate ; (b) if also, , , then , where is the Fisher information matrix; , where .
For Dropout and DropConnect RBMs, we also assume that the data is generated from a plain RBM structure. We assume is of size as in (11) for DropConnect and of length as in (14) for Dropout, therefore covering the cases of both original and partial Dropout/DropConnect with a fixed set of dropping rates. With a decreasing dropping rate with , we obtain the following convergence result.
Proposition 2. If as , then .
For network pruning, we show that as the number of data examples increase, if the retained proportion of parameters can cover all nonzero components of , we will not miss any important component.
Proposition 3. Assume . Then for simple network pruning, as , (a) ; (b) for sufficiently large , there exists such that .
Corollary 1. The above results also hold for iterative network pruning.
We note that for all regularization methods, under the above conditions, the calibrated weights converge to the “true” set of parameters , which indicates consistency. Also, adding regularization guarantees that we can identify components of zero value with infinitely many examples. The major benefits of Dropout come from the facts that it makes regularization adaptive, and also encourages more confident prediction of the outcomes [Wager, Wang, and Liang2013]. We propose partial DropConnect also based on Proposition 3, i.e. we do not drop the more important components of , therefore possibly reducing variation caused by dropping influential weights. Partial Dropout follows from the same reasoning.
5 Extension to Other Networks
5.1 Deep Belief Networks
We consider the multilayer network below,
where each probability on the right-hand side is from an RBM. To train the weights of , , , we only need to carry out a greedy layer-wise training approach, i.e. we first train the weights of , and then use to train , etc. The weights of the RBMs are used to initialize a deep FFNN which is finetuned with gradient descent. RBM regularization is applicable to each layer of a DBN.
Here we show that adding layers to a Dropout/Drop-Connect DBN improves the likelihood given symmetry of the weights of two adjacent layers. Similar results for plain DBN are in HO06 (2006) and B07 (2007). We demonstrate this by using likelihood bounds.
We let denote an -layer DBN and denote an -layer DBN with the first layers being the same as in . For a data example of a visible vector , the log-likelihood is bounded as follows,
Here, is the entropy function, and the derivation is analogous to Section 11 in B07 (2007). Mask is for , and mask is for the new -th layer. Note that after we have trained the first layers, and initialized the -th layer symmetric to the -th layer, assuming a constant dropping probability, we have
so has the same log-likelihood bound as . Training , is guaranteed to increase, and therefore the likelihood of is expected to improve. As a result, for regularized unsupervised deep neural nets, adding layers also tend to elevate the explanatory power of the network. Adding nodes has the same effect, providing a rationale for deep and large-scale networks. We present the following proposition.
Proposition 4. Adding nodes or layers (preserving weight symmetry) to a Dropout/DropConnect DBN continually improves the likelihood; also, adding layers of size continually improves the likelihood.
5.2 Other RBM Variants
More descriptions of DBMs, RSMs, and Gaussian RBMs are in the supplementary material [Wang and Klabjan2016]. RBM regularization can be extended to all these situations.
6 Data Studies
In this section, we compare the empirical performance of the aforementioned regularization methods on the following data sets: MNIST, NORB (image recognition); 20 Newsgroups, Reuters21578 (text classification); ISOLET (speech recognition). All results are obtained using GeForce GTX TITAN X in Theano.
6.1 Experiment Settings
We consider the following unsupervised neural network structures: DBN/DBM for MNIST; DBN for NORB; RSM plus logistic regression for 20 Newsgroups and Reuters21578; GRBM for ISOLET. CD-is performed for the rest of the paper. The following regularization methods are considered: None (no regularization); DO; DC; ; ; SNP; INP(
); PDO; PDC. The number of pretraining epochs isper layer and the number of finetuning epochs is , with a finetuning learning rate of . For , SNP, and INP which need re-calibration, we cut the epochs into two halves ( quarters for INP). For regularization parameters, we apply the following ranges: for DO/DC/SNP/INP; for , similar to H10 (2010); for ; , or the reverse for PDO/PDC. We only make one update to the “partial” dropping rates to maintain simplicity. From the results, we note that unsupervised neural networks tend to need less regularization than FFNNs. We choose the best iteration and regularization parameters over a fixed set of parameter values according to the validation error rates.
6.2 The MNIST Data Set
The MNIST data set consists of pixels of handwritten - digits. There are training examples, validation and testing examples. We first consider the likelihood of the testing data of an RBM with nodes for MNIST. There are two model fitting evaluation criteria: pseudo-likelihood and AIS-likelihood [Salakhutdinov and Murray2008]. The former is a sum of conditional likelihoods, while the latter directly estimates with AIS.
In Figure 1 below using log-scale, for DO, and for . These figures tend to be representative of the model fitting process. The pseudo-likelihood is a more optimistic estimate of the model fitting. We observe that Dropout outperforms the other two after about epochs, and regularization does not improve the pseudo-likelihood. In terms of the AIS-likelihood, which is a much more conservative estimate of the model fitting, the fitting process seems to have three stages: (1) initial fitting; (2) “overfitting”; (3) re-fitting. We observe that improves the likelihood significantly, while Dropout catches up at about epochs. Therefore, Dropout tends to improve model fitting according to both likelihood criteria.
In Figure 2, we can observe that more nodes increase the pseudo-likelihood, which is consistent with Proposition 4, but exhibit “overfitting” for the AIS-likelihood. However, such “overfitting” does not exist for pretraining purposes as well. Thus we suggest the pseudo-likelihood, and the AIS-likelihood should be viewed as too conservative.
Classification error rates tend to be a more practical measure. We first consider a -hidden-layer DBN with nodes per layer, pretraining learning rate , and batch size ; see Table 1. We tried DBNs of , , and hidden layers and found the aforementioned structure to perform best with None as baseline. The same was done for all other structures. We calculate the means of the classification errors for each regularization method averaged over
random replicates and their standard deviations. In each table, we stress in bold the topperformers with ties broken by deviation. We note that most of the regularization methods tend to improve the classification error rates, with DC and PDO yielding slightly higher error rates than no regularization.
In Table 2, we consider a -hidden-layer DBM with
nodes per layer. For simplicity, we only classify based on the original features. We let the pretraining learning rate beand the batch size be .
It can be observed that regularization tends to yield more improvement for DBM than DBN, possibly because a DBM doubles both the visible layer and the third hidden layer, resulting in a “larger” neural network structure in general. Only INP proves to be unsuitable for the DBM; all other regularization methods work better, with PDC being the best.
6.3 The NORB Data Set
The NORB data set has categories of images of 3D objects. There are training examples, with validation examples held out, and testing examples. We follow preprocessing of NH09 (2009), and apply a sparse two-hidden-layer DBN with nodes per layer as in LE07 (2007) with a sparsity regularization coefficient of and the first hidden layer being a Gaussian RBM. The pretraining learning rates are and for the first and second hidden layer, and the batch sizes for pretraining and finetuning are and . Because the validation error often goes to zero, we choose the -th epoch and fix the regularization parameters as follows based on the best values of other data sets: for DO/DC/SNP/INP, for , for , for PDO and for PDC. In Table 3, only weight decay and PDO/PDC perform better than None, with PDC again being the best.
6.4 The 20 Newsgroups Data Set
The 20 Newsgroups data set is a collection of news documents with categories. There are training examples, from which validation examples are randomly held out, and testing examples. We adopt the stemmed version, retain the most common words, and train an RSM with
hidden nodes in a single layer. We consider this as a simple case of deep learning since it is a two-step procedure. The pretraining learning rate isand the batch size is . We apply logistic regression to classify the trained features, i.e. hidden values of the RSM, as in SS13 (2013). This setting is quite challenging for unsupervised neural networks. In Table 4, Dropout performs best with other regularization methods yielding improvements except DropConnect.
6.5 The Reuters21578 Data Set
The Reuters21578 data set is a collection of newswire articles. We adopt the stemmed R-52 version which has categories, training examples, from which validation examples are randomly held out, and testing examples. We retain the most common words, and train an RSM with hidden nodes in a single layer. The pretraining learning rate is and the batch size is . We make the learning rate large because the cost function is quite bumpy. From Table 5, we note that PDC works best, and PDO improves the performance of Dropout.
6.6 The ISOLET Data Set
The ISOLET data set consists of voice recordings of the Latin alphabet (a-z). There are training examples, from which validation examples are randomly held out, and testing examples. We train a -hidden-node Gaussian RBM with pretraining learning rate , batch size , and initialize a FFNN, which can be viewed as a single-hidden-layer DBN. From Table 6, it is evident that all regularization methods work better then None, with PDC again being the best.
From the above results, we observe that regularization does improve the structure of unsupervised deep neural networks and yields lower classification error rates for each data set studied herein. The most robust methods which yield improvements for all six instances are , , and PDC. SNP is also acceptable, and preferable over INP. PDO can yield improvements for Dropout when Dropout is unsuitable for the network structure. PDC turns out to be the most stable method of all, and thus the recommended choice.
Regularization for deep learning has aroused much interest, and in this paper, we extend regularization to unsupervised deep learning, i.e. for DBNs and DBMs. We proposed several approaches, demonstrated their performance, and empirically compared the different techniques. For the future, we suggest that it would be of interest to consider more variants of model averaging regularization for supervised deep learning as well as novel methods of unsupervised learning; for instance, KW14 (2015) provided an interesting variational Bayesian auto-encoder approach.
- [Ba and Frey2013] Ba, L., and Frey, B. 2013. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems 26. MIT Press.
- [Baldi and Sadowski2013] Baldi, P., and Sadowski, P. 2013. Understanding dropout. In Advances in Neural Information Processing Systems 26. MIT Press.
- [Bengio2007] Bengio, Y. 2007. Learning deep architectures for ai. https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf.
- [Cho, Ilin, and Taiko2012] Cho, K.; Ilin, A.; and Taiko, T. 2012. Tikhonov-type regularization for restricted boltzmann machines. In 22nd International Conference on Artificial Neural Networks.
- [Goh et al.2013] Goh, H.; Thome, N.; Cord, M.; and Lim, J. 2013. Top-down regularization of deep belief networks. In Advances in Neural Information Processing Systems 26. MIT Press.
- [Hinton, Osindero, and Teh2006] Hinton, G.; Osindero, S.; and Teh, Y. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18:1527–1554.
A practical guide to training restricted boltzmann machines.
[Kang, Li, and Tao2016]
Kang, G.; Li, J.; and Tao, D.
Shakeout: A new regularized deep neural network training scheme.
30th AAAI Conference on Artificial Intelligence.
- [Kingma and Welling2014] Kingma, D., and Welling, M. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations.
- [Kingma, Salimans, and Welling2015] Kingma, D.; Salimans, T.; and Welling, M. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems 28.
- [Lee, Ekanadham, and Ng2007] Lee, H.; Ekanadham, C.; and Ng, A. 2007. Sparse deep belief net model for visual area v2. In Advances in Neural Information Processing Systems 20. MIT Press.
- [Nair and Hinton2009] Nair, V., and Hinton, G. 2009. 3d object recognition with deep belief nets. In Advances in Neural Information Processing Systems 22. MIT Press.
- [Reed1993] Reed, R. 1993. Pruning algorithms: a survey. IEEE Transactions on Neural Networks 4:740–747.
- [Salakhutdinov and Hinton2009a] Salakhutdinov, R., and Hinton, G. 2009a. Deep boltzmann machines. In 12th International Conference on Artificial Intelligence and Statistics.
- [Salakhutdinov and Hinton2009b] Salakhutdinov, R., and Hinton, G. 2009b. Replicated softmax: an undirected topic model. In Advances in Neural Information Processing Systems 22. MIT Press.
[Salakhutdinov and Murray2008]
Salakhutdinov, R., and Murray, I.
On the quantitative analysis of deep belief networks.
25th International Conference on Machine Learning.
- [Salakhutdinov, Mnih, and Hinton2007] Salakhutdinov, R.; Mnih, A.; and Hinton, G. 2007. Restricted boltzmann machines for collaborative filtering. In 24th International Conference on Machine Learning.
- [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 14:1929–1958.
- [Srivastava, Salakhutdinov, and Hinton2013] Srivastava, N.; Salakhutdinov, R.; and Hinton, G. 2013. Modeling documents with a deep boltzmann machine. In 29th Conference on Uncertainty in Artificial Intelligence.
- [Tieleman2008] Tieleman, T. 2008. Training restricted boltzmann machines using approximations to the likelihood gradient. In 25th International Conference on Machine Learning.
- [Wager, Wang, and Liang2013] Wager, S.; Wang, S.; and Liang, P. 2013. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26. MIT Press.
- [Wan et al.2013] Wan, L.; Zeiler, M.; Zhang, S.; LeCun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In 30th International Conference on Machine Learning.
- [Wang and Klabjan2016] Wang, B., and Klabjan, D. 2016. Supplementary material for “regularization for unsupervised deep neural nets”. http://www.dynresmanagement.com/publications.html.
- [Wang and Manning2013] Wang, S., and Manning, C. 2013. Fast dropout training. In 30th International Conference on Machine Learning.
- [Witten, Frank, and Hall2011] Witten, I.; Frank, E.; and Hall, M. 2011. Data Mining: Practical Machine Learning Tools and Techniques. Burlington, Massachusetts, USA: Morgan Kaufmann Publishers.
- [Zou and Hastie2005] Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67:301–320.
- [Zou2006] Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101:1418–1429.