1 Introduction
Dropout [1, 2] is a wellknown regularization method that has been used very successfully for feedforward neural networks. Training large models with strong regularization from dropout provides stateoftheart performance on numerous tasks (e.g., [3, 4]). The method inserts a “dropout” layer which stochastically zeroes individual activations in the previous layer with probability during training. At test time, these stochastic layers are deterministically approximated by rescaling the output of the previous layer by to account for the stochastic dropout during training. Empirically, dropout is most effective when applied to large models that would otherwise overfit [2]. Although training large models is usually not an issue anymore with multiGPU training (e.g., [5, 6]), model size must be restricted in many applications to ensure efficient test time evaluation.
In this paper, we propose a novel form of dropout training that provides the performance benefits of dropout training on a large model while producing a compact model for deployment. Our method is inspired by annealed dropout training [7], where the retention probability is slowly annealed to
over multiple epochs. For our method, we introduce an independent retention probability parameter for each hidden unit with a bimodal prior distribution sharply peaked at
and to encourage the posterior retention probability to converge to either or . Similarly to annealed dropout, some units will converge to never being dropped, however, unlike annealed dropout, some units will converge to always being dropped. These units can be removed from the network without accuracy degradation. The annealing schedule and compaction rate of the resulting network can be controlled by changing the hyperparameters of the prior distribution.In general, model compaction has been well investigated. Conventionally, this problem is addressed by sparsifying the weight matrices of the neural network. For example, [8] introduced a pruning criterion based on the secondorder derivative of the objective function. L1 regularization is also widely used to obtain sparse weight matrices. However, these methods only achieve weightlevel sparsity. Due to the relative efficiency of dense matrix multiplication compared to sparse matrix multiplication, these approaches do not improve test time efficiency without degrading accuracy. In contrast, our proposed method directly reduces the dimension of the weight matrices.
In another approach, singular value decomposition (SVD) is used to obtain approximate lowrank representations of the weight matrices
[9]. SVD can directly reduce the dimensionality of internal representations, however, it is typically implemented as a linear bottleneck layer, which introduces additional parameters. This inefficiency requires additional compression, which degrades performance. For example, if we are to compress the number of parameters by half, the SVD compacted model would have to restrict the internal dimensionality by 25%. Knowledge distillation can also be used to transfer knowledge from larger models to a small model [10]. However, this requires two separate optimization steps that makes it difficult to directly apply to existing optimization configurations. [11] introduced additional multiplicative parameters to the output of the hidden layers and regularize these parameters with an L1 penalty. This method is conceptually similar to dropout compaction in that both methods introduce regularization to reduce the dimensionality of internal representations. Dropout compaction can be seen as an extension of this method by replacing L1 regularization with dropoutbased regularization.Similar to the SVDbased compaction technique, there are several approaches for achieving faster evaluation of neural networks by assuming a certain structure in the weight matrices. For example, [12] introduces Toeplitz weight matrices as a building block of neural networks. [13]
defines a structured matrix via discrete cosine transform (DCT) for enabling fast matrix multiplication via a fast Fourier transform algorithm. These methods successfully reduce the computational cost for neural network prediction; however, since those methods restrict the parameter space prior to training, these method may restrict the flexibility of neural networks. Our proposed method attempts to jointly optimize the structure of the neural network and its parameters. This way, the optimization algorithm can determine an effective lowdimensional representation of the hidden layers.
Several approaches for reducing numerical precision to speed up training and evaluation have been proposed [14, 15, 16]. Most of these approaches are complemental with our method since our proposed method only changes the dimensionality of hidden layers, but the structure of the network stays the same. In particular, we also apply a basic weight quantization technique in our experiments on automatic speech recognition tasks.
The remainder of this paper is as follows: In Section 2, we introduce a probabilistic formulation of dropout training and cast it as an ensemble learning method. Then, we derive a method for optimizing the dropout retention probabilities in Section 3. In Section 4, we present experimental results. In Section 5, we conclude the paper and suggest future extensions.
2 Conventional Dropout Training
In this section, we describe conventional dropout training for feedforward neural networks. Hereafter, we denote the input vectors by
, the target labels by , and the parameters of a neural network by where is the number of layers and is the number of output units in the th layer.2.1 Training
Dropout training can be viewed as the optimization of a lower bound on the loglikelihood of a probabilistic model.
By introducing a set of random mask vectors which defines a subset of hidden units to be zeroed, the output of a dropout neural network can be expressed as follows:
(1) 
where denotes elementwise multiplication,
is the activation function (usually a sigmoid or rectifier function) of the
th layer, and denotes the th element in the vector function. The mask vectors are independent draws from a Bernoulli distribution (i.e.,
) parameterized by retention probability hyperparameters, . In conventional dropout training [1, 2], all retention probabilities belonging to the same layer are tied (i.e., for all possible , and ).Optimizing the conditional loglikelihood of the target labels is intractable. However, a tractable lower bound can be used instead;
(2) 
where is a set of all possible instances of
. A straightforward application of stochastic gradient descent (SGD) applied to this lower bound leads to conventional dropout training.
2.2 Prediction
At test time, it is not tractable to marginalize over all mask vectors as in Eq. (2). Instead, a crude (but fast) approximation is applied. The average over all network outputs with all possible mask vectors is replaced by the output of the network with the average mask, i.e. the vector in Eq. (1) is replaced by the expectation vector :
(3) 
In this way, we obtain an approximation of the predictive distribution , which we denote by . In practice, this approximation does not degrade prediction accuracy [2].
3 Retention Probability Optimization
For leveraging the retention probabilities as a unit pruning criterion, we propose to untie the retention probability parameters and put a bimodal prior on them. Then, we seek to optimize the joint loglikelihood
. In the next subsection, we compute the parameter gradients. In the second subsection, we describe the control variates we used to reduce variance in the gradients. Next, we describe the prior we used to encourage the posterior retention probabilities to converge to
or . Finally, we summarize the algorithm.3.1 Stochastic Gradient Estimation
The joint loglikelihood is given by
(4) 
where
is the prior probability distribution of the retention probability parameters. The objective function with respect to the weight parameters
is unchanged (up to a constant), so we can follow the conventional dropout parameter updates for . The partial derivative of with respect to the retention probability of the th unit in the th layer is:(5) 
where the weight function is:
(6) 
Similarly to prediction in the conventional dropout method, computing the denominator in the weight function is intractable due to the summation over all binary mask vectors. Therefore, we employ the same approximation, i.e. the denominator in Eq. (6) is computed using Eq. (3). Hence, the weight function is approximated as:
(7) 
The approximated weight function is computed by two feed forward passes: One with a stochastic binary mask and one with expectation scaling.
The partial derivatives of with respect to the retention probability parameters can be expressed as follows:
(8) 
3.2 Variance Reduction with Control Variates
The standard SGD approach uses an unbiased Monte Carlo estimate of the true gradient vector. However, the gradients of
with respect to the retention probability parameters exhibit high variance. We can reduce the estimator’s variance using control variates, closely following [17, 18]. We exploit the fact that(9) 
cf. Eq. (8). This implies
(10) 
for any that does not depend on
. Thus, an unbiased estimator is given by:
(11) 
where is a random minibatch of training data indices and is a set of mask vectors randomly drawn from for each element in . As before, we approximate to make the computation tractable.
The optimal does not have a closedform solution. However, a reasonable choice for is
(12) 
With this choice, we obtain an interpretable update rule: A training example only contributes to an update of the retention probabilities if the predictive distribution changes by applying a random dropout mask, i.e. if .
3.3 Prior distribution of retention probability
In order to encourage the posterior to place mass on compact models, a prior distribution that strongly prefers or
is required. We use a powered beta distribution as the prior distribution
Computing the partition function is intractable when or . However, computing the partition function is not necessary for SGDbased optimization.
By setting and , the prior probability density goes to infinity as approaches and , respectively. Thus, we can encourage the optimization result to converge to either or . The exponent is introduced in order to control the relative importance of the prior distribution. By setting sufficiently large, we can ensure that the retention probabilities converge to either or .
3.4 Algorithm
Finally, the stochastic updates of the retention probabilities with control variates are summarized in Algorithm 1. In our experiments, we alternate optimization of the neural network parameters and the retention probabilities . Specifically, updates computed with Algorithm 1 are applied after each epoch of conventional dropout training. Algorithm 2 shows the overall structure. After each epoch, we can remove hidden units with retention probability smaller than a threshold without degrading performance. Therefore, we can already benefit from compaction during the training phase.
4 Experiments
First, as a pilot study, we evaluated dropout compaction on the MNIST handwritten digit classification task. These experiments demonstrate the efficacy of our method on a widely used and publicly available dataset. Next, we conducted a systematic evaluation of dropout compaction on three realworld speech recognition tasks.
In the experiments, we compared the proposed method against the dropout annealing method and SVDbased compaction. Dropout annealing was chosen because the proposed method also varies the dropout retention probabilities while optimizing the DNN weights. The SVDbased compaction method was chosen since this technique is widely used and applicable to many tasks.
4.1 Mnist
#Weights  Err. rates [%]  Avg. loss  

Small  
Baseline  42200  
Dropout  42200  
Annealing  42200  
SVD  40400  
82000  
Compaction  46665.2  
Large  
Baseline  477600  
Dropout  477600  
Annealing  477600  
SVD  357600  
795200  
Compaction  481276.7 
The MNIST dataset consists of 60 000 training and 10 000 test images ( grayscale pixels) of handwritten digits. For simplicity, we focus on the permutation invariant version of the task without data augmentation.
We used 2 layer neural networks with rectified linear units (ReLUs). The parameters of the DNNs were initialized with random values from a uniform distribution with adaptive width computed with Glorot’s formula
[19] ^{1}^{1}1We also evaluated with the ReL variant of Glorot’s initialization [20]; however, the ReL variant did not outperform the original Glorot initialization in our experiments..As is standard, we split the training data into 50 000 images for training and 10 000 images for hyperparameter optimization. Learning rates, momentum, and the prior parameters were selected based on development set accuracy. The minibatch size for stochastic gradient descent was set to . We evaluated networks with various numbers of the hidden units . For dropout compaction training, we set , to produce approximately 50% compression. The SVD compacted models were trained by applying SVD to the hiddentohidden weight matrices in the best performing neural networks in each configuration of . The sizes of the bottleneck layer were set to to achieve 25% compression in terms of the number of the parameters in the hiddentohidden matrix. After the decomposition, the SVDcompacted networks were again finetuned to compensate for the approximation error.
Based on manual tuning over the the development set accuracies, the learning rate and momentum were set to and respectively. L2 regularization was found not to be effective for the baseline system. The optimal L2 regularization constants were for dropout and annealed dropout and for dropout compaction. For the dropout annealing method, we increased the retention probability from to over the first 4 epochs.
Figure 1 and Table 1 show the results of our proposed method and the other methods in comparison. In the figure, the plots in the first row show the differences in the average crossentropy loss computed on the test set, and the plots in the second row show the differences in the classification error rate. The green lines with “o” markers denote the proposed method and the blue lines with “x
” markers denote the compared method, i.e. baseline feed forward net, conventional dropout, dropout annealing, and SVD compacted DNNs. The error bars in the plots represent two standard deviations (
) estimated from 10 trials with different random initialization. The table shows the results for small and large networks. The numbers of the weights in the table are the average numbers of the trials.In terms of testset crossentropy loss, dropout compaction performs consistently better than the other methods in comparison. For the application to automatic speech recognition, which is our main interest, the performance in terms of crossentropy loss is decisive, because the DNN is used as an estimator for the label probability (rather than a classifier).
The behavior in terms of error rate differs. By increasing the model size, the error rate eventually saturates at the same point for all methods. However, with small networks, dropout compaction also clearly outperforms the other methods in terms of error rate. The case of small networks is the more relevant one, because our aim is to apply dropout compaction for training small models. Here, "small" must be understood relative to the complexity of the task. Neural networks, which can be deployed in largescale production for difficult tasks like speech recognition, can typically be considered as small.
4.2 LargeVocabulary Continuous Speech Recognition
Baseline  Annealing  SVD  Compaction  SVD  Compaction  

# Hid. unit  1536  1536  (3072, 384)  3072 ~1536  (1536, 192)  1536 ~768  
VoiceSearchSmall  XEnt  22.5 (22.6)  22.3 (21.7)  22.5 (22.3)  21.8 (21.2)  22.7 (21.6)  22.6 (21.7) 
+ bMMI  21.2 (21.0)  20.0 (19.8)  21.5 (21.1)  19.9 (19.7)  20.8 (20.3)  20.4 (19.8)  
VoiceSearchLarge  XEnt  18.9 (18.2)  18.7 (18.2)  18.8 (18.2)  18.4 (18.0)  19.2 (18.8)  18.9 (18.5) 
+ bMMI  17.9 (17.4)  17.2 (17.0)  17.7 (17.2)  16.9 (17.0)  17.3 (17.2)  17.3 (17.2)  
GenericFarField  XEnt  24.0 (21.9)  23.6 (21.7)  24.4 (22.5)  23.4 (21.3)  24.8 (22.8)  23.9 (22.1) 
+ bMMI  21.4 (19.7)  20.6 (18.8)  21.3 (19.1)  20.8 (18.7)  20.8 (18.9)  21.0 (18.9) 
As an example of a realworld application, which requires largescale deployment of neural networks, we applied dropout compaction to largevocabulary continuous speech recognition (LVCSR). We performed experiments on three tasks: VoiceSearchLarge, which contains 453h of voice queries, and VoiceSearchSmall, which is a 46h subset of VoiceSearchLarge. GenericFarField contains 115h of farfield speech, where the signal is obtained by applying frontend processing to the seven channels from from a microphone array. We used VoiceSearchSmall for conducting preliminary experiments for finding the optimal hyperparameters, and used these for the other tasks.
As input vectors to the DNNs, we extracted 32 dimensional log Melfilterbank energies over 25ms frames every 10ms. The DNN acoustic model processed 8 preceding, a middle frame, and 8 following frames as a stacked vector (i.e.,
dimensional input features for each target). Thus, with our feature extraction, the number of training examples is 16.5M (VoiceSearchSmall), 163M (VoiceSearchLarge), and 41M (GenericFarField), respectively.
We used a random 10% of the training data as a validation set, which was used for “Newbob”performance based learning rate control [21]. Specifically, we halved the learning rate when the improvement from the last epoch is less than a threshold. As an analogy of cross validationbased model selection, we used 10% of the validation set for optimizing the retention probabilities.
The baseline model size was designed such that the total ASR latency was below a certain threshold. The number of hidden units for each layer was determined to be 1 536 and the number of hidden layers was 4. The sigmoid activation function was used for nonlinearity in the hidden layers. Following the standard DNN/HMMhybrid modeling approach, the output targets were clustered HMMstates obtained via triphone clustering. We used 2 500, 2 506, and 2 464 clustered states with the VoiceSearchSmall, VoiceSearchLarge, and GenericFarField tasks, respectively. For fast evaluation, we quantized the values in the weight matrices and used integer operations for the feedforward computations. These networks are sufficiently small for achieving low latency in a speech recognition service. Therefore, in the experiments, we focus on two use cases: (a) Enabling the use of a larger network within the given fixed budget, and (b) achieving faster evaluation by compressing the current network.
All networks were trained with distributed parallel asynchronous SGD on 8 GPUs [6]. In addition, all networks were pretrained with the greedy layerwise training method [22]. For the dropout compaction and annealing methods, retention probabilities were kept fixed to 0.5 during pretraining. For annealed dropout, we use a schedule designed to increase the retention probabilities from 0.5 to 1.0 in the first 18 epochs. As in the MNIST experiments, we fixed the hyperparameters to obtain approximately 50% compression ^{2}^{2}250% compression of hidden units yields roughly 25% compression of hiddentohidden weight matrices and 50% compression of the input and output layer weight matrices. This leads to a roughly 2.5x speedup in the feedforward computation. by setting , and where is the number of nonsilence frames in the data set.
Because conventional dropout did not improve the performance of the baseline system in the ASR experiments, we only use annealed dropout as the reference system for dropoutbased training. This may be due to the small size of the neural networks relative to our training corpus size.
Fig. 2 shows histograms of the retention probabilities as functions of the numbers of epochs on the VoiceSearchSmall task. As designed, the retention probabilities (initialized to 0.5) rapidly diffused from the initial value , and converged to or in the first 11 epochs. We did not observe significant differences with regard to the pruning rate and the convergence speed over different hidden layers. The compression rates of hidden layers were around 50% in all layers.
Fig. 3 shows the evolution of the frame error rate on the VoiceSearchSmall task. We observed that annealed dropout started overfitting in the later epochs, after the retention probability was annealed to . On the other hand, the dropout compaction methods exhibited the performance gain after the probabilities were converged completely. This suggests that, similar to SVDbased compaction, our proposed method requires some finetuning after the structure is fixed, even though the finetune and compaction processes are smoothly connected in the proposed method. This might be the reason why the optimal prior parameters for dropout compaction, which were selected on the development set, implied that the retention probabilities converge already in only 11 epochs, whereas the optimal parameters for dropout annealing yield a deterministic model after 18 epochs.
Table 2 shows the word error rates over the development and evaluation sets. As is standard in ASR, all crossentropy models were finetuned according to a sequencediscriminative criterion, in this case the boosted maximum mutual information (bMMI) criterion [23, 24, 25]. Dropout compaction has been applied in the crossentropy phase of the optimization, and the pruned structure was then used in the bMMI training phase.
The results in Table 2 show that dropout compaction models starting with a larger structure (use case (a)) yielded the best error rates in all cases except for the bMMI result on the GenericFarField task and they always performed better than the baseline. The differences were especially large in comparison to the crossentropy trained models. The reason for this might be that the dropout compaction method determines the structure based on the crossentropybased criterion.
Regarding use case (b), dropout compaction achieved better results than the baseline network on all tasks. Further, most of the gains by annealed dropout are retained, although the dropout compaction models are roughly 2.5 times smaller. Compared to SVDbased model compaction, the proposed method performed better in almost all cases. Similar to the comparison in use case (a), the relative advantage of dropout compaction became smaller with the additional bMMI training step. Therefore, adapting the proposed method to be compatible with sequence discriminative training such as bMMI is a promising future research direction.
5 Conclusion
In this paper, we introduced dropout compaction, a novel method for training neural networks, which converge to a smaller network starting from a larger network. At At the same time, the method retains most of the performance gains of dropout regularization.
The method is based on the estimation of unitwise dropout probabilities with a sparsityinducing prior. On realworld speech recognition tasks, we demonstrated that dropout compaction provides comparable accuracy even when the final network has fewer than 40% of the original parameters. Since computational costs scale proportionally to the number of parameters in the neural network, this results in a 2.5x speed up in evaluation.
In future work, we want to study whether the results by dropout compaction can be further improved by using more sophisticated methods for estimating the expectation over the mask patterns. Further, the application of our proposed method to convolutional and recurrent neural networks is a promising direction.
References
 [1] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
 [2] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[3]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Infomration Processing Systems (NIPS), 2012, pp. 1097–1105.  [4] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in Proc. of the IEEE Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Vancouver, Canada, May 2013, pp. 8609–8613.
 [5] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs,” in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech), Singapore, Sep. 2014, pp. 1058–1062.
 [6] N. Strom, “Scalable distributed dnn training using commodity GPU cloud computing,” in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech), Dresden, Germany, Sep. 2015, pp. 1488–1492.
 [7] S. J. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of deep networks,” in Proc. of the IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 159–164.
 [8] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, “Optimal brain damage,” in Advances in Neural Infomration Processing Systems (NIPS), vol. 89, 1989, pp. 598–605.
 [9] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition,” in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech), Lyon, France, Aug. 2013, pp. 2365–2369.
 [10] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[11]
K. Murray and D. Chiang, “Autosizing neural networks: With applications to
gram language models,” in
Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
, 2015, pp. 908–916. 
[12]
V. Sindhwani, T. N. Sainath, and S. Kumar, “Structured transforms for smallfootprint deep learning,” in
Proc. of the International Conference on Learning Representation (ICLR), 2016.  [13] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “ACDC: A structured efficient linear layer,” in Proc. of the International Conference on Learning Representation (ICLR), 2016.
 [14] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” 2015.
 [15] M. Courbariaux, Y. Bengio, and J.P. David, “Low precision arithmetic for deep learning,” arXiv preprint arXiv:1412.7024, 2014.
 [16] ——, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Infomration Processing Systems (NIPS), 2015, pp. 3105–3113.

[17]
J. Ba, R. R. Salakhutdinov, R. B. Grosse, and B. J. Frey, “Learning wakesleep recurrent attention models,” in
Advances in Neural Infomration Processing Systems (NIPS), 2015, pp. 2575–2583.  [18] A. Mnih and K. Gregor, “Neural variational inference and learning in belief networks,” in Proc. of the International Conference on Machine Learning (ICML), 2014, pp. 1791–1799.

[19]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proc. of the International conference on artificial intelligence and statistics (AISTATS)
, 2010, pp. 249–256. 
[20]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
humanlevel performance on ImageNet classification,” in
Proc. of the International Conference on Computer Vision (ICCV)
, 2015, pp. 1026–1034.  [21] Quicknet. [Online]. Available: http://www1.icsi.berkeley.edu/Speech/qn.html
 [22] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy layerwise training of deep networks,” Advances in Neural Infomration Processing Systems (NIPS), vol. 19, pp. 153–160, 2007.
 [23] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for model and featurespace discriminative training,” in Proc. of the IEEE Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2008, pp. 4057–4060.
 [24] B. Kingsbury, “Latticebased optimization of sequence classification criteria for neuralnetwork acoustic modeling,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 3761–3764.
 [25] K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequencediscriminative training of deep neural networks.” in Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech), 2013, pp. 2345–2349.
Comments
There are no comments yet.