Deep Neural Networks (DNNs) have emerged as critical technologies to solve various complicated problems [2, 3]. The inference of DNNs is computational expensive and memory intensive and therefore has an urgent need for acceleration before we can fully embrace DNNs in the power-limited devices. Extensive works are proposed to reduce the computation by compressing the size of synaptic weights, such as weight pruning [4, 6], quantization [7, 8, 9], low rank [11, 12] and Compact Network Design . The above compression techniques may require retraining the DNN with limited accuracy loss (). The success of these techniques relies on the sparsity and plasticity of DNNs, however, cannot directly apply to the training phase of DNNs.
The training phase, involving the back-propagation through the network to update the weights, demands three-times more computation effort. GPGPU is suitable for such task which is attributed to the superior parallelism for large matrix multiplication [16, 17]. Extensive works propose to accelerate the training phase on the distributed GPU-based system [18, 20]. Other works [22, 23] focus on accelerating the training phase using gradient pruning and weight quantization, respectively.
Random Dropout technique addresses the over-fitting problem and is widely used in MLP and LSTM training. The most common method  randomly dropping some neurons of each layer in every training iteration, while the other (DropConnect ) aims the same goal by randomly dropping some synapses connections between layers, namely some elements in weight matrix. Theoretically, we can reduce the number of multiplication to -
if we can skip the calculation of all the dropped neurons or synapses while the dropout rate changes from 0.3 to 0.7. However, such tremendous saving of multiplication as well as the corresponding data access is hard to exploit. Because the neurons or synapses are randomly and irregularly dropped following the Bernoulli distribution. Such irregularity prevents the GPGPU’s single instruction multiple threads (SIMT) architecture to skip the unnecessary multiplication and memory access.
Therefore, in this work, we replace the random dropout with two types of regular dropout patterns to make the choices of dropped neurons or synapses predictable, which allow GPGPU to skip calculation of those dropped neurons or synapses. We further developed an SGD-based Search Algorithm to produce the distribution of dropout patterns such that the dropout rate of each neuron is approximately subjected to a Bernoulli distribution (We provide a brief proof). In each iteration, we sample a dropout pattern subjected to and then eliminate the redundant computation by omitting the dropped data during the hardware allocation. Our experiments show that applying Approximate Random Dropout during training can reduce the training time by - (-) when dropout rate is - on MLP (LSTM) with less than accuracy loss. We find that when the batch size increases, the speedup rate increases with accuracy of neural network declines.
Ii-a Accelerating DNN inference and training
There are considerable works pitch into accelerating inference of DNN by leveraging the sparsity of DNN [15, 27, 7, 6]. Han et al.  prune synaptic weights which are close to zero and then retrain the DNN to maintain the classification accuracy. The zero weights are then encoded and moved onto the on-chip memory. Special decoder is deployed in the accelerator to decode the zero weights and skip the computation. Consequently, above methods can only benefit ASIC/FPGA based DNN accelerator instead of GPU. Jaderberg et al.  and Ioannou et al.  use low-rank representations to create computationally efficient neural networks. These methods cannot be used in training phase because of the subtle change of the weights degrades the convergence and accuracy of the training phase.
propose to use ternary gradients to accelerate distributed deep learning in data parallelism. Zhang et al. propose a variant of the asynchronous SGD algorithm to guarantee the convergence of this algorithm and accelerate the training in a distributed system. Other works are relative to the acceleration in the training process using gradient pruning and weight quantization. Kster et al.  share the exponent part of the binary coding of the weights and thereby convert floating-point operations to fixed-point integer operations. Noted that this work is compatible with ours and we leave this topic to further research. Sun et al.  prune those comparatively small gradients to speed up training phase. However, their work focuses on software-level optimization and thus yields marginal training acceleration while this work enable computation reduction on hardware-level.
Ii-B Basics of the GPGPU
GPGPU is commonly used for DNN training. It is composed of dozens of streaming multiprocessors (SMs). Each SM consists of single instruction multiple threads (SIMT) cores and a group of on-chip memories including register file, shared memory, L1D cache and etc. Each SM manages and executes multi-threads on it. Those threads are clustered into warps (e.g., 32 threads in NVIDIA GPU), executing the same instruction at the same time. Thus, the branch divergence occurs when programmers write conditional branch (if-else).
Shared memory is a performance-critical on-chip memory. The latency of accessing the global memory (DRAM) is roughly 100x higher than that of accessing the shared memory. Hence, reducing the frequency of accessing global memory is critical for performance. The capacity of the shared memory per block is 48KB in Nvidia GTX 1080Ti, which is much smaller than the capacity of the global memory. Therefore, reducing the superfluous data in shared memory is also important.
The key purpose of this work is to reduce the scale of matrices, by which we can reduce the access frequency of the shared memory and the global memory as well as the computation effort to accelerate the training.
Ii-C Random Dropout
on each training iteration. The probability of a neuron or a synapses to be dropped is subjected to a Bernoulli distribution parameterized withdropout rate [25, 27]. In a nutshell, the main reason why random dropout can effectively prevent over-fitting is that it generates adequate different sub-models to learn diverse features during the training process and ensembles those sub-models to maximize the capability of DNN for inference.
and Tensorflow, all adopt the dropout technique. For each layer in the forward propagation, the output matrix is computed and thereafter element-wisely multiplied by a mask matrix composed of randomly generated s and s, as shown in Fig. 1(a). Similarly, in back-propagation, they first calculate the derivatives of the output matrix. The resulting derivative matrix then multiplies by the same mask matrix.
A question arises: why not skipping the calculation of those dropped neurons to reduce the redundant time spent on the matrix multiplication and the data movement. Intuitively, we can write conditional branch (if - else) to skip the redundant calculation. However, such conditional branches incur branch divergence in GPU, which is a great hurdle for performance. As shown in Fig 1(b), ‘T’ denotes the threads that are satisfying the conditions () and executing the green function( and ), while ‘F’ refers to those executing the red function(). In GPGPU’s SIMT architecture, the red threads have to wait for the green threads. Thus, some process elements (PEs) are idle, represented by the red cross. The total execution time is not reduced (even increased) due to the branch divergence. Thus, it is non-trivial to exploit the dropout for speedup in GPU.
Iii Approximate Random Dropout
The key idea of accelerating the DNN training is to reduce the scale of matrices involved in multiplication and avoid the brunch divergence of GPU. However, the randomness in conventional dropout methods hamper the scale reduction.
In this work, we define dropout pattern as the combination of dropped neurons in each training iteration. As shown in Fig. 2, we design two sets of regular dropout pattern and replace the random dropout with a sampled dropout pattern sampled from them. Resulted from the replacement, we can forecast which neurons or synapses to be dropped and thereafter assist GPU to skip the calculation and data access of the dropped neurons without incurring the divergence. We modify the caffe source code to reduce the scale of matrices which become feasible due to the predictable dropout.
However, the loss of randomness induced by the regular dropout patterns increases the risk of over-fitting the DNN. To cope with this issue, we further develop a Stochastic Gradient Decedent (SGD) based Search Algorithm (see section III-C
), to find a distribution of all possible dropout pattern such that the probability distributions of each neuron or synapse being dropped between our method and conventional method is equivalent. We provide a brief proof of that.
In this section, based on the computation characteristic of GPU, we firstly propose two sets of Dropout Patterns—Row-based Dropout Pattern (RDP) and Tile-based Dropout Pattern (TDP)—and then analyze the mechanisms of the reduction of the time of computation and data access. After that, we introduce our SGD-based Search Algorithm which produce a distribution of possible dropout patterns as well as the dropout pattern generation procedure in each iteration.
Iii-a Row-based Dropout Pattern
In conventional dropout method, the computation relative to a dropped neuron is the multiplication between zero and the correspondent row in the weight matrix of next layer to shrink the scale of matrix, in RDP, we drop the whole row in the weight matrix, which is equivalent to drop all the synapses of a specified neuron.
Concretely, RDP is parameterized by two scalar and bias as follow: we uniformly choose a bias and drop all rows in the weight matrix whose indices satisfy
Consequently, of the neurons are dropped. For instance (the left of Fig. 3(a)), when =, =, starting from the first row, we drop two rows (i.e., neurons) in every successive three rows (neurons) in the weight matrix.
Given the size of the output matrix as , the maximum is , and the maximum number of the sub-models is considering the number of possible bias is when .
The execution processes in GPU is also shown in Fig. 3(a). DRAM stores the whole weight matrix (as shown in step 1); the gray block denotes the rows of weight matrix correspondent to the dropped neurons. We write the kernel function to prevent GPU from fetching those dropped data into shared memory (as shown in step 2) and build two compact matrices (input matrix and weight matrix) for next step. After data fetch, every PE multiplies one row of the weight matrix by the whole input matrix. Thus, only of the original weight matrix as well as the input matrix is fetched and calculated. The resulting rows fill rows in the Output Matrix using the same pattern. The rest of the Output Matrix is set to zero by default. Note that the RDP is agnostic to the matrix-multiplication algorithm as it temporarily compresses the matrices into a compact layout. Therefore, RDP can comply to any optimization method for matrix multiplication.
Iii-B Tile-based Dropout Pattern (TDP)
Tile is a sub-matrix in weight matrix and contains multiple synapses connections. We use tiles as the unit to drop rather than synapse  (namely the size of tiles is ) for the purpose of regularity. TDP is also parameterized by and bias . tiles are dropped in every tiles, resulting in of synapses connections being dropped. As shown in the left of Fig. 3(b), when , starting from first tile, we drop 3 tiles in every 4 successive tiles.
TDP have similar procedure compare to RDP but different in: (1)TDP fetches non-dropped tiles into the shared memory rather than rows, and builds two compact matrices. (2) each PE conduct the multiplication of one tile of compact weight matrix and the corresponding tile of compact input matrix, according to their PE index. In the right of Fig. 3(b), GPU only conduct multiplication of two compact matrices whose scale is of the original scale. This Dropout Pattern can naturally work with Tiling Method [ryoo2008optimization] in matrix multiplication, which is an essential optimization technique.
Given the size of the output matrix , the size of the tile , the maximum is and the maximum number of sub-models is . TDP can generate more sub-models than RDP, when is roughly greater than .
The choice of tile size is critical: the smaller size of the tile, the more number of Dropout Patterns as well as sub-models, while small tile leads to fine-grained control. Under such circumstances, the size of tile is set to be to balance the maximization of the number of sub-models and avoiding shared memory’s bank conflict since the shared memory has 32 banks in NVIDIA GPU.
A typical training process is composed of three steps: fully connected layer computation, activation layer computation and dropout layer computation using the mask matrix. After applying the Dropout Pattern with , we only need to spend half of the time for fully connected layer computing and skip the dropout layer computing. Consequently, given the dropout pattern, the time spending on training can be overtly reduced.
Iii-C SGD-based Search Algorithm for Dropout Pattern Distribution
For each iteration in training procedure, only one regular dropout pattern is applied to the network. In order to approximate the traditional dropout process , the dropout pattern we choose in each iteration should satisfy that: (1)the dropout probability of each neuron should subject to a given Bernoulli distribution, and (2)different sub-models derived from that series of dropout patterns should be adequate.
Therefore, we propose an efficient SGD-based Search Algorithm to find a dropout pattern distribution from which the dropout pattern sampled satisfy the demands. SGD consumes tractable time and is convenient in optimizing the continuous variables. More specifically, the algorithm obtains a probability distribution which contains the probability of each possible Dropout Pattern , who is subjected to .
Here we define the global dropout rate as the proportion of neurons or synapses who are set to zero. Noted that the global dropout rate is different from the conventional dropout rate which refer to the probability of a single neuron or synapse to be dropped. However, we prove that within our approach the two dropout rate are statistically equivalent.
Given the target global dropout rate , and the maximum as , we use Algorithm 1 to search for desired distribution . A vector with length is first arbitrary initialized (line 1) and the serve as the final probability distribution of each dropout pattern (line 4). Then we setup a constant vector whose element denotes the global dropout rate of a given dropout pattern. Therefore, is the expected global dropout rate and the difference between it and the target global dropout rate is our optimization objective (line 5). To enforce to be dense and to produce more diversified sub-models, the negative information entropy of is added to the loss (line 6, 7). Then the algorithm uses SGD algorithm to update (line 8, 9) and return the distribution
when the loss is stuck. By the loss function we defined in line 7, the algorithm aims at finding a distributionthat (1)make the global dropout rate equal to required value and (2)maximize the sub-models diversity.
Iii-D Dropout Pattern Generation
The acquired distribution is then used to sample dropout pattern in each iteration. Concretely, in each iteration, we randomly sample a dropout pattern (parameterized by and ) subjected to the distribution , and then uniformly choose a bias . Dropout pattern is then determined.
In our method, global dropout rate is statistically equivalent to the single neurons or synapse dropout rate. For each neuron or synapse, the probability of it to be dropped (conventional dropout rate) is:
The global dropout rate of is:
Therefore, in terms of the whole training process, the dropout rate of a single neurons or synapse is equal to the global dropout rate and thus is approximately equal to the target dropout rate by the SGD-based Search Algorithm.
To evaluate the effectiveness of proposed approximate random dropout, we compare it with the conventional dropout technique in terms of the DNN accuracy and the training time. In section IV-A
, in order to explore the influence of the dropout rate on the performance of a specific 4-layer Multilayer perceptron (MLP), we vary different dropout rate on a MLP. Note that the dropout rate in our method refer to the target dropout rateas described in Section III-C. In section IV-B, we compare different MLPs with a specific dropout rate. The data set we use with MLP is MNIST 
. Long short-term memory(LSTM) neural network is used in section IV-C to verify the scalability of our method. The dataset we used with LSTM include a dictionary whose size is 8800, and the Penn Treebank (PTB)  data set which has long been a central data set for language modeling. The experiment codes is implemented in Caffe  and use a single GTX1080Ti GPU to run.
Iv-a Comparison of different dropout rate
The structure of a specific 4-layer MLP is described as follow: the input layer is shaped according to the batch size; the output layer has 10 neurons for digit 0 to 9; the two hidden layers have 2048 neurons both. During training, we set the following hyper-parameter: the batch size is 128, the learning rate is 0.01, and momentum is 0.9.
We vary the dropout rate from to (two hidden layers may have varied dropout rate), and record the accuracy and training time for each dropout rate. The comparison of two metrics of RDP and TDP against the conventional dropout are shown in Fig. 4. The training time of conventional dropout is divided by the new training time of proposed approximate random dropout to obtain the speedup rate.
The results show that RDP can obtain speedup compared with the traditional dropout technique when the dropout rate varies between 0.3 and 0.7, which comply to our intuition as the amount of data that require no calculation expands with the increment of the dropout rate. The speedup rate brought by TDP ranges from 1.18 to 1.6. The little slowdown is induced by the calculation of the nonzero positions in the output matrix before matrix multiplication. The accuracy loss of these two classes of dropout patterns is less than , which is the evadible concession to the speedup. TDP has less accuracy loss than RDP which can be attributed to the abundance of sub-models in TDP.
Iv-B Comparison of different networks
We investigate the speedup in different MLP structures using a fixed dropout rate (0.7, 0.7). Those MLPs have the same input and output layer as described in section IV-A. Their hidden layer size is shown in Table I. For instance, in the second column means the first and the second hidden layer’s size are 1024 and 64, respectively. The hyper-parameters of optimization algorithm follow above experiments.
From Table I, the accuracy degradation is less than . In some cases, the accuracy even increases. Moreover, the speedup rate increases as the network size increases. Especially, in the case of network, both of the proposed dropout patterns reach a speedup.
Iv-C Scaling to Long Short-Term Memory Model
We evaluate the speedup rate and the model performance on LSTM, which predicts the following word based on the given words. Each of the two hidden layers of LSTM contain 1500 neurons. During training, we set the following hyper-parameters: the base learning rate is 1 (the base learning rate will gradually decrease), batch size is 20, the maximum epoch is 50, and the length of the sequence is 35. It should be noted that the execution of LSTM is also performed as matrix multiplication, thus our proposed approximate dropout can be easily applied to LSTM.
As shown in Table II, the accuracy degradation is less than . When dropout rate is increasing, the speedup rate increases without undermining the accuracy loss.
To illustrate the effectiveness of the proposed method, we fix the dropout rate to and trace the accuracy of RDP until it’s convergence. As shown in Fig. 5, the red curve records our approximate random dropout training process; the blue one records the traditional dropout. The convergence of our method is eariler than the traditional dropout. Moreover, the smoothness of red curve indicates the approximate random dropout is helpful for the training process.
The result using the Penn Treebank data set (PTB)  on the 3-layer LSTM is shown in Fig. 6(a). The test perplexity using RDP only increases given the dropout rate is 0.7, which further shows that our proposed approximate dropout algorithm can generate adequate sub-models for PTB data set. Besides, when dropout rate increases from 0.3 to 0.7, the speedup rate also increases from to .
We vary the batch size from 20 to 40. Noted that SGD based search and data initialization are an one-time effort. When the batch size is increased, only the matrix operation and data transmission time increase accordingly. As shown in Fig. 6(b), the speedup rate increases when batch size increases. However, since one dropout pattern is applied to the whole batch, the sub-models generated during training may not be sufficient, which raises the perplexity.
Accelerating DNN training is difficult because it is nontrivial to leverage the sparsity of DNN in the dense matrix-multiplication. In this work, we propose a novel approach to eliminate the unnecessary multiplication and data access by replacing the traditional random dropout with an approximate random dropout. The two classes of dropout patterns can avoid the divergence issue in GPU, reduce the scale of the matrix, and thus gain significant improvement on the energy-efficiency with marginal decline of the model performance. The proposed SGD-based search algorithm can guarantee the dropout rate of single neurons or synapse is equivalent to the conventional dropout, as well as the convergence and accuracy of the models. In general, the training time can be reduced by and when dropout rate is 0.3-0.7 on MLP and LSTM, respectively. Moreover, higher speedup rate is expected on a larger DNN. The proposed method has been wrapped as an API and integrated into Caffe. The speedup can be much higher if the proposed method can be integrated into the cuBLAS Library.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in , pp. 1800–1807, 2017.
W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” 2017.
T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” 2017.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in ICLR, 2016.
-  W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” 2016.
-  C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, and G. Yuan, “Circnn: Accelerating and compressing deep neural networks using block-circulantweight matrices,” 2017.
-  A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” 2017.
-  Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” Computer Science, 2014.
-  C. Leng, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” 2017.
-  Q. Hu, P. Wang, and J. Cheng, “From hashing to cnns: Training binaryweight networks via hashing,” 2018.
-  M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” Computer Science, vol. 4, no. 4, p. XIII, 2014.
-  Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, and A. Criminisi, “Training cnns with low-rank filters for efficient image classification,” Journal of Asian Studies, vol. 62, no. 3, pp. 952–953, 2015.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, “Inverted residuals and linear bottlenecks: Mobile networks forclassification, detection and segmentation,” 2018.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” 2017.
-  International Conference on Artificial Neural Networks, pp. 381–390, 2009.
-  S. Puri, “Training convolutional neural networks on graphics processing units,” 2010.
-  W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” 2017.
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” 2017.
W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware async-sgd for
distributed deep learning,” in
International Joint Conference on Artificial Intelligence, pp. 2350–2356, 2016.
-  J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, and P. Tucker, “Large scale distributed deep networks,” in International Conference on Neural Information Processing Systems, pp. 1223–1231, 2012.
-  X. Sun, X. Ren, S. Ma, and H. Wang, “meprop: Sparsified back propagation for accelerated deep learning with reduced overfitting,” 2017.
-  U. Köster, T. J. Webb, X. Wang, M. Nassar, A. K. Bansal, W. H. Constable, O. H. Elibol, S. Gray, S. Hall, and L. Hornof, “Flexpoint: An adaptive numerical format for efficient training of deep neural networks,” in Neural Information Processing Systems, 2017.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  L. Wan, M. D. Zeiler, S. Zhang, Y. Lecun, and R. Fergus, “Regularization of neural networks using dropconnect,” in International Conference on Machine Learning, pp. 1058–1066, 2013.
-  V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves recurrent neural networks for handwriting recognition,” in International Conference on Frontiers in Handwriting Recognition, pp. 285–290, 2014.
-  W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li, “Learning intrinsic sparse structures within long short-term memory,” 2017.
-  Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, and Jonathan, “Caffe: Convolutional architecture for fast feature embedding,” pp. 675–678, 2014.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2016.
-  A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
-  A. Graves, Long Short-Term Memory. Springer Berlin Heidelberg, 2012.
-  M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated corpus of english: The penn treebank,” Computational linguistics, vol. 19, no. 2, pp. 313–330, 1993.