1 Introduction
Convolutional Neural Networks (CNN) have achieved great success in many fields (especially in visual recognition tasks), such as object classification (Krizhevsky et al., 2012)
(Taigman et al., 2014). Although the special network architecture of CNN makes it possible to get abstract features layer by layer, high computational complexity of the convolution computation consumes a large amount of the computing resources, and this problem turns CNN into a computeintensive model. CNN needs to do convolution computation via a matrix multiplication operation for each convolutional layer in the forward propagation, on the other hand, almost the same amount of computation is needed during back propagation. Because of this, a method that could reduce calculation will be of great help for reducing the time consumption in both the training process and the inference process.Sun et al. (2017) propose a minimal effort back propagation method to reduce calculation in back propagation, which called meProp. The idea is to compute only a very small but critical portion of the gradient information, and update only the corresponding minimal portion of the parameters in each learning step. In this way, we only update the highly relevant parameters, while others stay untouched. Hence, this technique results in sparse gradients and sparse update. In other words, fewer gradients are passed back and only rows or columns (depending on the layout) of the weight matrix are modified. The experiments also show that models using meProp are more robust and less likely to be overfitting.
We extend this technique to Convolutional Neural Network, which we call mePropCNN, to reduce calculation in back propagation of CNN. In back propagation of CNN, the convolution operation is transformed into matrix multiplication operations as in forward propagation. As in most neural networks, the matrix multiplication operation consumes more computing resources than other operations, such as plus, minus, and so on. To address this issue, we apply mePropCNN in CNN, just like the meProp in feedforward NN model (MLP) in Sun et al. (2017).
The differences from meProp in MLP and the contributions of this work are as follows:

Compared with the linear transformation in MLP, the convolution operation in CNN is a unique operation, and this characteristic leads to a different behavior during parameters updation. This will be explained in detail in Section
2.2. 
We implement a new sparse back propagation method for CNN to reduce calculation, which makes the complex convolution computation transformed into sparse matrix multiplication. In this way, the proposed method can outperform the original methods.

We enhance meProp technique with momentum method to get more stable results. This is an optional method in our experiments.
2 Method
We introduce meProp technique into the Convolutional Neural Network to reduce calculation in back propagation. The forward process is computed as usual, while only a small subset of gradients are used to update the parameters. Specifically, we select top elements to update parameters and the rest are set to , which is similar to the Dropout (Srivastava et al., 2014) technique. And we do find that model with meProp is more robust and less likely to be overfitting. We first present the proposed method and then describe the implementation details.


2.1 meProp
We first introduce meProp in feedforward neural network. For simplicity, a linear transformation unit is brief enough for us to explain and understand the detail of the proposed method:
(1) 
(2) 
where ,
is the dimension of the input vector,
is the dimension of the output vector, and is a nonlinear function (e.g., relu, tanh, and sigmoid). During back propagation, we need to compute the gradient of the parameter matrix and the input vector :(3) 
(4) 
where means . We can see that the computational cost of back propagation is directly proportional to the dimension of output vector .
The proposed meProp uses approximate gradients by keeping only top elements based on the magnitude values. That is, only the top elements with the largest absolute values are kept. For example, suppose a vector , then . We denote the indices of vector ’s top values as , and the approximate gradient of the parameter matrix and input vector is:
(5) 
(6) 
As a result, only rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction ( divided by the vector dimension) in the computational cost.
Figure 1 is an illustration of meProp for a single computation unit of neural models. The original back propagation uses the full gradient of the output vectors to compute the gradient of the parameters. The proposed method selects the top
values of the gradient of the output vector, and backpropagates the loss through the corresponding subset of the total model parameters.
As for a complete neural network framework with a loss , the original back propagation computes the gradient of the parameter matrix as:
(7) 
while the gradient of the input vector is:
(8) 
The proposed meProp selects top elements of the gradient
to approximate the original gradient, and passes them through the gradient computation graph according to the chain rule. Hence, the gradient of
goes to:(9) 
while the gradient of the vector is:
(10) 
2.2 mePropCNN
The forward propagation process is the same as it is in feed forward neural networks. Compared with linear transformation in Section
2.1, convolution computation in CNN is a unique operation, and this feature leads to a different behavior during parameters update. In MLP, only the corresponding critical portion of the parameters are updated in each learning step, which means only several rows or columns (depending on the layout) of the weight matrix are modified. But it is not necessarily like this in mePropCNN. meProp operation only generates sparse matrix in intermediate gradients. Take a simple convolution computation for example:(11) 
We use to denote the operation of a convolution for CNN, and to denote parameters of the filters while to denote the input of the current layer, which is the output of the previous layer. To get the gradient of , , we need to compute as:
(12) 
Then we apply meProp to Eq. (12):
(13) 
Here the operands are all transformed into right matrix shape as needed. Note that after convolution computation, the gradients of
are probably not as sparse as in MLP model. This is determinated by the difference between convolution operation and linear transformation. So the benefit we get in
mePropCNN is sparse matrix operation, and it is necessary to verify the validity of meProp in CNN architecture.Dense matrix operations in CNN consume most of the time of back propagation. To address this we propose mePropCNN technique, which will lead to sparse matrix operations, and we will benefit from this transformation. We apply the proposed method in every convolution filter respectively. As in Sun et al. (2017)
, for other elementwise operations (e.g., activation functions), the original back propagation procedure is kept, because those operations are already fast enough compared with matrixmatrix or matrixvector multiplication operations.
In this paper, we use top as the ratio of gradients that is selected in one hidden layer. For example, if we set top = 5%, then 50 gradients will be selected for a layer under the dimension of 1000.
As illustrated in Figure 3, this is a common architecture for Convolutional Neural Networks, and we apply our method in convolutional layers: , in Figure 3, and full gradients are passed back in fully–connected layers. Note that in Eq. (11) the operations of convolution are transformed into matrixmatrix or matrixvector multiplication operations, then we can rewrite Eq. (11) as:
(14) 
Since we conduct our experiments based on Tensorflow, we need not make these Matrix transformation manually. The operands are all transformed into right matrix shape properly by Tensorflow framework, all we need to do is to apply our method on the gradients. Consider that chain rule is used in back propagation, we need to get the gradients of the parameter matrix
and the input vector . Heretofore, the operations of convolution are transformed into matrixmatrix or matrixvector multiplication operations so the process of back propagation is similar with meProp decribed in Section 2.1.But always remember that convolution is different from other operations because of its weight sharing: all of the units in a feature map share the same parameters, namely, the same filters weights. Weight sharing mechanism makes CNN powerful to extract abstract features layer by layer. The filters slide over local patches of feature maps from current layer and then generate feature maps for the next layer. On the other hand, different filters hold different parameters and work for the model independently, this urges us to obey the same property when we apply proposed method during back propagation. That is to say, we should select topk gradients for every feature map respectively, rather than mixing all feature maps together then choosing from them. Concretely, the output of the current convolutional layer contains feature maps which are generated by filters respectively in forward propagation, then in back propagation we will get gradient matrix , which has the same shape as matrix . Slice the gradient matrix into pieces corresponding to the filters in forward propagation and then apply top selection respectively, that is the top elements with the largest absolute values are kept. One thing that should be pointed out is that we do not apply top selection on the current gradient matrix directly, instead we take into account the historical gradients scale. We achieve this by applying exponential decay, formally, the accumulated gradients matrix is updated as:
(15) 
Then we select top elements based on the new matrix and the unselected elements are set to 0. The selected sparse matrix will replace the original gradient matrix to complete the rest work for previous layers, as Eq. (9) and Eq. (10) do.
Also note that we use relu as the activation function, and one property of relu is that it tends to lead to sparsity, more or less. Then we check the CNN to find out how much sparsity relu
and Maxpooling layer contribute to the gradients. When we train CNN for 1 iteration, the sparsity(the ratio of nonzero values) of 3 sparse layers(conv1,conv2 and the first fully–connected layer) is 23%, 3% and 50%, respectively. In contrast, the sparsity of sigmoid is 25%, 25% and 99%, and the rate 25% is related to the kernel size of Max–pooling layer. Specifically, we set the kernel size and strides
for the Maxpooling layer in our experiments as in Table 1. As Maxpooling layer chooses the maximum element in grids, the other 3 unselected elements do not contribute to the next layer, so the gradients in these locations will be 0 in back propagation. Hence the sparsity of convolutional layer is 25% at most, and we use 5% of the full gradients which is also 20% of 25%.2.3 mePropCNN with Batch Normalization
Deep Neural Networks are difficult to train for the reason that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. Ioffe & Szegedy (2015)
propose a method called Batch Normalization to address this problem. It has been proven to be a very effective technique in Deep Neural Networks.
2.3.1 Batch Normalization
Ioffe & Szegedy (2015) find that the distribution of each layer’s inputs changes during training, and they refer to this phenomenon as internal covariate shift. Batch Normalization addresses this problem by performing the normalization for the training of each minibatch. LeCun et al. (2012); Wiesler & Ney (2011)
reveals that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. By performming normalization to the inputs of each layer, just like whitening performance, the model would achieve the fixed distributions of inputs that would remove the ill effects of the
internal covariate shift. Also note that the normalization procedure is different in training and inference. We use minibatch inputs to compute the mean and variance during training, while the unbiased estimatation is used during inference. We use the unbiased variance estimation
, where the expectation is over training minibatches of size and are their sample variances.2.3.2 BatchNormalized mePropCNN
Batch Normalization achieves great success in Deep Neural Networks, such as deep Convolutional Neural Networks. Merely adding Batch Normalization to a stateoftheart image classification model yields a substantial speedup in training. By further increasing the learning rates, removing Dropout, and applying other modifications afforded by Batch Normalization, the model reaches the previous state of the art with only a small fraction of training steps (Ioffe & Szegedy, 2015). Batch Normalization supplies a new way to regularize the model, just like what Dropout and our proposed method do. So what if we combine our method mePropCNN with Batch Normalization? Will the BatchNormalized mePropCNN could still work properly? We test and verify this idea in our experiments and the results are shown as follow.
As decribed in Ioffe & Szegedy (2015), Batch Normalization is added before the nonlinearity. For example,
(16) 
where and are learned parameters of the model, and is the nonlinearity such as sigmoid or relu. The output of is normalized before passed to :
(17) 
Batch Normalization method normalizes the outputs of each layer before the activation function. Formally, the process is shown as follows. We compute the mean and variance over a minibatch :
(18)  
(19) 
and are used to normalize the values of the minibatch:
(20) 
is a constant added to the minibatch variance for numerical stability (Ioffe & Szegedy, 2015). Finally we scale and shift the normalized value:
(21) 
and are parameters we should learn during training.
For the convolutional layers, we additionally want the normalization to obey the convolutional property—so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. See Figure 4 for an illustration, we let be the set of all values in a feature map across both the elements of a minibatch and spatial locations – for a minibatch of size and feature maps of size , we use the effective minibatch of size . We learn a pair of parameters and per feature map, rather than per activation. A sideeffect of this constraint is that we should also obey the convolutional property when we apply top operation in convolutional layers during back propagation. In other words, we should apply the top operation for gradients matrix along with the feature map dimension rather than other dimensions. Take the MNIST experiment as an example: the output of the first convolutional layer is a matrix of size (that is in Figure 4), during forward propagation we apply batch normalization to each featuremap respectively, which means there are pairs of parameters and in our model. Similarly, in back propagation we apply the top operation for each feature map respectively as the feature maps are computed by filters independently.
params  Conv1  Pool1  Conv2  Pool2  FC1  FC2  

MNIST  ksize  1024  10  
strides 
3 Related Work
Riedmiller & Braun (1993) proposed a direct adaptive method for fast learning, which performs a local adaptation of the weight update according to the behavior of the error function. Tollenaere (1990) also proposed an adaptive acceleration strategy for back propagation. Dropout (Srivastava et al., 2014) is proposed to improve training speed and reduce the risk of overfitting. Sparse coding is a class of unsupervised methods for learning sets of overcomplete bases to represent data efficiently (Olshausen & Field, 1996). Poultney et al. (2007)
proposed a sparse autoencoder model for learning sparse overcomplete features. The proposed method is quite different compared with those prior studies on back propagation, dropout, and sparse coding.
The sampledoutputloss methods (Jean et al., 2014)
are limited to the softmax layer (output layer) and are only based on random sampling, while our method does not have those limitations. The sparselygated mixtureofexperts
(Shazeer et al., 2017) only sparsifies the mixtureofexperts gated layer and it is limited to the specific setting of mixtureofexperts, while our method does not have those limitations. There are also prior studies focusing on reducing the communication cost in distributed systems (Seide et al., 2014; Dryden et al., 2016), by quantizing each value of the gradient from 32bit float to only 1bit. Those settings are also different from ours.4 Experiments
To demonstrate the effectiveness of our method, we perform experiments on MNIST image recognition task. The sample images of the database are shouwn in Figure 2. The CNN model without top and Batch Normalization is chosen as baseline.
We implement the proposed method mePropCNN for MNIST (LeCun et al., 2010) image recognition task to verify the method. We use Adam (Kingma & Ba, 2014) to optimize the model, and in our implementation, the detail implementation of Adam is stay untouched. We implement our experiments based on Tensorflow (Abadi et al., 2015).
MNIST: The MNIST dataset of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. The images in this dataset are all gray scale images with size pixel, and they are belong to 10 classes, which ranges from 0 to 9.
4.1 Settings
For MNIST task we use two convolutional layers, two Max–pooling layers and two fully–connected layers, and the output of the last fully–connected layer is fed to a softmax layer which produces a distribution over the tenclass labels. The architecture of our model is shown in Figure 3. The first convolutional layer filters the input image with 32 kernels of size with a stride of 1 pixel. The second convolutional layer takes as input the output of the previous pooling layer and filters it with 64 kernels of size . The pooling window size of the Max–pooling both are
with a stride of 2 pixels. We use Rectified Linear Unit (
relu) (Glorot et al., 2011) as the activation function in our model. Krizhevsky et al. (2012) find that deep convolutional neural networks with ReLUs converge several times faster than their equivalents with tanh units. Table 1 shows more details of our parameter setting.We perform top method in back propagation except the last fully–connected layer. The hyperparameters of Adam optimization are as follows: the learning rate , and , . Minibatch Size is 10.
4.2 Choice of topk ratio
An intuitive idea is that layers with different number of neurons should also have different number of gradients to be selected, and the gradients of front layers are influenced by the gradients from the back layers in back propagation, so these factors should be taken into account when we set the top
ratios. In our experiments, We have lots of exploration to get a proper parameters setting. Experiments reveal that too sparse gradients in back layers(such as fully–connected layers in our experiments) result in bad performance, which may be that too much gradients information is dropt and this causes the parameters of the front layers can not converge to the appropriate values.Topk(%)  Decay  Epoch  Dev Acc(%)  Test Acc(%) 

Baseline  18  99.44  99.02  
5%  0  28  99.40  99.07 
0.6  17  99.40  99.27  
8%  0  30  99.42  99.16 
0.6  27  99.34  99.23  
10%  0  28  99.36  99.15 
0.6  19  99.36  99.21 
4.3 Results
Table 2 shows the results on different top values of MNIST dataset. The Mini batch size is 10, and the top ratio ranges from 5% to 100%(the baseline) as shown in the table. The decay rate represents the tradeoff as in Eq. (15). As usual, we first evaluate our model on the development data to obtain the optimal number of iterations and the corresponding accuracy, then the test data is evaluated based on the best setting tuned in the development set. As we can see in Table 2, mePropCNNs get better accuracy than baseline, and the gap of baseline between Dev Acc and Test Acc reveals that the baseline without meProp tends to be overfitting. As for momentum method, a higher decay rate does not always mean better result: decay=0.6 works better than 0.9 in our experiments. This may be owing to that too large momentum makes the model inflexible, which means only a small fixed subset of gradients are used while others may never have chances to be selected. Compared with CNN, mePropCNNs keep the same ability while only keep a small subset of the full gradients in back propagation, or even better. The main reason could be that the minimal effort update does not modify weakly relevant parameters, which makes overfitting less likely, similar to the effect of Dropout.
Batch Normalization once again demonstrates the ability to accelerate convergence: the model with Batch Normalization gets a faster rate of convergence and higher accuracy, as shown in Table 3.
Topk(%)  Decay  Epoch  Dev Acc(%)  Test Acc(%) 

Baseline  30  99.60  99.28  
5%  0  22  99.58  99.38 
0.6  28  99.56  99.39  
8%  0  22  99.54  99.26 
0.6  22  99.54  99.48  
10%  0  23  99.56  99.37 
0.6  19  99.52  99.14 
Batch Normalization with top = 5% gets better accuracy than full gradients. In our experiments the gradients of fully–connected layers are not processed by meProp method, and top = 5% means that 5% gradients are passed back in convolutional layers in back propagation. The momentum mePropCNN with Batch Normalization is consistent with before, a proper decay rate 0.6 works better in our experiments. The results are shown in Table 3.
5 Conclusion and future work
We propose a new technique called mePropCNN to reduce calculation in back propagation. In back propagation of CNN, convolution computation is transformed into matrix multiplication operation as in forward propagation, and only a small subset of gradients are used to update the parameters. Specifically, we select top elements to update parameters and the rest are set to , which is similar to the Dropout technique. We enhance meProp technique with momentum method for more stable results. Experiments show that our method perform as good as the CNN even only a small subset of gradients are used, and what’s more, it has the ability to avoid overfitting. Our method is still able to work compatibly with Batch Normalization. In future work, we would like to apply the proposed method to lexical processing tasks (Gao et al., 2010; Sun et al., 2008, 2010, 2012) which may benefit from our method as well.
References

Abadi et al. (2015)
Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen,
Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin,
Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey,
Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur,
Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry,
Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner,
Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent,
Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete,
Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
URL http://tensorflow.org/. Software available from tensorflow.org.  Dryden et al. (2016) Dryden, Nikoli, Jacobs, Sam Ade, Moon, Tim, and Van Essen, Brian. Communication quantization for dataparallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, pp. 1–8. IEEE Press, 2016.
 Gao et al. (2010) Gao, Jianfeng, Li, Xiaolong, Micol, Daniel, Quirk, Chris, and Sun, Xu. A large scale rankerbased system for search query spelling correction. In COLING, 2010.

Glorot et al. (2011)
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua.
Deep sparse rectifier neural networks.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 315–323, 2011.  Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015.
 Jean et al. (2014) Jean, Sébastien, Cho, Kyunghyun, Memisevic, Roland, and Bengio, Yoshua. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 LeCun et al. (2010) LeCun, Yann, Cortes, Corinna, and Burges, Christopher JC. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
 LeCun et al. (2012) LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, and Müller, KlausRobert. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.
 Olshausen & Field (1996) Olshausen, Bruno A and Field, David J. Natural image statistics and efficient coding. Network: computation in neural systems, 7(2):333–339, 1996.

Poultney et al. (2007)
Poultney, Christopher, Chopra, Sumit, Cun, Yann L, et al.
Efficient learning of sparse representations with an energybased model.
In Advances in neural information processing systems, pp. 1137–1144, 2007.  Riedmiller & Braun (1993) Riedmiller, Martin and Braun, Heinrich. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pp. 586–591. IEEE, 1993.

Seide et al. (2014)
Seide, Frank, Fu, Hao, Droppo, Jasha, Li, Gang, and Yu, Dong.
1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns.
In Fifteenth Annual Conference of the International Speech Communication Association, 2014.  Shazeer et al. (2017) Shazeer, Noam, Mirhoseini, Azalia, Maziarz, Krzysztof, Davis, Andy, Le, Quoc, Hinton, Geoffrey, and Dean, Jeff. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 Srivastava et al. (2014) Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
 Sun et al. (2008) Sun, Xu, Morency, LouisPhilippe, Okanohara, Daisuke, and Tsujii, Jun’ichi. Modeling latentdynamic in shallow parsing: A latent conditional model with improved inference. 2008.
 Sun et al. (2010) Sun, Xu, Gao, Jianfeng, Micol, Daniel, and Quirk, Chris. Learning phrasebased spelling error models from clickthrough data. In ACL, 2010.
 Sun et al. (2012) Sun, Xu, Wang, Houfeng, and Li, Wenjie. Fast online training with frequencyadaptive learning rates for chinese word segmentation and new word detection. In ACL, 2012.

Sun et al. (2017)
Sun, Xu, Ren, Xuancheng, Ma, Shuming, and Wang, Houfeng.
meprop: Sparsified back propagation for accelerated deep learning with reduced overfitting.
In ICML, 2017. 
Taigman et al. (2014)
Taigman, Yaniv, Yang, Ming, Ranzato, Marc’Aurelio, and Wolf, Lior.
Deepface: Closing the gap to humanlevel performance in face
verification.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1701–1708, 2014.  Tollenaere (1990) Tollenaere, Tom. Supersab: fast adaptive back propagation with good scaling properties. Neural networks, 3(5):561–573, 1990.
 Wiesler & Ney (2011) Wiesler, Simon and Ney, Hermann. A convergence analysis of loglinear training. In Advances in Neural Information Processing Systems, pp. 657–665, 2011.
Comments
There are no comments yet.