1 Introduction
Training neural networks tends to be timeconsuming Jean et al. (2015); Poultney et al. (2007); Seide et al. (2014), especially for architectures with a large number of learnable model parameters. An important reason why neural network learning is typically slow is that backpropagation requires the calculation of full gradients and updates all parameters in each learning step Sun et al. (2017). As deep networks with massive parameters become more prevalent, more and more efforts are devoted to accelerating the process of backpropagation. Among existing efforts, a prominent research line is sparse backpropagation Sun et al. (2017); Wei et al. (2017); Zhu et al. (2018)
, which aims at sparsifying the full gradient vector to achieve significant savings on computational cost.
One effective solution for sparse backpropagation is top sparseness, which only keeps elements with the largest magnitude in the gradient vector and backpropagates them across different layers. For instance, meProp Sun et al. (2017) employs the top
sparseness to compute only a very small but critical portion of the gradient information and update corresponding model parameters for the linear transformation. Going a step further,
Wei et al. (2017) implements the topsparseness for backpropagation on convolutional neural networks. Experimental results demonstrate that these methods can achieve significant acceleration of the backpropagation process. However, despite its success in saving computational cost, the top
sparseness for backpropagation still suffers from some intractable drawbacks, elaborated on as follows.On the theoretical side, the theoretical characteristics of sparse backpropagation, especially for top sparseness Sun et al. (2018, 2017); Wei et al. (2017), have not been explored. Most previous work focuses on illustrating empirical explanations, rather than providing powerful theoretical guarantees. Towards filling this gap, we first present a unified sparse backpropagation framework, of which some existing work Sun et al. (2017); Wei et al. (2017) can prove to be special cases. Furthermore, we analyze the theoretical characteristics of the proposed framework, which provides theoretical explanations for some related work Sun et al. (2018, 2017); Wei et al. (2017). The relevant analysis illustrates that when applied to a multilayer perceptron, the proposed framework essentially employs an estimated gradient similar enough to the true gradient to perform gradient descent, which leads to convergence in probability under certain conditions.
On the empirical side, we find that top sparseness for backpropagation tends to result in the loss of information contained in unpropagated gradients. Although it can propagate the most crucial gradient information by keeping only elements with the largest magnitude in the gradient vector, the unpropagated gradient may also contain a certain amount of useful information. Such information loss usually results in some adverse effects like poor stability in model performance.^{1}^{1}1The model performance means taskspecific evaluation scores, such as accuracy on classification tasks. To remedy this, we propose memorized sparse backpropagation (MSBP), which stores unpropagated gradients in memory for the next learning while propagating a critical portion of the gradient information. Compared to the previous work Sun et al. (2017); Wei et al. (2017), the proposed MSBP is capable of alleviating the information loss with the memory mechanism, thus improves model performance significantly. To sum up, the main contributions of this work are twofold:

We present a unified sparse backpropagation framework and prove that some existing methods Sun et al. (2017); Wei et al. (2017) are special cases under this framework. In addition, the theoretical characteristics of the proposed framework are analyzed in detail to provide theoretical guarantees for related work.

We propose memorized sparse backpropagation, which aims at alleviating the information loss by storing unpropagated gradients in memory for the next learning. The experiments demonstrate that our approach is able to effectively alleviate the information loss while achieving comparable acceleration.
2 Preliminary
This section presents some preliminary preparations. Given the dataset , the training loss of an input instance is defined as , where denotes the learnable model parameters and
is some loss function such as
or logistic loss. Further, the training loss on the whole dataset is defined as . We represent the angle between the vector and the vector as .Definition 1 (Convexsmooth angle).
If the training loss on the dataset is strongly convex and smooth for parameter vector , the convexsmooth angle of is defined as .^{2}^{2}2The condition is always True. Please refer to Appendix.D.2 for the details.
Definition 2 (Gradient estimation angle).
For any vector and training loss on an instance or whole dataset, we use to represent an estimation of the true gradient . Then, the gradient estimation angle between the estimated gradient and the true gradient is defined as .^{3}^{3}3After both the training loss and the estimation method of the gradient are defined, the gradient estimation angle depends only on .
Definition 3 (Sparsifying function).
Given an integer , the function is defined as , where is the input vector, and is a binary vector consisting of ones and zeros determined by . If satisfies for any , we call sparsifying function and define its sparse ratio as .
Definition 4 ().
Given an integer , for vector where , the function is defined as , where the th element of is . In other words, the function only preserves elements with the largest magnitude in the input vector.
It is easy to verify that is a special sparsifying function (see Appendix.D.3).
3 A Unified Sparse Backpropagation Framework
This section presents a unified framework for sparse backpropagation, which can be used to explain some existing representative approaches Sun et al. (2017); Wei et al. (2017). We first define the EGD algorithm in Section 3.1) and then formally introduce the proposed framework in Section 3.2.
3.1 Estimated Gradient Descent
Here we introduce the definition of the estimated gradient descent algorithm, which provides a framework for analyzing the convergence of sparse backpropagation.
Definition 5 (Egd).
Suppose is the training loss defined on the dataset and is the parameter vector to learn. The estimated gradient descent (EGD) algorithm adopts the following parameter update:
(1) 
where is the parameter at timestep , is the learning rate, and is an estimation of the true gradient for parameter updates.
Some existing optimizers can be regarded as special cases of EGD. For instance, when is defined as the true gradient , EGD is essentially the gradient descent (GD) algorithm. Several other works (e.g. Adam Kingma and Ba (2014), AdaDelta Zeiler (2012)) can also be summarized as different expressions of EGD when is implemented as different estimates. More importantly, the sparse backpropagation employs the estimated gradient to approximate the true gradient for model training in essence, which can also be regarded as a special case of EGD. This connection casts the cornerstone of subsequent theoretical analysis of sparse backpropagation.
In this work, we theoretically prove that once the gradient estimation angle of the parameter satisfies certain conditions for each timestep , the EGD algorithm can converge to the global minima under some reasonable assumptions. This conclusion is demonstrated in Theorem 3. Readers can refer to Appendix.E for the detailed proofs and we discuss the convergence speed of EGD in Appendix.G.
Theorem 1 (Convergence of EGD).
Suppose is the parameter vector, is the global minima, and training loss defined on the dataset is strongly convex and smooth. When applying the EGD algorithm to minimize , if the gradient estimation angle of satisfies , then there exists learning rate for each timestep such that
(2) 
For the given training loss , is a fixed value. Therefore, the above theorem demonstrates that the EGD algorithm can converge to the global minima when the angle between the estimated gradient and the true gradient is small enough at each timestep.
3.2 Proposed Unified Sparse Backpropagation
In this section, we present a unified sparse backpropagation framework via sparsifying function (Definition 8). The core idea is that when performing backpropagation, the gradients propagated from the next layer are sparsified to achieve acceleration. Algorithm 1 presents the pseudo code of our unified sparse backpropagation framework, which is described in detail as follows.
Considering that a computation unit composed of one linear transformation and one activation function is the cornerstone of various neural networks, we elaborate on our unified sparse backpropagation framework based on such a computational unit:
(3) 
where is the input vector, is the parameter matrix, and denotes a pointwise activation function. Then, the original backpropagation computes the gradient of the parameter matrix and the input vector as follows:
(4) 
In the proposed unified sparse backpropagation framework, the sparsifying function (Definition 8) is utilized to sparsify the gradient
propagated from the next layer and propagates them through the gradient computation graph according to the chain rule. Note that
is also an estimated gradient passed from the next layer. The gradient estimations are finally performed as follows:(5) 
Since is a special sparsifying function (see Section 2), some existing approaches (e.g., meProp Sun et al. (2017), mePropCNN Wei et al. (2017)) based on the top sparseness can be regarded as special cases of our framework. Dependended on the specific task, the sparsifying function can be defined as the different expression to improve model performance.
However, an intractable challenge for sparse backpropagation is the lack of theoretical analysis. To remedy this, here we analyze the theoretical characteristics of the proposed framework. With the fact that sparse backpropagation is a special case of EGD (Section 3.1), we theoretically illustrate that when applied to a multilayer perceptron (MLP), the proposed framework can converge to the global minima in probability under several reasonable conditions, which is formalized in Theorem 4.
Theorem 2 (Convergence of unified sparse backpropagation).
For an ideal^{4}^{4}4It means that is large enough and data instance obeys independent and identical distribution dataset , if the training loss is strongly convex and smooth, when applying the unified sparse backpropagation to train a MLP^{5}^{5}5There are several trivial constraints on MLP. Please refer to Appendix.D.5 for more details., there exists a sparse ratio and learning rates such that can converge to the global minima if we set the sparse ratio of sparsifying functions to .
The crucial idea to prove the above theorem is illustrating that the angle between the sparse gradient and the true full gradient is small enough for sparse backpropagation. Under this circumstance, the condition of the gradient estimation angle in Theorem 3 is satisfied, leading to the desired convergence in probability. Readers can refer to Appendix.F for the detailed proofs.
Although Theorem 4 is constrained by the base architecture of multilayer perceptron and several additional conditions, it is able to provide a degree of theoretical guarantee for the proposed unified sparse backpropagation framework. Our efforts in these theoretical analyses are valuable because they help explain the effectiveness of not only our framework but also some existing approaches Sun et al. (2017); Wei et al. (2017) on the theoretical side.
4 Memorized Sparse Backpropagation
Although traditional sparse backpropagation is able to achieve significant acceleration by keeping only part of elements in the full gradient, the unpropagated gradient may also contain a certain amount of useful information. We empirically find that such information loss tends to bring negative effects (e.g., performance degradation in extremely sparse scenarios, poor stability in performance). To remedy this, here we propose memorized sparse backpropagation (MSBP), which aims at alleviating the information loss by storing unpropagated gradients in memory for the next learning.
The core component of the proposed MSBP is the memory mechanism, which enables MSBP to store unpropagated gradients for the next learning while propagating a critical portion of the gradient information. Formally, different from the unified sparse backpropagation in Section 3.2, we adopt the following gradient estimations:
(6) 
where is a given sparsifying function and is the memory storing unpropagated gradients from the last learning step. Then, the memory is updated by the information of unpropagated gradients at the current learning step. Formally,
(7) 
where is the memory ratio, a hyperparameter controlling the ratio of memorizing unpropagated gradients. When is set to 0, the proposed MSBP degenerates to the unified sparse backpropagation that completely discards unpropagated gradients. Algorithm 2 presents the pseudo code of MSBP. Before the model training begins, we initialize memory to zero vector.
Method  Time  Memory 

Linear  
+ SBP  
+ MSBP 
Intuitively, by storing unpropagated gradients with the memory mechanism, the information loss in backpropagation due to sparseness can be alleviated. The experiments also illustrate that the proposed MSBP is more advantageous in various respects than approaches that completely discards unpropagated gradients. In fact, we find that for MSBP, the angle between the sparse gradient and true full gradient tends to be small. Furthermore, this angle is smaller than that in traditional sparse backpropagation. According to theoretical analysis in Section 3, a smaller gradient estimation angle is more conducive to model convergence. This observation explains the effectiveness of our MSBP to a certain extent on the theoretical side. Readers can refer to Section 5.4 for a more detailed analysis.
Comparison to sparsified SGD with memory. A work that looks similar to this paper is sparsified SGD with memory Stich et al. (2018). It calculates full gradients in backpropagation and sparsifiers gradients to be communicated in a distributed system. Therefore, different from that we sparsify gradients in backpropagation, the backpropagation process in Stich et al. (2018) remains unchanged and cannot be accelerated. Besides, Stich et al. (2018) is an optimization approach that can only be used in distributed systems, while our MSBP is a backpropagation framework that applies to both distributed and centralized systems.
5 Experiments
Following meProp Sun et al. (2017), we adopt as the sparsifying function. For simplicity, we use SBP to represent the traditional sparse backpropagation that completely discards unpropagated gradients. Table 4 presents a comparison of time and memory complexity of traditional SBP and our proposed MSBP. Readers can refer to Appendix.A for the detailed discussions.
5.1 Evaluation Tasks
We evaluate the proposed MSBP on several typical benchmark tasks in computer vision as well as natural language processing. The baselines used for comparison on each task are also introduced. Due to page limitations, we include all details of dataest and experimental settings in Appendix.B.
MNIST image recognition (MNIST):
This task aims to recognize the numerical digit (09) of each image and the evaluation metric is the accuracy of classification. We use the standard MNIST handwritten digit dataset
LeCun et al. (1998) and adopt a 3layer MLP as the base model.CIFAR10 image recognition (CIFAR10): Similar to MNIST, this task also performs image classification with accuracy as the evaluation metric. We use the standard CIFAR10 dataset Krizhevsky and Hinton (2009) and implement PreActResNet18 He et al. (2016) as the base model.
Transitionbased dependency parsing (Parsing): Following previous work, the dataset is selected as English Penn TreeBank (PTB) Marcus et al. (1993) and the evaluation metric is unlabeled attachment score (UAS). We implement a parser using MLP following Chen and Manning (2014) as the base model.
Partofspeech tagging (POSTag): We use the standard benchmark dataset Collins (2002) derived from the Penn Treebank corpus and the evaluation metric is perword accuracy. We adopt a 2layer bidirectional LSTM (BiLSTM) as our base model.
Polarity classification and subjectivity classification: Both tasks are designed to perform sentence classification, with accuracy as the evaluation metric. We use the dataset constructed by Pang and Lee (2004) and implement TextCNN Kim (2014) as the base model.
5.2 Experimental Results
The experimental results on three tasks of MNIST, Parsing, and POSTag are shown in Table 2. We conduct an indepth analysis of the results from the following aspects.
Improving model performance. As shown in Table 2, the proposed MSBP achieves the best performance on all tasks. For instance, on the parsing task, MSBP achieves 0.44% absolute improvement over traditional SBP and also outperforms the base model by 0.65% in the UAS score. Considering that our ultimate goal is to accelerate neural network learning while achieving comparable model performance, such results are promising and gratifying. Compared to traditional SBP Sun et al. (2017); Wei et al. (2017), MSBP employs the memory mechanism to store unpropagated gradients. This reduces the information loss during backpropagation, leading to improvements in the model performance.
Accelerating backpropagation. In contrast to traditional SBP, our MSBP memorizes unpropagated gradients to alleviate information loss. However, a potential issue is that the introduction of memory containing unpropagated gradients may impair the acceleration of backpropagation. Table 4 illustrates that MSBP exhibits the same time complexity as traditional SBP. Here we further verify this conclusion with experiments. As shown in Table 2, either traditional SBP or our proposed MSBP is able to achieve great acceleration of backpropagation, and the latter shows only negligible increase in computational cost compared to the former. This illustrates that our MBSP can achieve comparable acceleration while improving model performance.
Applicability to extremely sparse scenarios. In sparse backpropagation, the sparse ratio controls the tradeoff between acceleration and model performance. In pursuit of ultralarge acceleration, tends to be set extremely small in realscenarios. However, we empirically find that traditional SBP usually results in a significant degradation in model performance in this case. Table 2 shows that traditional SBP brings a 1.90% reduction in accuracy on MNIST image classification for . The reason is that for small values, only a very small amount of gradient information is propagated. Therefore, there exists serious information loss during backpropagation, leading to a significant degradation in model performance. In contrast, results show that our MSBP is still effective in these extremely sparse scenarios. With the memory mechanism, the current unpropagated gradient information is stored for the next learning, reducing the information loss caused by sparseness.
5.3 Further InDepth Analysis
In this section, we conduct further analysis of the proposed approach and experimental results.
Universality to base network architectures. Here we compare traditional SBP and the proposed MSBP on the CNN base model to verify the universality of our approach. Results show that traditional SBP improves the model performance on shallow CNNs (sentence classification) but fails on deeper networks, e.g., PreActResNet18 (image classification). As shown in Table 3, traditional SBP suffers from significant degradation of classification accuracy on CIFAR10 image classification. In contrast, the proposed MSBP improves the performance of the base model on both sentence classification and image classification. This demonstrates that our MSBP is universal, which applies to not only different types of base networks, but also different depths of models.
Improvement in model stability.
We find that our MSBP is also effective in improving model stability, meaning that it contributes to reducing the variance of the model performance in repeated experiments. To verify this conclusion, for each setting, we repeat 20 experiments on the MNIST task with different random seeds. The mean and standard deviation of the accuracy of repeated experiments are presented in Figure
3 and Figure 3 respectively. Results show that the traditional SBP () suffers from poor model stability in repeated experiments, whose standard deviation is nearly times of the base model (MLP). In contrast, all experiments conducted with MSBP () have better stability than traditional SBP and most of them are even better than the base model.Insensitivity to hyperparameters. As depicted in Figure 3, for the same , MSBP () performs better than traditional SBP Sun et al. (2017) () regardless of the choice of . The difference in accuracy between MSBP and traditional SBP ranges from 0.4% to 0.8%, while that between MSBP with different settings lies between 0.1% to 0.3%. This illustrates that the performance of MSBP is not very sensitive to compared to the improvement gained by the memory mechanism.
5.4 Why MSBP Works
As analyzed in Section 3, for sparse backpropagation, a smaller gradient estimate angle can better guarantee the convergence of approach. Therefore, we show the averaged gradient estimate angle calculated in the proposed MSBP and traditional SBP to empirically explain the effectiveness of our method. As shown in Figure 3, higher results in smaller gradient estimation angles and for the same , the gradient estimation angles in MSBP are smaller than that in SBP. This illustrates that by employing the memory mechanism to store unpropagated gradients, the sparse gradient calculated by our approach gives a more accurate estimate of true gradient, which is also consistent with results in Figure 3. In addition, the gap between the gradient estimation angles of SBP and MSBP tends to be bigger for lower because SBP suffers from the loss of unpropagated gradients more for lower , under which circumstances our proposal improves the performance more greatly.
5.5 Related Systems of Evaluation Tasks
Here we present evaluation scores of related systems on each task. Our MSBP can benefit complicated base models to advance corresponding scores, but this is not the focus of this work. Therefore, we adopt MLP, LSTM, and CNN as the base model, due to their crucial roles in deep learning.
For MLP, the MLP based approaches can achieve around 98% Cireşan et al. (2010); LeCun et al. (1998) accuracy on MNIST, while our method achieves 98.23%. For LSTM, the reported accuracy in existing approaches lies between 97.2% to 97.4% Collobert et al. (2011); Huang et al. (2015); Tsuruoka et al. (2011) on POSTag, whereas our method can achieve 97.50% accuracy. As for shallow CNN model, TextCNN Kim (2014) reports around 81.3% and 93.4% on polarity classification and subjectivity classification respectively, while our method achieves around 81.5% and 93.8% respectively. For deep CNN model, the stateoftheart approach reaches 96.53% accuracy Graham (2014) on CIFAR10. The prevalent ResNet architectures can achieve around 93%94% and models based on PreActResNet can obtain around 95% He et al. (2016), whilst our method achieves 94.92%.
6 Related Work
A prominent research line to accelerate backpropagation is sparse backpropagation, which strives to save computational cost by sparsifying the full gradient vector. For instance, a hardwareoriented structural sparsifying method Zhu et al. (2018) was invented for LSTM, which enforces a fixed level of sparsity in the LSTM gate gradients, yielding blockbased sparse gradient matrices. Sun et al. (2017) proposed meProp for linear transformation, which employs top sparseness to computes only a small but critical portion of gradients and updates corresponding model parameters. Furthermore, Wei et al. (2017) extended meProp to fit more complicated models like CNN, so as to achieve significant computational benefits.
There also exist several efforts different from sparse backpropagation to accelerate network learning. Tallaneare (1990) proposed an adaptive acceleration strategy for backpropagation while Riedmiller and Braun (1993)
performed local adaptation of parameter update based on error function. To speed up the computation of the softmax layer,
Jean et al. (2015) utilized importance sampling to make the training more efficient. Srivastava et al. (2014) presented dropout, which improves training speed and reduces overfitting by randomly dropping units from the neural network during training. From the perspective of distributed systems, Seide et al. (2014) proposed a onebitquantizing mechanism to reduce the communication cost between multiple machines.7 Conclusion
This work presents a unified sparse backpropagation framework. Some previous representative approaches can be regarded as special cases of it. Besides, the theoretical characteristics of the proposed framework are analyzed in detail to provide theoretical guarantees for the relevant methods. Going a step further, we propose memorized sparse backpropagation (MSBP), which aims at alleviating the information loss in tradition sparse backpropagation by utilizing the memory mechanism to store unpropagated gradients. The experiments demonstrate that the proposed MSBP exhibits better performance in both model performance and stability while achieving comparable acceleration.
References
 A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 740–750. Cited by: §B.1, §5.1.
 Deep, big, simple neural nets for handwritten digit recognition. Neural computation 22 (12), pp. 3207–3220. Cited by: §5.5.

Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms
. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, Philadelphia, PA, USA, July 67, 2002, Cited by: §B.1, §5.1. 
Natural language processing (almost) from scratch.
Journal of Machine Learning Research
12 (Aug), pp. 2493–2537. Cited by: §5.5. 
Fractional maxpooling
. arXiv preprint arXiv:1412.6071. Cited by: §5.5.  Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §B.1, §B.2, §5.1, §5.5.
 Bidirectional lstmcrf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §5.5.

On using very large target vocabulary for neural machine translation
. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 2631, 2015, Beijing, China, Volume 1: Long Papers, pp. 1–10. Cited by: §1, §6.  Convolutional neural networks for sentence classification. pp. 1746–1751. Cited by: §B.1, §5.1, §5.5.
 Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §B.2, §3.1.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §B.1, §5.1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §B.1, §5.1, §5.5.
 Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19 (2), pp. 313–330. Cited by: §B.1, §5.1.

A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts
. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2126 July, 2004, Barcelona, Spain., pp. 271–278. Cited by: §B.1, §5.1.  Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. Cited by: §B.1.

Efficient learning of sparse representations with an energybased model
. In Advances in neural information processing systems, pp. 1137–1144. Cited by: §1.  On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §B.2.
 A direct adaptive method for faster backpropagation learning: the rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pp. 586–591. Cited by: §6.

1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns
. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §6.  Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §6.
 Sparsified SGD with memory. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 4452–4463. Cited by: §4.
 Training simplification and model simplification for deep learning: a minimal effort back propagation method. IEEE Transactions on Knowledge and Data Engineering (), pp. 1–1. External Links: ISSN 10414347 Cited by: §1.
 MeProp: sparsified back propagation for accelerated deep learning with reduced overfitting. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 3299–3308. Cited by: Appendix A, item 1, §1, §1, §1, §1, §3.2, §3.2, §3, §5.2, §5.3, §5, §6.
 Fast adaptive backpropagation with good scaling properties’ neural network. Cited by: §6.
 Learning with lookahead: can historybased models rival globally optimized models?. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 238–246. Cited by: §5.5.
 Minimal effort back propagation for convolutional neural networks. arXiv preprint arXiv:1709.05804. Cited by: item 1, §1, §1, §1, §1, §3.2, §3.2, §3, §5.2, §6.
 ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §3.1.

Structurally sparsified backward propagation for faster long shortterm memory training
. CoRR abs/1806.00512. External Links: 1806.00512 Cited by: §1, §6.
Appendix A Discussion of Complexity Information
Method  Time  Memory 

Linear  
+ SBP  
+ MSBP 
Following meProp Sun et al. (2017), we adopt as the sparsifying function. For simplicity, we use SBP to represent the traditional sparse backpropagation that completely discards unpropagated gradients. In contrast, our proposed memorized sparse backpropagation is denoted as MSBP. Table 4 presents a comparison of time and memory complexity of the two approaches.
Time complexity.
The backpropagation process of the linear layer focuses on calculating gradients of and , the time complexity of which is . The application of SBP consists of two steps: finding top dimensions of the gradient of using a maximum heap with time complexity of and backpropagating only top dimensions of gradients with time complexity of . The extra time cost of MSBP comes from adding the memory information into the gradient of and updating the memory. The time complexity of these two operations is , which is negligible compared to .
Memory complexity.
The analysis of memory complexity is similar. The backpropagation of the linear layer requires storing gradients of and , whose memory complexity is . For traditional SBP, the memory complexity of finding top dimensions of the gradient of with a maximum heap is , while the backpropagation of corresponding dimensions of gradients requires no additional memory overhead. The extra memory cost of MSBP is the memory vector, the memory complexities of which are both and negligible compared to .
Appendix B Experiment Details
b.1 Datasets
MNIST image recognition (MNIST)
The MNIST dataset LeCun et al. (1998) consists of 60,000 training handwritten digit images and additional 10,000 test handwritten digit images. This aim of MNIST dataset is to recognize the numerical digit (09) of each image. We split the training images into 5,000 development images and 55,000 training images. The evaluation metric is the accuracy of the classification. We adopt a 3layer MLP model as a baseline.
CIFAR10 image recognition (CIFAR10)
Similar to MNIST, the goal of this task is to predict the category of each image. We conduct experiments on the CIFAR10 dataset Krizhevsky and Hinton (2009), which consists of 50,000 training images and additional 10,000 test images. The evaluation metric is accuracy and PreActResNet18 He et al. (2016) is implemented as the base model.
Transitionbased dependency parsing (Parsing)
In this task, we use English Penn TreeBank (PTB) Marcus et al. (1993) for experiments. We adopt sections 221 consisting of 39,832 sentences and 1,900,056 transition examples as the training set. Each transition example contains a parsing context and its optimal transition action. The development set is selected as section 22 composed of 1,700 sentences and 80,234 transition examples. The final test set is section 23 consisting of 2,416 sentences and 113,368 transition examples. We adopt the unlabeled attachment score (UAS) as the evaluation metric. A parser using MLP in Chen and Manning (2014) is implemented as the base model.
Partofspeech tagging (POSTag)
In this task, we use the standard benchmark dataset derived from Penn Treebank corpus Collins (2002). We adopt sections 018 of the Wall Street Journal (WSJ) for training (38,219 examples), and sections 2224 for testing (5,462 examples). The evaluation metric is perword accuracy. We employ a 2layer bidirectional LSTM (BiLSTM) as the base model. In addition, we use 100dim pretrained GloVe Pennington et al. (2014) word embeddings.
Polarity classification and subjectivity classification
b.2 Experimental Settings
MNIST, Parsing, and POSTag
We train epochs on three tasks of MNIST, Paring, and POSTag, respectively. The batch size is set to and , respectively. The dropout probability is set to , respectively. We use the Adam optimizer Kingma and Ba (2014) with the learning rate on all three tasks.
Cifar10
For the CIFAR10, we report the averaged accuracy on the test set of all epochs. The batch size is set to and the dropout probability is . We conduct experiments with SGD and Adam two optimizers on this task. We apply a momentum Qian (1999) (with a weight decay of and momentum value of ) on SGD. For SGD, the initial learning rate is and we use a multistep learning rate scheduler and the milestones are and the decay rate is . For Adam, the initial learning rate is . Other hyperparameters and optimizing techniques on CIFAR10 are the same as those in He et al. (2016).
Polarity classification and subjectivity classification
For these two tasks, we report the averaged accuracy on the test set of all epochs. For the base model (TextCNN), the filter window sizes are 3, 4, and 5, with feature maps each. The batch size is set to . We conduct experiments with Adam optimizer with a learning rate of .
Appendix C Review of Definitions and Theorems in Paper
In this section, we review some important definitions and theorems introduced in the paper.
c.1 Definitions
This section presents several definitions. Given the dataset , the training loss of an input instance is defined as , where denotes the learnable model parameters and is some loss function such as or logistic loss. Further, the training loss on the whole dataset is defined as . We represent the angle between the vector and the vector as .
Definition 6 (Convexsmooth angle).
If the training loss on the dataset is strongly convex and smooth for parameter vector , the convexsmooth angle of is defined as .
Definition 7 (Gradient estimation angle).
For any vector and training loss on an instance or the dataset, we use to represent an estimation of the true gradient . Then, the gradient estimation angle between the estimated gradient and the true gradient is defined as .^{6}^{6}6After both the training loss and the estimation method of the gradient are defined, the gradient estimation angle depends only on .
Definition 8 (Sparsifying function).
Given an integer , the function is defined as , where is the input vector, and is a binary vector consisting of ones and zeros determined by . If satisfies for any , we call as sparsifying function and define its sparse ratio as .
Definition 9 ().
Given an integer , for vector where , the function is defined as where the th element of is . In other words, the function only preserves elements with the largest magnitude in the input vector.
It is easy to verify that is a special sparsifying function (see Appendix.D.3.).
Definition 10 (Egd).
Suppose is the training loss defined on the dataset and is the parameter vector to learn. The estimated gradient descent (EGD) algorithm adopts the following parameter update:
(8) 
where is the parameter at timestep , is the learning rate, and is an estimation of the true gradient for parameter updates.
c.2 Theorems
Theorem 3 (Convergence of EGD).
Suppose is the parameter vector, is the global minima and training loss defined on the dataset is strongly convex and smooth. When applying the EGD algorithm to minimize , if the gradient estimation angle of satisfies , then there exists learning rate for each timestep such that
(9) 
Theorem 4 (Convergence of unified sparse backpropagation).
For an ideal^{7}^{7}7It means that is large enough and data instance obeys independent and identical distribution dataset , if the training loss is strongly convex and smooth, when applying the unified sparse backpropagation framework to train a MLP^{8}^{8}8There are several trivial constraints on MLP, please refer to Appendix.D.5 for more details., there exists a sparse ratio and learning rates such that can converge to the global minima if we set the sparse ratio of sparsifying functions to .
Appendix D Preparation and Lemmas
Here we introduce some key definitions and lemmas throughout the appendix. All vectors and matrices are assumed to belong to the real number field. In Appendix, vectors (e.g. ) and matrices (e.g. ) are in bold formatting.
d.1 Vectors
We first introduce two vectorrelated lemmas.
Lemma D.1.
For any vectors , and , we have
Lemma D.2.
For matrix (), suppose
is a positive definite matrix, the eigenvalue decomposition of
is , () andis an orthogonal matrix. We define
and . If , for any dimension vectors and , we haved.2 Loss Function
We define the loss function as strongly convex if for , where
denotes identity matrix. If the loss function
is strongly convex , for any vectors , we have(10)  
(11) 
We define the loss function as smooth if for , where denotes identity matrix. If the loss function is smooth, for any vectors , we have
(12)  
(13) 
For the loss function , we define as its global minima. If is smooth, for any , we have
(14)  
(15)  
(16) 
From Eq. 12 and Eq. 13, we can see,
(17) 
in other words, . When , we have , in other words,when we set and , , where it has a closedform solution and is trival. Therefore, we assume in most cases, .
Back to the definition of convexsmooth angle, if the loss function is strongly convex and smooth (), we can see the convexsmooth angle of is .
d.3 Function
We will prove the function is a special sparsifying function.
Given an integer , for vector where , the function is defined as where the th element of is . It is easy to verify that
(18)  
(19) 
In other words,
(20) 
Therefore, the function is a special sparsifying function.
d.4 Linear Layer Trained with SBP
Consider a linear layer with one linear transformation and one increasing pointwise activation function
(21) 
where is the input sample, is the parameter matrix (), is the dimension of the input vector, is the dimension of the output vector and is an increasing pointwise activation function (e.g., , or ).
For matrix , we define flattening function to flatten it into a vector in as , where represents the th column of and the semicolon denotes the concatenation of many column vectors to a long column vector. In other words,.
Assume and , then , when backpropagating
(22) 
Assume
and , then
(23) 
In the proposed unified sparse backpropagation framework, the sparsifying function (Definition 8) is utilized to sparsify the gradient propagated from the next layer and propagates them through the gradient computation graph according to the chain rule. Note that is also an estimated gradient passed from the next layer. The gradient estimations are finally performed as follows:
(24) 
in other words,
(25) 
We introduce a lemma:
Lemma D.3.
For a linear layer trained with SBP, the sparse ratio of the sparsifying function in SBP is . Denote .If is the loss of MLP trained with SBP on this input instance and the input of this layer is which satisfies , we use SBP to estimate and . suppose is a positive definite matrix, the eigenvalue decomposition of is , () and is an orthogonal matrix. We define , and , . It is easy to verify that and because is a positive definite matrix and is increasing. If , , and , then we have
and
d.5 MLP Trained with SBP
Consider a MLP trained with SBP, it is a layer multilayer perception (MLP), every layer except the last layer is a linear layer with SBP. is the input of the MLP, is the output of the MLP. The th layer of MLP is defined as
(26) 
where , and , is an increasing pointwise activation function of layer (). Note that the last layer is not a linear layer trained with SBP. Therefore, need not to be an increasing pointwise activation function. It can be softmax function, which is not a pointwise activation function.
Assume is the parameter vector of MLP defined as
where .
We use the condition number to measure how sensitive the output is to perturbations in the input data and to roundoff errors made during the solution process. Define condition number of matrix as , when we adopts the spectral norm , then , where and
are the maximum and minimum singular value of
respectively.If the condition number is small, we say the matrix is wellposed and otherwise illposed. If a matrix is singular, then its condition number is infinite, it is very illposed.
For a MLP trained with SBP, we assume that it is wellposed if there exist , in any layer and any time step such that
(27) 
here for a dim vector , we define .
We introduce a lemma here to ensure that the gradient estimation angle of the parameter vector can be arbitrarily small for an input instance with its label as input in MLP trained with SBP.
Lemma D.4.
For a MLP trained with SBP, for any input instance with its label which satisfies . Assume is the parameter vector. If the MLP is wellposed, then for any , there exsits such that if we set the sparse ratio of every sparsifying function in SBP as , we can get , an estimation of to make gradient estimation angle satisfy .
d.6 Review of the Term "In Probability"
We introduce a lemma here
Lemma D.5.
For a sequence of random variables , when , if and
Comments
There are no comments yet.