Memorized Sparse Backpropagation

05/24/2019
by   Zhiyuan Zhang, et al.
Peking University
0

Neural network learning is typically slow since backpropagation needs to compute full gradients and backpropagate them across multiple layers. Despite its success of existing work in accelerating propagation through sparseness, the relevant theoretical characteristics remain unexplored and we empirically find that they suffer from the loss of information contained in unpropagated gradients. To tackle these problems, in this work, we present a unified sparse backpropagation framework and provide a detailed analysis of its theoretical characteristics. Analysis reveals that when applied to a multilayer perceptron, our framework essentially performs gradient descent using an estimated gradient similar enough to the true gradient, resulting in convergence in probability under certain conditions. Furthermore, a simple yet effective algorithm named memorized sparse backpropagation (MSBP) is proposed to remedy the problem of information loss by storing unpropagated gradients in memory for the next learning. The experiments demonstrate that the proposed MSBP is able to effectively alleviate the information loss in traditional sparse backpropagation while achieving comparable acceleration.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/17/2020

ZORB: A Derivative-Free Backpropagation Algorithm for Neural Networks

Gradient descent and backpropagation have enabled neural networks to ach...
02/17/2022

Gradients without Backpropagation

Using backpropagation to compute gradients of objective functions for op...
04/23/2021

GuideBP: Guiding Backpropagation Through Weaker Pathways of Parallel Logits

Convolutional neural networks often generate multiple logits and use sim...
06/04/2018

Backdrop: Stochastic Backpropagation

We introduce backdrop, a flexible and simple-to-implement method, intuit...
12/06/2020

Representaciones del aprendizaje reutilizando los gradientes de la retropropagacion

This work proposes an algorithm for taking advantage of backpropagation ...
12/21/2021

A Theoretical View of Linear Backpropagation and Its Convergence

Backpropagation is widely used for calculating gradients in deep neural ...
09/03/2020

Penalty and Augmented Lagrangian Methods for Layer-parallel Training of Residual Networks

Algorithms for training residual networks (ResNets) typically require fo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training neural networks tends to be time-consuming Jean et al. (2015); Poultney et al. (2007); Seide et al. (2014), especially for architectures with a large number of learnable model parameters. An important reason why neural network learning is typically slow is that backpropagation requires the calculation of full gradients and updates all parameters in each learning step Sun et al. (2017). As deep networks with massive parameters become more prevalent, more and more efforts are devoted to accelerating the process of backpropagation. Among existing efforts, a prominent research line is sparse backpropagation Sun et al. (2017); Wei et al. (2017); Zhu et al. (2018)

, which aims at sparsifying the full gradient vector to achieve significant savings on computational cost.

One effective solution for sparse backpropagation is top- sparseness, which only keeps elements with the largest magnitude in the gradient vector and backpropagates them across different layers. For instance, meProp Sun et al. (2017) employs the top-

sparseness to compute only a very small but critical portion of the gradient information and update corresponding model parameters for the linear transformation. Going a step further,

Wei et al. (2017) implements the top-

sparseness for backpropagation on convolutional neural networks. Experimental results demonstrate that these methods can achieve significant acceleration of the backpropagation process. However, despite its success in saving computational cost, the top-

sparseness for backpropagation still suffers from some intractable drawbacks, elaborated on as follows.

On the theoretical side, the theoretical characteristics of sparse backpropagation, especially for top- sparseness Sun et al. (2018, 2017); Wei et al. (2017), have not been explored. Most previous work focuses on illustrating empirical explanations, rather than providing powerful theoretical guarantees. Towards filling this gap, we first present a unified sparse backpropagation framework, of which some existing work Sun et al. (2017); Wei et al. (2017) can prove to be special cases. Furthermore, we analyze the theoretical characteristics of the proposed framework, which provides theoretical explanations for some related work Sun et al. (2018, 2017); Wei et al. (2017). The relevant analysis illustrates that when applied to a multilayer perceptron, the proposed framework essentially employs an estimated gradient similar enough to the true gradient to perform gradient descent, which leads to convergence in probability under certain conditions.

On the empirical side, we find that top- sparseness for backpropagation tends to result in the loss of information contained in unpropagated gradients. Although it can propagate the most crucial gradient information by keeping only elements with the largest magnitude in the gradient vector, the unpropagated gradient may also contain a certain amount of useful information. Such information loss usually results in some adverse effects like poor stability in model performance.111The model performance means task-specific evaluation scores, such as accuracy on classification tasks. To remedy this, we propose memorized sparse backpropagation (MSBP), which stores unpropagated gradients in memory for the next learning while propagating a critical portion of the gradient information. Compared to the previous work Sun et al. (2017); Wei et al. (2017), the proposed MSBP is capable of alleviating the information loss with the memory mechanism, thus improves model performance significantly. To sum up, the main contributions of this work are two-fold:

  1. We present a unified sparse backpropagation framework and prove that some existing methods Sun et al. (2017); Wei et al. (2017) are special cases under this framework. In addition, the theoretical characteristics of the proposed framework are analyzed in detail to provide theoretical guarantees for related work.

  2. We propose memorized sparse backpropagation, which aims at alleviating the information loss by storing unpropagated gradients in memory for the next learning. The experiments demonstrate that our approach is able to effectively alleviate the information loss while achieving comparable acceleration.

2 Preliminary

This section presents some preliminary preparations. Given the dataset , the training loss of an input instance is defined as , where denotes the learnable model parameters and

is some loss function such as

or logistic loss. Further, the training loss on the whole dataset is defined as . We represent the angle between the vector and the vector as .

Definition 1 (Convex-smooth angle).

If the training loss on the dataset is -strongly convex and -smooth for parameter vector , the convex-smooth angle of is defined as .222The condition is always True. Please refer to Appendix.D.2 for the details.

Definition 2 (Gradient estimation angle).

For any vector and training loss on an instance or whole dataset, we use to represent an estimation of the true gradient . Then, the gradient estimation angle between the estimated gradient and the true gradient is defined as .333After both the training loss and the estimation method of the gradient are defined, the gradient estimation angle depends only on .

Definition 3 (Sparsifying function).

Given an integer , the function is defined as , where is the input vector, and is a binary vector consisting of ones and zeros determined by . If satisfies for any , we call sparsifying function and define its sparse ratio as .

Definition 4 ().

Given an integer , for vector where , the function is defined as , where the -th element of is . In other words, the function only preserves elements with the largest magnitude in the input vector.

It is easy to verify that is a special sparsifying function (see Appendix.D.3).

3 A Unified Sparse Backpropagation Framework

This section presents a unified framework for sparse backpropagation, which can be used to explain some existing representative approaches Sun et al. (2017); Wei et al. (2017). We first define the EGD algorithm in Section 3.1) and then formally introduce the proposed framework in Section 3.2.

3.1 Estimated Gradient Descent

Here we introduce the definition of the estimated gradient descent algorithm, which provides a framework for analyzing the convergence of sparse backpropagation.

Definition 5 (Egd).

Suppose is the training loss defined on the dataset and is the parameter vector to learn. The estimated gradient descent (EGD) algorithm adopts the following parameter update:

(1)

where is the parameter at time-step , is the learning rate, and is an estimation of the true gradient for parameter updates.

Some existing optimizers can be regarded as special cases of EGD. For instance, when is defined as the true gradient , EGD is essentially the gradient descent (GD) algorithm. Several other works (e.g. Adam Kingma and Ba (2014), AdaDelta Zeiler (2012)) can also be summarized as different expressions of EGD when is implemented as different estimates. More importantly, the sparse backpropagation employs the estimated gradient to approximate the true gradient for model training in essence, which can also be regarded as a special case of EGD. This connection casts the cornerstone of subsequent theoretical analysis of sparse backpropagation.

In this work, we theoretically prove that once the gradient estimation angle of the parameter satisfies certain conditions for each time-step , the EGD algorithm can converge to the global minima under some reasonable assumptions. This conclusion is demonstrated in Theorem 3. Readers can refer to Appendix.E for the detailed proofs and we discuss the convergence speed of EGD in Appendix.G.

Theorem 1 (Convergence of EGD).

Suppose is the parameter vector, is the global minima, and training loss defined on the dataset is -strongly convex and smooth. When applying the EGD algorithm to minimize , if the gradient estimation angle of satisfies , then there exists learning rate for each time-step such that

(2)

For the given training loss , is a fixed value. Therefore, the above theorem demonstrates that the EGD algorithm can converge to the global minima when the angle between the estimated gradient and the true gradient is small enough at each time-step.

3.2 Proposed Unified Sparse Backpropagation

In this section, we present a unified sparse backpropagation framework via sparsifying function (Definition 8). The core idea is that when performing backpropagation, the gradients propagated from the next layer are sparsified to achieve acceleration. Algorithm 1 presents the pseudo code of our unified sparse backpropagation framework, which is described in detail as follows.

Considering that a computation unit composed of one linear transformation and one activation function is the cornerstone of various neural networks, we elaborate on our unified sparse backpropagation framework based on such a computational unit:

(3)

where is the input vector, is the parameter matrix, and denotes a pointwise activation function. Then, the original backpropagation computes the gradient of the parameter matrix and the input vector as follows:

(4)

In the proposed unified sparse backpropagation framework, the sparsifying function (Definition 8) is utilized to sparsify the gradient

propagated from the next layer and propagates them through the gradient computation graph according to the chain rule. Note that

is also an estimated gradient passed from the next layer. The gradient estimations are finally performed as follows:

(5)

Since is a special sparsifying function (see Section 2), some existing approaches (e.g., meProp Sun et al. (2017), meProp-CNN Wei et al. (2017)) based on the top- sparseness can be regarded as special cases of our framework. Dependended on the specific task, the sparsifying function can be defined as the different expression to improve model performance.

However, an intractable challenge for sparse backpropagation is the lack of theoretical analysis. To remedy this, here we analyze the theoretical characteristics of the proposed framework. With the fact that sparse backpropagation is a special case of EGD (Section 3.1), we theoretically illustrate that when applied to a multi-layer perceptron (MLP), the proposed framework can converge to the global minima in probability under several reasonable conditions, which is formalized in Theorem 4.

Theorem 2 (Convergence of unified sparse backpropagation).

For an ideal444It means that is large enough and data instance obeys independent and identical distribution dataset , if the training loss is -strongly convex and smooth, when applying the unified sparse backpropagation to train a MLP555There are several trivial constraints on MLP. Please refer to Appendix.D.5 for more details., there exists a sparse ratio and learning rates such that can converge to the global minima if we set the sparse ratio of sparsifying functions to .

The crucial idea to prove the above theorem is illustrating that the angle between the sparse gradient and the true full gradient is small enough for sparse backpropagation. Under this circumstance, the condition of the gradient estimation angle in Theorem 3 is satisfied, leading to the desired convergence in probability. Readers can refer to Appendix.F for the detailed proofs.

Although Theorem 4 is constrained by the base architecture of multi-layer perceptron and several additional conditions, it is able to provide a degree of theoretical guarantee for the proposed unified sparse backpropagation framework. Our efforts in these theoretical analyses are valuable because they help explain the effectiveness of not only our framework but also some existing approaches Sun et al. (2017); Wei et al. (2017) on the theoretical side.

1:  Initialize learnable parameter
2:  /* No memory here. */
3:  while training do
4:     /* forward */
5:     Get input of this layer
6:     
7:     
8:     Propagate to the next layer
9:     /* backward */
10:     Get propagated from the next layer
11:     
12:     
13:     /* Drop unpropagated part of . */
14:     
15:     
16:     Backpropagate to the previous layer.
17:     /* update */
18:     Update with
19:  end while
Algorithm 1 Unified sparse backpropagation learning for a linear layer
1:  Initialize learnable parameter
2:  Initialize gradient memory
3:  while training do
4:     /* forward */
5:     Get input of this layer
6:     
7:     
8:     Propagate to the next layer
9:     /* backward */
10:     Get propagated from the next layer
11:     
12:     
13:     
14:     
15:     
16:     Backpropagate to the previous layer
17:     /* update */
18:     Update with
19:  end while
Algorithm 2 Memorized sparse backpropagation learning for a linear layer

4 Memorized Sparse Backpropagation

Although traditional sparse backpropagation is able to achieve significant acceleration by keeping only part of elements in the full gradient, the unpropagated gradient may also contain a certain amount of useful information. We empirically find that such information loss tends to bring negative effects (e.g., performance degradation in extremely sparse scenarios, poor stability in performance). To remedy this, here we propose memorized sparse backpropagation (MSBP), which aims at alleviating the information loss by storing unpropagated gradients in memory for the next learning.

The core component of the proposed MSBP is the memory mechanism, which enables MSBP to store unpropagated gradients for the next learning while propagating a critical portion of the gradient information. Formally, different from the unified sparse backpropagation in Section 3.2, we adopt the following gradient estimations:

(6)

where is a given sparsifying function and is the memory storing unpropagated gradients from the last learning step. Then, the memory is updated by the information of unpropagated gradients at the current learning step. Formally,

(7)

where is the memory ratio, a hyper-parameter controlling the ratio of memorizing unpropagated gradients. When is set to 0, the proposed MSBP degenerates to the unified sparse backpropagation that completely discards unpropagated gradients. Algorithm 2 presents the pseudo code of MSBP. Before the model training begins, we initialize memory to zero vector.

Method Time Memory
Linear
+ SBP
+ MSBP
Table 1: The time and memory complexity of backpropagation for a linear layer with input size and output size . We adopt as the sparsifying function.

Intuitively, by storing unpropagated gradients with the memory mechanism, the information loss in backpropagation due to sparseness can be alleviated. The experiments also illustrate that the proposed MSBP is more advantageous in various respects than approaches that completely discards unpropagated gradients. In fact, we find that for MSBP, the angle between the sparse gradient and true full gradient tends to be small. Furthermore, this angle is smaller than that in traditional sparse backpropagation. According to theoretical analysis in Section 3, a smaller gradient estimation angle is more conducive to model convergence. This observation explains the effectiveness of our MSBP to a certain extent on the theoretical side. Readers can refer to Section 5.4 for a more detailed analysis.

Comparison to sparsified SGD with memory. A work that looks similar to this paper is sparsified SGD with memory Stich et al. (2018). It calculates full gradients in backpropagation and sparsifiers gradients to be communicated in a distributed system. Therefore, different from that we sparsify gradients in backpropagation, the backpropagation process in Stich et al. (2018) remains unchanged and cannot be accelerated. Besides, Stich et al. (2018) is an optimization approach that can only be used in distributed systems, while our MSBP is a backpropagation framework that applies to both distributed and centralized systems.

5 Experiments

Following meProp Sun et al. (2017), we adopt as the sparsifying function. For simplicity, we use SBP to represent the traditional sparse backpropagation that completely discards unpropagated gradients. Table 4 presents a comparison of time and memory complexity of traditional SBP and our proposed MSBP. Readers can refer to Appendix.A for the detailed discussions.

5.1 Evaluation Tasks

We evaluate the proposed MSBP on several typical benchmark tasks in computer vision as well as natural language processing. The baselines used for comparison on each task are also introduced. Due to page limitations, we include all details of dataest and experimental settings in Appendix.B.

MNIST image recognition (MNIST):

This task aims to recognize the numerical digit (0-9) of each image and the evaluation metric is the accuracy of classification. We use the standard MNIST handwritten digit dataset 

LeCun et al. (1998) and adopt a 3-layer MLP as the base model.

CIFAR-10 image recognition (CIFAR-10): Similar to MNIST, this task also performs image classification with accuracy as the evaluation metric. We use the standard CIFAR-10 dataset Krizhevsky and Hinton (2009) and implement PreAct-ResNet-18 He et al. (2016) as the base model.

Transition-based dependency parsing (Parsing): Following previous work, the dataset is selected as English Penn TreeBank (PTB) Marcus et al. (1993) and the evaluation metric is unlabeled attachment score (UAS). We implement a parser using MLP following Chen and Manning (2014) as the base model.

Part-of-speech tagging (POS-Tag): We use the standard benchmark dataset Collins (2002) derived from the Penn Treebank corpus and the evaluation metric is per-word accuracy. We adopt a 2-layer bi-directional LSTM (Bi-LSTM) as our base model.

Polarity classification and subjectivity classification: Both tasks are designed to perform sentence classification, with accuracy as the evaluation metric. We use the dataset constructed by Pang and Lee (2004) and implement TextCNN Kim (2014) as the base model.

Table 2: Results of time cost and evaluation scores on MNIST (left), Parsing (right-top), and POS-Tag (right-bottom). refers to the hidden size of MLP. is the sparse ratio of sparsifying function and denotes the memory ratio of our MSBP. BP (s) refers to the backpropagation time cost on CPU in seconds ( is compared to baseline). Acc and UAS denote the averaged accuracy and unlabeled attachment score, respectively (/ is compared to baseline).

5.2 Experimental Results

The experimental results on three tasks of MNIST, Parsing, and POS-Tag are shown in Table 2. We conduct an in-depth analysis of the results from the following aspects.

Improving model performance. As shown in Table 2, the proposed MSBP achieves the best performance on all tasks. For instance, on the parsing task, MSBP achieves 0.44% absolute improvement over traditional SBP and also outperforms the base model by 0.65% in the UAS score. Considering that our ultimate goal is to accelerate neural network learning while achieving comparable model performance, such results are promising and gratifying. Compared to traditional SBP Sun et al. (2017); Wei et al. (2017), MSBP employs the memory mechanism to store unpropagated gradients. This reduces the information loss during backpropagation, leading to improvements in the model performance.

Accelerating backpropagation. In contrast to traditional SBP, our MSBP memorizes unpropagated gradients to alleviate information loss. However, a potential issue is that the introduction of memory containing unpropagated gradients may impair the acceleration of backpropagation. Table 4 illustrates that MSBP exhibits the same time complexity as traditional SBP. Here we further verify this conclusion with experiments. As shown in Table 2, either traditional SBP or our proposed MSBP is able to achieve great acceleration of backpropagation, and the latter shows only negligible increase in computational cost compared to the former. This illustrates that our MBSP can achieve comparable acceleration while improving model performance.

Applicability to extremely sparse scenarios. In sparse backpropagation, the sparse ratio controls the trade-off between acceleration and model performance. In pursuit of ultra-large acceleration, tends to be set extremely small in real-scenarios. However, we empirically find that traditional SBP usually results in a significant degradation in model performance in this case. Table 2 shows that traditional SBP brings a 1.90% reduction in accuracy on MNIST image classification for . The reason is that for small values, only a very small amount of gradient information is propagated. Therefore, there exists serious information loss during backpropagation, leading to a significant degradation in model performance. In contrast, results show that our MSBP is still effective in these extremely sparse scenarios. With the memory mechanism, the current unpropagated gradient information is stored for the next learning, reducing the information loss caused by sparseness.

Table 3: Results of different approaches on deep CNNs (left) and shallow CNNs (right). Acc

denotes the averaged accuracy of all epochs instead of the best accuracy on the test set.

Adam and SGD are two representative optimizers. The detailed explanations of symbols can be found in Table 2.

5.3 Further In-Depth Analysis

In this section, we conduct further analysis of the proposed approach and experimental results.

Universality to base network architectures. Here we compare traditional SBP and the proposed MSBP on the CNN base model to verify the universality of our approach. Results show that traditional SBP improves the model performance on shallow CNNs (sentence classification) but fails on deeper networks, e.g., PreAct-ResNet-18 (image classification). As shown in Table 3, traditional SBP suffers from significant degradation of classification accuracy on CIFAR10 image classification. In contrast, the proposed MSBP improves the performance of the base model on both sentence classification and image classification. This demonstrates that our MSBP is universal, which applies to not only different types of base networks, but also different depths of models.

Improvement in model stability.

We find that our MSBP is also effective in improving model stability, meaning that it contributes to reducing the variance of the model performance in repeated experiments. To verify this conclusion, for each setting, we repeat 20 experiments on the MNIST task with different random seeds. The mean and standard deviation of the accuracy of repeated experiments are presented in Figure 

3 and Figure 3 respectively. Results show that the traditional SBP () suffers from poor model stability in repeated experiments, whose standard deviation is nearly times of the base model (MLP). In contrast, all experiments conducted with MSBP () have better stability than traditional SBP and most of them are even better than the base model.

Insensitivity to hyper-parameters. As depicted in Figure 3, for the same , MSBP () performs better than traditional SBP Sun et al. (2017) () regardless of the choice of . The difference in accuracy between MSBP and traditional SBP ranges from 0.4% to 0.8%, while that between MSBP with different settings lies between 0.1% to 0.3%. This illustrates that the performance of MSBP is not very sensitive to compared to the improvement gained by the memory mechanism.

5.4 Why MSBP Works

As analyzed in Section 3, for sparse backpropagation, a smaller gradient estimate angle can better guarantee the convergence of approach. Therefore, we show the averaged gradient estimate angle calculated in the proposed MSBP and traditional SBP to empirically explain the effectiveness of our method. As shown in Figure 3, higher results in smaller gradient estimation angles and for the same , the gradient estimation angles in MSBP are smaller than that in SBP. This illustrates that by employing the memory mechanism to store unpropagated gradients, the sparse gradient calculated by our approach gives a more accurate estimate of true gradient, which is also consistent with results in Figure 3. In addition, the gap between the gradient estimation angles of SBP and MSBP tends to be bigger for lower because SBP suffers from the loss of unpropagated gradients more for lower , under which circumstances our proposal improves the performance more greatly.

5.5 Related Systems of Evaluation Tasks

Here we present evaluation scores of related systems on each task. Our MSBP can benefit complicated base models to advance corresponding scores, but this is not the focus of this work. Therefore, we adopt MLP, LSTM, and CNN as the base model, due to their crucial roles in deep learning.

For MLP, the MLP based approaches can achieve around 98% Cireşan et al. (2010); LeCun et al. (1998) accuracy on MNIST, while our method achieves 98.23%. For LSTM, the reported accuracy in existing approaches lies between 97.2% to 97.4% Collobert et al. (2011); Huang et al. (2015); Tsuruoka et al. (2011) on POS-Tag, whereas our method can achieve 97.50% accuracy. As for shallow CNN model, TextCNN Kim (2014) reports around 81.3% and 93.4% on polarity classification and subjectivity classification respectively, while our method achieves around 81.5% and 93.8% respectively. For deep CNN model, the state-of-the-art approach reaches 96.53% accuracy Graham (2014) on CIFAR-10. The prevalent ResNet architectures can achieve around 93%-94% and models based on PreAct-ResNet can obtain around 95% He et al. (2016), whilst our method achieves 94.92%.

Figure 1: Averaged accuracy   of experiments.
Figure 2: Averaged standarddeviation of experiments.
Figure 3: Averaged run-time   gradient estimation angles.

6 Related Work

A prominent research line to accelerate backpropagation is sparse backpropagation, which strives to save computational cost by sparsifying the full gradient vector. For instance, a hardware-oriented structural sparsifying method Zhu et al. (2018) was invented for LSTM, which enforces a fixed level of sparsity in the LSTM gate gradients, yielding block-based sparse gradient matrices. Sun et al. (2017) proposed meProp for linear transformation, which employs top- sparseness to computes only a small but critical portion of gradients and updates corresponding model parameters. Furthermore, Wei et al. (2017) extended meProp to fit more complicated models like CNN, so as to achieve significant computational benefits.

There also exist several efforts different from sparse backpropagation to accelerate network learning. Tallaneare (1990) proposed an adaptive acceleration strategy for backpropagation while Riedmiller and Braun (1993)

performed local adaptation of parameter update based on error function. To speed up the computation of the softmax layer,

Jean et al. (2015) utilized importance sampling to make the training more efficient. Srivastava et al. (2014) presented dropout, which improves training speed and reduces overfitting by randomly dropping units from the neural network during training. From the perspective of distributed systems, Seide et al. (2014) proposed a one-bit-quantizing mechanism to reduce the communication cost between multiple machines.

7 Conclusion

This work presents a unified sparse backpropagation framework. Some previous representative approaches can be regarded as special cases of it. Besides, the theoretical characteristics of the proposed framework are analyzed in detail to provide theoretical guarantees for the relevant methods. Going a step further, we propose memorized sparse backpropagation (MSBP), which aims at alleviating the information loss in tradition sparse backpropagation by utilizing the memory mechanism to store unpropagated gradients. The experiments demonstrate that the proposed MSBP exhibits better performance in both model performance and stability while achieving comparable acceleration.

References

  • D. Chen and C. D. Manning (2014) A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 740–750. Cited by: §B.1, §5.1.
  • D. C. Cireşan, U. Meier, L. M. Gambardella, and J. Schmidhuber (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural computation 22 (12), pp. 3207–3220. Cited by: §5.5.
  • M. Collins (2002)

    Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms

    .
    In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, Philadelphia, PA, USA, July 6-7, 2002, Cited by: §B.1, §5.1.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    12 (Aug), pp. 2493–2537.
    Cited by: §5.5.
  • B. Graham (2014)

    Fractional max-pooling

    .
    arXiv preprint arXiv:1412.6071. Cited by: §5.5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §B.1, §B.2, §5.1, §5.5.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §5.5.
  • S. Jean, K. Cho, R. Memisevic, and Y. Bengio (2015)

    On using very large target vocabulary for neural machine translation

    .
    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1–10. Cited by: §1, §6.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. pp. 1746–1751. Cited by: §B.1, §5.1, §5.5.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §B.2, §3.1.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §B.1, §5.1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §B.1, §5.1, §5.5.
  • M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19 (2), pp. 313–330. Cited by: §B.1, §5.1.
  • B. Pang and L. Lee (2004)

    A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

    .
    In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain., pp. 271–278. Cited by: §B.1, §5.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. Cited by: §B.1.
  • C. Poultney, S. Chopra, Y. L. Cun, et al. (2007)

    Efficient learning of sparse representations with an energy-based model

    .
    In Advances in neural information processing systems, pp. 1137–1144. Cited by: §1.
  • N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §B.2.
  • M. Riedmiller and H. Braun (1993) A direct adaptive method for faster backpropagation learning: the rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pp. 586–591. Cited by: §6.
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu (2014)

    1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns

    .
    In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §6.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §6.
  • S. U. Stich, J. Cordonnier, and M. Jaggi (2018) Sparsified SGD with memory. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4452–4463. Cited by: §4.
  • X. Sun, X. Ren, S. Ma, B. Wei, W. Li, J. Xu, H. Wang, and Y. Zhang (2018) Training simplification and model simplification for deep learning: a minimal effort back propagation method. IEEE Transactions on Knowledge and Data Engineering (), pp. 1–1. External Links: ISSN 1041-4347 Cited by: §1.
  • X. Sun, X. Ren, S. Ma, and H. Wang (2017) MeProp: sparsified back propagation for accelerated deep learning with reduced overfitting. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3299–3308. Cited by: Appendix A, item 1, §1, §1, §1, §1, §3.2, §3.2, §3, §5.2, §5.3, §5, §6.
  • T. Tallaneare (1990) Fast adaptive backpropagation with good scaling properties’ neural network. Cited by: §6.
  • Y. Tsuruoka, Y. Miyao, and J. Kazama (2011) Learning with lookahead: can history-based models rival globally optimized models?. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 238–246. Cited by: §5.5.
  • B. Wei, X. Sun, X. Ren, and J. Xu (2017) Minimal effort back propagation for convolutional neural networks. arXiv preprint arXiv:1709.05804. Cited by: item 1, §1, §1, §1, §1, §3.2, §3.2, §3, §5.2, §6.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §3.1.
  • M. Zhu, J. Clemons, J. Pool, M. Rhu, S. W. Keckler, and Y. Xie (2018)

    Structurally sparsified backward propagation for faster long short-term memory training

    .
    CoRR abs/1806.00512. External Links: 1806.00512 Cited by: §1, §6.

Appendix A Discussion of Complexity Information

Method Time Memory
Linear
+ SBP
+ MSBP
Table 4: The time and memory complexity of backpropagation for a linear layer with input size and output size . We adopt as the sparsifying function.

Following meProp Sun et al. (2017), we adopt as the sparsifying function. For simplicity, we use SBP to represent the traditional sparse backpropagation that completely discards unpropagated gradients. In contrast, our proposed memorized sparse backpropagation is denoted as MSBP. Table 4 presents a comparison of time and memory complexity of the two approaches.

Time complexity.

The backpropagation process of the linear layer focuses on calculating gradients of and , the time complexity of which is . The application of SBP consists of two steps: finding top- dimensions of the gradient of using a maximum heap with time complexity of and backpropagating only top- dimensions of gradients with time complexity of . The extra time cost of MSBP comes from adding the memory information into the gradient of and updating the memory. The time complexity of these two operations is , which is negligible compared to .

Memory complexity.

The analysis of memory complexity is similar. The backpropagation of the linear layer requires storing gradients of and , whose memory complexity is . For traditional SBP, the memory complexity of finding top- dimensions of the gradient of with a maximum heap is , while the backpropagation of corresponding dimensions of gradients requires no additional memory overhead. The extra memory cost of MSBP is the memory vector, the memory complexities of which are both and negligible compared to .

Appendix B Experiment Details

b.1 Datasets

MNIST image recognition (MNIST)

The MNIST dataset LeCun et al. (1998) consists of 60,000 training handwritten digit images and additional 10,000 test handwritten digit images. This aim of MNIST dataset is to recognize the numerical digit (0-9) of each image. We split the training images into 5,000 development images and 55,000 training images. The evaluation metric is the accuracy of the classification. We adopt a 3-layer MLP model as a baseline.

CIFAR-10 image recognition (CIFAR-10)

Similar to MNIST, the goal of this task is to predict the category of each image. We conduct experiments on the CIFAR-10 dataset Krizhevsky and Hinton (2009), which consists of 50,000 training images and additional 10,000 test images. The evaluation metric is accuracy and PreAct-ResNet-18 He et al. (2016) is implemented as the base model.

Transition-based dependency parsing (Parsing)

In this task, we use English Penn TreeBank (PTB) Marcus et al. (1993) for experiments. We adopt sections 2-21 consisting of 39,832 sentences and 1,900,056 transition examples as the training set. Each transition example contains a parsing context and its optimal transition action. The development set is selected as section 22 composed of 1,700 sentences and 80,234 transition examples. The final test set is section 23 consisting of 2,416 sentences and 113,368 transition examples. We adopt the unlabeled attachment score (UAS) as the evaluation metric. A parser using MLP in Chen and Manning (2014) is implemented as the base model.

Part-of-speech tagging (POS-Tag)

In this task, we use the standard benchmark dataset derived from Penn Treebank corpus Collins (2002). We adopt sections 0-18 of the Wall Street Journal (WSJ) for training (38,219 examples), and sections 22-24 for testing (5,462 examples). The evaluation metric is per-word accuracy. We employ a 2-layer bi-directional LSTM (Bi-LSTM) as the base model. In addition, we use 100-dim pre-trained GloVe Pennington et al. (2014) word embeddings.

Polarity classification and subjectivity classification

In these two tasks, we implement the base model as TextCNN Kim (2014). We evaluate different approaches on the dataset in Pang and Lee (2004) and the evaluation metric is the accuracy of classification.

b.2 Experimental Settings

MNIST, Parsing, and POS-Tag

We train epochs on three tasks of MNIST, Paring, and POS-Tag, respectively. The batch size is set to and , respectively. The dropout probability is set to , respectively. We use the Adam optimizer Kingma and Ba (2014) with the learning rate on all three tasks.

Cifar-10

For the CIFAR-10, we report the averaged accuracy on the test set of all epochs. The batch size is set to and the dropout probability is . We conduct experiments with SGD and Adam two optimizers on this task. We apply a momentum Qian (1999) (with a weight decay of and momentum value of ) on SGD. For SGD, the initial learning rate is and we use a multi-step learning rate scheduler and the milestones are and the decay rate is . For Adam, the initial learning rate is . Other hyper-parameters and optimizing techniques on CIFAR-10 are the same as those in He et al. (2016).

Polarity classification and subjectivity classification

For these two tasks, we report the averaged accuracy on the test set of all epochs. For the base model (TextCNN), the filter window sizes are 3, 4, and 5, with feature maps each. The batch size is set to . We conduct experiments with Adam optimizer with a learning rate of .

Appendix C Review of Definitions and Theorems in Paper

In this section, we review some important definitions and theorems introduced in the paper.

c.1 Definitions

This section presents several definitions. Given the dataset , the training loss of an input instance is defined as , where denotes the learnable model parameters and is some loss function such as or logistic loss. Further, the training loss on the whole dataset is defined as . We represent the angle between the vector and the vector as .

Definition 6 (Convex-smooth angle).

If the training loss on the dataset is -strongly convex and -smooth for parameter vector , the convex-smooth angle of is defined as .

Definition 7 (Gradient estimation angle).

For any vector and training loss on an instance or the dataset, we use to represent an estimation of the true gradient . Then, the gradient estimation angle between the estimated gradient and the true gradient is defined as .666After both the training loss and the estimation method of the gradient are defined, the gradient estimation angle depends only on .

Definition 8 (Sparsifying function).

Given an integer , the function is defined as , where is the input vector, and is a binary vector consisting of ones and zeros determined by . If satisfies for any , we call as sparsifying function and define its sparse ratio as .

Definition 9 ().

Given an integer , for vector where , the function is defined as where the -th element of is . In other words, the function only preserves elements with the largest magnitude in the input vector.

It is easy to verify that is a special sparsifying function (see Appendix.D.3.).

Definition 10 (Egd).

Suppose is the training loss defined on the dataset and is the parameter vector to learn. The estimated gradient descent (EGD) algorithm adopts the following parameter update:

(8)

where is the parameter at time-step , is the learning rate, and is an estimation of the true gradient for parameter updates.

c.2 Theorems

Theorem 3 (Convergence of EGD).

Suppose is the parameter vector, is the global minima and training loss defined on the dataset is -strongly convex and smooth. When applying the EGD algorithm to minimize , if the gradient estimation angle of satisfies , then there exists learning rate for each time-step such that

(9)
Theorem 4 (Convergence of unified sparse backpropagation).

For an ideal777It means that is large enough and data instance obeys independent and identical distribution dataset , if the training loss is -strongly convex and smooth, when applying the unified sparse backpropagation framework to train a MLP888There are several trivial constraints on MLP, please refer to Appendix.D.5 for more details., there exists a sparse ratio and learning rates such that can converge to the global minima if we set the sparse ratio of sparsifying functions to .

Appendix D Preparation and Lemmas

Here we introduce some key definitions and lemmas throughout the appendix. All vectors and matrices are assumed to belong to the real number field. In Appendix, vectors (e.g. ) and matrices (e.g. ) are in bold formatting.

d.1 Vectors

We first introduce two vector-related lemmas.

Lemma D.1.

For any vectors , and , we have

Lemma D.2.

For matrix (), suppose

is a positive definite matrix, the eigenvalue decomposition of

is , () and

is an orthogonal matrix. We define

and . If , for any -dimension vectors and , we have

d.2 Loss Function

We define the loss function as -strongly convex if for , where

denotes identity matrix. If the loss function

is -strongly convex , for any vectors , we have

(10)
(11)

We define the loss function as -smooth if for , where denotes identity matrix. If the loss function is -smooth, for any vectors , we have

(12)
(13)

For the loss function , we define as its global minima. If is -smooth, for any , we have

(14)
(15)
(16)

From Eq. 12 and Eq. 13, we can see,

(17)

in other words, . When , we have , in other words,when we set and , , where it has a closed-form solution and is trival. Therefore, we assume in most cases, .

Back to the definition of convex-smooth angle, if the loss function is -strongly convex and -smooth (), we can see the convex-smooth angle of is .

d.3 Function

We will prove the function is a special sparsifying function.

Given an integer , for vector where , the function is defined as where the -th element of is . It is easy to verify that

(18)
(19)

In other words,

(20)

Therefore, the function is a special sparsifying function.

d.4 Linear Layer Trained with SBP

Consider a linear layer with one linear transformation and one increasing pointwise activation function

(21)

where is the input sample, is the parameter matrix (), is the dimension of the input vector, is the dimension of the output vector and is an increasing pointwise activation function (e.g., , or ).

For matrix , we define flattening function to flatten it into a vector in as , where represents the -th column of and the semicolon denotes the concatenation of many column vectors to a long column vector. In other words,.

Assume and , then , when backpropagating

(22)

Assume
and , then

(23)

In the proposed unified sparse backpropagation framework, the sparsifying function (Definition 8) is utilized to sparsify the gradient propagated from the next layer and propagates them through the gradient computation graph according to the chain rule. Note that is also an estimated gradient passed from the next layer. The gradient estimations are finally performed as follows:

(24)

in other words,

(25)

We introduce a lemma:

Lemma D.3.

For a linear layer trained with SBP, the sparse ratio of the sparsifying function in SBP is . Denote .If is the loss of MLP trained with SBP on this input instance and the input of this layer is which satisfies , we use SBP to estimate and . suppose is a positive definite matrix, the eigenvalue decomposition of is , () and is an orthogonal matrix. We define , and , . It is easy to verify that and because is a positive definite matrix and is increasing. If , , and , then we have

and

d.5 MLP Trained with SBP

Consider a MLP trained with SBP, it is a -layer multi-layer perception (MLP), every layer except the last layer is a linear layer with SBP. is the input of the MLP, is the output of the MLP. The -th layer of MLP is defined as

(26)

where , and , is an increasing pointwise activation function of layer (). Note that the last layer is not a linear layer trained with SBP. Therefore, need not to be an increasing pointwise activation function. It can be softmax function, which is not a pointwise activation function.

Assume is the parameter vector of MLP defined as

where .

We use the condition number to measure how sensitive the output is to perturbations in the input data and to roundoff errors made during the solution process. Define condition number of matrix as , when we adopts the spectral norm , then , where and

are the maximum and minimum singular value of

respectively.

If the condition number is small, we say the matrix is well-posed and otherwise ill-posed. If a matrix is singular, then its condition number is infinite, it is very ill-posed.

For a MLP trained with SBP, we assume that it is -well-posed if there exist , in any layer and any time step such that

(27)

here for a -dim vector , we define .

We introduce a lemma here to ensure that the gradient estimation angle of the parameter vector can be arbitrarily small for an input instance with its label as input in MLP trained with SBP.

Lemma D.4.

For a MLP trained with SBP, for any input instance with its label which satisfies . Assume is the parameter vector. If the MLP is -well-posed, then for any , there exsits such that if we set the sparse ratio of every sparsifying function in SBP as , we can get , an estimation of to make gradient estimation angle satisfy .

d.6 Review of the Term "In Probability"

A sequence of random variables

converges to a random variable in probability if for any

(28)

We introduce a lemma here

Lemma D.5.

For a sequence of random variables , when , if and