Embedding Differentiable Sparsity into Deep Neural Network

06/23/2020 ∙ by Yongjin Lee, et al. ∙ ETRI 0

In this paper, we propose embedding sparsity into the structure of deep neural networks, where model parameters can be exactly zero during training with the stochastic gradient descent. Thus, it can learn the sparsified structure and the weights of networks simultaneously. The proposed approach can learn structured as well as unstructured sparsity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have made great success in various recognition and prediction problems, but the size and the required computing resource have been overgrown as well. To reduce its size as well as computation time for inference, several approaches have been attempted. Among them, pruning has long been adapted Mozer and Smolensky (1988); LeCun et al. (1989); Hassibi et al. (1993); Liu et al. (2015); Dally (2015, 2016); Dong et al. (2017). However, it typically requires a pre-trained model and needs to go through several steps: selects unimportant parameters of a pre-trained model, deletes them and then, retrains the slimmed model, and may repeat the whole process multiple times.

Another most recognized approach is the sparse regularization with -norm, which shrinks redundant parameters to zero during training Tibshirani (1996), and thus does not require a pre-trained model. However, since it acts on an individual parameter, it often produces unstructured irregular models and thus, diminishes the benefit of computation on parallel hardware such as GPUs Wen et al. (2016). In order to obtain regular sparse structures, the regularization with -norm Yuan and Lin (2006)

was adopted on a group of parameters, where a group was defined as a set of parameters on the same filter, neuron, layer or building block, so that all parameters under the same group were either retained or zeroed-out together 

Alvarez and Salzmann (2016); Wen et al. (2016); Yoon and Hwang (2017). The optimization of the regularized objective is performed with proximal operation Yuan and Lin (2006); Parikh and Boyd (2014). The proximal operation is involved with soft-thresholding which consists of weight-decaying and thresholding operations and it is carried out as a separate step from the gradient descent-based optimization for a prediction loss. Therefore, it interrupts an end-to-end training with the stochastic gradient descent and it does not necessarily well balance the prediction loss and the model complexity.

In this work, we propose a differentiable approach in an end-to-end manner. Our method allows model parameters to be exactly zero during training with the stochastic gradient descent and thus it does not require both a proximal operator or a parameter selection stage in order to sparsify a model. Since it can learn the sparsified structure and the weights of networks simultaneously by optimizing an objective function, it abstracts and simplifies the whole learning process. Another advantage of the proposed method is that it can be applied on a group of parameters, and thus it can produce a structured model.

2 Related Work

The proposed approach is inspired by the group regularization with proximal operators Zhou et al. (2010); Alvarez and Salzmann (2016); Wen et al. (2016); Yoon and Hwang (2017) and a differentiable sparsification approach Lee (2019). We briefly review the two approaches in order to show how we were motivated.

2.1 Group Sparse Regularization

The group regularization with -norm enforces the sparsity in a group level. A group is defined as a set of parameters on the same filter, neuron or layer, and all parameters under the same group are either retained or zeroed-out together. The group sparsity was successfully applied to automatically determine the number of neurons and layers Alvarez and Salzmann (2016); Wen et al. (2016). The regularized objective function with -norm Yuan and Lin (2006) is written as


where and denote a prediction loss and a regularization term respectively, and is a set of training data, a set of model parameters, and controls the trade-off between a prediction loss and a model complexity. The regularization term is written as


where and represents a group of model parameters. In order to optimize the regularization term, parameter updating is performed with proximal operation Yuan and Lin (2006); Parikh and Boyd (2014),


where denotes an assignment operator and is a learning rate.

Another notable group regularization is exclusive lasso with -norm Zhou et al. (2010); Yoon and Hwang (2017). Rather than either retaining or removing an entire group altogether, it was employed to promote sparsity within a group. The regularization term is written as


To optimize the regularization term, learning is performed with the following proximal operator,


The proximal operators consist of weight decaying and thresholding steps and they are performed at every mini-batch or epoch in a sperate step after the optimization of a prediction loss  

Alvarez and Salzmann (2016); Wen et al. (2016); Yuan and Lin (2006). Thus, the parameter updating with the proximal gradient descent can be seen as an additional model discretization or a pruning step. Moreover, because it is carried out as a separate step from the optimization for a prediction loss, it does not necessarily well balance the model complexity with the prediction loss.

2.2 Differentiable Sparsification

The differentiable approach Lee (2019) embeds trainable architecture parameters into the structure of neural networks. By creating competition between the parameters and driving them to zero, they remove redundant or unnecessary components. For example, suppose that a neural network is given as a modularized form,


where x denotes an input to a module, model parameters for component and an architecture parameter. Model parameters denote ordinary parameters such as a filter in a convolutional layer or a weight in a fully connected layer. The value of represents the importance of component . Enforcing to be zero amounts to removing component or zeroing-out whole . Thus, by creating the competition between elements of and driving some of them to be zero, unnecessary components can be removed.

In order to allow the elements of to be zero and set up the competition between them, the differentiable approach Lee (2019) parameterizes architecture parameters as follows:


where and are unconstrained trainable parameters,

denotes a sigmoid function and

represents . It can be easily verified that

is allowed to be zero and it is also differentiable in the view of modern deep learning. In addition, they employ

-norm with in order to further encourage the sparsity of or the competition between its elements.


The proximal operator of Eq. (5) is reduced to the form of Eq. (8), when the model parameters are non-negative. Although their forms are similar to each other, they have completely different meanings. The proximal operator is a learning rule whereas Eq. (8) is the parameterized form of architecture parameters, which is the part of a neural network. Inspired by these two approaches, we directly embeds the learning rule into the structure of deep neural networks by re-parameterizing model parameters.

3 Proposed Approach

In this section, we derive two methods for embedding sparsity into a deep neural network. One is for structure sparsity and the other one is for unstructured sparsity. We simply distinguish two approaches in order to show how we are motivated and where they are derived from. With proper regularizers, the embedding method for unstructured sparsity can be used to induce structured sparsity as well.

3.1 Structured Sparsity

Motivated by the proximal operator of Eq. 3 and the threshold operation of Eq. 8, we directly embed the proximal operator into a deep neural network without resorting to architecture parameters. An original variable is re-parameterized as


where , and is used instead as an ordinary parameter such as convolutional filters. As in the proximal operator, if the magnitude of a group is less than , all parameters within the same group are zeroed-out. Note that is not a constant, but it is a trainable parameter and is adjusted by considering the trade-off between a prediction loss and a regularization term through training. As notified by Lee (2019), considering the support of relu as a built-in differentiable function in a modern deep learning tool, it should not matter with the stochastic gradient descent-based learning. Thus, our approach can simultaneously optimize the prediction accuracy and the model complexity using the stochastic gradient descent.

We want to eventually drive model parameters to be zero, and thus in the denominator can cause numerical problems when . Since implies , we can set to zero when , or we simply add a small number to the denominator. Also, we can reformulate it with a scaling factor:


where denotes a learnable scale parameter.

3.2 Unstructured Sparsity

Similarly as in the previous section, unstructured sparsity can be embedded by re-parameterizing an original variable as


which are motivated by Eq. (3) and (8). An individual parameter is zeroed-out according to the relative magnitude with its group whereas Eq. (11) tends to remove an entire group altogether. However, it does not mean that Eq. (13) cannot induce structured sparsity. A group regularizer such as -norm of Eq. 2 derives parameters within the same group to have similar values and thus it can remove all parameters by rasing the threshold.

The gradient of the function is zero almost everywhere, but it does not cause a problem for a modern deep learning tool. The equation can be rewritten as

where . We can handle it separately according to whether its value is negative or not. For example, the conditional statement can be implemented using tf.cond or tf.where

of TensorFlow 

Abadi et al. (2015).

3.3 Regularized Objective Function

In the proposed approach, a regularized objective function can be written as


where is a set of training data, and denote a set of original and reformulated model parameters respectively, and represents a set of threshold parameters. In usual, a regularization is applied on free parameters, i.e, , but it is more appropriate to apply the regularization on since is the function of : the threshold parameter can receive the learning signal directly from the regularization term and thus it can better learn how to balance the model complexity with the prediction loss.

Although the embedding forms of Eq. (11) and (13) are derived from the -and -norm (Eq. (2) and (4)) respectively, they do not need to be paired with their origins. They are agnostic to any regularizers. We can adopt regularizer, , or even -norm with as in Lee (2019),


The -norm is well-known for its nature of inducing sparsity, but it is rarely used in practice because it is not convex and the efficient optimization that makes parameters exactly zero is not known. Thus, -norm is widely used instead as a convex surrogate even if the -norm is more ideal for inducing sparsity. In our approach, however, we can employ any kinds of regularizers as long as they are differentiable almost everywhere.

3.4 Coarse Gradient for Threshold Operation

If the magnitude of a parameter or a group is less than a threshold, it is zeroed-out by relu and it does not receive a learning signal since the gradient of relu is zero on the negative domain. However, it does not necessary mean that it permanently dies once its magnitude is less than a threshold. It still has a chance to recover because a threshold is adjustable and the magnitude of a parameter can be increased by receiving the learning signal from a prediction loss. Nevertheless, in order to make sure that dropped parameters can receive learning signals and have more chances to recover, we can approximate the gradient of the thresholding function. Previously, Xiao et al. (2019) approximated the gradient of a step function using that of leaky relu or soft plus and Lee (2019) used elu for relu. We follow the work of Lee (2019) in order to improve the recoverability: relu is used in the forward pass but elu is used in the backward pass. As notified by Lee (2019)

, this heuristics can be easily implemented using modern deep learning tools and it does not interrupt an end-to-end learning with the stochastic gradient descent.

3.5 Gradual Sparsity Regularization

In the early stage of training, it is difficult to figure out which parameters are necessary and which ones are not because they are randomly initialized. Gradual scheduling can help to prevent the accidental early dropping. In addition, it changes the structure of the neural network smoothly and helps the learning process to be more stable. Inspired by the gradual pruning of Zhu and Gupta (2017), we increase the value of from an initial value to a final value over a span of epochs. Starting at epoch , we increase the value at every epoch:

where denotes an epoch index.

4 Conclusion

We have proposed a gradient-based sparsification method that can learn the sparsified structure and the weights of networks simultaneously. Our proposed method can be applied on structured as well as unstructured sparsity.

This research was supported by the National Research Council of Science & Technology (NST) grant by the Korea government (MSIP) (No. CRC-15-05-ETRI).


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from tensorflow.org External Links: Link Cited by: §3.2.
  • J. M. Alvarez and M. Salzmann (2016) Learning the number of neurons in deep networks. In NIPS, Cited by: §1, §2.1, §2.1, §2.
  • W. J. Dally (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
  • W. Dally (2015) Learning both weights and connections for efficient neural network. In NIPS, Cited by: §1.
  • X. Dong, S. Chen, and S. J. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In NIPS, Cited by: §1.
  • B. Hassibi, D. G. Stork, and G. J. . Wolff (1993) Optimal brain surgeon and general network pruning. In ICNN, Cited by: §1.
  • Y. LeCun, J. S. Denker, and S. A. Solla (1989) Optimal brain damage. In NIPS, Cited by: §1.
  • Y. Lee (2019) Differentiable sparsification for deep neural networks. CoRR abs/1910.03201. External Links: Link, 1910.03201 Cited by: §2.2, §2.2, §2, §3.1, §3.3, §3.4.
  • B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy (2015)

    Sparse convolutional neural networks

    In CVPR, Cited by: §1.
  • M. C. Mozer and P. Smolensky (1988) Skeletonization: a technique for trimming the fat from a network via relevance assessment. In NIPS, Cited by: §1.
  • N. Parikh and S. Boyd (2014) Proximal algorithms. Found. Trends Optim. 1 (3), pp. 127–239. External Links: ISSN 2167-3888, Link, Document Cited by: §1, §2.1.
  • R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.
  • W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §1, §2.1, §2.1, §2.
  • X. Xiao, Z. Wang, and S. Rajasekaran (2019) AutoPrune: automatic network pruning by regularizing auxiliary parameters. In NeurIPS, Cited by: §3.4.
  • J. Yoon and S. J. Hwang (2017) Combined group and exclusive sparsity for deep neural networks. In ICML, Cited by: §1, §2.1, §2.
  • M. Yuan and Y. Lin (2006)

    Model selection and estimation in regression with grouped variables

    Journal of the Royal Statistical Society. Series B (Methodological) 68 (1), pp. 49–67. Cited by: §1, §2.1, §2.1.
  • Y. Zhou, R. Jin, and S. C. Hoi (2010)

    Exclusive lasso for multi-task feature selection

    In AISTATS, Cited by: §2.1, §2.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR abs/1710.01878. External Links: Link, 1710.01878 Cited by: §3.5.