Compact Neural Representation Using Attentive Network Pruning

by   Mahdi Biparva, et al.
York University

Deep neural networks have evolved to become power demanding and consequently difficult to apply to small-size mobile platforms. Network parameter reduction methods have been introduced to systematically deal with the computational and memory complexity of deep networks. We propose to examine the ability of attentive connection pruning to deal with redundancy reduction in neural networks as a contribution to the reduction of computational demand. In this work, we describe a Top-Down attention mechanism that is added to a Bottom-Up feedforward network to select important connections and subsequently prune redundant ones at all parametric layers. Our method not only introduces a novel hierarchical selection mechanism as the basis of pruning but also remains competitive with previous baseline methods in the experimental evaluation. We conduct experiments using different network architectures on popular benchmark datasets to show high compression ratio is achievable with negligible loss of accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4


Convolutional Neural Network Pruning with Structural Redundancy Reduction

Convolutional neural network (CNN) pruning has become one of the most su...

Quantisation and Pruning for Neural Network Compression and Regularisation

Deep neural networks are typically too computationally expensive to run ...

Toward Compact Deep Neural Networks via Energy-Aware Pruning

Despite of the remarkable performance, modern deep neural networks are i...

Efficient Hardware Realization of Convolutional Neural Networks using Intra-Kernel Regular Pruning

The recent trend toward increasingly deep convolutional neural networks ...

To prune, or not to prune: exploring the efficacy of pruning for model compression

Model pruning seeks to induce sparsity in a deep neural network's variou...

ThresholdNet: Pruning Tool for Densely Connected Convolutional Networks

Deep neural networks have made significant progress in the field of comp...

Dynamic Network Surgery for Efficient DNNs

Deep learning has become a ubiquitous technology to improve machine inte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human brain receives a tremendously large amount of raw sensory data at every second. How the brain deals efficiently and accurately with the amount of input data to accomplish short- and long-range tasks is the target of various research studies. [20, 19] analyze the computational complexity of visual tasks and suggest the brain employs approximation solutions to overcome some of the difficulties presented due to the vast amount of input sensory data.

Neural networks have been successful on various computational tasks in vision, language, and speech processing. Such networks are defined using a large number of parameters arranged in multiple layers of computation. Despite achieving good performance on benchmark datasets, parametric redundancy is known to be widespread, hence not suitable for real-time mobile applications. Tensor processing, memory usage, and power consumption of mobile devices are limited and consequently neural networks must be accelerated and compressed for such mobile applications

[7]. Model compression reduces the number of parameters and primitive operations and consequently improve the computation speed at inference phase [7].

Moreover, due to the over-fitting phenomenon, over-parameterized models suffer from low generalization and therefore must be regularized. Such models learn dataset biases very quickly, memorize data distribution, and consequently lack proper generalization. One way to regularize parametric models is by imposing sparsity-imposing terms and consequently pruning a number of parameters to zero and keeping a sparse subset of them

[22, 23].

Neural network compression is defined as a systematic attempt to reduce parametric redundancies in dense and multi-layer networks while maintaining generalization performance with the least accuracy drop. Parametric redundancy in such networks is empirically investigated in [3]. Various neural network compression approaches such as weight clustering [8, 2, 17], low-rank approximation [3, 4], weight pruning [9, 6, 5, 8, 16, 14, 10], and sparsity via regularization [15, 23, 21]

are introduced to reduce parameter redundancy for lower computational and memory complexity. Pruning methods have shown to be computationally favorable while relying on straightforward heuristics and ad hoc approaches to schedule and devise pruning patterns. These compression approaches rely on defining some measure of importance based on which a significant subset of weight parameters are kept and the rest are pruned permanently.

STNet [1]

introduces a selective attention approach in convolutional neural networks for the task of object localization. STNet leverages a small portion of the entire visual hierarchy to route through all parametric layers to localize the most important regions of the input images for a top label category. The selection process is hierarchical and provides a reliable source of weight pruning. The experimental results of STNet for object localization reveal that a sparse subset of the network units and weight parameters are sufficient for a successful localization result. We propose a novel attentive pruning method based on STNet to achieve compact neural representation using Top-Down selection mechanisms. Following

[1], we define a neural network to benefit from two passes of information processing, Bottom-Up (BU) and Top-Down (TD). The BU pass is data-driven. It begins from raw input data, goes through multiple layers of feature transformation, and finally predicts abstract task-dependent outputs. On the other hand, the TD pass is initialized from high level attention signals, goes through selection layers, and outputs kernel importance activities. The importance activities are computed by three variable inputs in the TD selection pass: one is the hidden responses, the other is the kernel filters, the last is the top attention signals. We show that all of the three sources of TD selection are crucial for strong network pruning.

Attentive pruning relies on kernel importance activities to decide on pruning patterns. We feed neural networks with input data, and then activate TD selection to output kernel importance activities at every layer. These activities are accumulated and scheduled to generate pruning patterns. We evaluate the attentive pruning method using various network architectures on benchmark datasets. The competitive evaluation results reveal the selective nature of the TD mechanisms over the kernel filters. This complements the importance of the gating activities for visual tasks such as object localization and segmentation.

2 Network Pruning Using Selective Attention

We define neural networks with Bottom-Up (BU) and Top-Down (TD) information passes. The former transforms input data into high-level semantic information. On the other hand, the TD pass begins from class predictions and computes the kernel importance responses at each layer. The TD selection process relies on three main sources of information. We propose to compute the important connections along which the TD attentional traces route through the visual hierarchy. In this work, we introduce a novel approach that the pruning mechanism relies on the accumulated kernel importance responses while the baseline models solely relies on kernel filters of the feedforward pass. The kernel importance responses are computed using the local competitions that receives three variable inputs: the kernel weights, the hidden activities, and the gating activities. The kernel weights are learned in the pre-training phase. The hidden activities represent the hierarchical feature representation of the underlying layers for some specific input data. Therefore, the calculated kernel importance responses take into consideration not only the kernel weights but also the input hidden activities. The baseline pruning methods only rely on the magnitude thresholding while the proposed method generalizes the baselines with the inclusion of the hidden activities. Furthermore, the kernel importance responses are category-specific. The important subset of weight parameters are determined not only for some specific input but also some particular label category. The TD selection pass starts from a category initialization signal. Consequently, all the TD selection mechanisms are informative of that particular category initialization. The category-specific nature of the attentive pruning method further reduce the non-relevant pruning and therefore, speeds up the convergence in the retraining phase.

Figure 1: Schematic illustration of the proposed method for connection pruning that leads to the reduction of the number of network parameters. On the left side, a toy multi-layer feedforward network is shown. On the right, the corresponding TD networks is given. At each layer, once the active connections are computed using the TD selection mechanisms, they are additively accumulated into the persistent buffer ; subsequently, the mask tensor M is scheduled to get updated after a number of iterations. The feedforward pass is always additively modulated with the mask tensors M. The arrows show the direction of information flow.

2.1 Method Overview

Figure 1 demonstrates schematically the information flow at different computational stages of the proposed method. First, given some input at the bottom of the visual hierarchy shown on the left part of Fig. 1

, the feature extraction is done using the parametric layers and the output hidden activities

, , are computed at each layer until the top score layer is reached and the BU pass ends. The Predict Class Label block is a multi-class transfer function such that

outputs the class probability prediction given the input data. Then, the attention signal initialization determines the label category for which the TD pass (shown on the right side) will be activated. Once the attention signal is set, the selection mechanism within the local receptive field of the initialized category node is activated. According to the competition result, a number of important outgoing connections on the TD layer are activated. Then, the gating activity of the category node proportional to the activated connection weights propagates downward to the next gating layer. This is illustrated by the outgoing solid (activated) and dashed (deactivated) connections from

to in Fig. 1. At this stage, the kernel importance responses for the top layer are updated with the activated connection patterns in an additive manner. This layer-wise computation continues at all of the subsequent lower layers until the TD selection pass ends at the input layer and returns the gating activities . The kernel importance accumulation is iterated for a number of randomly-chosen input samples until a pruning phase is set by the scheduling strategy depicted by the yellow Pruning Scheduler module at the bottom of the figure. The pruning mask , then, is updated based on the so-far-accumulated kernel importance responses. The pruning masks are initially set to one, meaning no kernel weight is pruned before the first scheduled mask update. Over different pruning phases, they start to gradually become zero.

2.2 Notations

A multi-layer neural network is at the core of the BU pass. The training set has samples such that the input data is an input image with height and width and is the ground truth label for different classes. We define the BU pass as a feedforward network


in which is a network with layers, and are the input and output of the network, and is the set of network parameters at layers. We define the feature transformation at layer such that

is a linear transformation

of the input by the weight matrix for fully-connected layers. The convolutional layers apply convolutions using the kernel filter .

The BU network output is fed into a Softmax transfer function to compute the multinomial probability values

. The cross-entropy loss function is used with the Stochastic Gradient Descent (SGD) optimization algorithm to update network parameters.

2.3 Top-Down Processing

The role of the Top-Down (TD) pass is to traverse downward into the visual hierarchy by routing through the most significant weight connections of the network. TD pass begins from an initialization signal generated according to the ground truth label . It traverses down layer by layer by selecting through network connections


in which is the set of BU hidden activities, network parameters , and is the set of kernel importance responses at layers. The attention signal is initialized based on the ground truth label of the input image. It sets the signal unit for the category label corresponding to the ground truth to one and keep the rest zero

The TD pass at each layer computes the importance responses using three computational stages defined in STNet [1]. STNet originally is used to compute attention maps for the gating activities while in this work, we seek to compute attention maps for the kernel parameters.

The selection mechanism at each layer has three stages of computation. Each stage uses the element-wise multiplication of the hidden activities inside the receptive field of each top unit with the kernel parameters and then at the end of the selection mechanism propagates top gating activities proportional to the selected activities to the layer below. We review the three stages in the following based on the definitions in STNet. In the first stage, noisy redundant activities that interfere with the important activities are pruned away. The goal is to find the subset of the most critical activities that play the important role in the final decision of the network. In the second stage, among the selected activities returned by the first stage, activities are partitioned into connected components. This helps to impose the connectivity criterion that is crucial for a reliable hierarchical representation. The most informative group of activities are marked as the winner of the selection process at the end of the second stage. The local competition between groups is based on a combination of the size and total strength objectives. In the third stage, the selected winner activities are normalized such that they sum to one. Then, the the top gating activity is propagated proportional to the selected activities to the bottom gating layer, the activities of the bottom gating nodes are updated consequently. This procedure is repeated for a number of layer starting from the top of the network down to the early layers.

2.4 Kernel Importance Maps

The output hidden activities at a layer are computed by a linear multiplication of the kernel weight matrix and the input hidden activities for fully-connected layers. The extension to convolutional layers is straightforward using convolution operations and are ignored for the sake of brevity. Each output unit

receives a weighted sum of the input vector

according to the weight parameter vector .

The TD selection mechanism for the output unit operates on the input hidden activities , the weight parameters connecting all the input units to unit , and the input gating unit . The selection mechanism is only executed for non-zero units. The output of the selection mechanism contains two entities , where is the output gating activities which is the source of TD selection at the layer below, and is the kernel importance responses at layer . Hereafter, we drop for sake of notation brevity. The kernel importance responses are accumulated for all samples in the training set and used in selective pruning for network compression.

We categorize all the previous pruning approaches as class-agnostic pruning methods since they determine the connections to prune regardless of the target categories they are interested in. Our proposed attentive pruning method, however, is class-specific since the TD pass begins from some class hypothesis signal and routes through the network hierarchy accordingly. Therefore, the computed kernel responses are representative of the subset of the network connections that are most important for the true category label predictions.

Additionally, network parameters are trained according to the input data distribution. The BU information flows into the network hierarchy by measuring the numeric relation between an input unit and a connection weight that connects the input unit to the output unit . If both the input and the weight have high activity, the output will have high value too. Being motivated by this insight, we show that the TD selection process produces kernel importance maps by considering both of the input and the kernel weights. Kernel importance is measured based on whether the input units and the kernel weights are both positively related. This generalizes the previous works in which the kernel weights are individually considered for connection pruning.

2.5 Attentive Pruning

We define an attentive pruning method using the kernel importance responses . The importance responses at iteration is accumulated into a persistent buffer . The binary pruning mask defines the pattern using which the kernel weights are permanently pruned. The function determines the pruning mask . is a thresholding function that sets the binary values of :


where . is a multiplicative factor, is the mean, and

is the standard deviation of the input

. We set the binary mask at layer by setting and . We run the BU and TD passes for a number of iterations after which the attentive pruning starts to compress the network parameters. Once the set of mask binary tensors are determined after each pruning phase, the feedforward BU pass is modified using the binary pattern in the mask tensors:


where is the element-wise (Hadamard) product of with .

Fig. 2 illustrates the BU and TD interactions in detail. At the layer for instance, the TD selection mechanism receives the three inputs: the hidden activities , the kernel weights , and the gating activities . Once the selection is completed, the kernel importance maps are set for the downward gating activity propagation. Additionally, they are used to additively update the persistent buffer . The pruning pattern of the kernel weights is determined according to the binary pruning mask . The mask is updated according to the pruning scheduler unit. Once the scheduler set the updating on, the thresholding function updates the mask binary elements given the input persistent tensor. This procedure is applied to every layer the pruning is defined to be applied.

Figure 2: Detailed demonstration of different stages of computation of the BU and TD passes for selective connection pruning. At each layer, the inputs to the TD selection unit, the active connections , the additive accumulation into the persistent buffer, and the multiplicative mask of the BU kernel weight are depicted.

2.6 Retraining Strategy

At every iteration, using the samples in the mini-batch, we have sequentially the following computational stages: a feedforward BU pass, attention signal initialization, a TD pass, and an updating of the persistent buffer using the kernel importance responses . After a number of initial iterations to accumulate kernel importance responses into the persistent buffer, we start pruning the network connections for several times. The network is retrained from the first occurrence of pruning onward. This helps the network retain its level of accuracy for label prediction over multiple stage of connection pruning. Retraining is inevitable due to the high pruning rate of the network weight parameters. The network needs some iterations to shift its representational capability for a high level of label prediction accuracy. The retraining allows the adaptation to the reduced parameter space. It follows the exact optimization settings used for the pre-training of the network prior to the network compression.

3 Experimental Results

In this section, we conduct experiments to evaluate the compression ratio of the attentive pruning method. The compression ratio is defined as the ratio of the total number of mask units over the total number of the non-zero mask units (active connections). We use the Pytorch deep learning framework

111 [18] to implement the model for the experiments of this work. The layers of the TD pass are implemented using the code provided by 222 STNet [1]. We choose the learning rate , momentum , weight decay , and mini-batch size for the SGD optimizer unless otherwise mentioned.

Model Top-1 error Parameters Compression
LeNet-300-100-reference 3.3% 267K -
LeNet-300-100-pruned 3.8% 5.2K 58
LeNet-5-reference 2.1% 83K -
LeNet-5-pruned 3.2% 4.7K 102
Table 1: LeNet error rate and compression ratio on MNIST dataset using the attentive connection pruning.
Model Top-1 error Parameters Compression
Lenet-5-reference 38.2% 83K -
Lenet-5-pruned 39.4% 8.3K 10
CifarNet-reference 30.4% 84K -
CifarNet-pruned 31.1% 7.6K 11
AlexNet-reference 23.5% 390K -
AlexNet-pruned 24.8% 13K 29
Table 2: LeNet and CifarNet error rate and compression ratio on CIFAR-10 dataset using the attentive connection pruning.
Method Network Dataset Error-degradation Compression Ratio
Han et al. [9] LeNet-300-100 MNIST 0.19% 12
Guo et al. [6] LeNet-300-100 MNIST 0.23% 56
Dong et al. [5] LeNet-300-100 MNIST 0.20% 66
Ours LeNet-300-100 MNIST 0.50% 58
Han et al. [9] LeNet-5 MNIST 0.09% 12
Guo et al. [6] LeNet-5 MNIST 0.09% 108
Dong et al. [5] LeNet-5 MNIST 0.39% 111
Ours LeNet-5 MNIST 0.90% 102
Table 3: Comparison of the Compression ratio of the proposed method with the baseline approaches using LeNet-300-100 and LeNet-5 network architectures on MNIST. Error degradation is the difference between the original error and the error at the end of the retraining phase.

3.1 The MNIST Dataset

One of the popular datasets widely used in the machine learning community to experimentally evaluate novel methods is MNIST dataset. It contains gray-scale images of handwritten digits and is used for category classification.

We define the BU network for the MNIST dataset according to two classic network architectures: LeNet-300-100 [13] and LeNet-5 [13]

. The former consists of three fully-connected layers with output channel sizes of 300, 100, 10 successively and contains 267K learnable parameters. We train it for 10 epochs to obtain the reference model for the BU network. Lenet-5, on the other hand, has two convolutional layers at the beginning. Similarly, it is trained for 10 epochs. It has 431K learnable parameters.

After the first epoch that the persistent buffers are accumulated, we start pruning the network connections for the next 7 consecutive epochs. The multiplicative factor is set to the following values . We continue retraining the network for 25 epochs after the final pruning stage. We follow this pruning protocol for both of the LeNet architectures. Error rate and the compression ratio of the networks are given in Table 1. The results confirm the selectivity nature of the TD mechanism in the parameter space of the BU network. According to the experiment results, the kernel importance responses are shown a reliable source of connection pruning using LeNet architectures on MNIST. The proposed model is capable of reducing the number of kernel weights 58 and 102 times for LeNet300 and LeNet-5 respectively for negligible performance accuracy drops.

We compare the compression performance of the proposed attentive pruning mechanism with the baseline approaches on MNIST in Table 3. The experimental results reveal that the proposed approach outperforms two of the baseline approaches [9, 6] using the LeNet architectures while remain competitive with [5]. It should be noted that [5] uses the computationally expensive second order derivatives of a layer-wise error function to derive the pruning policy while we only rely on the important kernel responses derived from the TD selection mechanisms. [5] exhaustively relies on second-order derivatives at each layer while we chose to determine kernel responses in a hierarchical manner. However, the proposed method can outperform [9, 6] that use magnitude pruning of weight parameters. This supports the role of the TD selection mechanisms to determine the most important parameters of neural networks as the source of a pruning procedure.

3.2 The CIFAR Dataset

CIFAR-10 dataset [11]

contains RGB images of the same size and scale as MNIST dataset. The dataset consists of natural images of 10 semantic categories for object classification. In comparison with MNIST, the goal is to benchmark classifier performance on a higher level of complexity using CIFAR-10. We evaluate the performance of the proposed method using three network architectures on this dataset: LeNet-5, CifarNet and AlexNet. CifarNet

333 [11] is a multi-layer network with three convolutional layer and two fully-connected layers. It has larger number of parameters than LeNet-5. AlexNet [12] has 5 convolutional and 2 fully-connected layers.

We empirically choose a slightly different pruning and re-training policy for the CIFAR-10 dataset since it has a lot more complexity and care must be taken for connection pruning. First, we change the mini-batch size to 16. The multiplicative factor is set only to . However, unlike the MNIST pruning protocol, we prune layers individually. We observed in the preliminary experiments that this approach helps maintain the label prediction accuracy with the minimal performance compromise while keep the compression ratio high. This policy helps the network to maintain its representation capability for the classification task and avoid deteriorating learning collapses. We first accumulate the kernel importance responses in the persistent buffer for one epoch. Next, for every 4 epochs, we prune the connections of one layer starting from the first parametric layer at the bottom to the last one at the top of the BU network. Once the pruning of the last layer is done, we continue re-training of the pruned network for 40 epochs and then report the compression ratio in Table 2. For all of the three networks, the attentive pruning method is able to maintain the reference network error rate and achieve high compression ratio.

4 Conclusion

We propose a novel pruning method to reduce the number of parameters of multi-layer neural networks. The attentive pruning method relies not only a feedforward feature representation pass but also a selective top-down pass. The TD pass computes the most important parameters of the kernel filters according to a selected category label. Additionally, the hidden activities at each layer participate in the stages of the TD selection mechanism. This ensures both the top semantic information and input data representation play roles in the stages of kernel importance computation. We evaluate the compression ratio of the proposed method on two classification datasets and show the improvement on three popular network architectures. The network achieves a high compression ratio with minimal compromise of generalization performance.


  • [1] M. Biparva and J. K. Tsotsos (2017) STNet: selective tuning of convolutional networks for object localization.. In

    Internation Conference on Computer Vision Workshops

    pp. 2715–2723. Cited by: §1, §2.3, §3.
  • [2] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pp. 2285–2294. Cited by: §1.
  • [3] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al. (2013) Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pp. 2148–2156. Cited by: §1.
  • [4] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1.
  • [5] X. Dong, S. Chen, and S. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §1, §3.1, Table 3.
  • [6] Y. Guo, A. Yao, and Y. Chen (2016) Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §1, §3.1, Table 3.
  • [7] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. Internation Conference on Learning Representation. Cited by: §1.
  • [8] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations. Cited by: §1.
  • [9] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §3.1, Table 3.
  • [10] B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §1.
  • [11] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.2.
  • [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §3.2.
  • [13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 0018-9219, Link Cited by: §3.1.
  • [14] Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
  • [15] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 806–814. Cited by: §1.
  • [16] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §1.
  • [17] E. Park, J. Ahn, and S. Yoo (2017) Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5456–5464. Cited by: §1.
  • [18] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Cited by: §3.
  • [19] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, and F. Nuflo (1995) Modeling visual attention via selective tuning. Artificial Intelligence 78 (1–2), pp. 507–545. Note: Special Volume on Computer Vision External Links: ISSN 0004-3702, Document, Link Cited by: §1.
  • [20] J. K. Tsotsos (1990) Analyzing vision at the complexity level. Behavioral and Brain Sciences 13 (03), pp. 423–445. Cited by: §1.
  • [21] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1.
  • [22] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2074–2082. External Links: Link Cited by: §1.
  • [23] H. Zhou, J. M. Alvarez, and F. Porikli (2016) Less is more: towards compact cnns. In European Conference on Computer Vision, pp. 662–677. Cited by: §1, §1.