NodeDrop: A Condition for Reducing Network Size without Effect on Output

06/03/2019
by   Louis Jensen, et al.
0

Determining an appropriate number of features for each layer in a neural network is an important and difficult task. This task is especially important in applications on systems with limited memory or processing power. Many current approaches to reduce network size either utilize iterative procedures, which can extend training time significantly, or require very careful tuning of algorithm parameters to achieve reasonable results. In this paper we propose NodeDrop, a new method for eliminating features in a network. With NodeDrop, we define a condition to identify and guarantee which nodes carry no information, and then use regularization to encourage nodes to meet this condition. We find that NodeDrop drastically reduces the number of features in a network while maintaining high performance, reducing the number of parameters by a factor of 114x for a VGG like network on CIFAR10 without a drop in accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/23/2017

Progressive Learning for Systematic Design of Large Neural Networks

We develop an algorithm for systematic design of a large artificial neur...
07/30/2015

Multilinear Map Layer: Prediction Regularization by Structural Constraint

In this paper we propose and study a technique to impose structural cons...
11/30/2018

A Framework for Fast and Efficient Neural Network Compression

Network compression reduces the computational complexity and memory cons...
05/31/2020

Crossed-Time Delay Neural Network for Speaker Recognition

Time Delay Neural Network (TDNN) is a well-performing structure for DNN-...
11/10/2015

Reducing the Training Time of Neural Networks by Partitioning

This paper presents a new method for pre-training neural networks that c...
05/11/2017

Incremental Learning Through Deep Adaptation

Given an existing trained neural network, it is often desirable to be ab...
02/16/2021

Selfie Periocular Verification using an Efficient Super-Resolution Approach

Selfie-based biometrics has great potential for a wide range of applicat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A prime difficulty in neural network design is the appropriate tuning of network architectures. Choosing a size for each layer of a neural network is usually done by rough estimate, trial, and error. This imprecise process can often lead to network designs larger than needed to perform a particular task. Although the capacity for training large and complex networks grows with improving graphics processing unit (GPU) technology, designing too large a network can result in applications impracticable for general hardware use. Mobile devices and embedded systems limit compute, memory, and storage consumption, and as a result can only run small, minimally designed networks. A designer aiming to create such a minimal network is faced with the time-consuming task of manually tuning the number of neurons in each layer. This tuning process can result in many extended tuning experiments in order to balance the space and performance of the neural network.

The issues involved with using deep neural networks (DNN’s) on constrained systems has inspired significant research. One interesting area of research is the design of systems which can automatically prune a network’s parameters. Ideally these techniques can still maintain high performance while pruning as many parameters as possible, ensuring the network can fit on smaller systems. Many state-of-the-art methods for network pruning generally involve an iterative process of repeatedly pausing training, pruning parameters, and resuming training in order for the network to reconverge. Such iterative procedures can lead to long training times. Other techniques use regularization in order to eliminate nodes. The final performance of these networks are often highly variable with the hyper-parameters of the algorithm. Thus, while these techniques do offer parameter reduction benefits, the network designer will still be faced with similar difficulties as before: a time consuming training process and a potential hyper-parameter tuning headache.

We address the problem of parameter reduction with our novel NodeDrop technique, which prunes the network during training. The NodeDrop technique only drops nodes which carry no information and drops them fluidly during the training process.

First, we formally define the conditions necessary to guarantee a neuron carries no information. We then propose a simple variant of

regularization which drives nodes toward this condition. Second, we extend the NodeDrop technique to networks which use batch normalization

(Ioffe & Szegedy, 2015)

. We test our technique on modern architectures for the MNIST, CIFAR10, and CIFAR100 datasets, and show that we are able to drop a significant number of nodes without a loss in performance. Our method requires no iterative retraining and only a modest increase in training time. We demonstrate effective results with a wide range of hyperparamaters, indicating our method does not require precise hyperparameter tuning. At best case we produce a network which reduces the number of parameters by

, , and percent for MNIST, CIFAR10, and CIFAR100 respectively, with no perceivable loss in performance.

2 Related Works

2.1 Pruning

Network pruning comprises a set of techniques which take a pretrained network and then prune off connections using some heuristic. This is usually followed by a retraining of the network and sometimes by an iterative process of pruning and retraining the network several times. Pruning techniques first appeared in the 1990s, with the first instances using second order gradients of connections to determine which neurons should be pruned

(Hassibi & Stork, 1993; LeCun et al., 1990; Reed, 1993). More recent approaches have taken on a wide array of methods for determining which connections should be pruned. These approaches include correlation (Sun et al., 2015; Han et al., 2016; Srinivas & Babu, 2015), regularization (Han et al., 2015; Li et al., 2017), particle filtering on misclassification rate (Anwar et al., 2017), low rank approximation (Denton et al., 2014)

, vector quantization

(Gong et al., 2014)

and tensor decomposition

(Kim et al., 2015).

All network pruning techniques suffer from extended training time due to the iterative retraining of the network. This can lengthen training times significantly, and often makes tuning the various parameters in each method a lengthy chore.

2.2 Regularization

A more recently developed approach to network parameter reduction is to disable parameters through regularization. A majority of these techniques have focused on the sparsification of network connections using a group sparsity approach (Wen et al., 2016; Zhou et al., 2016; Alvarez & Salzmann, 2016; Lebedev & Lempitsky, 2016). This involves grouping the weights for every neuron and attempting to sparsify each group by penalizing its norm. These techniques require all weights to be driven to zero before a node can be guaranteed to carry no information. In practice nodes are removed based on a threshold since this guarantee is difficult to meet. Because of this, regularization methods can be difficult to use as they require very precise tuning of the regularization and threshold terms.

The most similar technique to ours, Liu et al. (Liu et al., 2017), uses regularization to drive the scale parameter in batch norm, , towards zero. This is similar in principle to our own experiments with batch norm. However, Liu et al. requires retraining after pruning in order to reconverge. We provide a more absolute condition to guarantee a node is off, eliminating the need for a retraining procedure and making node removal a more fluid process.

Our technique falls within the regularization category. Key differences in our approach involve special regularization of the bias for each neuron and a condition for node removal guaranteeing no effect on network output. Our condition is also more relaxed, utilizing the “dead” region in a node’s activation function, instead of requiring the node’s weights to be zero.

2.3 Other approaches

Several other approaches have appeared which do not fit into the categories of the previous two subsections. Many of these approaches focus on reducing precision as opposed to reducing the number of parameters. (Hubara et al., 2016; Vanhoucke et al., 2011; Gupta et al., 2015; Rastegari et al., 2016)

. As such, these approaches are largely orthogonal to our own work, and can be used in conjunction with our work in order to compound the reduction on memory and computation. One example of this approach is quantized and binarized neural networks

(Hubara et al., 2016), which take this approach to new levels by using weights and an XOR to replace multiplication.

An additional noteworthy paper is that of Molchanov et al. (Molchanov et al., 2017). They achieve impressive results by sparsifing a network’s connections during training using variational dropout. Again, in theory this work should be usable in conjunction with our own.

3 Methods

3.1 NodeDrop Condition

In this section we describe the condition for identifying useless nodes in a network. Nodes in a neural network carry information by outputting values from some distribution. A node can only be useful if that node sometimes outputs a non-zero value. A node which is guaranteed to always output a constant value is a node which can only be used as an extra bias node for future layers. Moreover, if a node is guaranteed to always output the constant zero, this node is entirely useless and can be removed from the network without impact. This occurs in activation functions with a flat zero region. The popular rectified linear unit (ReLU) activation function contains such a flat zero region. This flat zero region causes the observed “Dying ReLU” effect, in which nodes become stuck in this flat region with zero gradients. We can therefore design a condition to identify when a node is useless by taking advantage of this effect.

We propose the NodeDrop condition.

  1. Given a node with input vector , a weight vector , bias b , and an activation function such that .

  2. We wish to find the condition under which this node is dead, for all inputs .

  3. Since , we simply need to find the condition under which . We have constrained the inputs to be within , , so we have:

  4. Then,

This leaves us with the NodeDrop condition:

(1)

This condition can be applied to a fully connected layer, or in broader contexts such as filter weights of a convolutional layer. Because nodes which satisfy this condition are guaranteed to always output zero, they can be dropped from the network without affecting its output. Note that the weaker condition can also be used, but identifies fewer nodes that can be dropped.

The constraint on the activation function can be achieved using the standard ReLU activation of . However ReLU does not guarantee that the output will fall between and , a necessary condition if we want to apply NodeDrop to the next layer in the network. In the following section we will discuss an activation function for which the NodeDrop condition can be applied to both a layer and its following layer.

3.2 Activation Function

Supposing we want to apply NodeDrop to many or all layers of the network, we must use an activation function which possesses the appropriate flat zero region , and whose outputs are always between and . The flat zero region guarantees the NodeDrop condition can be applied to the layer preceding activation, and the constraint on the output allows for the NodeDrop condition to be applied to the layer immediately following activation. These necessary constraints are reiterated below.

(2)
(3)

If the outputs of a layer are guaranteed between 0 and 1 after activation, the inputs of the next layer will satisfy the conditions assumed in proving the NodeDrop condition. Many activation functions can satisfy these conditions, but none of the most popular activation functions satisfy both together. For example, the popular ReLU function satisfies the first condition in equation 2 but not 3. Conversely, The popular sigmoid activation function satisfies equation 3 but not 2.

One option is a clamped ReLU activation function, . This has two flat regions, , and an intermediate region where . This does satisfy both of the NodeDrop conditions; however, we found that having two regions with zero gradients can lead to too many nodes being “stuck” at either or even at network initialization. Thus, we propose the SoftClampedReLU activation function, which is a combination between ReLU and inverted SoftPlus activations:

(4)

Intuitively this activation is much like a ClampedReLU, but has a soft gradient in the upper region. This upper region is not perfectly flat and so values do not become stuck at . The lower region is still perfectly flat, satisfying our flat region condition, equation (2). This activation function is shown in figure 1. In our experiments we use .

Figure 1: SoftClampedRelu activation function. Shown with .

3.3 Regularization

The NodeDrop condition for identifying and eliminating useless nodes is powerful, but without encouragement, most trained networks will possess very few nodes satisfying the NodeDrop condition. Therefore, we add regularization during training to encourage features to satisfy the NodeDrop condition.

To encourage we can directly penalize its distance from zero:

However, this is too close to the boundary of our dead region. Alternatively to encourage , we could penalize it directly:

However, this causes the bias to tend toward negative infinity.

Instead we penalize the distance from a negative constant, , given:

This encourages , safely within the “dead” region, and without tending to negative infinity. As such, the choice of C is largely arbitrary; in our experiments we found a value of worked well, though other values worked just as well.

We can also write our regularization term as a small modification to standard regularization. For the case where , when a node is on, . We use this modified regularization given as:

(5)

This is normal regularization with two adjustments. We use the norm of instead of since this is a tighter bound given that . Instead of penalizing the bias as , we penalize . This modified regularization encourages biases to take bias values near , and weight values near 0.

We use regularization because regularization does not work well in our context. For regularization on both the weights and the bias, it is cheaper to use multiple weights as a bias rather than the bias itself. That is, when , then . This becomes worse when we penalize the distance of the bias from rather than from zero, making the normal case of an active node with bias around quite costly. This encourages the network to use many nodes in the previous layer as an alternative to a bias, preventing us from removing those nodes even if they carry no information beyond that of a bias.

3.4 Extension to Batch Normalization

Many state of the art networks utilize batch normalization or one of its alternatives (Ioffe & Szegedy, 2015; Ba et al., 2016; Salimans & Kingma, 2016). Here we consider our NodeDrop condition in a network with batch normalization. Batch normalization is given as follows:

where the sum is over the batch of size , and both and are learned parameters. Batch normalization is usually applied between the output of a layer and an activation function.

To achieve a similar NodeDrop condition for batch normalization as in equation 1, we would like to determine when . We similarly require an activation with a flat zero region, but no longer require an input between 0 and 1. Therefore, for our batch normalization NodeDrop (NodeDrop-BN) technique we are able to use the popular ReLU activation function. Our NodeDrop-BN condition is given in the following lemma.

Lemma 3.1.

.

Proof.
given

Together and imply . Therefore

This gives us the NodeDrop-BN condition:

(6)

Traditionally, batch normalization stores a running mean,

, and variance,

during training. These stored values are then used during testing. Our condition guarantees a node is always off during training, but does not guarantee a node will always be off during testing. We make the assumption that a node which is always off during training should also be off during testing. Thus, we can safely remove these nodes without impact. We experimentally validate this assumption in section 4.

The condition in equation 6 implies that so long as we use an activation function where (for example ReLU), we can determine if a node is off using only the batch normalization parameters, and , and the training batch size, . Following the same methodology for regularization as in equation 5, we define our NodeDrop-BN regularization term as:

(7)

regularization is generally applied to the layers before batch normalization. Unlike the vanilla NodeDrop regularization, regularization does not interfere with NodeDrop-BN technique, because the regularization applied to layers before a batch normalization has no effect on the output of the batch normalization layer.

4 Experiments

Network Name Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
Conv2d Conv2d Conv2d Conv2d Dense Output
Maxpool Maxpool
Dense160 16 16 32 32 64 10
Dense240 24 24 48 48 96 10
Dense320 32 32 64 64 128 10
Dense480 48 48 96 96 192 10
Dense640 64 64 128 128 256 10

Table 1: MNIST Network Architectures: Number of Features by Layer

Figure 2: In the right and center figures, the parameter values plotted on the y-axis are on a logarithmic scale. We note that the performance and parameter reduction both maintain desirable levels for a large range of values (over several orders of magnitude). This indicates the ease of tuning the NodeDrop technique. In the leftmost figure, networks of different starting size converge to nearly the same size for a given . The dashed diagonal line represents networks without pruning. Note that increased initialization size has a slight effect on final size, as indicated by the slight upward slopes. This effect is greater for larger .

Figure 3: Results on CIFAR10 for VGG with and without Batch Normalization over a spread of choices. Top Left: Classification error for VGG without Batch Normalization. Top Right: Final parameters after training using NodeDrop. Bottom Left: Classification error for VGG with Batch Normalization. Bottom Right: Final parameters after training using NodeDrop-BN. For both NodeDrop and NodeDrop-BN, a range of values are acceptable. Baseline accuracy and network size is indicated by the dashed lines.

Figure 4:

Accuracy stabilizes after less than 100 epochs in this CIFAR10 run, indicating the NodeDrop technique does not delay performance convergence. Training for another 400 epochs helps maximize parameter reduction.

Network Test Error Parameters Pruned % Factor Nodes Pruned %
VGG 16 w/o BN Baseline 13.01 15.04M 0.0 1.0 4736 0.0
14.14 0.45M 97.00 33.28 1115 76.46
13.27 0.31M 97.96 48.98 859 81.9
13.76 0.13M 99.12 114.00 612 87.08
90.00 0.0M 100.0 - 0 100.0
VGG 16 Baseline 6.50 15.04M 0.0 1.0 4736 0.0
6.88 8.88M 40.7 1.69 3624 23.48
7.36 1.39M 90.75 10.81 1164 75.42
7.41 0.61M 95.96 24.76 751 84.14
20.16 0.10M 99.35 152.84 308 93.50
DenseNet40 w/o BN Baseline 14.94 1.04M 0.0 1.0 456 0.0
15.21 0.66M 35.69 1.55 363 20.39
14.74 0.41M 60.47 2.54 291 36.18
14.99 0.08M 91.96 12.43 154 66.22
DenseNet40 Baseline 6.80 1.05M 0.0 1.0 456 0.0
7.13 0.99M 4.19 1.04 447 1.97
6.75 0.98M 5.67 1.06 443 2.85
7.79 0.55M 47.12 1.89 333 26.73
Table 2: Cifar10 Classification Results
Network Test Error Parameters Pruned % Factor Nodes Pruned %
VGG 16 Baseline 27.65 15.04M 0.0 1.0 4736 0.0
27.69 9.78M 34.99 1.54 3914 17.35
28.04 1.83M 87.82 8.21 1623 65.73
38.49 0.46M 96.93 32.58 729 84.6
DenseNet40 Baseline 26.5 1.05M 0.0 1.0 456 0.0
26.92 1.05M 2.27 1.02 451 1.09
27.01 1.03M 4.74 1.05 445 2.41
29.38 0.744M 31.12 1.45 376 17.54
Table 3: Cifar100 Classification Results

Having established a theoretical basis for the NodeDrop condition and regularization technique, we will now establish NodeDrop’s practical viability as a method for shrinking networks. The NodeDrop technique requires two hyperparameters: and . The C value is unimportant, and can be set to almost any positive value without impacting results or parameter reduction. However, the parameter is crucial in determining the balance between learning the objective and dropping nodes. Therefore, we closely examine the effect that choosing different values has on both network performance and parameter reduction. We test many values on the MNIST and CIFAR10 datasets. We also test a few values on the CIFAR100 dataset.

The network initalization size should affect the number of nodes dropped. We show that if a network starts near optimal size, NodeDrop will maintain accuracy and only drop what few nodes it can. Furthermore, we show that if a network is grossly oversized at initialization, NodeDrop will drop many nodes and converge towards the same size as a smaller network initialization. This result is desirable, as it demonstrates NodeDrop is largely unaffected by poor layer size choices. NodeDrop uses to determine the balance between performance and number of nodes utilized. Therefore, a network architect using NodeDrop can afford to initialize a large network, and remain confident that NodeDrop will eliminate needless nodes. Using the MNIST dataset, we demonstrate this ability by showing that networks will converge to the same size from multiple initialization sizes, for a fixed .

Many pruning methods require an increase in training time to be effective. The NodeDrop technique does not delay performance or accuracy convergence, but in order to allow the number of network nodes to converge, one must train for a longer time. We examine the training time required for this convergence with experiments on the CIFAR10 datasets.

Most importantly we test to ensure NodeDrop maintains performance and effectively drops nodes. We find that NodeDrop regularization does not affect a network’s performance for a large swath of values, only reducing testing accuracy if extreme values are chosen.

Furthermore, we demonstrate that NodeDrop is able to drop more than x parameters from popular networks such as VGG16, while continuing to maintain classification accuracy on the CIFAR10 dataset. We test NodeDrop network performance and parameter reduction on MNIST, CIFAR10, and CIFAR100.

4.1 MNIST Experiments

The MNIST dataset (LeCun & Cortes, 1998) provides an opportunity to perform a large number of experiments because of the datasets rapid accuracy convergence. Thus, we used this dataset to sweep across values for five differently sized, but otherwise similar, network architectures, as shown in table 1. We demonstrate NodeDrop’s ability to rapidly converge to similarly sized networks from different starting sizes.

For all MNIST experiments we used a simple network design: four convolution layers and a single fully connected layer. We used 3x3 filters in all convolution layers, and performed max-pooling after every second convolution layer. We varied the width of the layers in order to test the effects of changing network initialization size. We did not investigate the effects of changing network depth, but suspect that prudent selection of network depth remains important. The network architectures are described in table

1. The following consistent hyperparameters were used across all MNIST runs: learning rate , batch size , optimizer

, loss function

cross entropy, epochs.

4.1.1 Choosing Lambda

Choosing an appropriate value for the NodeDrop’s parameter remains an important task. In order to prove that the NodeDrop technique remains robust for many selections of , we tested five different network initialization sizes to observe differences in convergence across values. The network architectures and hyperparameters are discussed in section 4.1. We tested ten different values between and .

Our results indicate that easy tuning is a benefit of the NodeDrop technique. We found that selections across orders of magnitude yielded desirable results, as shown in figure 2. For we noticed a drop in MNIST accuracy, and for we judged there to be a significant sacrifice in parameter reduction. Choosing appropriate will always be dependent on both application and loss function. Because of these MNIST experiments, we expect that the NodeDrop technique is robust for a large range of selections. For a network designer using the popular cross-entropy loss objective function, as we did, we would suggest .

4.1.2 Network Sizes

In the previous section (4.1.1) we experimentally observed that tuning the parameter of the NodeDrop technique should not cause a network designer grief. In this section, we will experimentally observe that choosing initialization layer sizes should also prove easy. We use the same experiments from the previous section (4.1.1), but instead plot the effect of initializing with differently sized networks. This plot, shown in figure 2, demonstrated that the NodeDrop technique will converge to a similar “equilibrium” from many differently sized initialization networks. The size of the final network is instead mostly dependent on . A network designer should err towards too large a network in order to ensure desirable performance.

4.2 CIFAR10 and CIFAR100 Experiments

4.2.1 Dataset

The CIFAR dataset (Krizhevsky et al., 2009) consists of x colored natural images. Both CIFAR10 and CIFAR100 are designed for classification, containing and classes respectively. There are training images and testing images for both. We adopt a standard data augmentation scheme where the training images are shifted and mirrored horizontally (He et al., 2016; Liu et al., 2017).

4.2.2 Architectures and Training

We implement our technique on two standard models, VGG (Simonyan & Zisserman, 2014) and DenseNet (Gao Huang, 2017). Our VGG network is a slight variant of the standard VGG16 model. We follow the standard modification of VGG for CIFAR (Liu et al., 2017; Molchanov et al., 2017), by removing the final fully connected layers of size and instead using only a single fully connected layer of size . We train the network using SGD with momentum of . The network is trained for epochs with an initial learning rate of 0.1 which is decayed by at epochs 80 and 130. We tested both with and without batch normalization, and discovered that batch normalization is necessary for the large VGG16 initialization when applied to the more difficult CIFAR100 dataset. Therefore results without batch normalization are excluded for CIFAR100.

For DenseNet we implement the standard DenseNet-40 given in the original paper with and . We train the model as per the original paper with SGD and momentum 0.9. The network is trained for epochs with an initial learning rate of 0.1 and is decayed by 0.1 at epochs 150 and 225. As with VGG we found that the CIFAR100 dataset required batch normalization, but we were again able to train a variant on CIFAR10 without batch normalization.

4.2.3 Lambda Parameter Tests

As with the MNIST experiments, we tested a range of ’s on CIFAR10 in order to determine the choices which suit the network and dataset well. Furthermore, here we test NodeDrop-BN, which was not tested in the MNIST experiments. Results for VGG on CIFAR10 with varying choices of are shown in figure 3.

For the case without batch normalization our network maintains performance and prunes a large number of nodes over many choices of . As with the MNIST case, this suggests that choosing is relatively easy. All choices of achieved high performance with significant pruning. For the regularization parameter proved too high, causing an entire layer to turn off, which in turn caused the network to turn off all other layers.

For NodeDrop-BN, we find that is appropriate for maintaining performance. However, NodeDrop-BN requires more precise tuning than NodeDrop, as only achieved desirable parameter reduction. Based on the above results we continue to recommend an initial lambda setting of for the cross-entropy loss objective function.

4.2.4 Network Convergence Time

Sometimes it is important to avoid needlessly extending training time. In this section we analyze NodeDrop’s effect on training time. Using , we train a network for 2000 epochs in order to observe network parameter and performance convergence over time. This experiment used the VGG16 network without batch normalization on the CIFAR10 dataset. Our results, shown in figure 4, indicate that while accuracy convergence is not delayed by the NodeDrop technique, one will need to wait longer to maximize NodeDrop’s parameter reduction.

4.2.5 Parameter Reduction

Results for CIFAR10 and CIFAR100 are given in tables 2 and 3 respectively. We highlight the rows which provide the highest parameter reduction while maintaining high accuracy.

For the VGG network we are able to drop a significant number of parameters without degradation to the accuracy of the network. For NodeDrop-BN, we can prune percent of the parameters for CIFAR10 and percent for CIFAR100. For vanilla NodeDrop, we can prune percent of the parameters on CIFAR10. This suggests that VGG is a significantly oversized network for application to the CIFAR datasets.

It is more difficult to prune nodes from the DenseNet architecture than for VGG. We are only able to prune approximately percent of the parameters from DenseNet on CIFAR100. We believe this suggests that the DenseNet architecture is already well sized for CIFAR100. DenseNet starts at around million parameters, which is close to the number of remaining parameters after our best case pruning of the VGG network.

5 Conclusion and Outlook

In this paper, we proposed the novel NodeDrop technique for reducing parameters in neural networks. The NodeDrop technique consists of a condition for identifying nodes which are guaranteed to carry no information, and a regularization term to encourage this condition to be met. We also propose a modified version of NodeDrop, NodeDrop-BN, for use in networks with batch normalization. Experiments on the MNIST and CIFAR10 datasets show that NodeDrop does not significantly increase training time, and facilitates network design with the easily tuneable hyperparameter . With experiments on MNIST, CIFAR10, and CIFAR100 datasets, using VGG16 and DenseNet architectures, we demonstrate that NodeDrop compares favorably with other parameter reduction techniques. NodeDrop reduces the number of parameters in a network by up to a factor of 114x. We hope that NodeDrop and NodeDrop-BN will prove useful in neural network design, and will help to make the implementation of neural networks on constrained systems more practical.

References

  • Alvarez & Salzmann (2016) Alvarez, J. M. and Salzmann, M. Learning the number of neurons in deep networks. Neural Information Processing Systems (NIPS), 2016.
  • Anwar et al. (2017) Anwar, S., Hwang, K., and Sung, W.

    Structured pruning of deep convolutional neural networks.

    ACM Journal on Emerging Technologies in Computing Systems (JETC), 2017.
  • Ba et al. (2016) Ba, J. L., Kiro, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Denton et al. (2014) Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. Neural Information Processing Systems (NIPS), 2014.
  • Gao Huang (2017) Gao Huang, Zhuang Liu, L. v. d. M. K. Q. W. Densely connected convolutional networks. Computer Vision and Pattern Recognition (CVPR), 2017.
  • Gong et al. (2014) Gong, Y., Liu, L., Yang, M., and Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
  • Gupta et al. (2015) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. Internatinal Conference on Machine Learning (ICML), 2015.
  • Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. Neural Information Processing Systems (NIPS), 2015.
  • Han et al. (2016) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
  • Hassibi & Stork (1993) Hassibi, B. and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. Neural Information Processing Systems (NIPS), 1993.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. Computer Vision and Pattern Recognition (CVPR), 2016.
  • Hubara et al. (2016) Hubara, M. C. I., , Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks. Neural Information Processing Systems (NIPS), 2016.
  • Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning (ICML), 2015.
  • Kim et al. (2015) Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., and Shin, D. Compression of deep convolutional neural networks for fast and low power mobile applications. International Conference on Learning Representations (ICLR), 2015.
  • Krizhevsky et al. (2009) Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Lebedev & Lempitsky (2016) Lebedev, V. and Lempitsky, V. Fast convnets using group-wise brain damage. Computer Vision and Pattern Recognition (CVPR), 2016.
  • LeCun & Cortes (1998) LeCun, Y. and Cortes, C. MNIST handwritten digit database. 1998. URL http://yann.lecun.com/exdb/mnist/.
  • LeCun et al. (1990) LeCun, Y., Denker, J. S., Solla, S., Howard, R. E., and Jackel, L. D. Optimal brain damage. Neural Information Processing Systems (NIPS), 1990.
  • Li et al. (2017) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. Internation Conference on Learning Representations (ICLR), 2017.
  • Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. International Conference on Computer Vision (ICCV), 2017.
  • Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational dropout sparsifies deep neural networks. International Conference on Machine Learning (ICML), 2017.
  • Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. European Conference on Computer Vision (ECCV), 2016.
  • Reed (1993) Reed, R. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 1993.
  • Salimans & Kingma (2016) Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Neural Information Signal Processing (NIPS), 2016.
  • Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2014.
  • Srinivas & Babu (2015) Srinivas, S. and Babu, R. V. Data-free parameter pruning for deep neural networks. British Machine Vision Conference (BMVC), 2015.
  • Sun et al. (2015) Sun, Y., Wang, X., and Tang, X.

    Sparsifying neural network connections for face recognition.

    Computer Vision and Pattern Recognition (CVPR), 2015.
  • Vanhoucke et al. (2011) Vanhoucke, V., Senior, A., and Mao, M. Z. Improving the speed of neural networks on cpus. Workshop on Deep Learning and Unsupervised Feature Learning (NIPS), 2011.
  • Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. Neural Information Processing Systems (NIPS), 2016.
  • Zhou et al. (2016) Zhou, H., Alvarez, J. M., , and Porikli, F. Less is more: Towards compact cnns. European Conference on Computer Vision (ECCV), 2016.