1 Introduction
Neural networks are a class of parametric models that achieve the state of the art across a broad range of tasks, but their heavy computational requirements hinder practical deployment on resourceconstrained devices, such as mobile phones, Internetofthings (IoT) devices, and offline embedded systems. Many recent works focus on alleviating these computational burdens, mainly falling under two nonmutually exclusive categories: manually designing resourceefficient models, and automatically compressing popular architectures. In the latter, increasingly sophisticated techniques have emerged li2017pruning ; liu2017learning ; louizos2017bayesian , which have achieved respectable accuracy–efficiency operating points, some even Paretobetter than that of the original network; for example, network slimming li2017pruning reaches an error rate of 6.20% on CIFAR10 using VGGNet simonyan2014very with a 51% FLOPs reduction—an error decrease of 0.14% over the original.
However, few techniques impose a FLOPs constraint as part of a single optimization objective. Budgeted super networks veniat2018learning are closely related to this work, incorporating FLOPs and memory usage objectives as part of a policy gradientbased algorithm for learning sparse neural architectures. MorphNets gordon2018morphnet apply an norm, shrinkagebased relaxation of a FLOPs objective, but for the purpose of searching and training multiple models to find good network architectures; in this work, we learn a sparse neural network in a single training run. Other papers directly target devicespecific metrics, such as energy usage yang2017designing
, but the pruning procedure does not explicitly include the metrics of interest as part of the optimization objective, instead using them as heuristics. Falling short of continuously deploying a model candidate and measuring actual inference time, as in timeconsuming neural architectural search
tan2018mnasnet , we believe that the number of FLOPs is reasonable to use as a proxy measure for actual latency and energy usage; across variants of the same architecture, Tang et al. suggest that the number of FLOPs is a stronger predictor of energy usage and latency than the number of parameters tang2018experimental .Indeed, there are compelling reasons to optimize for the number of FLOPs as part of the training objective: First, it would permit FLOPsguided compression in a more principled manner. Second, practitioners can directly specify a desired target of FLOPs, which is important in deployment. Thus, our main contribution is to present a novel extension of the prior state of the art louizos2018learning to incorporate the number of FLOPs as part of the optimization objective, furthermore allowing practitioners to set and meet a desired compression target.
2 FLOPs Objective
Formally, we define the FLOPs objective as follows:
(1) 
where is the FLOPs associated with hypothesis , is a function with the explicit dependencies, and is the indicator function. We assume
to depend only on whether parameters are nonzero, such as the number of neurons in a neural network. For a dataset
, our empirical risk thus becomes(2) 
Hyperparameters and control the strength of the FLOPs objective and the target, respectively. The second term is a blackbox function, whose combinatorial nature prevents gradientbased optimization; thus, using the same procedure in prior art louizos2018learning , we relax the objective to a surrogate of the evidence lower bound with a fullyfactorized spikeandslab posterior as the variational distribution, where the addition of the clipped FLOPs objective can be interpreted as a sparsityinducing prior . Let
be Bernoulli random variables parameterized by
:(3) 
where denotes the Hadamard product. To allow for efficient reparameterization and exact zeros, Louizos et al. louizos2018learning propose to use a hard concrete distribution as the approximation, which is a stretched and clipped version of the binary Concrete distribution maddison2016concrete : if , then is said to be a hard concrete r.v., given and . Define , and let and . Then, the approximation becomes
(4) 
is the probability of a gate being nonzero under the hard concrete distribution. It is more efficient in the second expectation to sample from the equivalent Bernoulli parameterization compared to hard concrete, which is more computationally expensive to sample multiple times. The first term now allows for efficient optimization via the reparameterization trick
kingma2013auto; for the second, we apply the score function estimator (REINFORCE)
williams1992simple, since the FLOPs objective is, in general, nondifferentiable and thus precludes the reparameterization trick. High variance is a nonissue because the number of FLOPs is fast to compute, hence letting many samples to be drawn. At inference time, the deterministic estimator is
for the final parameters .FLOPs under group sparsity. In practice, computational savings are achieved only if the model is sparse across “regular” groups of parameters, e.g., each filter in a convolutional layer. Thus, each computational group uses one hard concrete r.v. louizos2018learning —in fullyconnected layers, one per input neuron; in 2D convolution layers, one per output filter. Under convention in the literature where one addition and one multiplication each count as a FLOP, the FLOPs for a 2D convolution layer given a random draw is then defined as for kernel width and height , input width and height
, padding width and height
, and number of input channels . The number of FLOPs for a fullyconnected layer is , where is the number of input neurons. Note that these are conventional definitions in neural network compression papers—the objective can easily use instead a number of FLOPs incurred by other devicespecific algorithms. Thus, at each training step, we compute the FLOPs objective by sampling from the Bernoulli r.v.’s and using the aforementioned definitions, e.g., for convolution layers. Then, we apply the score function estimator to the FLOPs objective as a blackbox estimator.3 Experimental Results
We report results on MNIST, CIFAR10, and CIFAR100, training multiple models on each dataset corresponding to different FLOPs targets. We follow the same initialization and hyperparameters as Louizos et al. louizos2018learning , using Adam kingma2014adam with temporal averaging for optimization, a weight decay of , and an initial that corresponds to the original dropout rate of that layer. We similarly choose , , and . For brevity, we direct the interested reader to their repository^{1}^{1}1https://github.com/AMLabAmsterdam/L0_regularization for specifics. In all of our experiments, we replace the original
penalty with our FLOPs objective, and we train all models to 200 epochs; at epoch 190, we prune the network by weights associated with zeroed gates and replace the r.v.’s with their deterministic estimators, then finetune for 10 more epochs. For the score function estimator, we draw 1000 samples at each optimization step—this procedure is fast and has no visible effect on training time.
Model  Architecture  Err.  FLOPs 
GL wen2016learning  312192500  1.0%  205K 
GD srinivas2016generalized  71320816  1.1%  254K 
SBP neklyudov2017structured  318284283  0.9%  217K 
BCGNJ louizos2017bayesian  8138813  1.0%  290K 
BCGHS louizos2017bayesian  5107616  1.0%  158K 
louizos2018learning  202545462  0.9%  1.3M 
sep louizos2018learning  9186525  1.0%  403K 
, K  313208500  0.9%  218K 
, K  38128499  1.0%  153K 
, K  27112478  1.1%  111K 
Comparison of LeNet5Caffe results on MNIST
We choose in all of the experiments for LeNet5Caffe, the Caffe variant of LeNet5.^{1}^{1}footnotemark: 1 We observe that our methods (Table 1, bottom three rows) achieve accuracy comparable to those from previous approaches while using fewer FLOPs, with the added benefit of providing a tunable “knob” for adjusting the FLOPs. Note that the convolution layers are the most aggressively compressed, since they are responsible for most of the FLOPs in this model.
Method  CIFAR10  CIFAR100  

Err.  [FLOPs]  FLOPs  Err.  [FLOPs]  FLOPs  
Orig.  4.00%  5.9B  5.9B  21.18%  5.9B  5.9B 
Orig. w/dropout  3.89%  5.9B  5.9B  18.85%  5.9B  5.9B 
3.83%  5.3B  5.9B  18.75%  5.3B  5.9B  
small  3.93%  5.2B  5.9B  19.04%  5.2B  5.9B 
, B  3.82%  3.9B  4.6B  18.93%  3.9B  4.6B 
, B  3.91%  2.4B  2.4B  19.48%  2.4B  2.4B 
Orig. in Table 2 denotes the original WRN2810 model zagoruyko2016wide , and * refers to the regularized models louizos2018learning ; likewise, we augment CIFAR10 and CIFAR100 with standard random cropping and horizontal flipping. For each of our results (last two rows), we report the median error rate of five different runs, executing a total of 20 runs across two models for each of the two datasets; we use in all of these experiments. We also report both the expected FLOPs and actual FLOPs, the former denoting the number of FLOPs, on average, at training time under stochastic gates and the latter denoting the number of FLOPs at inference time. We restrict the FLOPs calculations to the penalized nonresidual convolution layers only. For CIFAR10, our approaches result in Paretobetter models with decreases in both error rate and the actual number of inferencetime FLOPs. For CIFAR100, we do not achieve a Paretobetter model, since our approach trades accuracy for improved efficiency. The acceptability of the tradeoff depends on the end application.
References

(1)
Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, TienJu Yang, and Edward
Choi.
MorphNet: Fast & simple resourceconstrained structure learning of
deep networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  (2) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 (3) Diederik P. Kingma and Max Welling. Autoencoding variational Bayes. arXiv:1312.6114, 2013.
 (4) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017.
 (5) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2755–2763, 2017.

(6)
Christos Louizos, Karen Ullrich, and Max Welling.
Bayesian compression for deep learning.
In Advances in Neural Information Processing Systems, pages 3288–3298, 2017.  (7) Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018.

(8)
Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations, 2017.  (9) Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured Bayesian pruning via lognormal multiplicative noise. In Advances in Neural Information Processing Systems, pages 6775–6784, 2017.
 (10) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 (11) Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv:1611.06791, 2016.
 (12) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. MnasNet: Platformaware neural architecture search for mobile. arXiv:1807.11626, 2018.

(13)
Raphael Tang, Weijie Wang, Zhucheng Tu, and Jimmy Lin.
An experimental analysis of the power consumption of convolutional neural networks for keyword spotting.
In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5479–5483, 2018.  (14) Tom Veniat and Ludovic Denoyer. Learning time/memoryefficient deep architectures with budgeted super networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3492–3500, 2018.
 (15) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.

(16)
Ronald J Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 8(34):229–256, 1992.  (17) TienJu Yang, YuHsin Chen, and Vivienne Sze. Designing energyefficient convolutional neural networks using energyaware pruning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6071–6079, 2017.
 (18) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv:1605.07146, 2016.
Comments
There are no comments yet.