1 Introduction
Over the past decade, convolutional neural networks (CNNs) have been accepted as the core of many computer vision solutions. Deep models trained on a massive amount of data have delivered impressive accuracy on a variety of tasks, including but not limited to semantic segmentation, face recognition, object detection and recognition.
In spite of the successes, mobile devices cannot take much advantage of these models, mainly due to their inadequacy of computational resources. As is known to all, camera based games are dazzling to be operated with object recognition and detection techniques, hence it is eagerly anticipated to deploy advanced CNNs (e.g., AlexNet [15], VGGNet [25] and ResNet [10]
) on tablets and smartphones. Nevertheless, as the winner of ILSVRC2012 competition, AlexNet comes along with nearly 61 million realvalued parameters and 1.5 billion floatingpoint operations (FLOPs) to classify an image, making it resourceintensive in different aspects. Running it for realtime applications would require considerable high CPU/GPU workloads and memory footprints, which is prohibitive on typical mobile devices. The similar situation occurs on the deeper networks like VGGNet and ResNet.
Recently, CNNs with binary weights are designed to resolve this problem. By forcing the connection weights to only two possible values (normally 1 and 1), researchers attempt to eliminate the required floatingpoint multiplications (FMULs) for network inference, as they are considered to be the most expensive operations. In addition, since the realvalued weights are converted to be binary, these networks would commit much less space for storing their parameters, which leads to a great saving in the memory footprints and thus energy costs [8]. Several methods have been proposed to train such networks [1, 2, 22]. However, the reported accuracy of obtained models is unsatisfactory on large dataset (e.g., ImageNet) [22]. Even worse, since straightforwardly widen the networks does not guarantee any increase in accuracy [14], it is also unclear how we can make a tradeoff between the model precision and expected accuracy with these methods.
In this paper, we introduce network sketching as a new way of pursuing binaryweight CNNs, where the binary structures are exploited in pretrained models rather than being trained from scratch. To seek the possibility of yielding stateoftheart models, we propose two theoretical grounded algorithms, making it possible to regulate the precision of sketching for more accurate inference. Moreover, to further improve the efficiency of generated models (a.k.a., sketches), we also propose an algorithm to associatively implement binary tensor convolutions, with which the required number of floatingpoint additions and subtractions (FADDs)^{1}^{1}1Without ambiguity, we collectively abbreviate floatingpoint additions and floatingpoint subtractions as FADDs. is likewise reduced. Experimental results demonstrate that our method works extraordinarily well on both AlexNet and ResNet. That is, with a bit more FLOPs required and a little more memory space committed, the generated sketches outperform the existing binaryweight AlexNets and ResNets by large margins, producing near stateoftheart recognition accuracy on the ImageNet dataset.
The remainder of this paper is structured as follows. In Section 2, we briefly introduce the related works on CNN acceleration and compression. In Section 3, we highlight the motivation of our method and provide some theoretical analyses for its implementations. In Section 4, we introduce the associative implementation for binary tensor convolutions. At last, Section 5 experimentally demonstrates the efficacy of our method and Section 6 draws the conclusions.
2 Related Works
The deployment problem of deep CNNs has become a concern for years. Efficient models can be learnt either from scratch or from pretrained models. Generally, training from scratch demands strong integration of network architecture and training policy [18], and here we mainly discuss representative works on the latter strategy.
Early works are usually hardwarespecific. Not restricted to CNNs, Vanhoucke et al. [28] take advantage of programmatic optimizations to produce a speedup on x86 CPUs. On the other hand, Mathieu et al. [20]
perform fast Fourier transform (FFT) on GPUs and propose to compute convolutions efficiently in the frequency domain. Additionally, Vasilache et al.
[29] introduce two new FFTbased implementations for more significant speedups.More recently, lowrank based matrix (or tensor) decomposition has been used as an alternative way to accomplish this task. Mainly inspired by the seminal works from Denil et al. [4] and Rigamonti et al. [23], lowrank based methods attempt to exploit parameter redundancy among different feature channels and filters. By properly decomposing pretrained filters, these methods [5, 12, 16, 32, 19] can achieve appealing speedups ( to ) with acceptable accuracy drop (). ^{2}^{2}2Some other works concentrate on learning lowrank filters from scratch [27, 11], which is out of the scope of our paper.
Unlike the above mentioned ones, some research works regard memory saving as the top priority. To tackle the storage issue of deep networks, Gong et al. [6], Wu et al. [31] and Lin et al. [18] consider applying the quantization techniques to pretrained CNNs, and trying to make network compressions with minor concessions on the inference accuracy. Another powerful category of methods in this scope is network pruning. Starting from the early work of LeCun et al’s [17] and Hassibi & Stork’s [9], pruning methods have delivered surprisingly good compressions on a range of CNNs, including some advanced ones like AlexNet and VGGnet [8, 26, 7]. In addition, due to the reduction in model complexity, a fair speedup can be observed as a byproduct.
As a method for generating binaryweight CNNs, our network sketching is orthogonal to most of the existing compression and acceleration methods. For example, it can be jointly applied with lowrank based methods, by first decomposing the weight tensors into lowrank components and then sketching them. As for the cooperation with quantizationbased methods, sketching first and conducting product quantization thereafter would be a good choice.
3 Network Sketching
In general, convolutional layers and fullyconnected layers are the most resourcehungry components in deep CNNs. Fortunately, both of them are considered to be overparameterized [4, 30]. In this section, we highlight the motivation of our method and present its implementation details on the convolutional layers as an example. Fullyconnected layers can be operated in a similar way.
Suppose that the learnable weights of a convolutional layer can be arranged and represented as , in which indicates the target number of feature maps, and is the weight tensor of its th filter. Storing all these weights would require bit memory, and the direct implementation of convolutions would require FMULs (along with the same number of FADDs), in which indicates the spatial size of target feature maps.
Since many convolutional models are believed to be informational redundant, it is possible to seek their lowprecision and compact counterparts for better efficiency. With this in mind, we consider exploiting binary structures within , by using the divide and conquer strategy. We shall first approximate the pretrained filters with a linear span of certain binary basis, and then group the identical basis tensors to pursue the maximal network efficiency. Details are described in the following two subsections, in which we drop superscript from because the arguments apply to all the weight tensors.
3.1 Approximating the Filters
As claimed above, our first goal is to find a binary expansion of that approximates it well (as illustrated in Figure 2), which means , in which and are the concatenations of binary tensors and the same number of scale factors , respectively. We herein investigate the appropriate choice of and with a fixed . Two theoretical grounded algorithms are proposed in Section 3.1.1 and 3.1.2, respectively.
3.1.1 Direct Approximation
For easy understanding, we shall first introduce the direct approximation algorithm. Generally, the reconstruction error (or approximation error, roundoff error) should be minimized to retain the model accuracy after expansion. However, as a concrete decision problem, directly minimizing seems NPhard and thus solving it can be very time consuming [3]
. In order to finish up in reasonable time, we propose a heuristic algorithm, in which
and are sequentially learnt and each of them is selected to be the current optimum with respect to the minimization criterion. That is,(1) 
in which , the norm operator is defined as for any 3D tensor , and indicates the approximation residue after combining all the previously generated tensor(s). In particular, if , and
(2) 
if . It can be easily known that, through derivative calculations, Equation (1) is equivalent with
(3) 
under this circumstance, in which function calculates the elementwise sign of the input tensor, and .
The above algorithm is summarized in Algorithm 1. It is considered to be heuristic (or greedy) in the sense that each is selected to be the current optimum, regardless of whether it will preclude better approximations later on. Furthermore, some simple deductions give the following theoretical result.
Theorem 1.
For any , Algorithm 1 achieves a reconstruction error satisfying
(4) 
3.1.2 Approximation with Refinement
We can see from Theorem 1 that, by utilizing the direct approximation algorithm, the reconstruction error decays exponentially with a rate proportional to . That is, given a with small size (i.e., when is small), the approximation in Algorithm 1 can be pretty good with only a handful of binary tensors. Nevertheless, when is relatively large, the reconstruction error will decrease slowly and the approximation can be unsatisfactory even with a large number of binary tensors. In this section, we propose to refine the direct approximation algorithm for better reconstruction property.
Considering that, in Algorithm 1, both and are chosen to minimize with fixed counterparts. However, in most cases, it is doubtful whether and are optimal overall. If not, we may need to refine at least one of them for the sake of better approximation. On account of the computational simplicity, we turn to a specific case when is fixed. That is, suppose the direct approximation has already produced and
, we hereby seek another scale vector
satisfying . To achieve this, we follow the least square regression method and get(7) 
in which the operator gets a column vector whose elements are taken from the input tensor, and gets .
The above algorithm with scale factor refinement is summarized in Algorithm 2. Intuitively, the refinement operation attributes a memorylike feature to our method, and the following theorem ensures it can converge faster in comparison with Algorithm 1.
Theorem 2.
Proof.
To simplify the notations, let us further denote and , then we can obtain by block matrix multiplication and inverse that,
(9) 
in which , , and . Consequently, we have the following equation for ,
(10) 
by defining . As we know, given positive semidefinite matrices and , . Then, Equation (10) gives,
Obviously, it follows that,
(11) 
in which . Since represents the squared Euclidean norm of an orthogonal projection of , it is not difficult to prove , and then the result follows. ∎
3.2 Geometric Interpretation
After expanding the pretrained filters, we can group the identical binary tensors to save some more memory footprints. In this paper, the whole technique is named as network sketching, and the generated binaryweight model is straightforwardly called a sketch. Next we shall interpret the sketching process from a geometric point of view.
For a start, we should be aware that, Equation (1) is essentially seeking a linear subspace spanned by a set of dimensional binary vectors to minimize its Euclidean distance to . In concept, there are two variables to be determined in this problem. Both Algorithm 1 and 2 solve it in a heuristic way, and the
th binary vector is always estimated by minimizing the distance between itself and the current approximation residue. What make them different is that, Algorithm
2 takes advantage of the linear span of its previous estimations for better approximation, whereas Algorithm 1 does not.Let us now take a closer look at Theorem 2. Compared with Equation (4) in Theorem 1, the distinction of Equation (8) mainly lies in the existence of . Clearly, Algorithm 2 will converge faster than Algorithm 1 as long as holds for any . Geometrically speaking, if we consider as the matrix of an orthogonal projection onto , then is equal to the squared Euclidean norm of a vector projection. Therefore, holds if and only if vector is orthogonal to , or in other words, orthogonal to each element of
which occurs with extremely low probability and only on the condition of
. That is, Algorithm 2 will probably prevail over Algorithm 1 in practice.4 Speedingup the Sketch Model
Using Algorithm 1 or 2, one can easily get a set of binary tensors on , which means the storage requirement of learnable weights is reduced by a factor of . When applying the model, the required number of FMULs is also significantly reduced, by a factor of . Probably, the only side effect of sketching is some increases in the number of FADDs, as it poses an extra burden on the computing units.
In this section, we try to ameliorate this defect and introduce an algorithm to further speedup the binaryweight networks. We start from an observation that, yet the required number of FADDs grows monotonously with , the inherent number of addends and augends is always fixed with a given input of . That is, some repetitive FADDs exist in the direct implementation of binary tensor convolutions. Let us denote as an input subfeature map and see Figure 3 for a schematic illustration.
4.1 Associative Implementation
To properly avoid redundant operations, we first present an associative implementation of the multiple convolutions on , in which the connection among different convolutions is fully exploited. To be more specific, our strategy is to perform convolutions in a hierarchical and progressive way. That is, each of the convolution results is used as a baseline of the following calculations. Suppose the th convolution is calculated in advance and it produces , then the convolution of and can be derived by using
(12) 
or alternatively,
(13) 
in which denotes the elementwise not operator, and denotes an elementwise operator whose behavior is in accordance with Table 1.
Since produces ternary outputs on each index position, we can naturally regard as an iteration of switch … case … statements. In this manner, only the entries corresponding to in need to be operated in , and thus acceleration is gained. Assuming that the innerproduct of and equals to , then and FADDs are still required for calculating Equation (12) and (13), respectively. Obviously, we expect the real number to be close to either or for the possibility of fewer FADDs, and thus faster convolutions in our implementation. In particular, if , Equation (12) is chosen for better efficiency; otherwise, Equation (13) should be chosen.
4.2 Constructing a Dependency Tree
Our implementation works by properly rearranging the binary tensors and implementing binary tensor convolutions in an indirect way. For this reason, along with Equations (12) and (13), a dependency tree is also required to drive it. In particular, dependency is the notion that certain binary tensors are linked to specify which convolution to perform first and which follows up. For instance, with the depthfirstsearch strategy, in Figure 4 shows a dependency tree indicating first to calculate , then to derive from the previous result, then to calculate on the base of , and so forth. By traversing the whole tree, all convolutions will be progressively and efficiently calculated.
In fact, our algorithm works with a stochastically given tree, but a dedicated is still in demand for its optimum performance. Let be an undirected weighted graph with the vertex set and weight matrix . Each element of represents a single binary tensor, and each element of measures the distance between two chosen tensors. To keep in line with the previous subsection, we here define the distance function of the following form
(14) 
in which indicates the innerproduct of and . Clearly, the defined function is a metric on and its range is restricted in . Recall that, we expect to be close to in the previous subsection. In consequence, the optimal dependency tree should have the shortest distance from root to each of its vertices, and thus the minimum spanning tree (MST) of is what we want.
From this perspective, we can use some offtheshell algorithms to construct such a tree. Prim’s algorithm [21] is chosen in this paper on account of its linear time complexity with respect to , i.e., on . With the obtained , one can implement our algorithm easily and the whole process is summarized in Algorithm 3. Note that, although the fullyconnected layers calculate vectormatrix multiplications, they can be considered as a bunch of tensor convolutions. Therefore, in the binary case, we can also make accelerations in the fullyconnected layers by using Algorithm 3.
5 Experimental Results
In this section, we try to empirically analyze the proposed algorithms. For pragmatic reasons, all experiments are conducted on the famous ImageNet ILSVRC2012 database [24]
with advanced CNNs and the opensource Caffe framework
[13]. The training set is comprised of 1.2 million labeled images and the test set is comprised of 50,000 validation images.In Section 5.1 and 5.2, we will test the performance of the sketching algorithms (i.e., Algorithm 1 and 2) and the associative implementation of convolutions (i.e., Algorithm 3) in the sense of filter approximation and computational efficiency, respectively. Then, in Section 5.3, we evaluate the wholenet performance of our sketches and compare them with other binaryweight models.
5.1 Efficacy of Sketching Algorithms
As a starting experiment, we consider sketching the famous AlexNet model [15]. Although it is the champion solution of ILSVRC2012, AlexNet seems to be very resourceintensive. Therefore, it is indeed appealing to seek its lowprecision and efficient counterparts. As claimed in Section 1, AlexNet is an 8layer model with 61M learnable parameters. Layerwise details are shown in Table 2, and the pretrained reference model is available online ^{3}^{3}3https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet..
Layer Name  Filters  Params (b)  FLOPs 

Conv1  96  1M  211M 
Conv2  256  10M  448M 
Conv3  384  28M  299M 
Conv4  384  21M  224M 
Conv5  256  14M  150M 
Fc6  1  1208M  75M 
Fc7  1  537M  34M 
Fc8  1  131M  8M 
Using Algorithm 1 and 2, we are able to generate binaryweight AlexNets with different precisions. Theoretical analyses have been given in Section 3, so in this subsection, we shall empirically analyze the proposed algorithms. In particular, we demonstrate in Figure 6 how ”energy” accumulates with a varying size of memory commitment for different approximators. Defined as , the accumulative energy is negatively correlated with reconstruction error [32], so the faster it increases, the better. In the figure, our two algorithms are abbreviately named as ”Sketching (direct)” and ”Sketching (refined)”. To compare with other strategies, we also test the stochastically generated binary basis (named ”Sketching_random”) as used in [14], and the network pruning technique [8] which is not naturally orthogonal with our sketching method. The scalar factors for ”Sketching (random)” is calculated by Equation (7) to ensure its optimal performance.
We can see that, it is consistent with the theoretical result that Algorithm 1 converges much slower than Algorithm 2 on all learnable layers, making it less effective on the filter approximation task. However, on the other hand, Algorithm 1 can be better when compared with ”Sketching (random)” and the pruning technique. With small working memory, our direct approximation algorithm approximates better. However, if the memory size increases, pruning technique may converge faster to its optimum.
As discussed in Section 4, parameter balances the model accuracy and efficiency in our algorithms. Figure 6 shows that, a small (for example 3) should be adequate for AlexNet to attain over 80% of the accumulative energy in its refined sketch. Let us take layer ”Conv5” and ”Fc6” as examples and see Table 3 for more details.
Layer Name  Energy (%)  Params (b)  FMULs 

Conv2_sketch  82.9  0.9M  560K 
Fc6_sketch  94.0  114M  12K 
5.2 Efficiency of Associative Manipulations
The associative implementation of binary tensor manipulations (i.e., Algorithm 3) is directly tested on the 3bit refined sketch of AlexNet. To begin with, we still focus on ”Conv2_sketch” and ”Fc6_sketch”. Just to be clear, we produce the result of Algorithm 3 with both a stochastically generated dependency tree and a delicately calculated MST, while the direct implementation results are compared as a benchmark. All the implementations require the same number of FMULs, as demonstrated in Table 3, and significantly different number of FADDs, as compared in Table 4. Note that, in the associative implementations, some logical evaluations and operations are specially involved. Nevertheless, they are much less expensive than the FADDs and FMULs [22], by at least an order of magnitude. Hence, their cost are not deeply analyzed in this subsection ^{4}^{4}4Since the actual speedups varies dramatically with the architecture of processing units, so we will not measure it in the paper..
Implementation  Conv2_sketch  Fc6_sketch 

Direct  672M  113M 
Associative (random)  328M  56M 
Associative (MST)  265M  49M 
From the above table, we know that our associative implementation largely reduces the required number of FADDs on ”Conv2_sketch” and ”Fc6_sketch”. That is, it properly ameliorates the adverse effect of network sketching and enables us to evaluate the 3bit sketch of AlexNet without any unbearably increase in the required amount of computation. In addition, the MST helps to further improve its performance and finally get 2.5 and 2.3 reductions on the two layers respectively. Results on all learnable layers are summarized in Figure 5.
5.3 Wholenet Performance
Having tested Algorithm 1, 2 and 3 on the base of their own criteria, it is time to compare the wholenet performance of our sketch with that of other binary weight models [1, 22]. Inspired by the previous experimental results, we still use the 3bit (direct and refined) sketches for evaluation, as they are both very efficient and accurate. Considering that the fullyconnected layers of AlexNet contain more than 95% of its parameters, we should try sketching them to an extreme, namely 1 bit. Similar with Rastegari et al. [22], we also keep the ’fc8’ layer (or say, the output layer) to be of its full precision and report the top1 and top5 inference accuracies. However, distinguished from their methods, we sketch the ’conv1’ layer as well because it is also computeintensive (as shown in Table 1).
Model  Params (b)  Top1 (%)  Top5 (%) 

Reference  1951M  57.2  80.3 
Sketch (ref.)  193M  55.2  78.8 
Sketch (dir.)  193M  54.4  78.1 
BWN [22]  190M  53.8  77.0 
BC [1]  189M  35.4  61.0 
Just to avoid the propagation of reconstruction errors, we need to somehow finetune the generated sketches. Naturally, there are two protocols to help accomplish this task; one is known as projection gradient descent and the other is stochastic gradient descent with full precision weight update
[1]. We choose the latter by virtue of its better convergency. The training batch size is set as 256 and the momentum is 0.9. We let the learning rate drops every 20,000 iterations from 0.001, which is one tenth of the original value as set in Krizhevsky et al.’s paper [15], and use only the center crop for accuracy evaluation. After totally 70,000 iterations (i.e., roughly 14 epochs), our sketches can make faithful inference on the test set, and the refined model is better than the direct one. As shown in Table
5, our refined sketch of AlexNet achieves a top1 accuracy of 55.2% and a top5 accuracy of 78.8%, which means it outperforms the recent released models in the name of BinaryConnect (BC) [1] and BinaryWeight Network (BWN) [22] by large margins, while the required number of parameters only exceeds a little bit.Network pruning technique also achieves compelling results on compressing AlexNet. However, it demands a lot of extra space for storing parameter indices, and more importantly, even the optimal pruning methods perform mediocrely on convolutional layers [8, 7]. In contrast, network sketching works sufficiently well on both of the layer types. Here we also testify its efficacy on ResNet [10]. Being equipped with much more convolutional layers than that of AlexNet, ResNet wins the ILSVRC2015 classification competition. There are many instantiations of its architecture, and for fair comparisons, we choose the type B version of 18 layers (as with Rastegari et al. [22]).
A pretrained Torch model is available online
^{5}^{5}5https://github.com/facebook/fb.resnet.torch/tree/master/pretrained. and we convert it into an equivalent Caffe model before sketching ^{6}^{6}6https://github.com/facebook/fbcaffeexts.. For the finetuning process, we set the training batch size as 64 and let the learning rate drops from 0.0001. After 200,000 iterations (i.e., roughly 10 epochs), the generated sketch represents a top1 accuracy of 67.8% and a top5 accuracy of 88.4% on ImageNet dataset. Refer to Table 6 for a comparison of the classification accuracy of different binaryweight models.Model  Params (b)  Top1 (%)  Top5 (%) 

Reference  374M  68.8  89.0 
Sketch (ref.)  51M  67.8  88.4 
Sketch (dir.)  51M  67.3  88.2 
BWN [22]  28M  60.8  83.0 
6 Conclusions
In this paper, we introduce network sketching as a novel technology for pursuing binaryweight CNNs. It is more flexible than the current available methods and it enables researchers and engineers to regulate the precision of generated sketches and get better tradeoff between the model efficiency and accuracy. Both theoretical and empirical analyses have been given to validate its efficacy. Moreover, we also propose an associative implementation of binary tensor convolutions to further speedup the sketches. After all these efforts, we are able to generate binaryweight AlexNets and ResNets with the ability to make both efficient and faithful inference on the ImageNet classification task. Future works shall include exploring the sketching results of other CNNs.
References
 [1] M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), 2015.
 [2] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1. In arXiv preprint arXiv:1602.02830, 2016.
 [3] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Constructive approximation, 13(1):57–98, 1997.

[4]
M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. d. Freitas.
Predicting parameters in deep learning.
In Advances in Neural Information Processing Systems (NIPS), 2013.  [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (NIPS), 2014.
 [6] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 [7] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems (NIPS), 2016.
 [8] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
 [9] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (NIPS), 1993.

[10]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  [11] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, and A. Criminisi. Training CNNs with lowrank filters for efficient image classification. In International Conference on Learning Representations (ICLR), 2016.
 [12] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference (BMVC), 2014.
 [13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia Conference (MM), 2014.
 [14] F. JuefeiXu, V. N. Boddeti, and M. Savvides. Local binary convolutional neural networks. arXiv preprint arXiv:1608.06049, 2016.
 [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), 2012.
 [16] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned CPdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [17] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), 1989.

[18]
D. D. Lin, S. S. Talathi, and V. S. Annapureddy.
Fixed point quantization of deep convolutional networks.
In
International Conference on Machine Learning (ICML)
, 2016.  [19] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [20] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851, 2013.
 [21] R. C. Prim. Shortest connection networks and some generalizations. Bell system technical journal, 36(6):1389–1401, 1957.
 [22] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: Imagenet classification using binary convolutional neural networks. arXiv preprint arXiv:1603.05279v2, 2016.
 [23] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable filters. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
 [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2014.
 [26] S. Srinivas and R. V. Babu. Datafree parameter pruning for deep neural networks. In British Machine Vision Conference (BMVC), 2015.
 [27] C. Tai, T. Xiao, Y. Zhang, X. Wang, and W. E. Convolutional neural networks with lowrank regularization. In International Conference on Learning Representations (ICLR), 2016.
 [28] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, Advances in Neural Information Processing Systems (NIPS Workshop), 2011.
 [29] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation. In International Conference on Learning Representations (ICLR), 2015.
 [30] A. Veit, M. Wilber, and S. Belongie. Residual networks are exponential ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
 [31] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [32] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and accurate approximations of nonlinear convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.