I Introduction
(CNN) has become a powerful technique for many computer vision tasks in recent days. Besides improving the accuracy of generic tasks such as ILSVRC
[36], CNN researchers are designing models for efficient execution of domainspecific applications on edge devices. New efficient CNN models [13, 37, 47, 27] and various pruning techniques [30, 33, 29] have been proposed. Nevertheless, these techniques have several problems when applying on devices with limited resources or customisable hardware accelerators, e.g., introducing new operations that are unavailable in the instruction set, or resulting in models too irregular to be highly parallelised.The motivation of our approach is to adapt grouped convolution (Figure 1) for domainadaptive CNN deployment. It splits a big convolution into multiple small convolution layers with identical shape, which independently handle different depth groups separated from the input feature map, and combine final results at the end of computation. A grouped convolution layer requires workload of its standard counterpart, and can be processed by existing, highly optimised convolution operations, in parallel. Moreover, instead of designing and training a grouped convolution based model for different domains with various configurations from scratch, we can learn from pretrained models about the optimal grouped convolution, specifically, the mapping from pretrained weights to grouped weights.
Efficiently learning grouped convolution under our problem setting is challenging. At first glance, the mapping from a convolution layer to a grouped one can be presented as a structured pruning problem. The structural constraints, which will be known as group sparsity pattern, require that retained connections can directly form independent convolution groups. The optimisation problem turns out to be hard to solve after formalising the objective and constraints.
This paper proposes Dokei,^{1}^{1}1Dokei means isomorphism in Japanese, since our method explores grouped convolution with the same interface as the original model.an endtoend method that maps a pretrained CNN model to a grouped one. Dokei targets domainspecific applications for various hardware platforms. The major contributions of this paper include:

[label=()]

formalisation of the proposed approach as an optimisation problem constrained by group sparsity pattern, with efficient solutions by maximal bipartite matching;

structured regularisers that reduce the optimisation difficulty and improve the final performance;

an endtoend development flow and and its evaluation on various domains and hardware platforms.
Results show that Dokei can produce regular and compact models from ResNet50 for different domains. VGG16 is also evaluated for some aspects. When compared with prior work under similar reduction rate and accuracy, our resulting models have a higher level of parallelism, fewer changes to the instruction set, and competitive outcome.
Ii Background and Related Work
Essential background information related to Dokei includes grouped convolution and model pruning techniques.
Iia Grouped Convolution and Efficient CNN
Let denote the number of convolution groups, a grouped convolution splits the input feature map into depth groups and performs independent convolution. Results are concatenated by a specific order along the depth of output feature map in the end. An extreme case is depthwise convolution that each group has one input channel [13, 2].
Grouped convolution is originally proposed in AlexNet [21] for parallel processing on GPUs. Ioannou et al. [17] integrate grouped convolution in root module with additional 1x1 convolution. Models produced by [33] share a similar architecture. ResNeXt [43], derived from ResNet [10], simply turns a 3x3 convolution layer in a bottleneck block into a grouped one, while more extremely, ShuffleNet [47] groups 1x1 and makes the 3x3 convolution depthwise with an additional channel shuffle unit that permutes along the depth of the output feature map. Built on DenseNet [16], CondenseNet [15] replaces both 1x1 and 3x3 convolution into grouped ones and adds indexing layers as well. Except [33] that prunes for the grouped models, others are trained from scratch. Specifically, CondenseNet gradually removes connections to create grouped convolution during training. Regarding Dokei, models are produced from pretrained models without appending 1x1 convolution for computation efficiency. Dokei can replace 3x3 only or both 1x1 and 3x3 convolution, similar to ResNeXt and CondenseNet respectively, from the pretrained model.
IiB Network Pruning
Dokei is related to neural network pruning methods, which remove unimportant weights for efficient inference. A pruning method can be sensitivity or regularisation based, and can produce structured or unstructured models.
Sensitivitybased pruning selects unimportant weights by specific criteria and removes them directly. There are magnitude, such as , norms [28, 8, 24, 26], firstorder [30, 6] or secondorder Taylor expansion [23, 9, 4], average percentage of zero (APoZ) [26, 14], or even singular values [33, 29] based criteria. Each criterion has various characteristics regarding computation efficiency and accuracy of measurement. These methods are normally unstructured except [33] that produces structured grouped models.
As an alternative approach, regularisationbased methods first sparsify models through carefully designed regularisers and then prune normally by a magnitudebased criterion. norm, also known as lasso in statistical learning literature [40], is studied for sparsifying CNN models in [25, 8]. norm regulariser can only produce unstructured, irregular models, while some other methods use grouplasso [45] to encode specific structures during regularisation, such as channel or filter level pruning [32, 42, 22]. Similar to our objective, CondenseNet [15] tries to regularise for grouped convolution by grouplasso. Optimising with lasso or grouplasso regularisers is difficult due to their nondifferentiability at optimal points, and it can be solved by forwardbackward splitting [49, 5].
Regarding target applications, while current papers are mainly from a general perspective that prune models while considering ImageNet accuracy,
[30, 29] take into account domain adaptation and examine how their pruning methods work for different image domains [41, 31, 44, 35, 20].Dokei falls into the structured pruning category and uses magnitude criteria to find optimal structured sparsity patterns (Section IIIA1), as well as grouplasso regulariser to pre and post process models for easier optimisation and better performance (Section IIIC). Meanwhile, Dokei explicitly targets domain applications. Comparison between Dokei and prior work is presented in Section IVC.
Iii Method
By convention, we define domain as a set of finegrained image categories, e.g., birds or human actions, and the corresponding classification task. Briefly, the core of Dokei is about solving an optimisation problem that minimises the accuracy drop after converting a pretrained convolution layer to a grouped one. Accuracy drop is indirectly measured by a criterion for computation efficiency. This problem is difficult to solve due to structural constraints introduced by grouped convolution. Dokei also adapts structured regularisation to process pretrained models for easier optimisation and performance enhancement. In all, Dokei provides an endtoend flow that coordinates these modules.
Iiia Problem Formalisation
Let denote a CNN model that has , , and , which are sets of convolution layer configuration, number of convolution groups, and weights. Suppose has convolution layers, for each : is the th convolution layer, i.e., specify height, width, number of channels, number of filters, and kernel size respectively; is the number of convolution groups, specifically, for all layers in the original model ; is the set of weights. denotes the original model. Our objective is to find a model , which has the same topology as but replaces some convolution layers with grouped ones, to meet a specific requirement, such as model size, speed, and accuracy. To reduce the size of the architectural search space, here we restrict that the replacement will not change the layer interface, i.e., the shape of input and output feature maps, as shown in Figure 1. Hence, should be equivalent to since the topology does not change, while and should be figured out through optimisation, where .
IiiA1 Group Sparsity Pattern
Since Dokei works in the context of domain adaptation, the important information contained in the pretrained weights should better be utilised to avoid training each candidate from scratch. We can link this problem to structured pruning as presented in Section IIB.
Let denote a structured pruning pattern of the th convolution layer, which is a boolean mask and has the same shape as . Then , where is elementwise product and maps the th masked weights to grouped weights. A weight that is masked (mask value is ) will be retained in the grouped convolution while others will be discarded.
A valid mask should satisfy these constraints (Figure 2):

[noitemsep]

A valid solution should satisfy that the sum of each row or column equals to .

If there are three positive masks that , should be positive as well.^{2}^{2}2We omit the axes for kernel elements since we presume that masks within a kernel are identical.
These constraints are deduced by the presumption that connections between different groups should be removed, and channels and filters shall be fully connected within a group. Masks specified by these constraints are named as group sparsity pattern (GSP).
IiiA2 GSP Optimisation Problem
Since , optimising for is equivalent to solving the GSP optimisation problem as below. Similar to [30], we set the objective function as the absolute difference in empirical risk on dataset after pruning by . The first two constraints are shown before, while the last is not intuitive: denotes the exclusiveor operator, that gives true only if two inputs are different; is true if channel and are allocated to the same group; and the whole logical expression means that for any two channels, their masks toward the output filter should not differ if they are in the same group. Note that this optimisation problem formalisation is layerwise, i.e., solving it only provides the optimal mask for a specific layer, and is predetermined.
s.t. 
This problem is quite hard to solve: since constraints involving logical operators can be converted to inequalities, this problem is actually formalised as a Integer Linear Programming (ILP) problem, which is known to be at least NPcomplete
[19], regarding for each layer.IiiB Pruning Criteria
Another aspect of the problem difficulty is the evaluation time of the objective function as the training dataset will be iterated for the number of times that solving the ILP problem takes. The expected outcome of the objective function can be interpreted as removing weights that contribute least to the empirical cost, and therefore, we can evaluate some computationally cheaper criteria that measure weight importance. Prior sensitivity based CNN pruning papers (Section IIB) propose or utilise various criteria. Let be a specific criterion, the new minimisation objective function becomes (1). We require to match the channelfilter granularity. Magnitude criteria are adapted: and , which are and norms respectively.
(1) 
Directly applying these criteria to our problem is hard: normally GSP drastically removes many weights, and models pretrained with norm regulariser are not sparse enough, both hinder us to achieve good performance. Figure 3 shows the distribution of criterion value, the mode of which is close to but not exactly zero; and the minimal , which unfortunately, scales by . Even though, GSP is still necessary because theoretically a mapped grouped convolution ignoring GSP will significantly changes the original connection topology and require much more reconstruction efforts (empirical results in Table III, IV).
IiiC Structured Regularisation
We adapt structured regularisation to make the GSP optimisation easier to solve and give better results. Grouplasso[45], a lasso [40] regulariser that applies on groups of weights,^{3}^{3}3To clarify the terminology, the group in grouplasso is considered as a subset of weights rather than an operator in grouped convolution.suits our purpose well: it penalises weights to zero based on the property of lasso; and the penalisation performs simultaneously upon weights per group, which effectively turns the weights coarsergrained and reduces complexity, as shown in Section IIID. Besides reducing the optimisation difficulty, once the mask is produced by the optimiser, applying grouplasso on all unmasked weights can suppress the effect on accuracy from removing them.
This section elaborates on this approach by first formalising grouplasso under our GSP context, then justifying its effectiveness, and finally illustrating a forwardbackward splitting [7, 5, 48] based regularisation implementation.
IiiC1 GroupLasso Regulariser
Let be a set of weight indices representing a group and denote a set of groups, then the grouplasso regulariser on defined by is formalised as (2). Note that the dimensions for kernel elements are omitted since they are regularised as a whole.
(2)  
(3)  
(4) 
We propose two ways to form the group set , by block or by mask: the former one divides a weights matrix into blocks and set them as groups; while the latter one creates a single group that contains all unmasked weights once a mask is specified. Let and denote the block size along channel and filter, the blockbased grouplasso term is specified as (3). Suppose is the specified mask, the maskbased grouplasso is listed as (4).
(5) 
Two corollaries are the basis for the design of and . According to [45], encourages sparsity at the block granularity. The other one is from [18]: under their framework, and are two specific sparsityinducing norms, and they argue that only the set of elements complement to the union of normspecified groups is allowed to be nonzero. Note that these corollaries are merely guidance, not guarantees to our regulariser design, since they are deduced from linear model and CNN is nonlinear.
We devise to make weights coarsergrained, by applying which the performance of blockbased GSP mask solution will be closer to the optimal value. A blockbased solution , as shown in (5), is defined over blocks. In another word, describing by the elementwise notation , all within block takes the value . Suppose uses the same block setting as , we argue that during grouplasso regularisation by , the difference between the criterion values of and will be reduced. Note that should be more optimal in the beginning. This argument is based on the assumption that, suppose those that equal to are scattered over the matrix, and each block contains to of them, grouplasso regularisation will nullify from those blocks with the least number of to the most. During this process, more will be found in blocks and will be closer to , therefore, the gap will be narrowed. Figure 3(a) shows resulting weights regularised by . This property is the basis for our polynomial time method.
Since regularises all weights not covered by , and according to [18], only elements masked by are allowed to be nonzero, can nullify unmasked weights and further reduce criterion value drop after layer replacement.
IiiC2 ForwardBackward Splitting
A grouplasso regulariser is nondifferentiable at the points that elements within any group are equal to zero. Most published papers that adopt grouplasso for structured pruning in deep CNN either ignore this issue or propose ad hoc solutions, e.g., manually nullifying weights below a threshold [42, 15, 22]. We adapt the forwardbackward splitting approach, which is utilised by [48] to solve a similar grouplasso problem. Based on the formalisation from [5], we list our approach as in (6), where is the empirical cost function and are the learning rate and weight decay factor respectively. can be either or .
(6) 
This method suggests that after performing a standard gradient descent step (forward), we will find a point that is both close to the intermediate result measured by squared Euclidean norm, and small in terms of the grouplasso regulariser (backward). This method avoids directly calculating the gradient of nondifferentiable and can jointly minimise regarding and .
(7)  
IiiD Polynomial Solution to GSP Optimisation
Once is fixed, the optimisation problem can be formalised as an ILP problem, as shown in Section IIIA2. Our first attempt to solve the problem is by directly passing the formalised ILP problem to a solver. Although a logical expression can be turned into a combination of inequalities, since the problem is NPcomplete, increasing the number of constraints for each variable by a small amount will drastically increase the runtime. Therefore, we turn the last constraint into (8), which repeats rather than . is a carefully chosen length list of coefficients.
(8) 
Even though, solving this ILP problem with large and values is still not tractable. Based on Section IIIC1, it is possible to reduce the complexity by optimising at the blocklevel, after grouplasso regularisation. If we coarsen the granularity of a weights matrix to and , this problem can be solved within polynomial time. In this case, a group has the same size as a block, and therefore the group constraint can be removed. The problem is downgraded to a basic assignment problem: each row and column can only have one block selected, i.e., a block of weights will be assigned to a group as a whole. In another word, this method can work well if importance connections are localised within a specific group, and this localisability is encouraged by . An assignment problem can be efficiently solved by maximal bipartite matching within polynomial time,^{4}^{4}4Note that the grouplasso regularisation time is not counted.e.g., using the HopcroftKarp algorithm [12] that has time complexity . Figure 3(b) demonstrates an example that given coarsegrained weights, how the optimisation becomes an assignment problem on blocks.
IiiE The EndtoEnd Flow
Given a pretrained model and a group configuration that specifies for each layer, the endtoend flow of our framework Dokei is summarised as follows:

[label=(S.0)]

Dokei operates layerwise, from top to bottom, based on the idea that higherlevel features learned by top layers are more redundant for specific domains [46].

For each selected , we run the GSP solver to find the optimal mask . We apply structured regularisation, specifically, blockbased grouplasso to reduce the computational complexity. Maximal bipartite matching can be adapted if the block size matches a group.

Once is determined, we can choose to apply regularisation to control the performance drop after removing all unmasked weights.

Finally, for the current layer, we replace it by the corresponding grouped convolution, and assign weights to initialise it. This layerreplaced model will be further finetuned to improve its performance.
Iv Experimental Results
This sections presents experimental results that demonstrate the performance of Dokei on various image classification domains. We justify the effectiveness of each processing step through several ablation studies. In the end, we compare with related work in corresponding aspects.
Our experiments are conducted by TensorFlow
[1] (version 1.10) and its highlevel library TFslim [38]. ILSVRC2012 [36] pretrained models are downloaded from TFslim as well. Due to the lack of support for grouped convolution from TensorFlow, we provide a naïveimplementation that splits input tensor, maps them to convolution groups, and permutes results by a provided index mapping. We also implement our version of forwardbackward splitting optimiser in TensorFlow. By default, when regularising a specific layer by forwardbackward splitting, Dokei freezes other layers’ update and only updates the current layer; and the
strength, , is normally set as . This implementation, as well as the endtoend optimisation flow, can be found in the opensourced codebase of Dokei.^{5}^{5}5The URL is removed for doubleblind review.Iva Experiments on Domain Tasks
We adapt ILSVRC2012 pretrained ResNet50 [10] as baseline and apply Dokei upon them for specific domains. ResNet50 occupies 98 MB for parameters storage and requires 4 GFLOPs^{6}^{6}6GFLOPs = Giga FLoatingpoint OPerations
to finish one inference pass. Its topology builds upon bottleneck blocks, which stack 1x1, 3x3, and 1x1 convolution layers altogether with residual connections. VGG16
[39] is also evaluated for some scenarios.Based on the major objective of Dokei, we focus on image classification tasks in different domains, specified by the datasets shown in Table I
. For each dataset, we stick to its default train/test split and finetune the last fullyconnected layer of ResNet50 to get the baseline models. During finetuning, we simply apply the RMSProp optimiser
[11]with exponential decay for learning rate, and train each dataset for at least 30 epochs. For each dataset and model, Dokei works layerwise from top to bottom.
Dataset  Cls.  Img.  Baseline  DokeiA  

70%  60%  
Birds [41]  200  6,033  77.74%  74.8%  70.2% 
Flowers [31]  102  8,189  94.61%  95.4%  94.0% 
Actions [44]  40  9,532  80.67%  74.2%  67.1% 
Indoor [35]  67  6,700  74.21%  71.5%  69.4% 
Dogs [20]  120  20,580  82.65%  75.2%  70.2% 
Group 3x3 convolution only.
We explore one universal group configurations of either ResNet50 or VGG16 for all datasets, named as DokeiA, that starts with for top layers, and when either or is halved, halves . Under this configuration, all convolution has . For each layer, Dokei performs blockbased grouplasso regularisation, GSP optimisation, maskbased group regularisation, and final replacement and finetuning, following the endtoend flow (Section IIIE). Although in each step, we could run until we meet a specific threshold regarding criterion or other measurement, we choose a rather simple and effective approach that runs each step for 100 epochs.
We apply Dokei on ResNet50 for all domains using the DokeiA configuration. Figure 5 illustrates how accuracy and model size change while layerwise replacing from top to bottom. This resulting model has of original parameters and uses GFLOPs during inference.
In general, Dokei works well to convert ResNet50 to a specific grouped model for different domains, even though ResNet50 is already compact. Because different domains have various levels of classification difficulty and redundancy regarding the original finetuned model, the accuracy curves diverse. Even though Dokei can adjust to various domains primarily due to the combination of structured regularisation and efficient criterion based GSP optimisation. In another word, the GSP optimisation produce masks that are learned from the criterion values evaluated on different domains, while structured regularisation makes those masks unique. Figure 6 illustrates how Dokei adapts to different domains with different GSP masks, found by maximal bipartite matching.
Besides this adaptability, the resulting model is highly regular: each grouped convolution layer contains small convolution that have same configuration and can run completely in parallel. Table II shows the evaluated performance of DokeiA on GPU, CPU, and embedded FPGA.

DokeiA performs better on CPU while worse on GPU, mainly due to our naïve grouped convolution implementation requires to invoke CUDA kernel times.
GPU  CPU  FPGA (est.)  

Batch size  128  32  1 
Spec.  1080Ti  X5690  ZC7020 
Baseline  0.234s  2.12s  0.808 s 
DokeiA  0.244s  1.73s  0.114 s 
DokeiB  0.239s  2.01s  0.392 s 
1x1 convolution.
Beyond DokeiA that groups only 3x3 convolution, we intend to see how well Dokei can group 1x1 convolution. This new configuration, named as DokeiB, groups all layers with in conv5_x [10], which includes 3 3x3 convolution and 7 1x1 convolution layers. We evaluate on Birds and compare results with DokeiA on the same dataset (Figure 7), which shows that grouping top 1x1 layers is more beneficial than lower 3x3 layers. We will explore configuration that involves 1x1 for future work.
IvB Ablation Study
Training from scratch.
We train DokeiA from scratch under 2 different settings: all layers or only 3x3 convolution are from scratch. Results (Table III) show that Dokei outperforms training from scratch by a large margin.
Random initialisation.
We also assign values at random positions from the pretrained weights rather than solving the GSP optimisation. We evaluate on VGG16 instead to minimise effects from residual connections. Results show that Dokei performs the best, a random GSP mask is worse but still better than one that ignores GSP (Table III).
From scratch (RN50)  Random init. (VGG16)  
Dokei  3x3  All  Dokei  w/ GSP  w/o GSP 
77.3%  49.8%  34.2%  76.4%  75.7%  74.4% 
Grouplasso effect.
We study the effect from on the criteria measured for maximal bipartite matching solutions to justify our claims in Section IIIC1. Figure 8 shows the effect on the last 3x3 convolution in ResNet50 on Birds. Changes are significant when using scale on . Figure 3(a) also shows the weights resulting from .
IvC Comparison with Prior Work
Model  Baseline  Tech.  Before FT  After FT 

VGG16  76.26%  CdN  0.552%  76.10% 
Ours  6.81%  76.38%  
RN50  77.74%  CdN  61.25%  77.17% 
Ours  62.20%  77.28% 
Grouplasso based method.
Most of other grouplasso based methods have a different objective comparing with us: [22, 49, 42] intend to induce channel or filterlevel sparsity which does not lead to grouped convolution, while CondenseNet [15] tries to regularise towards grouped convolution. There are three major differences comparing with us:

is built on block, while theirs is on filter group, i.e., completely separates our block along the channel axis.

After grouplasso, they select channels with the highest importance, based on magnitude, regardless of the GSP constraint (Section IIIA1).

They handle the nondifferentiability of grouplasso by nullifying weights under a given threshold.
We implement their filterwise grouplasso regulariser and top channel selection method in Dokei and investigate whether the GSP constraint actually matters. We consider the top1 validation accuracy before and after finetuning to separate the effect of regularisation from post tuning process. Without the help from residual connections, the accuracy is sensitive to regularisation and replacement in VGG16. Results in Table IV show that Dokei performs better in both metrics. After finetuning accuracy differs not much as other layers are also updated.
Lowrank decomposition method.
Another recent work that features the ability of producing grouped convolution is [33]. This paper claims that by using lowrank decomposition based filter group decomposition method, it is possible to directly produce a grouped model without regularisation. However, this method requires appending 1x1 convolution based on their approximation method. This is a major drawback since even 1x1 convolution consumes many resources when and are large, and it is redundant to add that layer for models like ResNet50, which already surrounds 3x3 convolution with 1x1 ones. Table V illustrates under the same group setting (DokeiA, B) how many parameters and operations will remain for [33] and Dokei.
Pruning with domain adaptation.
It is difficult to directly compare with prior works on pruning for efficient domain adaptation [30, 29] due to the differences in network architecture: [30] applies VGG16 and AlexNet while [29] uses only VGG19. They both cover fullyconnected layer while we focus only on convolution. Results are in Table VI. Note that since ResNet50 is already compact, removing parameters from it will possibly decrease accuracy in a larger amount comparing with other models like VGG16. To eliminate effect from model difference, we will compare Dokei with prior work on the same models in the future.
Dataset  Model  Top1 Accuracy  Size Red.  
Origin  Result  
Birds  VGG16[30]  72.2%  70.0%  40.0% 
VGG19[29]  55.7%  56.0%  6.81%  
RN50 (A)  77.7%  71.4%  39.2%  
RN50 (B)  77.7%  75.0%  52.5%  
Flowers  AlexNet[30]  80.1%  79.8%  41.0% 
VGG19[29]  78.8%  77.6%  14.9%  
RN50 (A)  94.6%  94.0%  41.4%  
Actions  VGG19[29]  68.7%  69.4%  29.9% 
RN50 (A)  80.7%  74.2%  30.1% 
V Conclusion
This paper presents Dokei, a domain adaptation approach for converting pretrained CNN models to ones involving efficient grouped convolution. The basis for Dokei has been formalised as an optimisation problem constrained by grouped sparsity pattern, and practical solutions based on structured regularisation and maximal bipartite matching are provided. Results show that Dokei is effective for various domains and outperforms prior work in different ways, including model efficiency and accuracy. For a compact model like ResNet50, Dokei can still perform well.
Future work includes evaluating Dokei on ImageNet, devising selection method for values based on criterion values, supporting fullyconnected layers, improving grouped convolution implementation on GPU, reducing regularisation time, finding a better layerwise replacement schedule for models with complex topology, and understanding the mechanism of layerwise replacement involving 1x1 convolution and residual connections.
References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,
V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg,
M. Wicke, Y. Yu, X. Zheng, and G. Research.
TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems.
Technical report. 
[2]
F. Chollet.
Xception: Deep Learning with Depthwise Separable Convolutions.
CoRR, 2016.  [3] W. Deng, W. Yin, and Y. Zhang. Group sparse optimization by alternating direction method. 77005:88580R, 2013.
 [4] X. Dong, S. Chen, and S. J. Pan. Learning to Prune Deep Neural Networks via Layerwise Optimal Brain Surgeon. In NIPS, 2017.
 [5] J. Duchi and Y. Singer. Efficient Online and Batch Learning Using Forward Backward Splitting. Journal of Machine Learning Research, 10:2899–2934, 2009.
 [6] M. Figurnov, A. Ibraimova, D. Vetrov, and P. Kohli. PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions. In NIPS, 2015.
 [7] T. Goldstein, C. Studer, and R. Baraniuk. A Field Guide to ForwardBackward Splitting with a FASTA Implementation. 2014.
 [8] S. Han, H. Mao, and W. J. Dally. Deep Compression  Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
 [9] B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, 1993.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 12 2016.
 [11] G. E. Hinton, N. Srivastava, and K. Swersky. Overview of minibatch gradient descent, 2012.
 [12] J. E. Hopcroft and R. M. Karp. An $n^{5/2}$ Algorithm for Maximum Matchings in Bipartite Graphs. SIAM Journal on Computing, 2(4):225–231, 1973.
 [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR, 2017.

[14]
H. Hu, R. Peng, Y.W. Tai, and C.K. Tang.
Network Trimming: A DataDriven Neuron Pruning Approach towards Efficient Deep Architectures.
2016.  [15] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. CondenseNet: An Efficient DenseNet using Learned Group Convolutions. CoRR, abs/1711.0, 2017.
 [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely Connected Convolutional Networks. CoRR, abs/1608.0, 2016.
 [17] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups. In CVPR, 2017.
 [18] R. Jenatton, J.Y. Audibert, and F. Bach. Structured Variable Selection with SparsityInducing Norms. JMLR, 12:2777–2824, 2011.
 [19] R. M. Karp. Reducibility among combinatorial problems. Complexity of Computer Computations, pages 85–103, 1972.

[20]
A. Khosla, N. Jayadevaprakash, B. Yao, and L. FeiFei.
Novel dataset for finegrained image categorization.
In
First Workshop on FineGrained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition
, Colorado Springs, CO, June 2011.  [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
 [22] V. Lebedev and V. Lempitsky. Fast ConvNets Using GroupWise Brain Damage. In CVPR, pages 2554–2564, 2016.
 [23] Y. LeCun, J. S. Denker, and S. a. Solla. Optimal Brain Damage. In NIPS, pages 598–605, 1990.
 [24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning Filters for Efficient Convnets. In ICLR, 2017.
 [25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning Efficient Convolutional Networks through Network Slimming. In ICCV, pages 2755–2763, 2017.
 [26] J. H. Luo, J. Wu, and W. Lin. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In ICCV, 2017.
 [27] N. Ma, X. Zhang, H.t. Zheng, and J. Sun. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. 2018.
 [28] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. In NIPS, 2017.
 [29] M. Masana, J. V. D. Weijer, L. Herranz, A. D. Bagdanov, and J. M. Alvarez. DomainAdaptive Deep Network Compression. In ICCV, pages 4299–4307, 2017.

[30]
P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.
Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning.
In ICLR, 2017.  [31] M.E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
 [32] W. Pan, H. Dong, and Y. Guo. DropNeuron: Simplifying the Structure of Deep Neural Networks. In NIPS, 2016.
 [33] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, and S. Pu. Extreme Network Compression via Filter Group Approximation. In ECCV, 2018.
 [34] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In FPGA, pages 26–35, 2016.
 [35] A. Quattoni and A. Torralba. Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 413–420. IEEE, 2009.
 [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. CoRR, 2018.
 [38] N. Silberman and S. Guadarrama. Tensorflowslim image classification model library, 2016.
 [39] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. In ICLR, 2015.
 [40] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996.
 [41] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. CaltechUCSD Birds 200. Technical Report CNSTR2010001, California Institute of Technology, 2010.
 [42] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning Structured Sparsity in Deep Neural Networks. In NIPS, 2016.
 [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated Residual Transformations for Deep Neural Networks. CoRR, abs/1611.0, 2016.
 [44] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. FeiFei. Human action recognition by learning bases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1331–1338. IEEE, 2011.
 [45] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 68(1):49–67, 2006.
 [46] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. ECCV, 8689:818–833, 2014.
 [47] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CoRR, 2017.
 [48] H. Zhou, J. M. Alvarez, and F. Porikli. Less Is More: Towards Compact CNNs. In ECCV, 2016.
 [49] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients, 2016.