Learning Grouped Convolution for Efficient Domain Adaptation

11/23/2018 ∙ by Ruizhe Zhao, et al. ∙ 0

This paper presents Dokei, an effective supervised domain adaptation method to transform a pre-trained CNN model to one involving efficient grouped convolution. The basis of this approach is formalised as a novel optimisation problem constrained by group sparsity pattern (GSP), and a practical solution based on structured regularisation and maximal bipartite matching is provided. We show that it is vital to keep the connections specified by GSP when mapping pre-trained weights to grouped convolution. We evaluate Dokei on various domains and hardware platforms to demonstrate its effectiveness. The models resulting from Dokei are shown to be more accurate and slimmer than prior work targeting grouped convolution, and more regular and easier to deploy than other pruning techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Network

(CNN) has become a powerful technique for many computer vision tasks in recent days. Besides improving the accuracy of generic tasks such as ILSVRC

[36], CNN researchers are designing models for efficient execution of domain-specific applications on edge devices. New efficient CNN models [13, 37, 47, 27] and various pruning techniques [30, 33, 29] have been proposed. Nevertheless, these techniques have several problems when applying on devices with limited resources or customisable hardware accelerators, e.g., introducing new operations that are unavailable in the instruction set, or resulting in models too irregular to be highly parallelised.

(a) Standard convolution layer . All input channels are correlated with all output filters. The shape of the kernel is flattened.
(b) Grouped convolution. Different groups are coloured correspondingly. and are split into consecutive parts.
Fig. 1: Standard and grouped convolution. Grouped weights are mapped from the original weights (dotted) in a diagonal style. Blue regions denote receptive fields and corresponding kernels.

The motivation of our approach is to adapt grouped convolution (Figure 1) for domain-adaptive CNN deployment. It splits a big convolution into multiple small convolution layers with identical shape, which independently handle different depth groups separated from the input feature map, and combine final results at the end of computation. A -grouped convolution layer requires workload of its standard counterpart, and can be processed by existing, highly optimised convolution operations, in parallel. Moreover, instead of designing and training a grouped convolution based model for different domains with various configurations from scratch, we can learn from pre-trained models about the optimal grouped convolution, specifically, the mapping from pre-trained weights to grouped weights.

Efficiently learning grouped convolution under our problem setting is challenging. At first glance, the mapping from a convolution layer to a grouped one can be presented as a structured pruning problem. The structural constraints, which will be known as group sparsity pattern, require that retained connections can directly form independent convolution groups. The optimisation problem turns out to be hard to solve after formalising the objective and constraints.

This paper proposes Dokei,111Dokei means isomorphism in Japanese, since our method explores grouped convolution with the same interface as the original model.an end-to-end method that maps a pre-trained CNN model to a grouped one. Dokei targets domain-specific applications for various hardware platforms. The major contributions of this paper include:

  1. [label=()]

  2. formalisation of the proposed approach as an optimisation problem constrained by group sparsity pattern, with efficient solutions by maximal bipartite matching;

  3. structured regularisers that reduce the optimisation difficulty and improve the final performance;

  4. an end-to-end development flow and and its evaluation on various domains and hardware platforms.

Results show that Dokei can produce regular and compact models from ResNet-50 for different domains. VGG-16 is also evaluated for some aspects. When compared with prior work under similar reduction rate and accuracy, our resulting models have a higher level of parallelism, fewer changes to the instruction set, and competitive outcome.

Ii Background and Related Work

Essential background information related to Dokei includes grouped convolution and model pruning techniques.

Ii-a Grouped Convolution and Efficient CNN

Let denote the number of convolution groups, a -grouped convolution splits the input feature map into depth groups and performs independent convolution. Results are concatenated by a specific order along the depth of output feature map in the end. An extreme case is depthwise convolution that each group has one input channel  [13, 2].

Grouped convolution is originally proposed in AlexNet [21] for parallel processing on GPUs. Ioannou et al. [17] integrate grouped convolution in root module with additional 1x1 convolution. Models produced by [33] share a similar architecture. ResNeXt [43], derived from ResNet [10], simply turns a 3x3 convolution layer in a bottleneck block into a grouped one, while more extremely, ShuffleNet [47] groups 1x1 and makes the 3x3 convolution depthwise with an additional channel shuffle unit that permutes along the depth of the output feature map. Built on DenseNet [16], CondenseNet [15] replaces both 1x1 and 3x3 convolution into grouped ones and adds indexing layers as well. Except [33] that prunes for the grouped models, others are trained from scratch. Specifically, CondenseNet gradually removes connections to create grouped convolution during training. Regarding Dokei, models are produced from pre-trained models without appending 1x1 convolution for computation efficiency. Dokei can replace 3x3 only or both 1x1 and 3x3 convolution, similar to ResNeXt and CondenseNet respectively, from the pre-trained model.

Ii-B Network Pruning

Dokei is related to neural network pruning methods, which remove unimportant weights for efficient inference. A pruning method can be sensitivity or regularisation based, and can produce structured or unstructured models.

Sensitivity-based pruning selects unimportant weights by specific criteria and removes them directly. There are magnitude, such as , norms [28, 8, 24, 26], first-order [30, 6] or second-order Taylor expansion [23, 9, 4], average percentage of zero (APoZ) [26, 14], or even singular values [33, 29] based criteria. Each criterion has various characteristics regarding computation efficiency and accuracy of measurement. These methods are normally unstructured except [33] that produces structured grouped models.

As an alternative approach, regularisation-based methods first sparsify models through carefully designed regularisers and then prune normally by a magnitude-based criterion. norm, also known as lasso in statistical learning literature [40], is studied for sparsifying CNN models in [25, 8]. norm regulariser can only produce unstructured, irregular models, while some other methods use group-lasso [45] to encode specific structures during regularisation, such as channel or filter level pruning [32, 42, 22]. Similar to our objective, CondenseNet [15] tries to regularise for grouped convolution by group-lasso. Optimising with lasso or group-lasso regularisers is difficult due to their non-differentiability at optimal points, and it can be solved by forward-backward splitting [49, 5].

Regarding target applications, while current papers are mainly from a general perspective that prune models while considering ImageNet accuracy,

[30, 29] take into account domain adaptation and examine how their pruning methods work for different image domains [41, 31, 44, 35, 20].

Dokei falls into the structured pruning category and uses magnitude criteria to find optimal structured sparsity patterns (Section III-A1), as well as group-lasso regulariser to pre and post process models for easier optimisation and better performance (Section III-C). Meanwhile, Dokei explicitly targets domain applications. Comparison between Dokei and prior work is presented in Section IV-C.

Iii Method

By convention, we define domain as a set of fine-grained image categories, e.g., birds or human actions, and the corresponding classification task. Briefly, the core of Dokei is about solving an optimisation problem that minimises the accuracy drop after converting a pre-trained convolution layer to a grouped one. Accuracy drop is indirectly measured by a criterion for computation efficiency. This problem is difficult to solve due to structural constraints introduced by grouped convolution. Dokei also adapts structured regularisation to process pre-trained models for easier optimisation and performance enhancement. In all, Dokei provides an end-to-end flow that coordinates these modules.

Iii-a Problem Formalisation

Let denote a CNN model that has , , and , which are sets of convolution layer configuration, number of convolution groups, and weights. Suppose has convolution layers, for each : is the -th convolution layer, i.e., specify height, width, number of channels, number of filters, and kernel size respectively; is the number of convolution groups, specifically, for all layers in the original model ; is the set of weights. denotes the original model. Our objective is to find a model , which has the same topology as but replaces some convolution layers with grouped ones, to meet a specific requirement, such as model size, speed, and accuracy. To reduce the size of the architectural search space, here we restrict that the replacement will not change the layer interface, i.e., the shape of input and output feature maps, as shown in Figure 1. Hence, should be equivalent to since the topology does not change, while and should be figured out through optimisation, where .

Iii-A1 Group Sparsity Pattern

(a) Original weights , only show its and axes
(b) Mask , different groups are marked in different scale
(c) Masked weights
(d) weights groups
Fig. 2: Given , we create a GSP mask and apply it to transform a weights to different weights groups.

Since Dokei works in the context of domain adaptation, the important information contained in the pre-trained weights should better be utilised to avoid training each candidate from scratch. We can link this problem to structured pruning as presented in Section II-B.

Let denote a structured pruning pattern of the -th convolution layer, which is a boolean mask and has the same shape as . Then , where is element-wise product and maps the -th masked weights to grouped weights. A weight that is masked (mask value is ) will be retained in the grouped convolution while others will be discarded.

A valid mask should satisfy these constraints (Figure 2):

  • [noitemsep]

  • A valid solution should satisfy that the sum of each row or column equals to .

  • If there are three positive masks that , should be positive as well.222We omit the axes for kernel elements since we presume that masks within a kernel are identical.

These constraints are deduced by the presumption that connections between different groups should be removed, and channels and filters shall be fully connected within a group. Masks specified by these constraints are named as group sparsity pattern (GSP).

Iii-A2 GSP Optimisation Problem

Since , optimising for is equivalent to solving the GSP optimisation problem as below. Similar to [30], we set the objective function as the absolute difference in empirical risk on dataset after pruning by . The first two constraints are shown before, while the last is not intuitive: denotes the exclusive-or operator, that gives true only if two inputs are different; is true if channel and are allocated to the same group; and the whole logical expression means that for any two channels, their masks toward the output filter should not differ if they are in the same group. Note that this optimisation problem formalisation is layer-wise, i.e., solving it only provides the optimal mask for a specific layer, and is pre-determined.

s.t.

This problem is quite hard to solve: since constraints involving logical operators can be converted to inequalities, this problem is actually formalised as a Integer Linear Programming (ILP) problem, which is known to be at least NP-complete 

[19], regarding for each layer.

Iii-B Pruning Criteria

(a) PDF
(b) Minimal
Fig. 3: Example criteria values in ResNet-50, pre-trained on ILSVRC2012 and adapted to Birds [41]. Left figure illustrates the distribution of of given ResNet-50 layers, and the right shows the minimal objective we can achieve for with various .

Another aspect of the problem difficulty is the evaluation time of the objective function as the training dataset will be iterated for the number of times that solving the ILP problem takes. The expected outcome of the objective function can be interpreted as removing weights that contribute least to the empirical cost, and therefore, we can evaluate some computationally cheaper criteria that measure weight importance. Prior sensitivity based CNN pruning papers (Section II-B) propose or utilise various criteria. Let be a specific criterion, the new minimisation objective function becomes (1). We require to match the channel-filter granularity. Magnitude criteria are adapted: and , which are and norms respectively.

(1)

Directly applying these criteria to our problem is hard: normally GSP drastically removes many weights, and models pre-trained with norm regulariser are not sparse enough, both hinder us to achieve good performance. Figure 3 shows the distribution of criterion value, the mode of which is close to but not exactly zero; and the minimal , which unfortunately, scales by . Even though, GSP is still necessary because theoretically a mapped grouped convolution ignoring GSP will significantly changes the original connection topology and require much more reconstruction efforts (empirical results in Table III, IV).

Iii-C Structured Regularisation

We adapt structured regularisation to make the GSP optimisation easier to solve and give better results. Group-lasso[45], a lasso [40] regulariser that applies on groups of weights,333To clarify the terminology, the group in group-lasso is considered as a subset of weights rather than an operator in grouped convolution.suits our purpose well: it penalises weights to zero based on the property of lasso; and the penalisation performs simultaneously upon weights per group, which effectively turns the weights coarser-grained and reduces complexity, as shown in Section III-D. Besides reducing the optimisation difficulty, once the mask is produced by the optimiser, applying group-lasso on all unmasked weights can suppress the effect on accuracy from removing them.

This section elaborates on this approach by first formalising group-lasso under our GSP context, then justifying its effectiveness, and finally illustrating a forward-backward splitting [7, 5, 48] based regularisation implementation.

Iii-C1 Group-Lasso Regulariser

Let be a set of weight indices representing a group and denote a set of groups, then the group-lasso regulariser on defined by is formalised as (2). Note that the dimensions for kernel elements are omitted since they are regularised as a whole.

(2)
(3)
(4)

We propose two ways to form the group set , by block or by mask: the former one divides a weights matrix into blocks and set them as groups; while the latter one creates a single group that contains all unmasked weights once a mask is specified. Let and denote the block size along channel and filter, the block-based group-lasso term is specified as (3). Suppose is the specified mask, the mask-based group-lasso is listed as (4).

(5)

Two corollaries are the basis for the design of and . According to [45], encourages sparsity at the block granularity. The other one is from [18]: under their framework, and are two specific sparsity-inducing norms, and they argue that only the set of elements complement to the union of norm-specified groups is allowed to be non-zero. Note that these corollaries are merely guidance, not guarantees to our regulariser design, since they are deduced from linear model and CNN is non-linear.

We devise to make weights coarser-grained, by applying which the performance of block-based GSP mask solution will be closer to the optimal value. A block-based solution , as shown in (5), is defined over blocks. In another word, describing by the element-wise notation , all within block takes the value . Suppose uses the same block setting as , we argue that during group-lasso regularisation by , the difference between the criterion values of and will be reduced. Note that should be more optimal in the beginning. This argument is based on the assumption that, suppose those that equal to are scattered over the matrix, and each block contains to of them, group-lasso regularisation will nullify from those blocks with the least number of to the most. During this process, more will be found in blocks and will be closer to , therefore, the gap will be narrowed. Figure 3(a) shows resulting weights regularised by . This property is the basis for our polynomial time method.

Since regularises all weights not covered by , and according to [18], only elements masked by are allowed to be non-zero, can nullify unmasked weights and further reduce criterion value drop after layer replacement.

Iii-C2 Forward-Backward Splitting

A group-lasso regulariser is non-differentiable at the points that elements within any group are equal to zero. Most published papers that adopt group-lasso for structured pruning in deep CNN either ignore this issue or propose ad hoc solutions, e.g., manually nullifying weights below a threshold [42, 15, 22]. We adapt the forward-backward splitting approach, which is utilised by [48] to solve a similar group-lasso problem. Based on the formalisation from [5], we list our approach as in (6), where is the empirical cost function and are the learning rate and weight decay factor respectively. can be either or .

(6)

This method suggests that after performing a standard gradient descent step (forward), we will find a point that is both close to the intermediate result measured by squared Euclidean norm, and small in terms of the group-lasso regulariser (backward). This method avoids directly calculating the gradient of non-differentiable and can jointly minimise regarding and .

(7)

We derive the solution to (6) as (7), based on the regularisation solution from [5, 3]. refers to the weights within group . For , update will perform simultaneously within each block, while for we will optimise for a single group that contains all unmasked weights.

Iii-D Polynomial Solution to GSP Optimisation

(a) regularised weights that have a pattern of zero strips.
(b) Maximal bipartite result
Fig. 4: Illustration of ResNet-50 convolution weights regularised by , and how -grouped convolution can be efficiently solved by maximal bipartite matching on coarse-grained weights.

Once is fixed, the optimisation problem can be formalised as an ILP problem, as shown in Section III-A2. Our first attempt to solve the problem is by directly passing the formalised ILP problem to a solver. Although a logical expression can be turned into a combination of inequalities, since the problem is NP-complete, increasing the number of constraints for each variable by a small amount will drastically increase the runtime. Therefore, we turn the last constraint into (8), which repeats rather than . is a carefully chosen length- list of coefficients.

(8)

Even though, solving this ILP problem with large and values is still not tractable. Based on Section III-C1, it is possible to reduce the complexity by optimising at the block-level, after group-lasso regularisation. If we coarsen the granularity of a weights matrix to and , this problem can be solved within polynomial time. In this case, a group has the same size as a block, and therefore the group constraint can be removed. The problem is downgraded to a basic assignment problem: each row and column can only have one block selected, i.e., a block of weights will be assigned to a group as a whole. In another word, this method can work well if importance connections are localised within a specific group, and this localisability is encouraged by . An assignment problem can be efficiently solved by maximal bipartite matching within polynomial time,444Note that the group-lasso regularisation time is not counted.e.g., using the Hopcroft-Karp algorithm [12] that has time complexity . Figure 3(b) demonstrates an example that given coarse-grained weights, how the optimisation becomes an assignment problem on blocks.

Iii-E The End-to-End Flow

Given a pre-trained model and a group configuration that specifies for each layer, the end-to-end flow of our framework Dokei is summarised as follows:

  1. [label=(S.0)]

  2. Dokei operates layer-wise, from top to bottom, based on the idea that higher-level features learned by top layers are more redundant for specific domains [46].

  3. For each selected , we run the GSP solver to find the optimal mask . We apply structured regularisation, specifically, block-based group-lasso to reduce the computational complexity. Maximal bipartite matching can be adapted if the block size matches a group.

  4. Once is determined, we can choose to apply regularisation to control the performance drop after removing all unmasked weights.

  5. Finally, for the current layer, we replace it by the corresponding grouped convolution, and assign weights to initialise it. This layer-replaced model will be further fine-tuned to improve its performance.

Iv Experimental Results

This sections presents experimental results that demonstrate the performance of Dokei on various image classification domains. We justify the effectiveness of each processing step through several ablation studies. In the end, we compare with related work in corresponding aspects.

Our experiments are conducted by TensorFlow 

[1] (version 1.10) and its high-level library TF-slim [38]. ILSVRC-2012 [36] pre-trained models are downloaded from TF-slim as well. Due to the lack of support for grouped convolution from TensorFlow, we provide a naïve

implementation that splits input tensor, maps them to convolution groups, and permutes results by a provided index mapping. We also implement our version of forward-backward splitting optimiser in TensorFlow. By default, when regularising a specific layer by forward-backward splitting, Dokei freezes other layers’ update and only updates the current layer; and the

strength, , is normally set as . This implementation, as well as the end-to-end optimisation flow, can be found in the open-sourced codebase of Dokei.555The URL is removed for double-blind review.

Iv-a Experiments on Domain Tasks

We adapt ILSVRC-2012 pre-trained ResNet-50 [10] as baseline and apply Dokei upon them for specific domains. ResNet-50 occupies 98 MB for parameters storage and requires 4 GFLOPs666GFLOPs = Giga FLoating-point OPerations

to finish one inference pass. Its topology builds upon bottleneck blocks, which stack 1x1, 3x3, and 1x1 convolution layers altogether with residual connections. VGG-16 

[39] is also evaluated for some scenarios.

Based on the major objective of Dokei, we focus on image classification tasks in different domains, specified by the datasets shown in Table I

. For each dataset, we stick to its default train/test split and fine-tune the last fully-connected layer of ResNet-50 to get the baseline models. During fine-tuning, we simply apply the RMSProp optimiser 

[11]

with exponential decay for learning rate, and train each dataset for at least 30 epochs. For each dataset and model, Dokei works layer-wise from top to bottom.

Dataset Cls. Img. Baseline Dokei-A
70% 60%
Birds [41] 200 6,033 77.74% 74.8% 70.2%
Flowers [31] 102 8,189 94.61% 95.4% 94.0%
Actions [44] 40 9,532 80.67% 74.2% 67.1%
Indoor [35] 67 6,700 74.21% 71.5% 69.4%
Dogs [20] 120 20,580 82.65% 75.2% 70.2%
TABLE I: Statistics of domain datasets and baseline performance for evaluation. Cls. and Img. mean number of classes and dataset images respectively. Baseline indicates the top-1 accuracy of originally adapted ResNet-50. Accuracy results of Dokei-A at checkpoints when 70% and 60% parameters removed are listed as well.
Group 3x3 convolution only.

We explore one universal group configurations of either ResNet-50 or VGG-16 for all datasets, named as Dokei-A, that starts with for top layers, and when either or is halved, halves . Under this configuration, all convolution has . For each layer, Dokei performs block-based group-lasso regularisation, GSP optimisation, mask-based group regularisation, and final replacement and fine-tuning, following the end-to-end flow (Section III-E). Although in each step, we could run until we meet a specific threshold regarding criterion or other measurement, we choose a rather simple and effective approach that runs each step for 100 epochs.

We apply Dokei on ResNet-50 for all domains using the Dokei-A configuration. Figure 5 illustrates how accuracy and model size change while layer-wise replacing from top to bottom. This resulting model has of original parameters and uses GFLOPs during inference.

Fig. 5: Evaluation of Dokei-A for all domain datasets on ResNet-50. Each point indicates one layer replacement.

In general, Dokei works well to convert ResNet-50 to a specific grouped model for different domains, even though ResNet-50 is already compact. Because different domains have various levels of classification difficulty and redundancy regarding the original fine-tuned model, the accuracy curves diverse. Even though Dokei can adjust to various domains primarily due to the combination of structured regularisation and efficient criterion based GSP optimisation. In another word, the GSP optimisation produce masks that are learned from the criterion values evaluated on different domains, while structured regularisation makes those masks unique. Figure 6 illustrates how Dokei adapts to different domains with different GSP masks, found by maximal bipartite matching.

(a) Weights for Indoor.
(b) Weights for Birds.
Fig. 6: Regularised 3x3 convolution weights of block2/unit_1 in ResNet-50. There are blocks of weights, which are shown lighter, to initialise weights in convolution groups.

Besides this adaptability, the resulting model is highly regular: each grouped convolution layer contains small convolution that have same configuration and can run completely in parallel. Table II shows the evaluated performance of Dokei-A on GPU, CPU, and embedded FPGA.

  • Dokei-A performs better on CPU while worse on GPU, mainly due to our naïve grouped convolution implementation requires to invoke CUDA kernel times.

  • FPGA benefits more from the regularity: based on [34]

    , we estimate the performance under the constraint that only 50% memory can store weights and the design can process

    3x3 kernels in parallel.

GPU CPU FPGA (est.)
Batch size 128 32 1
Spec. 1080Ti X5690 ZC7020
Baseline 0.234s 2.12s 0.808 s
Dokei-A 0.244s 1.73s 0.114 s
Dokei-B 0.239s 2.01s 0.392 s
TABLE II: Inference speed of the baseline ResNet-50 model and Dokei-A, B on various hardware platforms for Birds. GPU (GeForce GTX 1080Ti with CUDA 9.0 and cuDNN 7.0) and CPU (Intel Xeon X5690) use officially pre-compiled TensorFlow v1.10; FPGA (Xilinx Zynq ZC7020) is estimated under 200 MHz.
1x1 convolution.

Beyond Dokei-A that groups only 3x3 convolution, we intend to see how well Dokei can group 1x1 convolution. This new configuration, named as Dokei-B, groups all layers with in conv5_x [10], which includes 3 3x3 convolution and 7 1x1 convolution layers. We evaluate on Birds and compare results with Dokei-A on the same dataset (Figure 7), which shows that grouping top 1x1 layers is more beneficial than lower 3x3 layers. We will explore configuration that involves 1x1 for future work.

Fig. 7: Comparison between Dokei-A, B on Birds. Dokei-B first groups all 3x3 layers, then starts over from the top 1x1. The significant improvement is for grouping the residual connection.

Iv-B Ablation Study

Training from scratch.

We train Dokei-A from scratch under 2 different settings: all layers or only 3x3 convolution are from scratch. Results (Table III) show that Dokei outperforms training from scratch by a large margin.

Random initialisation.

We also assign values at random positions from the pre-trained weights rather than solving the GSP optimisation. We evaluate on VGG-16 instead to minimise effects from residual connections. Results show that Dokei performs the best, a random GSP mask is worse but still better than one that ignores GSP (Table III).

From scratch (RN-50) Random init. (VGG-16)
Dokei 3x3 All Dokei w/ GSP w/o GSP
77.3% 49.8% 34.2% 76.4% 75.7% 74.4%
TABLE III: We train Dokei-A of ResNet-50 from scratch with two strategies, and two different randomisation settings for VGG-16, for the top convolution only on Birds, results are in top-1 accuracy.
Group-lasso effect.

We study the effect from on the criteria measured for maximal bipartite matching solutions to justify our claims in Section III-C1. Figure 8 shows the effect on the last 3x3 convolution in ResNet-50 on Birds. Changes are significant when using scale on . Figure 3(a) also shows the weights resulting from .

Fig. 8: Changes in criterion differences during group-lasso. We evaluate three different regularisation scales , , .

Iv-C Comparison with Prior Work

Model Baseline Tech. Before FT After FT
VGG-16 76.26% CdN 0.552% 76.10%
Ours 6.81% 76.38%
RN-50 77.74% CdN 61.25% 77.17%
Ours 62.20% 77.28%
TABLE IV: Comparison between CondenseNet (CdN) and ours regarding the regularisation performance before and after fine-tuning (FT) regarding VGG-16 and ResNet-50.
Group-lasso based method.

Most of other group-lasso based methods have a different objective comparing with us: [22, 49, 42] intend to induce channel or filter-level sparsity which does not lead to grouped convolution, while CondenseNet [15] tries to regularise towards grouped convolution. There are three major differences comparing with us:

  1. is built on block, while theirs is on filter group, i.e., completely separates our block along the channel axis.

  2. After group-lasso, they select channels with the highest importance, based on magnitude, regardless of the GSP constraint (Section III-A1).

  3. They handle the non-differentiability of group-lasso by nullifying weights under a given threshold.

We implement their filter-wise group-lasso regulariser and top- channel selection method in Dokei and investigate whether the GSP constraint actually matters. We consider the top-1 validation accuracy before and after fine-tuning to separate the effect of regularisation from post tuning process. Without the help from residual connections, the accuracy is sensitive to regularisation and replacement in VGG-16. Results in Table IV show that Dokei performs better in both metrics. After fine-tuning accuracy differs not much as other layers are also updated.

Low-rank decomposition method.

Another recent work that features the ability of producing grouped convolution is [33]. This paper claims that by using low-rank decomposition based filter group decomposition method, it is possible to directly produce a grouped model without regularisation. However, this method requires appending 1x1 convolution based on their approximation method. This is a major drawback since even 1x1 convolution consumes many resources when and are large, and it is redundant to add that layer for models like ResNet-50, which already surrounds 3x3 convolution with 1x1 ones. Table V illustrates under the same group setting (Dokei-A, B) how many parameters and operations will remain for [33] and Dokei.

Dokei-A (3x3) Dokei-B (3x3 and 1x1)
# params. GFLOPs # params. GFLOPs
[33] 13.5% 4.27% 75.9% 84.6%
Dokei 2.3% 4.20% 41.3% 84.4%
TABLE V: We compare the performance regarding number of remaining parameters and operations between [33] and Dokei under configurations Dokei-A and Dokei-B
Pruning with domain adaptation.

It is difficult to directly compare with prior works on pruning for efficient domain adaptation [30, 29] due to the differences in network architecture: [30] applies VGG-16 and AlexNet while [29] uses only VGG-19. They both cover fully-connected layer while we focus only on convolution. Results are in Table VI. Note that since ResNet-50 is already compact, removing parameters from it will possibly decrease accuracy in a larger amount comparing with other models like VGG-16. To eliminate effect from model difference, we will compare Dokei with prior work on the same models in the future.

Dataset Model Top-1 Accuracy Size Red.
Origin Result
Birds VGG-16[30] 72.2% 70.0% 40.0%
VGG-19[29] 55.7% 56.0% 6.81%
RN-50 (A) 77.7% 71.4% 39.2%
RN-50 (B) 77.7% 75.0% 52.5%
Flowers AlexNet[30] 80.1% 79.8% 41.0%
VGG-19[29] 78.8% 77.6% 14.9%
RN-50 (A) 94.6% 94.0% 41.4%
Actions VGG-19[29] 68.7% 69.4% 29.9%
RN-50 (A) 80.7% 74.2% 30.1%
TABLE VI: Rough comparison among Dokei and prior work regarding various datasets and size reduction rate.

V Conclusion

This paper presents Dokei, a domain adaptation approach for converting pre-trained CNN models to ones involving efficient grouped convolution. The basis for Dokei has been formalised as an optimisation problem constrained by grouped sparsity pattern, and practical solutions based on structured regularisation and maximal bipartite matching are provided. Results show that Dokei is effective for various domains and outperforms prior work in different ways, including model efficiency and accuracy. For a compact model like ResNet-50, Dokei can still perform well.

Future work includes evaluating Dokei on ImageNet, devising selection method for values based on criterion values, supporting fully-connected layers, improving grouped convolution implementation on GPU, reducing regularisation time, finding a better layerwise replacement schedule for models with complex topology, and understanding the mechanism of layerwise replacement involving 1x1 convolution and residual connections.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, and G. Research.

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.

    Technical report.
  • [2] F. Chollet.

    Xception: Deep Learning with Depthwise Separable Convolutions.

    CoRR, 2016.
  • [3] W. Deng, W. Yin, and Y. Zhang. Group sparse optimization by alternating direction method. 77005:88580R, 2013.
  • [4] X. Dong, S. Chen, and S. J. Pan. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon. In NIPS, 2017.
  • [5] J. Duchi and Y. Singer. Efficient Online and Batch Learning Using Forward Backward Splitting. Journal of Machine Learning Research, 10:2899–2934, 2009.
  • [6] M. Figurnov, A. Ibraimova, D. Vetrov, and P. Kohli. PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions. In NIPS, 2015.
  • [7] T. Goldstein, C. Studer, and R. Baraniuk. A Field Guide to Forward-Backward Splitting with a FASTA Implementation. 2014.
  • [8] S. Han, H. Mao, and W. J. Dally. Deep Compression - Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
  • [9] B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, 1993.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 12 2016.
  • [11] G. E. Hinton, N. Srivastava, and K. Swersky. Overview of mini-batch gradient descent, 2012.
  • [12] J. E. Hopcroft and R. M. Karp. An $n^{5/2}$ Algorithm for Maximum Matchings in Bipartite Graphs. SIAM Journal on Computing, 2(4):225–231, 1973.
  • [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR, 2017.
  • [14] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang.

    Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures.

    2016.
  • [15] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. CondenseNet: An Efficient DenseNet using Learned Group Convolutions. CoRR, abs/1711.0, 2017.
  • [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely Connected Convolutional Networks. CoRR, abs/1608.0, 2016.
  • [17] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups. In CVPR, 2017.
  • [18] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured Variable Selection with Sparsity-Inducing Norms. JMLR, 12:2777–2824, 2011.
  • [19] R. M. Karp. Reducibility among combinatorial problems. Complexity of Computer Computations, pages 85–103, 1972.
  • [20] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In

    First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition

    , Colorado Springs, CO, June 2011.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
  • [22] V. Lebedev and V. Lempitsky. Fast ConvNets Using Group-Wise Brain Damage. In CVPR, pages 2554–2564, 2016.
  • [23] Y. LeCun, J. S. Denker, and S. a. Solla. Optimal Brain Damage. In NIPS, pages 598–605, 1990.
  • [24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning Filters for Efficient Convnets. In ICLR, 2017.
  • [25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning Efficient Convolutional Networks through Network Slimming. In ICCV, pages 2755–2763, 2017.
  • [26] J. H. Luo, J. Wu, and W. Lin. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In ICCV, 2017.
  • [27] N. Ma, X. Zhang, H.-t. Zheng, and J. Sun. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. 2018.
  • [28] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. In NIPS, 2017.
  • [29] M. Masana, J. V. D. Weijer, L. Herranz, A. D. Bagdanov, and J. M. Alvarez. Domain-Adaptive Deep Network Compression. In ICCV, pages 4299–4307, 2017.
  • [30] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.

    Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning.

    In ICLR, 2017.
  • [31] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  • [32] W. Pan, H. Dong, and Y. Guo. DropNeuron: Simplifying the Structure of Deep Neural Networks. In NIPS, 2016.
  • [33] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, and S. Pu. Extreme Network Compression via Filter Group Approximation. In ECCV, 2018.
  • [34] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In FPGA, pages 26–35, 2016.
  • [35] A. Quattoni and A. Torralba. Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 413–420. IEEE, 2009.
  • [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. CoRR, 2018.
  • [38] N. Silberman and S. Guadarrama. Tensorflow-slim image classification model library, 2016.
  • [39] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015.
  • [40] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996.
  • [41] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • [42] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning Structured Sparsity in Deep Neural Networks. In NIPS, 2016.
  • [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated Residual Transformations for Deep Neural Networks. CoRR, abs/1611.0, 2016.
  • [44] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1331–1338. IEEE, 2011.
  • [45] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 68(1):49–67, 2006.
  • [46] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. ECCV, 8689:818–833, 2014.
  • [47] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CoRR, 2017.
  • [48] H. Zhou, J. M. Alvarez, and F. Porikli. Less Is More: Towards Compact CNNs. In ECCV, 2016.
  • [49] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients, 2016.