Merging and Evolution: Improving Convolutional Neural Networks for Mobile Applications

03/24/2018
by   Zheng Qin, et al.
0

Compact neural networks are inclined to exploit "sparsely-connected" convolutions such as depthwise convolution and group convolution for employment in mobile applications. Compared with standard "fully-connected" convolutions, these convolutions are more computationally economical. However, "sparsely-connected" convolutions block the inter-group information exchange, which induces severe performance degradation. To address this issue, we present two novel operations named merging and evolution to leverage the inter-group information. Our key idea is encoding the inter-group information with a narrow feature map, then combining the generated features with the original network for better representation. Taking advantage of the proposed operations, we then introduce the Merging-and-Evolution (ME) module, an architectural unit specifically designed for compact networks. Finally, we propose a family of compact neural networks called MENet based on ME modules. Extensive experiments on ILSVRC 2012 dataset and PASCAL VOC 2007 dataset demonstrate that MENet consistently outperforms other state-of-the-art compact networks under different computational budgets. For instance, under the computational budget of 140 MFLOPs, MENet surpasses ShuffleNet by 1 ILSVRC 2012 top-1 accuracy, while by 2.3 respectively.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/09/2019

HGC: Hierarchical Group Convolution for Highly Efficient Neural Network

Group convolution works well with many deep convolutional neural network...
07/08/2020

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Replacing normal convolutions with group convolutions can significantly ...
07/10/2017

Interleaved Group Convolutions for Deep Neural Networks

In this paper, we present a simple and modularized neural network archit...
04/07/2019

ANTNets: Mobile Convolutional Neural Networks for Resource Efficient Image Classification

Deep convolutional neural networks have achieved remarkable success in c...
11/25/2017

CondenseNet: An Efficient DenseNet using Learned Group Convolutions

Deep neural networks are increasingly used on mobile devices, where comp...
05/30/2022

Universality of group convolutional neural networks based on ridgelet analysis on groups

We investigate the approximation property of group convolutional neural ...
02/11/2018

On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups

Convolutional neural networks have been extremely successful in the imag...

Code Repositories

MENet

This repo contains code for *Merging and Evolution: Improving Convolutional Neural Networks for Mobile Applications*.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional neural networks (CNNs) have achieved significant progress in computer vision tasks such as image classification

[1, 2, 3, 4, 5], object detection [6, 7, 8, 9] and semantic segmentation [10]. However, state-of-the-art CNNs require computation at billions of FLOPs, which prevents them from being utilized in mobile or embedded applications. For instance, ResNet-101 [3], which is broadly used in detection tasks [6, 9], has a complexity of 7.8 GFLOPs and fails to achieve real-time detection even with a powerful GPU.

In view of the huge computational cost of modern CNNs, compact neural networks [11, 12, 13] have been proposed to deploy both accurate and efficient networks on mobile or embedded devices. Compact networks can achieve relatively high accuracy under a tight computational budget. For better computational efficiency, these networks are inclined to utilize “sparsely-connected” convolutions such as depthwise convolution and group convolution rather than standard “fully-connected” convolutions. For instance, ShuffleNet [13] utilizes a lightweight version of the bottleneck unit [3] termed ShuffleNet unit. In a ShuffleNet unit, the original convolution is replaced with a depthwise convolution, while the convolutions are substituted with pointwise group convolutions. This modification significantly reduces the computational cost, but blocks the information flow between channel groups and leads to severe performance degradation. For this reason, ShuffleNet introduces the channel shuffle operation to enable inter-group information exchange. As shown in Fig. 1, a channel shuffle operation permutes the channels so each group in the second convolutional layer contains channels from every group in the first convolutional layer. Benefiting from the channel shuffle operation, ShuffleNet achieves 65.9% top-1 accuracy on ILSVRC 2012 dataset [14] with 140 MFLOPs, and 70.9% top-1 accuracy with 524 MFLOPs, which is state-of-the-art.

Fig. 1: Channel shuffle operation with 9 channels and 3 channel groups. Each group in the second convolution receives only 1 channel from each group in the first convolution. This leads to severe inter-group information loss.

However, the channel shuffle operation fails to eliminate the performance degradation and ShuffleNet still suffers from the loss of inter-group information. Fig. 1 illustrates a channel shuffle operation with 9 channels and 3 channel groups. Each group in the second convolutional layer receives only 1 channel from every group in the first convolutional layer, whereas there are 2 other channels in each group being ignored. As a result, a large portion of the inter-group information cannot be leveraged. This problem is aggravated given more channel groups. Although there are more channels in total given more groups, the number of channels in each group is smaller, which increases the loss of inter-group information. Consequently, when the computational budget is relatively larger, ShuffleNet architectures with more channel groups perform worse than the narrower ones which have less groups. This indicates that it is difficult for ShuffleNet to gain performance increase by increasing the number of channels directly.

To address this issue, we propose two novel operations named merging and evolution to directly fuse features across all channels in a group convolution and alleviate the loss of inter-group information. For a feature map generated from a group convolution, a merging operation aggregates the features at the same spatial position across all channels and encodes the inter-group information into a narrow feature map. An evolution operation is performed afterwards to extract spatial information from the feature map. Then, based on the proposed operations, we introduce the Merging-and-Evolution (ME) module, a powerful and efficient architectural unit specifically for compact networks. For computational efficiency, ME modules exploit depthwise convolutions and group convolutions to reduce the computational cost. For better representation, ME modules utilize merging and evolution operations to leverage the inter-group information. Finally, we present a new family of compact neural networks called MENet which is built with ME modules. Compared with ShuffleNet [13], MENet alleviates the loss of inter-group information and gains substantial improvements as the group number increases.

We conduct extensive experiments to evaluate the effectiveness of MENet. Firstly, we compare MENet with other state-of-the-art network structures on the ILSVRC 2012 classification dataset [14]. Then, we examine the generalization ability of MENet on the PASCAL VOC 2007 detection dataset [15]. Experiments show that MENet consistently outperforms other state-of-the-art compact networks under different computational budgets. For instance, under a complexity of 140 MFLOPs, MENet achieves improvements of 1% over ShuffleNet and 1.95% over MobileNet on ILSVRC 2012 top-1 accuracy, while 2.3% and 4.1% on PASCAL VOC 2007 mAP, respectively. Our models have been made publicly available at https://github.com/clavichord93/MENet.

Ii Related Work

As deep neural networks suffer from heavy computational cost and large model size, the inference-time compression and acceleration of neural networks has become an attractive topic in deep learning community. Commonly, the related work can be categorized into four groups.

Tensor decomposition factorizes a convolution into a sequence of smaller convolutions with fewer parameters and less computational cost. Jaderberg et al. [16] proposed to decompose a convolution into a convolution and a convolution, reporting speedup with 1% accuracy loss. Denton et al. [17]

proposed a method exploiting a low-rank decomposition to estimate the original convolution. Recently, Zhang

et al. [18]

proposed a method based on generalized singular value decomposition without the need of stochastic gradient descent, which achieved

speedup on VGG-16 [2] with a graceful accuracy degradation.

Parameter quantization is proposed to utilize low-bit parameters in neural networks. Vanhoucke et al. [19] proposed to use 8-bit fixed-point parameters and achieved speedup. Gong et al. [20]

applied k-means clustering on network parameters and provided

compression with only 1% accuracy drop. Binarization methods

[21, 22, 23] attempted to train networks directly with 1-bit weights. Quantization methods provide significant memory savings and enormous theoretical speedup. However, current hardware is mainly optimized for half-/single-/double-precision computation, so it is difficult for quantization methods to achieve the theoretical speedup.

Network pruning attempts to recognize the structure redundancy in network architectures and cut off the redundant parameters. Han et al. [24] proposed a method to remove all connections with small weights, reporting reduction in model size. Network slimming [25]

applied sparsity-induced penalty on the scaling factors in batch normalization layers and removes the channels with small scaling factors. He

et al. [26]

proposed a LASSO regression based method to prune redundant channels, achieving

speedup with comparable accuracy. Yu et al. [27] proposed a group-wise 2D-filter pruning approach and provided speedup on VGG-16. However, iterative pruning strategy is commonly utilized in network pruning, which slows down the training procedure.

Compact networks are designed for mobile or embedded applications specifically. SqueezeNet [11] proposed fire modules, where a convolutional layer is first applied to “squeeze” the width of the network, followed by a layer mixing and convolutional kernels to reduce parameters. MobileNet [12] exploited depthwise separable convolutions as its building unit, which decompose a standard convolution into a combination of a depthwise convolution and a pointwise convolution. ShuffleNet [13] utilized depthwise convolutions and pointwise group convolutions into the bottleneck unit [3], and proposed the channel shuffle operation to enable inter-group information exchange. Compact networks can be trained from the scratch, so the training procedure is very fast. Moreover, compact networks are orthogonal to the aforementioned methods and can be further compressed.

Iii Merging-and-Evolution Networks

In this section, we first analyze the loss of inter-group information in ShuffleNet and introduce merging and evolution operations for alleviating the performance degradation. Next, we describe the structure of the ME module. At last, the details about MENet architecture are introduced.

Iii-a Merging and Evolution Operations

As figured out in Section I, ShuffleNet suffers from the severe inter-group information loss. The loss of inter-group information can be measured with the number of inter-group connections. Specifically, for two consecutive convolutional layers with output channels and channel groups, each group contains channels, and there are totally

(1)

inter-group connections if the channels were “fully-connected”. After a channel shuffle operation, each group in the later convolutional layer receives channels from every group in the former layer, so there are

(2)

actual inter-group connections. This means a ratio of

(3)

of the inter-group connections are lost, which induces severe loss of inter-group information. This significantly weakens the representation capability and leads to serious performance degradation. The problem is aggravated when there are more channel groups. The ratio of the inter-group connections lost is 66.7% when there are three groups, but the number increases to 87.5% given eight groups. This explains why ShuffleNet with three groups outperforms the one with eight groups.

To address this issue, we design two operations termed merging and evolution to leverage inter-group information. As shown in Fig. 2, the proposed operations encode the inter-group information with a narrow feature map, and combine it with the original network for more discriminative features.

Iii-A1 Merging Operation

The merging operation is designed to fuse features across all channels and encode the inter-group information into a narrow feature map. Given the feature map generated from a group convolution, a merging transformation is applied to aggregate features over all channels, where is the number of channels in the original feature map, and are the spatial dimensions, and is the number of channels in the produced feature map. A small is chosen to make the merging operations computationally economical. As is relatively large, it is difficult to integrate the spatial information without harming the computational efficiency. So we aggregate only the features on the same spatial position along all channels in a merging operation. A single pointwise convolution is exploited as the merging transformation, followed by a batch normalization [28]

and a ReLU activation. Formally, the output feature map of a merging operation is calculated as

(4)

where indicates the ReLU function, represents the convolution operator, and is the convolutional kernel. By this means, each channel in Z contains information from every channel in the previous group convolutional layer.

Iii-A2 Evolution Operation

After a merging operation, an evolution operation is performed to obtain more discriminative features. An evolution operation is defined in two steps. In the first step, an evolution transformation is applied to the feature map from the previous merging operation. The number of channels is kept unchanged. In this step, we intend to leverage more spatial information so a standard convolution is selected as the evolution transformation, followed by a batch normalization and a ReLU activation. In the second step, a matching transformation is performed to match the size of the output feature map with the original network. As in the merging operation, a single pointwise convolution is chosen as to maintain the computational efficiency. Another batch normalization and a sigmoid activation are added afterwards. The whole process is formally written as

(5)
(6)

where and are the convolutional kernels, and

indicates the sigmoid function.

Fig. 2: Merging and evolution operations. A merging operation applies a merging transformation and encodes the inter-group information into a narrow feature map. An evolution operation consists of an evolution transformation and a matching transformation and leverages spatial information.

At last, the features generated from evolution operations are regarded as neuron-wise scaling factors and combined with the original network using an element-wise product to improve the representation capability of the features in the network:

(7)

where is the transformation in the original network, and represents element-wise product. As encodes information from every channel in the previous convolution, each channel in also contains information from all channels. This alleviates the loss of inter-group information.

Fig. 3: The structure of ME module. (a): Standard ME module. (b): Downsampling ME module. GConv: Group convolution. DWConv: Depthwise convolution.

Iii-B Merging-and-Evolution Module

Taking advantage of the proposed merging and evolution operations, we present the Merging-and-Evolution (ME) module, an architectural unit specifically designed for compact neural networks.

The ME module is a variant of the conventional residual block [3]. An ME module consists of three branches: an identity branch, a residual branch and a fusion branch, as illustrated from left to right in Fig.3. For computational efficiency, the residual branch adopts a bottleneck design [3] and exploits “sparsely-connected” convolutions. It consists of three layers, a pointwise group convolution to squeeze the channel dimension, a depthwise convolution to leverage spatial information, and another pointwise group convolution to recover the channel dimension. A channel shuffle operation [13] is applied after the first pointwise group convolution for inter-group information exchange. The utilization of merging and evolution operations introduces the fusion branch. A merging operation is performed after the channel shuffle, with an evolution operation following. Then the fusion branch is combined with the residual branch before the second pointwise group convolutional layer. This design helps alleviate the loss of inter-group information in the second group convolutional layer. The merging and evolution operations are applied to the bottleneck channels to reduce the overall computational cost. Additionally, as described in Section III-A, the number of channels in the fusion branch is kept small to maintain computational efficiency.

For the downsampling version of ME modules, two more modifications are performed. (i) The strides of the depthwise convolution in the residual branch and the

convolution in the fusion branch are altered to 2. (ii) Inspired by [13], a average pooling with a stride of 2 is applied in the identity branch, and the element-wise addition is substituted with a concatenation to combine the identity branch and the residual branch. After a downsampling ME module, the spatial dimensions of the feature map are halved, while the channel dimension is doubled. Fig. 3 describes the structure of the downsampling ME module.

[!t] Stage Output Size 228-MENet-121 () 256-MENet-121 () 352-MENet-121 () Image Stage 1 conv, 24, /2 max pool, /2 Stage 2 ME module, 228, /2 ME module, 256, /2 ME module, 352, /2 ME module, 228, 3 ME module, 256, 3 ME module, 352, 3 Stage 3 ME module, 456, /2 ME module, 512, /2 ME module, 704, /2 ME module, 456, 7 ME module, 512, 7 ME module, 704, 7 Stage 4 ME module, 912, /2 ME module, 1024, /2 ME module, 1408, /2 ME module, 912, 3 ME module, 1024, 3 ME module, 1408, 3 Classifier global average pool, 1000-d fc, softmax FLOPs The number after the layer/module type is the number of output channels. “3” and “7” indicate the ME module repeats 3 or 7 times respectively. “/2” represents the stride of the layer is 2. The ME modules with “/2” perform downsampling.

TABLE I:

MENet Architecture for ImageNet under the Computational Budget of 140 MFLOPs

Iii-C MENet Architecture

Based on ME modules, we propose MENet, a new family of compact neural networks. The overall architecture of MENet for ImageNet classification is demonstrated in Table I.

MENet begins with a convolutional layer and a max pooling layer, both with strides of 2. A batch normalization and a ReLU activation are applied after the convolutional layer. These two layers perform downsampling to reduce the overall computational cost. Then there follow a sequence of ME modules, which are grouped into three stages (Stage 2 to 4). In each stage, the first building block is a downsampling ME module, while the rest building blocks are standard ME modules. The number of output channels is kept the same within a stage and is doubled in the next stage. Furthermore, the number of bottleneck channels in the residual branch is set to of the output channels in the same ME module, and we do not apply group convolution on the first pointwise layer in Stage 2. We build MENet with three group numbers : , and . Increasing the group number aggravates the connection sparsity in the residual branch, but contributes to wider feature maps. The influence of the group number on the performance of MENet is discussed in the next section.

We furthermore introduce three hyper-parameters for customizing MENet to fit different computational budgets. The first two hyper-parameters are the fusion width and the expansion factor , which control the complexity of the fusion branch. The fusion width is defined as the number of channels in the fusion branch of Stage 2, and the expansion factor represents the ratio of the channels in the fusion branch between two consecutive stages. The number of channels in the fusion branch of Stage () is calculated as . We figure that intuitively it is beneficial for generating more discriminative features to have wider fusion branches, but it also leads to more computational cost. The effects of the fusion width and the expansion factor on the performance of MENet is discussed in the next section. The third hyper-parameter is the residual width , which is defined as the number of output channels in the residual branch of Stage 2. The residual width controls the computational cost in the residual branch.

Finally, we define a notation “-MENet-” to represent a network with a residual width , a fusion width and an expansion factor . For example, the network in Table I with can be denoted as “228-MENet-”.

Iv Experiments

We conduct extensive experiments to examine the effectiveness of MENet with two benchmarks. We first evaluate MENet on the ILSVRC 2012 classification dataset [14] and compare MENet with other state-of-the-art networks. The influence of different model choices is then investigated. At last, we conduct experiments on the PASCAL VOC 2007 detection dataset [15] to examine the generalization ability of MENet.

[!t] Models MFLOPs Top-1 Acc. Top-5 Acc. VGG-16 [2] 15300 71.5 89.8 GoogLeNet [4] 1550 69.8 89.6 456-MENet-241 (, ours) 551 71.6 90.2 ResNet-15 [3] 140 61.3 - Xception-15 [29] 140 64.9 - RexNeXt-15 [30] 140 65.7 - 352-MENet-121 (, ours) 144 66.7 86.9 ResNet-15 [3] 38 51.1 - Xception-15 [29] 38 53.9 - RexNeXt-15 [30] 38 53.7 - 108-MENet-81 (, ours) 38 56.1 79.2 The results that surpasses all competing networks are bold. Larger number in top-1 and top-5 accuracy represents better performance.

TABLE II: ILSVRC 2012 Accuracy (%) Comparison with State-of-the-art Network Structures

[!t] Models MFLOPs Top-1 Acc. Top-5 Acc. ShuffleNet 1 () [13] 137 65.9 - ShuffleNet 1 () [13] 134 65.7 - ShuffleNet 1 () [13] 138 65.3 - ShuffleNet 1 (, re-impl.) [13] 137 65.7 86.3 ShuffleNet 1 (, re-impl.) [13] 134 65.5 86.2 ShuffleNet 1 (, re-impl.) [13] 138 65.4 86.3 228-MENet-121 (, ours) 144 66.4 86.7 256-MENet-121 (, ours) 140 66.6 86.7 352-MENet-121 (, ours) 144 66.7 86.9

TABLE III: ILSVRC 2012 Accuracy (%) Comparison with ShuffleNet

[!t] Models MFLOPs Top-1 Acc. Top-5 Acc. 1 MobileNet-224 [12] 569 70.73 89.72 ShuffleNet 2 () [13] 524 70.79 89.80 456-MENet-241 (, ours) 551 71.60 90.07 0.75 MobileNet-224 [12] 325 68.60 88.33 ShuffleNet 1.5 () [13] 292 68.85 88.41 348-MENet-121 (, ours) 299 69.91 89.08 0.5 MobileNet-224 [12] 149 64.74 85.63 ShuffleNet 1 () [13] 137 65.69 86.29 228-MENet-121 (, ours) 144 66.43 86.72 352-MENet-121 (, ours) 144 66.69 86.92 0.25 MobileNet-224 [12] 41 53.71 77.04 ShuffleNet 0.5 () [13] 38 55.40 78.70 108-MENet-81 (, ours) 38 56.08 79.24

TABLE IV: ILSVRC 2012 Accuracy (%) Comparison with ShuffleNet and MobileNet

Iv-a ImageNet Classification

The ILSVRC 2012 dataset is composed of a training set of 1.28 million images and a validation set of 50,000 images, which are categorized into 1,000 classes. We train the networks on the training set and report the top-1 and the top-5 accuracy rates on the validation set using center-crop evaluations.

Iv-A1 Implementation Details

All our experiments are conducted using PyTorch

[31]

with four GPUs. We utilize synchronous stochastic gradient descent to train the models for 120 epochs with a batch size of 256 and a momentum of 0.9. Following

[13], a relatively small weight decay of 4e-5 is used to avoid underfitting. The learning rate starts from 0.1, and is divided by 10 every 30 epochs. Because our models are relatively small, we use less aggressive multi-scale data augmentation. Color jittering is not adopted because we find it can lead to underfitting. On evaluation, each validation image is first resized with its shorter edge to 256 pixels, and then evaluated using the center pixels crop.

Iv-A2 Comparison with Other State-of-the-art Networks

Table II demonstrates the comparison of MENet and some state-of-the-art network structures on ILSVRC 2012 dataset.

We first compare MENet with two popular networks, GoogLeNet [4] and VGG-16 [2]. GoogLeNet provides 69.8% top-1 accuracy and 89.6% top-5 accuracy, while VGG-16 produces remarkably better top-1 accuracy of 71.5%. However, they are all computationally intensive. In comparison, 456-MENet-241 achieves 71.6% top-1 accuracy and 90.2% top-5 accuracy under a complexity of about 550 MFLOPs. MENet significantly surpasses GoogLeNet by 1.8% on top-1 accuracy with 2.8 fewer FLOPs, and slightly outperforms VGG-16 (0.1%) with 27 fewer FLOPs.

We further compare the ME module with the building structures of three state-of-the-art networks, ResNet [3], Xception [29] and ResNeXt [30]. Following [13], we replace the ME modules with other structures in the architecture shown in Table I and adapting the number of channels to the computational budgets. These networks are referred to as ResNet-15, Xception-15 and RexNeXt-15111The number 15 indicates the number of building blocks in the network. and use the results reported in [13] for comparison. As shown in Table II, 352-MENet-121 achieves significant improvements of 5.4% over ResNet-15, 1.8% over Xception-15 and 1% over ResNeXt-15 under a complexity of 140 MFLOPs, while 108-MENet-81 achieves improvements of 5.0%, 2.2% and 2.3% under 40 MFLOPs, respectively. These improvements have proven the effectiveness of ME modules in building compact networks.

Iv-A3 Comparison with Other Compact Networks

Fig. 4: Comparison of MENet with other network structures. MENet surpasses all competing structures under all four computational budgets.

We also compare the performance of MENet with two state-of-the-art compact networks: ShuffleNet [13] and MobileNet [12]. For fair comparison, we re-implement ShuffleNet and MobileNet with the same settings as described in Section IV-A1.

Table III demonstrates the comparison of the results between MENet and ShuffleNet with different group numbers. When the group number is kept the same, MENet surpasses ShuffleNet by a large margin. Considering there are fewer channels in the residual branch in MENet than in ShuffleNet, we attribute this improvement to the effectiveness of the proposed merging and evolution operations. Although ShuffleNet has more channels, it suffers from the loss of inter-group information. On the other side, MENet leverages the inter-group information through the merging and evolution operations. Consequently, MENet generates more discriminative features than ShuffleNet and overcomes the performance degradation. This is the first advantage of MENet: it can achieve better performance with fewer channels.

Models MFLOPs Top-1 Acc. Top-5 Acc.
228-MENet-101 () 140 66.37 86.70
228-MENet-121 () 144 66.43 86.72
228-MENet-141 () 148 66.56 86.95
228-MENet-161 () 153 66.86 87.19
TABLE V: ILSVRC 2012 Accuracy (%) of Different Fusion Widths
Models MFLOPs Top-1 Acc. Top-5 Acc.
228-MENet-121 () 144 66.43 86.72
228-MENet-121.5 () 152 66.71 87.14
228-MENet-122 () 163 67.25 87.30
228-MENet-122.5 () 179 67.51 87.66
TABLE VI: ILSVRC 2012 Accuracy (%) of Different Expansion Rates
Models Top-1 Acc. (%) Top-5 Acc. (%)
228-MENet-121 () 66.43 86.77
228-MENet-121 (add, ) 66.27 86.25
256-MENet-121 () 66.53 86.82
256-MENet-121 (add, ) 65.48 86.30
TABLE VII: ILSVRC 2012 Accuracy (%) of Element-wise Product and Element-wise Addition

In ShuffleNet, the top-1 accuracy decreases as the number of groups increases. As figured out in Section III-A, the ratio of the inter-group connections lost is when there are channel groups. Increasing makes more inter-group connections lost, which aggravates the loss of inter-group information. More specifically, although there are more channels in total in the residual branch when the group number is larger, the number of channels within each channel group become smaller, which harms the representation capability. However, the results are opposite for MENet: the classification accuracy rises given more channel groups. This is another advantage that MENet brings: it can gain accuracy improvement by directly increasing the width of the network and the number of groups. The merging and evolution operations fuse the features from all channels simultaneously, thus alleviates the loss of inter-group information. Consequently, MENet benefits from the wider feature maps and generates more discriminative features. These improvements are consistent with our initial motivation to design ME modules.

Backbone mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
0.5 MobileNet-224 54.8 59.4 65.4 52.7 36.3 30.1 57.2 71.3 64.4 31.1 64.4 47.5 62.5 70.8 68.8 64.2 27.6 53.2 51.8 61.8 54.9
ShuffleNet 1 () 56.6 57.4 64.6 55.4 37.7 30.4 60.1 72.1 66.8 29.9 64.0 50.3 65.7 73.5 68.6 66.9 34.2 58.5 50.0 65.7 60.0
352-MENet-121 () 58.9 64.2 64.1 59.0 44.6 34.6 64.1 73.8 70.0 32.8 69.1 52.5 64.3 75.3 69.4 67.7 35.3 59.5 50.6 66.9 60.5
1 MobileNet-224 62.4 65.1 69.0 59.9 51.4 40.1 66.2 75.7 76.7 40.0 69.3 52.9 72.9 75.2 68.3 71.2 35.3 64.8 58.1 70.0 66.8
ShuffleNet 2 () 63.5 68.4 69.7 58.8 49.7 38.2 70.5 76.4 77.0 38.7 73.0 54.7 73.8 79.0 71.1 72.0 36.1 65.8 60.2 71.4 64.7
456-MENet-241 () 65.5 69.2 72.6 66.5 52.1 42.3 70.8 79.4 78.9 41.3 75.7 56.3 78.0 77.4 71.9 74.4 38.0 66.8 58.2 72.2 67.4
TABLE VIII: Comparison of mAP (%) and AP (%) on PASCAL VOC 2007 Test Set

We further compare the three compact networks under four computational budgets. The results are demonstrated in Table IV. The number of output channels in the first convolution in MENet is adjusted to fit the computational budget. According to the table, MENet significantly outperforms ShuffleNet and MobileNet under all the computational budgets. Under a budget of 140 MFLOPs, MENet surpasses ShuffleNet by 0.74% on top-1 accuracy with the same group number (), and by 1% with more groups (). Meanwhile, MENet surpasses MobileNet by 1.95%. We do not tune too much on group numbers for MENet but simply set under other computational budgets. For smaller networks with only 40 MFLOPs, MENet provides improvements of 0.68% over ShuffleNet and 2.37% over MobileNet. For larger networks under a complexity of 300 MFLOPs, MENet performs 1.06% better than ShuffleNet and 1.31% better than MobileNet. When the complexity is 550 MFLOPs, MENet surpasses ShuffleNet and MobileNet by 0.81% and 0.87% respectively. Similar results are observed on the top-5 accuracy. More detailed comparison results are illustrated in Fig. 4. These results have proven that MENet has stronger representation capability and is both efficient and accurate for various scenarios.

Iv-A4 Model Choices

We furthermore conduct experiments to examine the influences of several design choices on the performance of MENet, including the fusion width, the expansion rate, and the function which combines the fusion branch and the residual branch.

Fusion Width. The fusion width is the hyper-parameter which controls the initial number of channels in the fusion branch. We evaluate the effects of the fusion width using four models: 228-MENet-101, 228-MENet-121, 228-MENet-141 and 228-MENet-161, all with . Table V shows the comparison of these networks. Substantial improvements in accuracy are observed as the fusion width increases. In ME modules, we set the fusion branch to be relatively narrow for computational efficiency. This limits the representation capability of the features generated from the fusion branch. Increasing the fusion width improves the information capacity of the fusion branch, which allows more inter-group information to be encoded and improves the representation capability.

Expansion Rate. The expansion rate controls the “growth” of the channels in the fusion branch between stages. We also select four MENet models to examine the effect of the expansion rate: 228-MENet-121, 228-MENet-121.5, 228-MENet-122, 228-MENet-122.5, all with . The results are shown in Table VI. It is observed that the networks with larger expansion rates are inclined to have higher accuracy. The model with an expansion rate of 2.5 achieves an improvement above 1% on top-1 accuracy over the model whose expansion rate is 1. It is conjectured that as the width of the residual branch increases from stage to stage, the inter-group information becomes increasingly complicated. This makes it difficult to encode all the information within a fixed number of channels in the fusion branch for all stages. By applying a large expansion rate, different number of channels are used to fuse the features in each stage, which helps improve the representation capability in the later stages.

Element-wise Product vs. Element-wise Addition. It is a conventional practice to learn residual information (element-wise addition) in state-of-the-art deep networks [3, 30, 5]. However, we choose to learn neuron-wise scaling information (element-wise product) instead in MENet. We evaluate the effects of these two choices using two MENet models with different group numbers ( and ). For the networks using element-wise addition, we simply make two modifications: (i) The element-wise product is replaced by an element-wise addition. (ii) The sigmoid activation after the second pointwise convolution in the fusion branch is removed. The results are demonstrated in Table VII. It is clear that learning scaling information significantly outperforms residual information. The model with element-wise product is 0.16% better when , and 1.05% better when . Notice that the model learning residual information provides a worse result than its ShuffleNet counterpart when . These results indicate that residual information is not effective for inter-group feature fusion. This difference may be potentially induced by the narrow feature maps in the fusion branch, which cannot encode adequate residual information. We are planning to further examine this in our future work.

Iv-B Object Detection on PASCAL VOC

To investigate the generalization ability of MENet, we conduct comparative experiments on PASCAL VOC 2007 detection dataset [15]. PASCAL VOC 2007 dataset consists of about 10,000 images split into three (train/val/test) sets. We train the models on VOC 2007 trainval set and report the single-model results on VOC 2007 test set.

We adopt Faster R-CNN [6] detection pipeline and compare the performance of MENet, ShuffleNet and MobileNet on 600

resolution under two computational budgets (140 MFLOPs and 550 MFLOPs). The pre-trained models on ILSVRC 2012 dataset are used for transfer learning. For the MobileNet-based detectors, we use the first 28 layers as the R-CNN base network and the remaining 4 layers as the R-CNN subnet. For the ShuffleNet-based and the MENet-based detectors, the first three stages are used as the base network, and the last stage is used as the R-CNN subnet. All strides in R-CNN subnets are set 1 to obtain larger feature maps. RoI align

[9] is used to encode RoIs instead of RoI pooling [32]. During testing, 300 region proposals are sent to the R-CNN subnet to generate the final predictions.

Table VIII demonstrates the comparison of the three compact networks on VOC 2007 test set. According to the results, MENet significantly outperforms MobileNet and ShuffleNet under both computational budgets. Under the computational budget of 140 MFLOPs, the MENet-based detector achieves the mAP of 58.9%, while the mAP of the ShuffleNet-based and the MobileNet-based detectors is 56.6% and 54.8%, respectively. MENet achieves improvements of 2.3% mAP over ShuffleNet and 4.1% over MobileNet. More specifically, MENet provides better results on most classes, with the improvements from 0.5% (tv) to 6.9% (boat). On the classes which are difficult for ShuffleNet and MobileNet, such as boat, bottle, table and plant, the MENet-based detector increases the AP by 6.9%, 4.2%, 2.2% and 1.1%, respectively. Under a complexity of 550 MFLOPs, the MENet-based detector surpasses the ShuffleNet-based one by 2% and the MobileNet-based one by 3.1% on mAP. Additionally, MENet also outperforms ShuffleNet and MobileNet on single-class results. These results have proven that the proposed MENet has strong generalization ability and can benefit various tasks.

V Conclusion

In this paper, we propose two novel operations, merging and evolution, to perform feature fusion across all channels in a group convolution and alleviate the performance degradation induced by the loss of inter-group information. Based on the proposed operations, we introduce an architectural unit named ME module specially designed for compact networks. Finally, we propose MENet, a family of compact neural networks. Compared with ShuffleNet, the proposed MENet leverages inter-group information and generates more discriminative features. Extensive experiments show that MENet consistently outperforms other state-of-the-art compact neural networks under different computational budgets. Experiments on object detection show that MENet has strong generalization ability for transfer learning. For future work, we consider to further evaluate MENet on other tasks such as semantic segmentation.

Acknowledgment

This work is supported by the National Key Research and Development Program of China (2016YFB1000100).

References