ANTNets: Mobile Convolutional Neural Networks for Resource Efficient Image Classification

04/07/2019 ∙ by Yunyang Xiong, et al. ∙ Amazon University of Wisconsin-Madison Korea University 0

Deep convolutional neural networks have achieved remarkable success in computer vision. However, deep neural networks require large computing resources to achieve high performance. Although depthwise separable convolution can be an efficient module to approximate a standard convolution, it often leads to reduced representational power of networks. In this paper, under budget constraints such as computational cost (MAdds) and the parameter count, we propose a novel basic architectural block, ANTBlock. It boosts the representational power by modeling, in a high dimensional space, interdependency of channels between a depthwise convolution layer and a projection layer in the ANTBlocks. Our experiments show that ANTNet built by a sequence of ANTBlocks, consistently outperforms state-of-the-art low-cost mobile convolutional neural networks across multiple datasets. On CIFAR100, our model achieves 75.7 8.3 achieves 72.8 faster) on iPhone 5s over MobileNetV2.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have emerged as state-of-the-art solutions for various tasks in computer vision, machine learning, and natural language processing. Recent research in deep learning mainly focuses on deeper and heavier models to achieve superhuman accuracy with a tremendous number of parameters. Inception

[12], ResNets [5], HighwayNets [22], and DenseNet [10] are popular architectures in this direction and has been shown to be effective in a variety of tasks. However, in many real-world applications, due to limited computing resources and short latency requirements, more efficient recognition systems are often required, for example, in mobile phones, robots, and smart appliances that require the on-device intelligence systems.

For the last few years, small and efficient neural networks have enabled the deployment of models on computationally limited hardware for a wide range of applications. One stream of such efforts is to substitute existing layers with more efficient layers. Since in a vision system, convolutional neural networks (CNNs) are the most popular base feature extraction networks and the main computational burden is convolutional layers, faster convolution layers are crucial. The standard convolution layer performs convolution using all the input channels for one output channel. So, the number of filters and calculations increase as the number of input channels grows. Instead,

group convolution involves only each group of input channels resulting in a smaller number of filters (and calculations) reduced by a factor of the number of groups. Group convolution has been used in multiple architectures. AlexNet [15] uses group convolution to train models on GPUs with limited memory. Later, ResNeXt [25] utilizes group convolution to achieve better performance and [11] proposed more complex group convolution with hierarchical arrangements. One extreme of group convolution is depthwise separable convolutions introduced in [21]. Each group involves only one input channel and convolution filter. Since then, the trick has been adopted in other architectures such as Inception [12], Flattened Networks [13] and Xception [2]. Recently, the depthwise separable convolution has been adopted by a compact architecture specifically designed for mobile devices. MobileNetV1 [6], and MobileNetV2 [20] achieved significant improvement with respect to inference time (latency) on mobile devices.

Efficient convolutional layers are preferable but accuracy degradation is inevitable. To fill the performance gap induced by approximate convoltuion, recent network enhancement techniques can be used as long as the additional cost is negligible. There have been multiple attempts to boost the representational power of models with negligible additional cost. Attentional neural networks have been proven that it is a general module and improve performance by suppressing irrelevant information and focus on informative parts of data. Temporal/spatial attention has been studied in the literature but, arguably, they come with a significant cost. However, channel attention can be implemented in a much more efficient way. For instance, the “Squeeze-and-Excitation” (SE) block proposed in SENet [8] allows selective reweighting channels based on global information from each channel. This improves a variety of architectures with a minor computational cost. A channel shuffle operation also boosts the performance or mitigates degradation of group convolution as shown in ShuffleNet [27, 18] and two-stage convolution [24]. This line of efforts motivate our work to develop an efficient and powerful architecture.

Our contributions: (i) We propose a new efficient and powerful architectural block, ANTBlock, that dynamically utilizes channel relationships; (ii) we show that a naive adaptation of channel attention (e.g., SE) does not improve the representational power of depthwise convolutional layers and propose an optimal configuration that maximizes the number of channels and has full channel receptive fields; (iii) using group convolution we make the ANTBlock more efficient w.r.t. parameter counts and computational costs without significant performance loss and extend it to an ensemble block; (iv) ANTBlock is simple to implement in widely-used deep learning frameworks and outperforms the state-of-the-art lightweight CNNs. ANTNet achieves improvement over MobileNetV2 on the ImageNet [19] with fewer parameters and fewer multiply-adds (MAdds) resulting in 20% faster inference time on a mobile phone.

2 Related Work

The efficiency of neural networks becomes an important topic as networks get larger and deeper. Inception module was utilized in GoogLeNet

[23] to obtain high performance with a drastically reduced number of parameters by using small convolutions. An efficient bottleneck structure was designed to construct ResNet to achieve high performance. Further, the large demand for on-device applications encourages studies on resource-efficient models with minimal latency and memory usage. To this end, [4] studied module designs with the trade-off between multiple factors such as depth, width, filter size, pooling layer and so on.

Group convolution is a straightforward and effective technique to save computations while maintaining accuracy. It was introduced with AlexNet [16] as a workaround for small GPUs. Later DeepRoots [11] and ResNeXt [25] adopted group convolutions to improve models. Depthwise separable convolution is an extreme case of group convolution that performs convolution each channel separately. It was first introduced in [21] and Xception [2] integrates the idea into the Inception and CondenseNet [9] does so for DenseNet. For mobile platforms, MobileNetV1 [6], and MobileNetV2 [20]

used depthwise convolutions with some hyperparameters to control the size of models.

Channel relationship is a relatively underexplored source of the performance boost. It is a promising direction since it usually requires a small additional cost. ShuffleNets [27, 18] shuffle channels within two-stage group convolution and can be efficiently implemented by “random sparse convolution” layer [1, 26]. Apart from random sparse channel grouping, Squeeze-and-Excitation Networks (SENet) [8] studies a dynamic channel reweighting scheme to boost model capacity at a small cost. The success of channel grouping and channel manipuliation motivates our work.

3 Model Architecture: ANTNet

The goal of this work is to design a basic low cost architecture block, which can be used to build efficient Convolutional Neural Networks for mobile devices with budget constraints. The budget of a model varies depending on implementations and hardware quantities for real-world applications. To have a general and fair comparison, in the literature [3], the budget (or complexity) of a model is measured by the number of computations, e.g., multiply-adds (MAdds) or floating point operations (FLOPs), and the number of model parameters. Our goal is to build a more accurate CNN, ANTNet (Attention NesTed Network), with fewer MAdds and Params by stacking our novel basic blocks. ANTNet utilizes depthwise separable convolution and channel attention. Before introducing our ANTBlock, we breifly discuss depthwise seprable convolution and its variations with computation budgets (i.e., MAdds, and Params).

Depthwise Separable Convolution is proven to be a effective module to build efficient neural network architectures. It approximates a standard convolution operation with two separate convolutions: depthwise convolution and pointwise convolution. The most common depthwise separable convolution [6] consists of two layers: a depthwise convolution that filters the data and point-wise convolution that combines the outputs of depthwise convolution. Consider that the input and output of the depthwise separable convolution are three dimensional feature maps of size , , where , , and denote height, width, and the number of channels of the feature map and indicates input and output . are the height, are width, and are the number of channels of the input and output feature map. For a convolution kernel size , The total number of MAdds for depthwise separable convolution is


Compared to the standard convolution, it reduces almost times computational cost. However, it often leads to reduced representational power.

Layer Input Operator Output
(a) Expansion layer 1x1 conv2d, ReLU6
(b) Depthwise layer

3x3 dwise stride =

, ReLU6
(c) Channel attention layer Global pooling, FC(2), Sigmoid
(d) Group-wise projection layer linear 1x1 gconv2d group =
Table 1: The structure of ANTBlock that it transforms from to channels with expansion factor , group with stride .

Inverted Residual Block. One quick fix for the reduced representational power is to increase the input channels for depthwise separable convolution by adding an expansion layer before depthwise convolution in inverted residual bottleneck blocks of MobileNetV2 [20]. The expansion operation expands the number of input channels to times by point-wise convolution. The inverted residual block has three different types of layers, expansion layer, depthwise layer and projection layer. The projection layer takes the largest portion of computation and more parameters than others when the number of input/output channels is larger than kernel parameters .

Based on our observation on MAdds and parametres of inverted residual block, we will develop a more accurate and efficient block by saving computations on the projection layer with cheaper operations and allocating more resource on the depthwise layer with a small additional cost.

3.1 Designing Efficient Blocks: ANTBlock

In this section, we introduce our ANTBlock with detailed discussion. The ANTBlock presented in Fig. 1(b) is a residual block and can be written as


When the dimension of the input to the block is not the same as the output, i.e.,

, we simply skip the residual connection as MobileNet V2. For simplicity, let us focus on the equation

2. ANTBlock is motivated by the Inverted Residual Block in MobileNet V2 and it can be factored into two parts: mapping from the input space to the high dimensional depthwise convolution space, , and projection to the input space, . Then, Eq. 2 can be written as


In ANTBlock, consists of one expansion layer and one depthwise convolutional layer. is a projection layer. Now, our block can be rewritten as


This construction can be further improved by the attention mechanism. In [17], the models equipped with attention mask show significant improvement on segmentation. We apply this similar idea to the output of depthwise layer for boosting feature representation. In this case, channel attention is used to improve representational power without a significant increase in computational cost and parameters. With channel attention, we can write our ANTBlock (see Fig. 1 (b)) as,


where stands for element-wise product, is the input feature, corresponding to the input of ANTBlock in Fig. 1 (b). denotes the output of depthwise convolution layer (b) of ANTBlock. denotes the attention mask for , represented by (c)-1, (c)-2 and (c)-3 in Fig. 1 (b). denotes an output channel of ANTBlock, is the projection for each output channel, corresponding to (d) Fig. 1 (b) with group convolution (group = 1), which means the output of ANTBlock is using all features from the output of attention maps .

(a) ANTNet
(b) ANTBlock
Figure 1: ANTNet architecture for ImageNet.

is the dimension of the tensor,

is the expansion factor of channels, is the reduction ratio for channel attention. Symbol denotes the element-wise addition and symbol denotes the channel-wise multiplication. ANTBlock means the number of repeated layers within ANTBlock. Note that If the output resolution differs from the input resolution, only the stride of the first layer within ANTBlock is and residual connection of the block is skipped. DwConv stands for depthwise convolution and GConv stands for group convolution. (a): ANTNet model is shown in Table 2 with more details; (b) is the structure of corresponding ANTBlock for building ANTNet and it is shown in Table 1.
Figure 2: The structure of e-ANTBlock for building e-ANTNet. Two types of ANTBlock are used for constructing e-ANTBlock and the weight parameters and of each e-ANTBlock are learned end-to-end for training e-ANTNet built on a sequence of e-ANTBlock. The e-ANTNet architecture is similar to ANTNet except that the ANTBlocks of ANTNet are replaced with e-ANTBlocks.

As discussed before, the parameters and MAdds of depthwise convolutional layer kernels are usually fewer than expansion layer and projection layer. We use a group convolutional layer forward more efficient projection saving parameters and MAdds by a factor of groups. Group convolution first has been adopted in [16] to use multiple GPUs for distributed convolution computation. It reduces computational cost and the number of parameters while still achieving high representational power. [28] proposed channel local convolution (CLC), in which an output channel can depend on an arbitrary subset of the input channels. It is a multi-stage group convolution with a nice property so-call full channel receptive field (FCRF). They found that in order to achieve high accuracy every output channel of (CLC) should cover all the input channels. In our case, channel attention uses all the input channels of the depthwise convolution layer. So any group convolution for the projection layer satisfies FCRF condition and our ANTBlock becomes a CLC block. With a group convolution layer for the projection layer , our ANTBlock can be written as


where is the number of group convolution, denotes the output channels of , , denotes the feature maps associated with each output channel of ANTBlock. is the projection for each output channel with group convolution (group ), corresponding to (d) Fig. 1 (b).

3.2 Ensemble ANTBlocks: e-ANTBlock

The proposed block (ANTBlock) can be extended further to an ensemble block, denoted by e-ANTBlock. To construct more powerful networks, we can ensemble (or weighted aggregate) different types of ANTBlocks (e.g., different group). e-ANTBlock can be written as


where is the number of different ANTBlocks, denotes an ANTBlock with a group convolutional layer for projection, is a weight of an ANTBlock. The weights are outputs of a softmax function written as


so that and .

are parameters of e-ANTBlock trained by backpropagation. During standard training, these parameters of e-ANTBlock will be learned end-to-end. In our experiments,

is used. The structure of e-ANTBlock with can be seen in Fig. 2

3.3 ANTNet

ANTNet (Attention NesTed Network) is a new efficient convolutional neural network architecture constructed by a sequence of ANTBlocks. The architecture of the network is similar to MobileNetV2, but all the inverted residual blocks are replaced by ANTBlocks and one may use a different number of ANTBlocks depending on the target accuracy.

Now, we describe our architecture in detail. The basic building block is ANTBlock, which has an expansion layer, a depthwise convolutional layer, a channel-attention layer, and a group-wise projection layer with residual connections. The detailed structure of ANTBlock is shown is Table 1 and Figure 1(b).

Name Input Operator Expansion () Reduction Output Repetition () Stride () Group ()
Ratio () Channels ()
conv0 conv2d - - 32 1 2 -
ant1 ANTBlock 1 8 16 1 1 1
ant2 ANTBlock 6 8 24 2 2
ant3 ANTBlock 6 12 32 2 2
ant4 ANTBlock 6 16 64 2 2
ant5 ANTBlock 6 24 96 3 1 2
ant6 ANTBlock 6 32 160 2 2
ant7 ANTBlock 6 64 320 1 2
conv8 conv2d 1x1 - - 1280 1 1 -
pool9 avgpool 7x7 - - 1280 1 1 -
fc10 FC - - n - - -
Table 2: The architecture of ANTNet(). Each line gives a sequence of or more identical (modulo stride) layers with repetition times. All layers in the same module or sequence have the same number of output channels. The stride is applied to the only first block in each layer. ANTNet() has the same parameters as above but is always .

A channel attention block (c) in Fig.1-(b) introduces additional parameters and MAdds compared to the Inverted Residual Blocks. Consider our ANTNet which has group of repeated blocks as shown in Table 2. Each group has ANTBlocks. Given reduction ratio , the increase in computational cost can be written as


where is the number of output channels from the depthwise convolutional layer. The equation 9 shows that when the dimension of output channels increases, the number of additional parameters and MAdds will increase. Also, [7] demonstrates that the channel attention is prone to be saturated at later layers and saturation also appear in our experiments. Therefore, the reduction ratio can be optimized for each repeated blocks and our configuration is shown in Table 2

. Later layers have less degree of freedom in terms of channel reweighting.

The group-wise projection layer in the ANTBlock reduces the number of parameters and MAdds. The group parameters needs to be determined for every ANTBlock for building ANTNet. It is the trade-off between efficiency and model accuracy. The overall design of ANTNet is shown in Table 2. We set the expansion rate and for the first repeated blocks (ant0). In others, we use a constant expansion rate and a group throughout the network.

When more budgets (e.g., MAdds and Params) are allowed, e-ANTBlock can be used as a basic block to build a more powerful network. e-ANTNet achieves the highest accuracy as shown in Table 3.

4 Experiments

We evaluate the computational efficiency and accuracy of ANTNet and compare it with state-of-the-art mobile models with favorable classification accuracy. The computational efficiency is measured theoretically by MAdds, and Params, and empirically by CPU Latency and model size on a mobile platform (iPhone 5s). For accuracy of models, we evaluate the image classification accuracy on CIFAR100 dataset [14] and ImageNet dataset (ILSVRC 2012 image classification) [19]. For ImageNet, we follow the prior work and use the validation dataset as a test set.

ANTNet is implemented using PyTorch. We use built-in 1

1 convolution and group convolution implementation for channel attention, and projection layers. Our ANTBlock is easy to reproduce in any deep learning frameworks such as Caffe, TensorFlow, and MXNet using built-in layers as long as

1 standard convolution and group convolution are available.

SGD optimizer was used in our experiments for model training. The momentum of SGD optimizer is set to and the nesterov momentum is used. We use a multistep learning rate schedule with initial learning rate and multiplicative factor of learning rate decay

at epoch

and . The maximum training epoch is set to . We set the regularization parameter, weight decay during our training process to , which is used in the Inception model [23]. The weight decay factor is the same for all the convolution layers in ANTNet. We use the same default data augmentation module as in ResNet for fair comparison. Random cropping and horizontal flipping are used for training images and images are resized or cropped to pixels for ImageNet and pixels for CIFAR100. During test, the trained model is evaluated on center crops. The same default settings are used in image preprocessing for evaluation as ResNet [5].

4.1 CIFAR100 Classification

The CIFAR100 dataset consists of RGB images of 100 classes, with training images and test images. We consider the start-of-the-art network architecture MobileNetV2 as our baseline. For fair comparison, we keep our settings the same as MobileNetV2. The images are converted to

images with zero-padding by

pixels on each side. Then, we randomly sample a crop from the image. Horizontal flipping and RGB mean value subtraction are applied as well. The overall network architecture and the hyperparameters for CIFAR100 are the same as ANTNet for ImageNet described in Table 2 except for different input and output size (100 classes vs. 1,000 classes) and strides of the first conv2d and the ANTBlock with set to 1.

As our purpose is to build resource efficient image classifier on mobile platform, we only compare our model with low computational cost models with fewer parameters consuming less memory and taking small network width. We consider mobile-suitable models, MobileNet and ShuffleNet as our comparison baselines. We evaluate the top-1 and top-5 accuracy and compare MAdds and number of parameters for benchmark. The performance comparison between baseline models and our ANTNet is listed in the table

3. It is easy to notice that our ANTNet achieve significant improvements over MobileNetV2 and ShuffleNet with fewer computational cost and parameter count. Our ANTNet () achieves computation reduction and parameter reduction with increase in top-1 accuracy. Plus, our ANTNet () achieves more accuracy improvement increase of top-1 accuracy with a slightly more computational cost and parameter count.

Network Top-1 Top-5 #Parameters #MAdds
Accu. Accu.
ShuffleNet (1.5)
ANTNet (g = 1)
ANTNet (g = 2)
Table 3: Performance on CIFAR100. We compare ANTNet models with MobileNetV2. Our proposed model ANTNet () achieves computation reduction and parameter reduction with increase in top-1 accuracy.

4.1.1 Optimal Configuration of Channel Attention

Channel attention in the ANTBlock is a key to improve the feature representation but the naive combination of channel attention and Depthwise Separable Convolution does not necessarily yield better performance. Table 4 shows that the naive adaptation of squeeze-and-excitation [8] for MobileNetV2 (Se-MobileNetV2) does not improve the representation power. To combine channel attention with depthwise convolutional layers, it needs a more careful design. We observed that channel attention is effective when the number of channels is large. Also similar to Rule for Full Channel Receptive Field (FCRF) [28], we design the ANTBlock that each output channel of a depthwise convolutional layer has a full channel receptive field to maximize the representation power. So, channel attention is inserted between expansion and projection layers in the ANTBlock as proposed in Fig.1 (b). One additional advantage of this design is that since any output channel of the depthwise convolutional layer has a FCRF, the projection layer in the ANTBlock can be substituted with any group convolutional layers ensuring that all output channels of an ANTBlock have a FCRF. All ANTBlocks () have FCRF.

Network Top-1 Accu. Top-5 Accu.
ANTNet (proposed) 75.7 93.6
Table 4: Different configurations of channel attention in the ANTBLock are evaluated on CIFAR100. For projection, all ANTNets use group convolution () and MobileNetV2 uses the standard convolution (). All the blocks have similar computational cost and parameters. Our construction (ANTNet) with channel attention between the depthwise convolution layer and projection layer shows the largest improvement (1.5%). It is consistent with our intuition. Note that a naive adaptation of Squeeze-and-excitation does not improve the performance of MobileNetV2. se-MobilenetV2, which has a simple concatenation of a MobileNetV2 block and a SE-Block, shows degradation compared to MobileNetV2. Even in a mobileNetV2 block, channel attention at arbitrary layers such as before the expansion layer (c-ANTNet) and after the projection layer (ANTNet-c) are not effective.

We compare the accuracy of three different arrangements of channel attention against the MobileNetV2 and they have similar computational costs and parameter counts. Our experiments in Table 4 show that ANTBlock with channel attention between expansion and projection layers was most effective (+1.5% Top-1 Accuracy) whereas all other arrangements do not show a significant performance boost. The channel attention after the projection layer (ANTNet-c) has almost the same performance as MobileNetV2 and channel attention before the expansion layer (c-ANTNet) even reduces the representational power. This experimental result is consistent with our observation and shows that channel attention is most effective with a large number of channels and a full channel receptive field.

4.1.2 Reduction Ratio

Reduction ratios , in Eq. (9) are hyperparameters to adjust the capacity and MAdds/Params. We varied at each ANTBlock and our final model (ANTNet) achieved the better accuracy (see Table 5) with less parameters rather than fixed for all ANTBlocks. We also observed that the last stage of the network shows an interesting tendency towards a saturated state. We found that the setting of reduction ratio for our ANTNet (see Table 2) achieved a good balance between accuracy and complexity and we thus use this setting for all experiments.

Ratio Params MAdds Top-1 accu Top-5 accu
8 3.5M 92.3M 75.9 94.1
16 3.0M 91.7M 75.2 93.9
32 2.8M 91.5M 75.5 93.9
Ours (mixed) 2.7M 91.4M 75.9 94.3
Table 5: Performance comparison of our ANTNet with the configuration of using different reduction ratios for each ANTBlock and fixed for all ANTBlocks on CIFAR100.

4.1.3 Parameters Learning of e-ANTBlock

Adaptively weighting different types of ANTBlocks, e-ANTBlock, allows building larger models with higher accuracy. In our experiment, we use types of ANTBlocks (varying number of groups in convolution) by constructing e-ANTBlock. indicates we are using two types of ANTBlocks with group convolution with group and group . and are the parameters corresponding to their weights of two types of ANTBlocks. We compared the accuracy on CIFAR100 with manually set weight parameters , e.g., (0,1), (1,0), (0.5,0.5), etc. If we set , or , it means we are using only one type of ANTBlock for constructing e-ANTBlock and another one type of ANTBlock is not used. When we set , it means we are using both type of blocks for constructing e-ANTBlock by averaging. The experiment on CIFAR100 with e-ANTNet shows that automatic learning outperforms manually setting (see Table 6). The best Top1-Accuracy by fixed weights was 76.2% whereas learned achieved 76.7%.

Top-1 accu
0 1 75.7
1 0 75.9
0.5 0.5 76.2
learned learned 76.7
Table 6: Performance of e-ANTNet with e-ANTBlocks by adaptively weighting two types of ANTBlock on CIFAR100 w/o learning parameters and .

4.2 ImageNet Classification

The ImageNet 2012 dataset consists of million training images and K validation images from 1,000 classes. We train our network on the training set and report top-1 and top-5 accuracy with the corresponding MAdds and the parameters of models.

Our ANTNet achitecture is shown in Fig 1(a) and the details of layers are listed in the Table 2. We compare our models with other low-cost models (e.g., M params, M MAdds), such as MobileNetV1, MobileNetV2 (), and ShuffleNet (1.5). The comparison of accuracy and computation budgets is shown in Table 7. Our ANTNet () achieves consistent improvement over MobileNetV2 by Top-1 accuracy and outperforms ShuffleNet (1.5) by . Compared with the most resource-efficient model, CondenseNet (G=C=4), our ANTNet performs better than it with accuracy improvement, even with fewer MACCs. With slight more parameters and MACCs, our ANTNet () can offer Top-1 accuracy improvement against MobileNetV2 (). Also, we have a variant of our ANTNet which has comparable performace as MobileNetV2 with similar MAdds and Params as CondenseNet (G=C=4).

Model #Parameters #MAdds Top-1 Accu. (%) Top-5 Accu. (%)
MobileNetV1 4.2M 575M 70.6 89.5
SqueezeNext 3.2M 708M 67.5 88.2
ShuffleNet (1.5) 3.4M 292M 71.5 -
ShuffleNet (x2) 5.4M 524M 73.7 -
CondenseNet (G=C=4) 2.9M 274M 71.0 90.0
CondenseNet (G=C=8) 4.8M 529M 73.8 91.7
MobileNetV2 3.4M 300M 72.0 91.0
MobileNetV2 (1.4) 6.9M 585M 74.7 92.5
NASNet-A 5.3M 564M 74.0 91.3
AmoebaNet-A 5.1M 555M 74.5 92.0
PNASNet 5.1M 588M 74.2 91.9
DARTS 4.9M 595M 73.1 91
ANTNet (g = 1) (ours) 3.7M 322M 73.2 91.2
ANTNet (g = 2) (ours) 3.2M 267M 72.8 91.0
e-ANTNet (ours) 5.5M 545M 74.2 91.6
ANTNet () (ours) 6.8M 598M 75.0 92.3
Table 7: Performance Results on ImageNet Classification. We compare our AntNet models with mobile models. Our proposed model ANTNet () achieves absolute Top-1 accuracy improvement over MobileNetV2 with fewer parameters and fewer MAdds. Compare with the lightest model CondenseNet (G=C=4), our model achieves absolute Top-1 accuracy with fewer MAdds. To compare with M #MAdds, we increase the dimension of features with depth multiplier () of our ANTNet, ANTNet (), and it performs better than all baseline models, Top-1 accuracy improvement over MobileNetV2 (), Top-1 accuracy improvement over NASNet-A and Top-1 accuracy improvement over DARTS.

4.3 Inference on a Mobile Device

We briefly discussed that MAdds and Params are used to measure the computational cost and model size. They are handy to compare models across a variety of implementation and hardware. But this estimate does not consider memory reads and writes cost, which can be a crucial factor in a real world scenario. Since memory access is relatively slower than computations, the amount of memory access will have a big impact on its real speed on actual devices. Moreover, both CPUs and GPUs can do caching to speed up memory reads and writes. Memory coalescing can be very useful for speeding up memory reads as each thread can read a chunk of memory in one go instead of doing separate reads. Kernels can also read small amounts of memory into local or thread group storage of faster access. It is possible for each thread to compute multiple outputs instead of only one, allowing it to reuse some of the input multiple times and thus requiring fewer memory reads overall. In short, the actual inference speed running on actual devices depends on hardware architecture and the ways of implementation of each layer. So the inference speed of models should be tested on actual devices as well.

Model MAdds CoreML Model Size Latency
ANTNet (g = 1)
ANTNet (g = 2)
Table 8: Latency (inference time) running on an actual device, iPhone 5s. Our proposed model ANTNet (g = 2) achieves faster than MobileNetV2.

We evaluate the actual inference time of models on a commodity iOS-based smartphone iPhone 5s, which has a 64-bit 1.3 GHz dual-core Apple Cyclone, Apple A7, Apple M7 motion coprocessor and 1GB LPDDR3 RAM. To run the inference of models on iPhone5s, we need to convert our trained models to CoreML models, which can be deployed on iOS-based devices using an Apple machine learning platform. CoreML is optimized for on-device performance and minimizes memory footprint and power consumption. Although it is only focused on and optimized on iOS-based platform, it can still be meaningful to compare the speed of ANTNet relative to other baseline models. The actual inference time of models on iPhone 5s is available in Table 8. We run each model times and take out the fastest and the slowest runs, and then take average of runs as the final inference time. The table also provides converted CoreML model file sizes. Table 8 shows that our ANTNet achieves speedup compared to MobileNetV2 and the improvement of latency is our analysis of MAdds.

The CPU inference time on a desktop machine with a 2.10 GHz 32-core Intel(R) Xeon(R) CPU E5-2620 shows similar improvement as iPhone 5s that our ANTNet () is 8% faster than MobileNetV2 (1.11s vs 1.21s).

5 Conclusion

In this paper we proposed the ANTBlock, a novel basic architecture unit designed to boost the representational capacity of a network by imposing channel-wise attention and grouped convolution. The capacity of ANTBlock allows designing resouce-efficient networks. MobileNetV2 can be viewed as a special case of our network with the removal of channel-wise attention and group convolution. Extensive experiments demonstrate the effectiveness and efficiency of our ANTNet which achieves state-of-the-art performance on multiple datasets. In addition, the experiments on an actual device iPhone 5s show that ANTNet achieves significant latency improvement on top of state-of-the-art low cost models in practice. Finally, the improved capacity induced by ANTBlocks shows that leveraging the interdependency of channels is a promising direction to find more resource-efficient mobile models by imposing MAdds and parameter constraints.