1 Introduction
Deep neural networks have emerged as stateoftheart solutions for various tasks in computer vision, machine learning, and natural language processing. Recent research in deep learning mainly focuses on deeper and heavier models to achieve superhuman accuracy with a tremendous number of parameters. Inception
[12], ResNets [5], HighwayNets [22], and DenseNet [10] are popular architectures in this direction and has been shown to be effective in a variety of tasks. However, in many realworld applications, due to limited computing resources and short latency requirements, more efficient recognition systems are often required, for example, in mobile phones, robots, and smart appliances that require the ondevice intelligence systems.For the last few years, small and efficient neural networks have enabled the deployment of models on computationally limited hardware for a wide range of applications. One stream of such efforts is to substitute existing layers with more efficient layers. Since in a vision system, convolutional neural networks (CNNs) are the most popular base feature extraction networks and the main computational burden is convolutional layers, faster convolution layers are crucial. The standard convolution layer performs convolution using all the input channels for one output channel. So, the number of filters and calculations increase as the number of input channels grows. Instead,
group convolution involves only each group of input channels resulting in a smaller number of filters (and calculations) reduced by a factor of the number of groups. Group convolution has been used in multiple architectures. AlexNet [15] uses group convolution to train models on GPUs with limited memory. Later, ResNeXt [25] utilizes group convolution to achieve better performance and [11] proposed more complex group convolution with hierarchical arrangements. One extreme of group convolution is depthwise separable convolutions introduced in [21]. Each group involves only one input channel and convolution filter. Since then, the trick has been adopted in other architectures such as Inception [12], Flattened Networks [13] and Xception [2]. Recently, the depthwise separable convolution has been adopted by a compact architecture specifically designed for mobile devices. MobileNetV1 [6], and MobileNetV2 [20] achieved significant improvement with respect to inference time (latency) on mobile devices.Efficient convolutional layers are preferable but accuracy degradation is inevitable. To fill the performance gap induced by approximate convoltuion, recent network enhancement techniques can be used as long as the additional cost is negligible. There have been multiple attempts to boost the representational power of models with negligible additional cost. Attentional neural networks have been proven that it is a general module and improve performance by suppressing irrelevant information and focus on informative parts of data. Temporal/spatial attention has been studied in the literature but, arguably, they come with a significant cost. However, channel attention can be implemented in a much more efficient way. For instance, the “SqueezeandExcitation” (SE) block proposed in SENet [8] allows selective reweighting channels based on global information from each channel. This improves a variety of architectures with a minor computational cost. A channel shuffle operation also boosts the performance or mitigates degradation of group convolution as shown in ShuffleNet [27, 18] and twostage convolution [24]. This line of efforts motivate our work to develop an efficient and powerful architecture.
Our contributions: (i) We propose a new efficient and powerful architectural block, ANTBlock, that dynamically utilizes channel relationships; (ii) we show that a naive adaptation of channel attention (e.g., SE) does not improve the representational power of depthwise convolutional layers and propose an optimal configuration that maximizes the number of channels and has full channel receptive fields; (iii) using group convolution we make the ANTBlock more efficient w.r.t. parameter counts and computational costs without significant performance loss and extend it to an ensemble block; (iv) ANTBlock is simple to implement in widelyused deep learning frameworks and outperforms the stateoftheart lightweight CNNs. ANTNet achieves improvement over MobileNetV2 on the ImageNet [19] with fewer parameters and fewer multiplyadds (MAdds) resulting in 20% faster inference time on a mobile phone.
2 Related Work
The efficiency of neural networks becomes an important topic as networks get larger and deeper. Inception module was utilized in GoogLeNet
[23] to obtain high performance with a drastically reduced number of parameters by using small convolutions. An efficient bottleneck structure was designed to construct ResNet to achieve high performance. Further, the large demand for ondevice applications encourages studies on resourceefficient models with minimal latency and memory usage. To this end, [4] studied module designs with the tradeoff between multiple factors such as depth, width, filter size, pooling layer and so on.Group convolution is a straightforward and effective technique to save computations while maintaining accuracy. It was introduced with AlexNet [16] as a workaround for small GPUs. Later DeepRoots [11] and ResNeXt [25] adopted group convolutions to improve models. Depthwise separable convolution is an extreme case of group convolution that performs convolution each channel separately. It was first introduced in [21] and Xception [2] integrates the idea into the Inception and CondenseNet [9] does so for DenseNet. For mobile platforms, MobileNetV1 [6], and MobileNetV2 [20]
used depthwise convolutions with some hyperparameters to control the size of models.
Channel relationship is a relatively underexplored source of the performance boost. It is a promising direction since it usually requires a small additional cost. ShuffleNets [27, 18] shuffle channels within twostage group convolution and can be efficiently implemented by “random sparse convolution” layer [1, 26]. Apart from random sparse channel grouping, SqueezeandExcitation Networks (SENet) [8] studies a dynamic channel reweighting scheme to boost model capacity at a small cost. The success of channel grouping and channel manipuliation motivates our work.
3 Model Architecture: ANTNet
The goal of this work is to design a basic low cost architecture block, which can be used to build efficient Convolutional Neural Networks for mobile devices with budget constraints. The budget of a model varies depending on implementations and hardware quantities for realworld applications. To have a general and fair comparison, in the literature [3], the budget (or complexity) of a model is measured by the number of computations, e.g., multiplyadds (MAdds) or floating point operations (FLOPs), and the number of model parameters. Our goal is to build a more accurate CNN, ANTNet (Attention NesTed Network), with fewer MAdds and Params by stacking our novel basic blocks. ANTNet utilizes depthwise separable convolution and channel attention. Before introducing our ANTBlock, we breifly discuss depthwise seprable convolution and its variations with computation budgets (i.e., MAdds, and Params).
Depthwise Separable Convolution is proven to be a effective module to build efficient neural network architectures. It approximates a standard convolution operation with two separate convolutions: depthwise convolution and pointwise convolution. The most common depthwise separable convolution [6] consists of two layers: a depthwise convolution that filters the data and pointwise convolution that combines the outputs of depthwise convolution. Consider that the input and output of the depthwise separable convolution are three dimensional feature maps of size , , where , , and denote height, width, and the number of channels of the feature map and indicates input and output . are the height, are width, and are the number of channels of the input and output feature map. For a convolution kernel size , The total number of MAdds for depthwise separable convolution is
(1) 
Compared to the standard convolution, it reduces almost times computational cost. However, it often leads to reduced representational power.
Layer  Input  Operator  Output 

(a) Expansion layer  1x1 conv2d, ReLU6  
(b) Depthwise layer 
3x3 dwise stride = , ReLU6 

(c) Channel attention layer  Global pooling, FC(2), Sigmoid  
(d) Groupwise projection layer  linear 1x1 gconv2d group = 
Inverted Residual Block. One quick fix for the reduced representational power is to increase the input channels for depthwise separable convolution by adding an expansion layer before depthwise convolution in inverted residual bottleneck blocks of MobileNetV2 [20]. The expansion operation expands the number of input channels to times by pointwise convolution. The inverted residual block has three different types of layers, expansion layer, depthwise layer and projection layer. The projection layer takes the largest portion of computation and more parameters than others when the number of input/output channels is larger than kernel parameters .
Based on our observation on MAdds and parametres of inverted residual block, we will develop a more accurate and efficient block by saving computations on the projection layer with cheaper operations and allocating more resource on the depthwise layer with a small additional cost.
3.1 Designing Efficient Blocks: ANTBlock
In this section, we introduce our ANTBlock with detailed discussion. The ANTBlock presented in Fig. 1(b) is a residual block and can be written as
(2) 
When the dimension of the input to the block is not the same as the output, i.e.,
, we simply skip the residual connection as MobileNet V2. For simplicity, let us focus on the equation
2. ANTBlock is motivated by the Inverted Residual Block in MobileNet V2 and it can be factored into two parts: mapping from the input space to the high dimensional depthwise convolution space, , and projection to the input space, . Then, Eq. 2 can be written as(3) 
In ANTBlock, consists of one expansion layer and one depthwise convolutional layer. is a projection layer. Now, our block can be rewritten as
(4) 
This construction can be further improved by the attention mechanism. In [17], the models equipped with attention mask show significant improvement on segmentation. We apply this similar idea to the output of depthwise layer for boosting feature representation. In this case, channel attention is used to improve representational power without a significant increase in computational cost and parameters. With channel attention, we can write our ANTBlock (see Fig. 1 (b)) as,
(5) 
where stands for elementwise product, is the input feature, corresponding to the input of ANTBlock in Fig. 1 (b). denotes the output of depthwise convolution layer (b) of ANTBlock. denotes the attention mask for , represented by (c)1, (c)2 and (c)3 in Fig. 1 (b). denotes an output channel of ANTBlock, is the projection for each output channel, corresponding to (d) Fig. 1 (b) with group convolution (group = 1), which means the output of ANTBlock is using all features from the output of attention maps .
is the dimension of the tensor,
is the expansion factor of channels, is the reduction ratio for channel attention. Symbol denotes the elementwise addition and symbol denotes the channelwise multiplication. ANTBlock means the number of repeated layers within ANTBlock. Note that If the output resolution differs from the input resolution, only the stride of the first layer within ANTBlock is and residual connection of the block is skipped. DwConv stands for depthwise convolution and GConv stands for group convolution. (a): ANTNet model is shown in Table 2 with more details; (b) is the structure of corresponding ANTBlock for building ANTNet and it is shown in Table 1.As discussed before, the parameters and MAdds of depthwise convolutional layer kernels are usually fewer than expansion layer and projection layer. We use a group convolutional layer forward more efficient projection saving parameters and MAdds by a factor of groups. Group convolution first has been adopted in [16] to use multiple GPUs for distributed convolution computation. It reduces computational cost and the number of parameters while still achieving high representational power. [28] proposed channel local convolution (CLC), in which an output channel can depend on an arbitrary subset of the input channels. It is a multistage group convolution with a nice property socall full channel receptive field (FCRF). They found that in order to achieve high accuracy every output channel of (CLC) should cover all the input channels. In our case, channel attention uses all the input channels of the depthwise convolution layer. So any group convolution for the projection layer satisfies FCRF condition and our ANTBlock becomes a CLC block. With a group convolution layer for the projection layer , our ANTBlock can be written as
(6)  
where is the number of group convolution, denotes the output channels of , , denotes the feature maps associated with each output channel of ANTBlock. is the projection for each output channel with group convolution (group ), corresponding to (d) Fig. 1 (b).
3.2 Ensemble ANTBlocks: eANTBlock
The proposed block (ANTBlock) can be extended further to an ensemble block, denoted by eANTBlock. To construct more powerful networks, we can ensemble (or weighted aggregate) different types of ANTBlocks (e.g., different group). eANTBlock can be written as
(7) 
where is the number of different ANTBlocks, denotes an ANTBlock with a group convolutional layer for projection, is a weight of an ANTBlock. The weights are outputs of a softmax function written as
(8) 
so that and .
are parameters of eANTBlock trained by backpropagation. During standard training, these parameters of eANTBlock will be learned endtoend. In our experiments,
is used. The structure of eANTBlock with can be seen in Fig. 23.3 ANTNet
ANTNet (Attention NesTed Network) is a new efficient convolutional neural network architecture constructed by a sequence of ANTBlocks. The architecture of the network is similar to MobileNetV2, but all the inverted residual blocks are replaced by ANTBlocks and one may use a different number of ANTBlocks depending on the target accuracy.
Now, we describe our architecture in detail. The basic building block is ANTBlock, which has an expansion layer, a depthwise convolutional layer, a channelattention layer, and a groupwise projection layer with residual connections. The detailed structure of ANTBlock is shown is Table 1 and Figure 1(b).
Name  Input  Operator  Expansion ()  Reduction  Output  Repetition ()  Stride ()  Group () 
Ratio ()  Channels ()  
conv0  conv2d      32  1  2    
ant1  ANTBlock  1  8  16  1  1  1  
ant2  ANTBlock  6  8  24  2  2  
ant3  ANTBlock  6  12  32  2  2  
ant4  ANTBlock  6  16  64  2  2  
ant5  ANTBlock  6  24  96  3  1  2  
ant6  ANTBlock  6  32  160  2  2  
ant7  ANTBlock  6  64  320  1  2  
conv8  conv2d 1x1      1280  1  1    
pool9  avgpool 7x7      1280  1  1    
fc10  FC      n       
A channel attention block (c) in Fig.1(b) introduces additional parameters and MAdds compared to the Inverted Residual Blocks. Consider our ANTNet which has group of repeated blocks as shown in Table 2. Each group has ANTBlocks. Given reduction ratio , the increase in computational cost can be written as
(9) 
where is the number of output channels from the depthwise convolutional layer. The equation 9 shows that when the dimension of output channels increases, the number of additional parameters and MAdds will increase. Also, [7] demonstrates that the channel attention is prone to be saturated at later layers and saturation also appear in our experiments. Therefore, the reduction ratio can be optimized for each repeated blocks and our configuration is shown in Table 2
. Later layers have less degree of freedom in terms of channel reweighting.
The groupwise projection layer in the ANTBlock reduces the number of parameters and MAdds. The group parameters needs to be determined for every ANTBlock for building ANTNet. It is the tradeoff between efficiency and model accuracy. The overall design of ANTNet is shown in Table 2. We set the expansion rate and for the first repeated blocks (ant0). In others, we use a constant expansion rate and a group throughout the network.
When more budgets (e.g., MAdds and Params) are allowed, eANTBlock can be used as a basic block to build a more powerful network. eANTNet achieves the highest accuracy as shown in Table 3.
4 Experiments
We evaluate the computational efficiency and accuracy of ANTNet and compare it with stateoftheart mobile models with favorable classification accuracy. The computational efficiency is measured theoretically by MAdds, and Params, and empirically by CPU Latency and model size on a mobile platform (iPhone 5s). For accuracy of models, we evaluate the image classification accuracy on CIFAR100 dataset [14] and ImageNet dataset (ILSVRC 2012 image classification) [19]. For ImageNet, we follow the prior work and use the validation dataset as a test set.
ANTNet is implemented using PyTorch. We use builtin 1
1 convolution and group convolution implementation for channel attention, and projection layers. Our ANTBlock is easy to reproduce in any deep learning frameworks such as Caffe, TensorFlow, and MXNet using builtin layers as long as
1 standard convolution and group convolution are available.SGD optimizer was used in our experiments for model training. The momentum of SGD optimizer is set to and the nesterov momentum is used. We use a multistep learning rate schedule with initial learning rate and multiplicative factor of learning rate decay
at epoch
and . The maximum training epoch is set to . We set the regularization parameter, weight decay during our training process to , which is used in the Inception model [23]. The weight decay factor is the same for all the convolution layers in ANTNet. We use the same default data augmentation module as in ResNet for fair comparison. Random cropping and horizontal flipping are used for training images and images are resized or cropped to pixels for ImageNet and pixels for CIFAR100. During test, the trained model is evaluated on center crops. The same default settings are used in image preprocessing for evaluation as ResNet [5].4.1 CIFAR100 Classification
The CIFAR100 dataset consists of RGB images of 100 classes, with training images and test images. We consider the startoftheart network architecture MobileNetV2 as our baseline. For fair comparison, we keep our settings the same as MobileNetV2. The images are converted to
images with zeropadding by
pixels on each side. Then, we randomly sample a crop from the image. Horizontal flipping and RGB mean value subtraction are applied as well. The overall network architecture and the hyperparameters for CIFAR100 are the same as ANTNet for ImageNet described in Table 2 except for different input and output size (100 classes vs. 1,000 classes) and strides of the first conv2d and the ANTBlock with set to 1.As our purpose is to build resource efficient image classifier on mobile platform, we only compare our model with low computational cost models with fewer parameters consuming less memory and taking small network width. We consider mobilesuitable models, MobileNet and ShuffleNet as our comparison baselines. We evaluate the top1 and top5 accuracy and compare MAdds and number of parameters for benchmark. The performance comparison between baseline models and our ANTNet is listed in the table
3. It is easy to notice that our ANTNet achieve significant improvements over MobileNetV2 and ShuffleNet with fewer computational cost and parameter count. Our ANTNet () achieves computation reduction and parameter reduction with increase in top1 accuracy. Plus, our ANTNet () achieves more accuracy improvement increase of top1 accuracy with a slightly more computational cost and parameter count.Network  Top1  Top5  #Parameters  #MAdds 

Accu.  Accu.  
ShuffleNet (1.5)  
MobileNetV2  
ANTNet (g = 1)  
ANTNet (g = 2)  
eANTNet 
4.1.1 Optimal Configuration of Channel Attention
Channel attention in the ANTBlock is a key to improve the feature representation but the naive combination of channel attention and Depthwise Separable Convolution does not necessarily yield better performance. Table 4 shows that the naive adaptation of squeezeandexcitation [8] for MobileNetV2 (SeMobileNetV2) does not improve the representation power. To combine channel attention with depthwise convolutional layers, it needs a more careful design. We observed that channel attention is effective when the number of channels is large. Also similar to Rule for Full Channel Receptive Field (FCRF) [28], we design the ANTBlock that each output channel of a depthwise convolutional layer has a full channel receptive field to maximize the representation power. So, channel attention is inserted between expansion and projection layers in the ANTBlock as proposed in Fig.1 (b). One additional advantage of this design is that since any output channel of the depthwise convolutional layer has a FCRF, the projection layer in the ANTBlock can be substituted with any group convolutional layers ensuring that all output channels of an ANTBlock have a FCRF. All ANTBlocks () have FCRF.
Network  Top1 Accu.  Top5 Accu. 

MobileNetV2  
seMobileNetV2  
cANTNet  
ANTNetc  
ANTNet (proposed)  75.7  93.6 
We compare the accuracy of three different arrangements of channel attention against the MobileNetV2 and they have similar computational costs and parameter counts. Our experiments in Table 4 show that ANTBlock with channel attention between expansion and projection layers was most effective (+1.5% Top1 Accuracy) whereas all other arrangements do not show a significant performance boost. The channel attention after the projection layer (ANTNetc) has almost the same performance as MobileNetV2 and channel attention before the expansion layer (cANTNet) even reduces the representational power. This experimental result is consistent with our observation and shows that channel attention is most effective with a large number of channels and a full channel receptive field.
4.1.2 Reduction Ratio
Reduction ratios , in Eq. (9) are hyperparameters to adjust the capacity and MAdds/Params. We varied at each ANTBlock and our final model (ANTNet) achieved the better accuracy (see Table 5) with less parameters rather than fixed for all ANTBlocks. We also observed that the last stage of the network shows an interesting tendency towards a saturated state. We found that the setting of reduction ratio for our ANTNet (see Table 2) achieved a good balance between accuracy and complexity and we thus use this setting for all experiments.
Ratio  Params  MAdds  Top1 accu  Top5 accu 

8  3.5M  92.3M  75.9  94.1 
16  3.0M  91.7M  75.2  93.9 
32  2.8M  91.5M  75.5  93.9 
Ours (mixed)  2.7M  91.4M  75.9  94.3 
4.1.3 Parameters Learning of eANTBlock
Adaptively weighting different types of ANTBlocks, eANTBlock, allows building larger models with higher accuracy. In our experiment, we use types of ANTBlocks (varying number of groups in convolution) by constructing eANTBlock. indicates we are using two types of ANTBlocks with group convolution with group and group . and are the parameters corresponding to their weights of two types of ANTBlocks. We compared the accuracy on CIFAR100 with manually set weight parameters , e.g., (0,1), (1,0), (0.5,0.5), etc. If we set , or , it means we are using only one type of ANTBlock for constructing eANTBlock and another one type of ANTBlock is not used. When we set , it means we are using both type of blocks for constructing eANTBlock by averaging. The experiment on CIFAR100 with eANTNet shows that automatic learning outperforms manually setting (see Table 6). The best Top1Accuracy by fixed weights was 76.2% whereas learned achieved 76.7%.
Top1 accu  
0  1  75.7 
1  0  75.9 
0.5  0.5  76.2 
learned  learned  76.7 
4.2 ImageNet Classification
The ImageNet 2012 dataset consists of million training images and K validation images from 1,000 classes. We train our network on the training set and report top1 and top5 accuracy with the corresponding MAdds and the parameters of models.
Our ANTNet achitecture is shown in Fig 1(a) and the details of layers are listed in the Table 2. We compare our models with other lowcost models (e.g., M params, M MAdds), such as MobileNetV1, MobileNetV2 (), and ShuffleNet (1.5). The comparison of accuracy and computation budgets is shown in Table 7. Our ANTNet () achieves consistent improvement over MobileNetV2 by Top1 accuracy and outperforms ShuffleNet (1.5) by . Compared with the most resourceefficient model, CondenseNet (G=C=4), our ANTNet performs better than it with accuracy improvement, even with fewer MACCs. With slight more parameters and MACCs, our ANTNet () can offer Top1 accuracy improvement against MobileNetV2 (). Also, we have a variant of our ANTNet which has comparable performace as MobileNetV2 with similar MAdds and Params as CondenseNet (G=C=4).
Model  #Parameters  #MAdds  Top1 Accu. (%)  Top5 Accu. (%) 
MobileNetV1  4.2M  575M  70.6  89.5 
SqueezeNext  3.2M  708M  67.5  88.2 
ShuffleNet (1.5)  3.4M  292M  71.5   
ShuffleNet (x2)  5.4M  524M  73.7   
CondenseNet (G=C=4)  2.9M  274M  71.0  90.0 
CondenseNet (G=C=8)  4.8M  529M  73.8  91.7 
MobileNetV2  3.4M  300M  72.0  91.0 
MobileNetV2 (1.4)  6.9M  585M  74.7  92.5 
NASNetA  5.3M  564M  74.0  91.3 
AmoebaNetA  5.1M  555M  74.5  92.0 
PNASNet  5.1M  588M  74.2  91.9 
DARTS  4.9M  595M  73.1  91 
ANTNet (g = 1) (ours)  3.7M  322M  73.2  91.2 
ANTNet (g = 2) (ours)  3.2M  267M  72.8  91.0 
eANTNet (ours)  5.5M  545M  74.2  91.6 
ANTNet () (ours)  6.8M  598M  75.0  92.3 
4.3 Inference on a Mobile Device
We briefly discussed that MAdds and Params are used to measure the computational cost and model size. They are handy to compare models across a variety of implementation and hardware. But this estimate does not consider memory reads and writes cost, which can be a crucial factor in a real world scenario. Since memory access is relatively slower than computations, the amount of memory access will have a big impact on its real speed on actual devices. Moreover, both CPUs and GPUs can do caching to speed up memory reads and writes. Memory coalescing can be very useful for speeding up memory reads as each thread can read a chunk of memory in one go instead of doing separate reads. Kernels can also read small amounts of memory into local or thread group storage of faster access. It is possible for each thread to compute multiple outputs instead of only one, allowing it to reuse some of the input multiple times and thus requiring fewer memory reads overall. In short, the actual inference speed running on actual devices depends on hardware architecture and the ways of implementation of each layer. So the inference speed of models should be tested on actual devices as well.
Model  MAdds  CoreML Model Size  Latency 

MobileNetV2  
ANTNet (g = 1)  
ANTNet (g = 2) 
We evaluate the actual inference time of models on a commodity iOSbased smartphone iPhone 5s, which has a 64bit 1.3 GHz dualcore Apple Cyclone, Apple A7, Apple M7 motion coprocessor and 1GB LPDDR3 RAM. To run the inference of models on iPhone5s, we need to convert our trained models to CoreML models, which can be deployed on iOSbased devices using an Apple machine learning platform. CoreML is optimized for ondevice performance and minimizes memory footprint and power consumption. Although it is only focused on and optimized on iOSbased platform, it can still be meaningful to compare the speed of ANTNet relative to other baseline models. The actual inference time of models on iPhone 5s is available in Table 8. We run each model times and take out the fastest and the slowest runs, and then take average of runs as the final inference time. The table also provides converted CoreML model file sizes. Table 8 shows that our ANTNet achieves speedup compared to MobileNetV2 and the improvement of latency is our analysis of MAdds.
The CPU inference time on a desktop machine with a 2.10 GHz 32core Intel(R) Xeon(R) CPU E52620 shows similar improvement as iPhone 5s that our ANTNet () is 8% faster than MobileNetV2 (1.11s vs 1.21s).
5 Conclusion
In this paper we proposed the ANTBlock, a novel basic architecture unit designed to boost the representational capacity of a network by imposing channelwise attention and grouped convolution. The capacity of ANTBlock allows designing resouceefficient networks. MobileNetV2 can be viewed as a special case of our network with the removal of channelwise attention and group convolution. Extensive experiments demonstrate the effectiveness and efficiency of our ANTNet which achieves stateoftheart performance on multiple datasets. In addition, the experiments on an actual device iPhone 5s show that ANTNet achieves significant latency improvement on top of stateoftheart low cost models in practice. Finally, the improved capacity induced by ANTBlocks shows that leveraging the interdependency of channels is a promising direction to find more resourceefficient mobile models by imposing MAdds and parameter constraints.
References
 [1] Soravit Changpinyo, Mark Sandler, and Andrey Zhmoginov. The power of sparsity in convolutional neural networks. arXiv preprint arXiv:1702.06257, 2017.
 [2] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.
 [3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.

[4]
Kaiming He and Jian Sun.
Convolutional neural networks at constrained time cost.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 5353–5360, 2015.  [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [7] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
 [8] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. 2018.
 [9] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [11] Yani Ioannou, Duncan Robertson, Roberto Cipolla, Antonio Criminisi, et al. Deep roots: Improving cnn efficiency with hierarchical filter groups. 2017.
 [12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [13] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014.
 [14] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 [18] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
 [19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [20] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [21] Laurent Sifre and Stéphane Mallat. Rigidmotion scattering for image classification. PhD thesis, Citeseer, 2014.
 [22] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.
 [23] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [24] Guotian Xie, Jingdong Wang, Ting Zhang, Jianhuang Lai, Richang Hong, and GuoJun Qi. Interleaved structured sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8847–8856, 2018.
 [25] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [26] Ting Zhang, GuoJun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In Computer Vision and Pattern Recognition, 2017.
 [27] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [28] DongQing Zhang et al. clcnet: Improving the efficiency of convolutional neural network using channel local convolutions. arXiv preprint arXiv:1712.06145, 2017.
Comments
There are no comments yet.