1 Introduction
Convolutional neural networks (CNNs) have achieved high performance in several visual recognition tasks, including object classification, detection, and segmentation [19, 17, 5]. These networks learn representations by performing convolutional operations on an input tensor along its spatial dimensions using a sliding window approach. Because of the limited receptive field of a convolutional kernel, convolutional layers are not able to utilize the full encoding capacity of a tensor and therefore, we need to stack multiple convolutional layers to encode more information. This increases the complexity of a CNN.
width). (c) compares the top1 accuracy on the ImageNet dataset at two different settings. Dimensionwise convolutions maximize the use of information contained in the tensor; thus leading to more accurate solutions.
To improve the efficiency of standard convolutions, most recent attempts have focused on reducing redundant operations and parameters in the convolutions [30, 20]. Recently, depthwise separable convolutions [22] have been proposed to reduce CNNs complexity in terms of floatingpoint operations (FLOPs) and network parameters. These convolutions factorize the convolutional operation in two steps: (1) A lightweight filter is applied to each spatial plane using depthwise convolutions [6] to learn spatial representations and (2) a pointwise () convolution is then applied to fuse channels and learn combinations between spatial representations. Similar to standard convolutions, depthwise convolutions are not able to utilize the full encoding capacity of a tensor and therefore, pointwise convolutions are used to encode more information in depthwise separable convolutions. Since depthwise convolutions are computationally very efficient, this leaves a significant computational load on pointwise convolutions, which causes a computational bottleneck. As an example, pointwise convolutions account for about 90% of total operations in ShuffleNetv2 [37] and MobileNetv2 [51].
In this paper, we introduce DiCENet, Dimensionwise Convolutions for Efficient Networks. Our dimensionwise convolution (DimConv) convolves separate filters across each dimension, including the channel (or depth^{1}^{1}1We will use “depth” and “channel” interchangeably in this paper.) dimension of the input tensor. Unlike standard and depthwise convolutions, DimConv leverages the spatial locality of various channels and is thus, in a sense made precise in Section 3, not invariant to channelwise permutations of an input tensor. Moreover, learning representations across all dimensions leads to an encoding mechanism that increases the use of information contained in the tensor. In particular, the increased learning capacity due to DimConv allows us to further reduce the computational burden from expensive pointwise convolutions. We introduce an Efficient channel Fusion (EFuse) to fuse dimensionwise representations efficiently. With DimConv and EFuse, we are able to build a more efficient CNN while maintaining accuracy (see Figure 1). We highlight that the introduced module is novel and is not a part of any architecture search methods [68, 56, 3].
We have evaluated the performance of DiCENet on several visual recognition tasks including object classification, detection, and semantic segmentation. Compared to existing methods, DiCENet shows a significant improvement across all tasks. When compared with existing efficient architecture designs, such as MobileNetv2 [51] and ShuffleNetv2 [37], on the ImageNet dataset [50], DiCENet (1) is 12% more accurate for extremely small models (715 MFLOPs), (2) delivers similar performance with 1520% fewer FLOPs for small models (38160 MFLOPs), and (3) outperforms existing efficient designs with fewer FLOPs for mediumsize (270600 MFLOPs) models. When DiCENet is used as a base feature extractor on tasks such as semantic segmentation and object detection, it delivers a competitive performances and outperforms some of the taskspecific tailored networks. For semantic segmentation on the PASCAL VOC 2012 dataset [10], DiCENet is 4.3% more accurate and has fewer FLOPs than ESPNet [39], a recent efficient segmentation network. DiCENet is 4.5% more accurate and has fewer FLOPs than YOLOv2 [46] on the MSCOCO object detection [32].
2 Related Work
CNN architecture designs:
Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [29, 53, 19, 28, 55, 25]. To make these network designs more efficient, they have been extended with efficient and sparse forms of convolutions, such as depthwise and group convolutions [22, 24, 64, 37, 51, 40]. In this paper, we introduce dimensionwise convolutions that generalize depthwise convolutions to all dimensions of the input tensor.
Neural architecture search:
Recently, neural search methods, including reinforcement learning and genetic algorithms, have been proposed to automatically construct network architectures
[68, 62, 45, 69, 56, 33]. These methods search over a huge network space (e.g. MNASNet [56] searches over 8K different design choices) using a dictionary of predefined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually nonhomogeneous, that satisfies optimization constraints, such as inference time. Recent searchbased methods [56, 3, 60] use MobileNetv2 [51] as a basic search block for automatic network design. Since the proposed unit delivers better performance than MobileNetv2 (see Section 4), we believe that neural architecture search with our proposed unit would enable finding a better network design.Other alternatives for efficient CNNs:
These approaches include network quantization [54, 44, 61, 8, 67, 26, 1], compression [14, 15, 59, 30, 57, 20], and distillation [21, 49, 13, 36, 63]. Network quantizationbased approaches approximate convolution operations with fewer bits instead of using 32bit full precision floating points. This improves speed at inference and also, reduces memory required for storing network weights. Network compressionbased approaches improve the efficiency of a network by removing redundant weights and connections. Unlike network quantization and compression, distillationbased approaches improve the accuracy of (usually shallow) networks by supervising the training with large pretrained networks.
Though these approaches have shown to be very effective for improving either the efficiency or the accuracy of a network, they fail to underline architectural design benefits. For a fair comparison with existing designbased methods such as ShuffleNets [37, 64], MobileNets [22, 51], and ESPNetv2 [40], we do not use these approaches.
3 DiCENet
In this section, we elaborate on the details of our model, DiCENet. We first discuss the building block of our model, DiCENet unit, and then describe DiCENet architecture.
3.1 DiCENet unit
The main building block of our model, the DiCENet unit, is based on a factorization principle and is primarily composed of two steps: (1) Dimensionwise Convolution (DimConv) convolves separate filters across each dimension of the input tensor to learn rich representations and (2) Efficient channel Fusion (EFuse) efficiently combines channels of the output tensor from the DimConv. Figure 2 shows an overview of the DiCENet unit. Dimensionwise convolutions encode information from all dimensions of the tensor. This increases the learning capacity of a convolutional layer and reduces the computational burden on other layers. Therefore, the proposed efficient channel fusion will work effectively with dimensionwise convolutions.
Dimensionwise convolution (DimConv):
The input to a convolutional layer is a threedimensional tensor defined by width , height , and depth (or channels) . A standard depthwise convolutional layer applies convolutional kernels , with each convolutional kernel processing a single channel along the spatial dimensions (width and height), across the depth dimension (channels) to produce an output ; where denotes the kernel size. This type of convolution encodes information within the order of the elements in spatial dimensions. In other words, for any reordering of the channels in the input tensor, there exist a reordering of weight filters in the model that can obtain the same output tensor. To encode information further along the channel dimension, we extend depthwise convolutions to all dimensions of the input tensor and call this operation as Dimensionwise Convolution (DimConv). As illustrated in Figure 2, DimConv has three branches. These branches apply depthwise convolutional kernels along depth, widthwise convolutional kernels along width, and heightwise convolutional kernels kernels along height to produce outputs , , that encode information from all dimensions of the input tensor. The output of these independent branches are concatenated along the channel dimension to produce the output . To facilitate learning of interdimension representations, we group the spatial planes (or feature maps) in such that each group has a spatial plane from each dimension. We then combine these dimensionwise grouped representations with pointwise convolutional kernels to produce weighted average output .
Efficient channel fusion (EFuse):
Recent efficient CNN designs [22, 51, 64, 37] use pointwise convolutions for channel fusion. To produce an output tensor of size from an input tensor of size , a pointwise convolution performs operations. This is a computationally expensive operation and is the main computational bottleneck in recent efficient CNN designs. We propose an Efficient channel Fusion (EFuse) as a substitute for channelfusion with pointwise convolution. This allows us to skip majority of pointwise convolutions in a CNN architecture, thus reducing the computational burden from pointwise convolutions by a large margin (Figure 4).
In EFuse, instead of fusing the channels across all elements in the spatial dimensions, we squeeze spatial dimensions of
using a global average pooling operation to extract a global vector descriptor
. Then we fuse the channels within the global vector using a fully connected layer to produce a fusion vector . In parallel, we encode spatial representations by applying depthwise convolutional kernels to to produce an output . Next, we propagate the fusion vector on along the spatial dimensions by applying a sigmoid on and then multiplying it with all elements ofalong the spatial dimensions. We choose the sigmoid function to prevent gradient overflow during training. The complexity of our
EFuse layer is . For , , and , EFuse requires fewer operations than the standard pointwise convolution.If we stack DimConv and EFuse, we obtain an efficient block structure, the DiCENet unit, that has fewer FLOPs for different input tensor sizes than other stateoftheart efficient CNNs. Figure 0(b) provides a comparison between FLOPs and different input tensor sizes for different CNN units. We can see that DiCENet unit has about fewer FLOPs than existing units, such as ShuffleNetv2 [37] and MobileNetv2 [51] units, a major improvement over stateoftheart efficient designs.
3.2 DiCENet architecture
The architecture design of DiCENet is motivated by ShuffleNetv2 [37], a stateoftheart efficient network that uses channel split and channel shuffle to learn representations efficiently. Figure 3 contrasts building blocks of ShuffleNetv2 and DiCENet while overall architecture is shown in Table 1. Except for the first two layers, we replace all pointwise and depthwise convolutions with the DiCENet unit. The first layer is a standard
convolution with a stride of two while the second layer is a max pooling layer. All convolutional layers are followed by a batch normalization layer
[27] and a PReLU nonlinear activation layer [18], except for the last layer that feeds into a softmax for classification. Following previous work [22, 51, 37, 64, 40], we scale the number of output channels by a width scaling factor to construct networks at different FLOPs. We initialize weights of our network using the same method as in [18].Layer  Output size  Kernel size  Stride  Repeat  Output channels (network width scaling parameter )  
Image  3  3  3  3  3  
Conv1  2  1  8  16  24  24  48  
Max Pool  2  8  16  24  24  48  
Stage2  2  1  16  32  116  278  384  
1  3  16  32  116  278  384  
Stage3  2  1  32  64  232  556  768  
1  7  32  64  232  556  768  
Stage4  2  1  64  128  464  1112  1536  
1  3  64  128  464  1112  1536  
Global Pool  512  1024  1024  1280  2048  
Grouped FC [38]  1  1  512  1024  1024  1280  2048  
FC  1000  1000  1000  1000  1000  
FLOPs  6.5 M  12 M  24240 M  298 M  553 M 
Dynamic input scaling:
A standard practice in computer vision community is to train a CNN on largescale classification dataset (e.g. the ImageNet
[50]) and then transfer learned representations to other tasks, including object detection and semantic segmentation, via finetuning. In general, visual recognition tasks, such as semantic segmentation and object detection, have images of higher spatial dimensions (e.g.
for the PASCAL VOC segmentation dataset [10]) than the ones on which classification network is trained (e.g. for the ImageNet dataset). Since DiCENet has two branches corresponding to width and height dimensions of the input, a natural question arises: “Can we use DiCENet with images that have spatial dimensions different than the one used for training on the classification dataset?”. To make DiCENet invariant to spatial dimensions of the input image, we dynamically scale (either upsample or downsample) the height or width dimension of the input tensor to the height or width of the input tensor used in the pretrained network. The resultant tensors are then scaled (either downsampled or upsampled) back to their original size before being fed to the weighted average function; this makes DiCENet invariant to an input image size. We note that this dynamic scaling allows DiCENet to learn scaleinvariant features because each convolutional layer (depth or width or heightwise) receives input with different dimensions. Figure 5 sketches the DiCENet block with dynamic input scaling.4 Experimental Results
In this section, we demonstrate the performance of DiCENet on three different visual recognition tasks: (1) object classification, (2) semantic segmentation, and (3) object detection. To showcase the strength of our network, we also compare it with stateoftheart efficient networks, including taskspecific networks such as YOLOv2 [46] and ESPNet [39].

^{⋆⋆} DiCENet uses ShuffleNetv2style blocks (see Figure 3). For a fair comparison, we compare the performance of DiCENet with our implementation of ShuffleNetv2.
4.1 Object classification on the ImageNet
Implementation and dataset details:
Following a common practice [37, 64, 22, 51, 40], we evaluate the performance of DiCENet on the ImageNet 1000 way classification dataset [50] at different complexity levels, ranging from 6 MFLOPs to 500+ MFLOPs. The dataset consists of 1.28M training samples and 50K validation samples. We optimize our networks by minimizing a crossentropy loss using SGD. We use the same learning rate and data augmentation policy as in [40]
. We use PyTorch
[43] as our framework for training these networks because of its ability to handle dynamic graphs.Evaluation metric:
We evaluate our networks performance using a single crop top1 accuracy on the validation set.
Results:
Table 1(a) and 1(b) compares the performance of DiCENet with stateoftheart efficient architectures at different complexity levels. For extremely small models (FLOP range: 620 M), DiCENet outperforms MobilNetv2 [51] and ShuffleNetv2 [37] by about 1.5%. Note that our model at 14 MFLOPs delivers the similar performance as MobileNetv1 [22] at 41 MFLOPs (48% top1 accuracy). For small models (FLOPs range: 30160 MFLOPs), DiCENet achieves either the best accuracy with fewer FLOPs. At 41 MFLOPs, DiCENet outperforms MobileNetv2 and ShuffleNetv2 by about 3% and 1.5%. For mediumsize models (FLOP range: 270 MFLOPs  600 MFLOPs), DiCENet delivers the best performance. Notably, DiCENet delivers a top1 accuracy of 75.1 with 551 MFLOPs, outperforming all previous efficient designs, such as ShuffleNets [37, 64] and MobileNets [22, 51], with fewer FLOPs. We would like to highlight that at 551 MFLOPs, DiCENet is about less accurate than ResNet50 [19], but has fewer FLOPs.
Recently, neural architecture search (NAS)based methods have gained attention. Recent NASbased methods rely on architecture innovations to deliver better performance. For example, MNASNet [56] and FBNet [60] have shown remarkable improvements using MobileNetv2 as basic search block. For the sake of completeness, we have included a comparison with such methods in Table 1(c). DiCENet delivers competitive performance to networks derived using NAS. In particular, DiCENet outperforms FBNet and MNASNetbased small models (FLOPs range: 70100 MFLOPs). We note that NASbased methods are not directly comparable to DiCENet because they rely on existing CNN units, such as MobileNetv2, for network search. However, we believe that adding DiCENet unit in the search dictionary of NASbased methods will deliver a better performance because DiCENet learns rich representations from all dimensions which should help it to outperform MobileNetv2 across different complexitylevels.
4.2 Multilabel classification on the MSCOCO
The goal of multilabel classification is to predict multiple labels per image. This is in contrast to the ImageNet classification task that aims to predict one label per image.
Implementation and dataset details:
Evaluation metrics:
Results:
Figure 6 summarizes quantitative results for multilabel classification. For similar FLOPs, DiCENet is 3.5% and 2% more accurate than ShuffleNetv2 [37] and ESPNetv2 [40], respectively. Also, DiCENet is 34% less accurate and has fewer FLOPs when compared to heavy weight networks, including Elastic versions [58].
4.3 Semantic segmentation
Implementation and dataset details:
For the task of efficient semantic segmentation, we use encoderdecoder architecture; a widely studied efficient segmentation architecture [39, 40, 41, 48]. We initialize the encoder with the ImageNet pretrained model and then finetune it by minimizing a crossentropy loss using SGD. We evaluate the performance on two segmentation datasets: (1) the Cityscapes [7] and (2) the PASCAL VOC 2012 [10].
The Cityscapes dataset has been widely studied for benchmarking the performance of efficient segmentation networks because of its application in urban scene understanding, especially for selfdriving cars
[39, 40, 41, 48]. The dataset provides pixelwise fine annotations (20 classes including background) for 5K high resolution images () that are captured across 50 cities. We follow the same splits for training and validation as in [39, 40, 41, 48].The PASCAL VOC 2012 dataset is a wellknown segmentation dataset that provide annotations for 20 foreground objects. It has 1.4K training images, 1.4K validation images, and 1.4K test images. Following a standard convention [4, 66, 5], we also use additional images from [16] and [32] for training our networks.
Evaluation metrics:
We measure the accuracy in terms of mean intersection over union (mIOU). For both datasets, we use official scripts for evaluation on the validation set while we use an online server for evaluation on the test set because the test set is private. We do not use multiscale testing strategy for evaluation.

[51] uses additional data from the MSCOCO dataset.
http://host.robots.ox.ac.uk:8080/anonymous/XWF8QJ.html
http://host.robots.ox.ac.uk:8080/anonymous/T44DHQ.html
Results:
Figure 7 summarizes the quantitative results of DiCENet on both datasets, including comparison with stateoftheart methods^{2}^{2}2Most of the efficient networks have not reported performance on the PASCAL VOC dataset, with an exception to ESPNet [39]. Therefore, we compare the performance of DiCENet on this dataset with other networks that are not necessarily efficient.. We can see that: (1) The accuracy of DiCENet drops off smoothly when we decrease the image resolution. Note that even at low resolutions, DiCENet delivers a good accuracy. For example, with an input image size of , DiCENet has fewer FLOPs than SegNet [2] while delivering the same accuracy on the PASCAL VOC dataset. (2) DiCENet delivers better accuracy than existing efficient networks such as ENet [41], ESPNet [39], and RTSeg [52] on the Cityscapes dataset while requiring fewer FLOPs. Note that DiCENet is the most efficient network while delivering competitive performance to stateoftheart efficient networks; thus making it a potential candidate for resource constrained devices.
4.4 Object detection
Implementation and dataset details:
For the task of efficient object detection, we use single shot object detector (SSD) [34]. We use DiCENet pretrained on the ImageNet as base feature extractor instead of VGG [53] in SSD. We finetune our network using SGD with smooth L1 loss for object localization and crossentropy loss for object classification. We evaluate the performance on two widelystudied object detection datasets: (1) the PASCAL VOC 2007 [9] and (2) the MSCOCO dataset [32]. Following a standard convention for training on the PASCAL VOC 2007 dataset, we also use the union of the PASCAL VOC 2012 [10] and PASCAL VOC 2007 trainval set for training and evaluate the performance on the PASCAL VOC 2007 test set.
Evaluation metrics:
We measure the accuracy in terms of mean Average Precision (mAP). For the COCO dataset, we report mAP @ IoU of 0.50:0.95.

Results:
Figure 4 summarizes the quantitative results of DiCENet on the PASCAL VOC 2007 and the MSCOCO dataset. It also provides a comparison with existing methods, including MobileNets [22, 51], YOLOv2 [46], and SSD [34]. We do not compare with other methods such as Fast RCNN [12] and Faster RCNN [47] because our focus is on efficient network designs. We can see that the accuracy of DiCENet improves with the image resolution on both datasets. DiCENet delivers competitive performance to existing methods on this task. Notably, DiCENet is 4.5% more accurate and requires fewer FLOPs than YOLOv2 on the MSCOCO dataset.
5 Ablations
In this section, we study the significance of two main components of DiCENet, i.e. DimConv and EFuse, on the ImageNet 1000way classification dataset.
Importance of DimConv:
Learning representations with depthwise convolutions have been studied widely for efficient networks [51, 37, 40]. The new result reported in this paper is that learning representations along all dimensions of the input tensor helps better encode the information contained in the input tensor, thus leading to better accuracy (Table 3). Note that DimConv does not increase the total number of operations (FLOPs) and parameters significantly.
Channel  Width  Height  # Params  FLOPs  top1 

✓  2.29 M  84 M  64.2  
✓  ✓  2.31 M  88 M  64.9  
✓  ✓  ✓  2.32 M  91 M  66.4 
Importance of EFuse:
EFuse efficiently fuses the representations from different dimensions. To demonstrate the superior performance of EFuse, we replace it with (1) a standard pointwise convolution and (2) a squeeze and excitation unit (SE) [23], which squeezes spatial dimensions to encode channelwise representations. Table 4 summarizes our findings. Pointwise convolution delivers the same performance as EFuse, but less efficiently. On the other hand, the SE unit drops the performance of DiCENet significantly. The drop in accuracy is likely because the SE unit does not encode the spatial information as EFuse does.
Operation  # Params  FLOPs  top1 

Pointwise conv  2.30 M  126 M  66.6 
SE unit [23]  2.29 M  89 M  62.7 
EFuse (Ours)  2.32 M  91 M  66.4 
6 Conclusion
In this paper, we introduce a novel convolutional unit, the DiCENet unit, that increases the learning capacity of a CNN by utilizing information from all dimensions of the tensor. Our experimental results suggest that the proposed unit has superior performance in comparison to existing stateoftheart networks across several computer vision tasks and datasets. The introduced convolutional unit is generic and can be applied across higher dimensionality data, such as 3D (or volumetric) images.
In this work, we study a homogeneous network structure. In future, we will study the performance of DiCENet with neural search methods.
Acknowledgement: This research was supported by the NSF III (1703166), Allen Distinguished Investigator Award, Samsung GRO award, and gifts from Google, Amazon, and Bloomberg. We also thank PRIOR team at AI2 for their helpful comments.
References
 [1] R. Andri, L. Cavigelli, D. Rossi, and L. Benini. Yodann: An architecture for ultralow power binaryweight cnn acceleration. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2018.
 [2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. TPAMI, 2017.
 [3] H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
 [4] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018.
 [5] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.

[6]
F. Chollet.
Xception: Deep learning with depthwise separable convolutions.
In CVPR, 2017.  [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
 [8] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
 [10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.

[11]
W. Ge, S. Yang, and Y. Yu.
Multievidence filtering and fusion for multilabel classification, object detection and semantic segmentation based on weakly supervised learning.
In CVPR, 2018.  [12] R. Girshick. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.

[13]
S. Gupta, J. Hoffman, and J. Malik.
Cross modal distillation for supervision transfer.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2827–2836, 2016.  [14] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [15] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 [16] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
 [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [20] Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
 [21] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [23] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 [24] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. In CVPR, 2018.
 [25] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [26] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 [27] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [29] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pages 396–404, 1990.
 [30] C. Li and C. R. Shi. Constrained optimization based lowrank approximation of deep neural networks. In ECCV, 2018.
 [31] Y. Li, Y. Song, and J. Luo. Improving pairwise ranking for multilabel image classification. In CVPR, 2017.
 [32] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
 [33] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [34] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
 [35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [36] D. LopezPaz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
 [37] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.

[38]
S. Mehta, R. KoncelKedziorski, M. Rastegari, and H. Hajishirzi.
Pyramidal recurrent unit for language modeling.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
. Association for Computational Linguistics, 2018.  [39] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 2018.
 [40] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi. Espnetv2: A lightweight, power efficient, and general purpose convolutional neural network. In CVPR, 2019.
 [41] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for realtime semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
 [42] R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach. Contextnet: Exploring context and detail for semantic segmentation in realtime. In BMVC, 2018.
 [43] PyTorch. Tensors and Dynamic neural networks in Python with strong GPU acceleration. http://pytorch.org/. Accessed: 20181115.
 [44] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.

[45]
E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and
A. Kurakin.
Largescale evolution of image classifiers.
InProceedings of the 34th International Conference on Machine LearningVolume 70
, pages 2902–2911. JMLR. org, 2017.  [46] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
 [47] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [48] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo. Erfnet: Efficient residual factorized convnet for realtime semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 2018.
 [49] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
 [50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 [51] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 [52] M. Siam, M. Gamal, M. AbdelRazek, S. Yogamani, and M. Jagersand. rtseg: Realtime semantic segmentation comparative study. In 2018 25th IEEE International Conference on Image Processing (ICIP).
 [53] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2014.

[54]
D. Soudry, I. Hubara, and R. Meir.
Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights.
In NIPS, 2014.  [55] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [56] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 [57] A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, 2018.
 [58] H. Wang, A. Kembhavi, A. Farhadi, A. Yuille, and M. Rastegari. Elastic: Improving cnns with instance specific scaling policies. In CVPR, 2019.
 [59] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
 [60] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
 [61] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
 [62] L. Xie and A. Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1379–1388, 2017.
 [63] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
 [64] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
 [65] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for realtime semantic segmentation on highresolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 405–420, 2018.
 [66] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
 [67] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [68] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [69] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
Comments
There are no comments yet.